# 📘 Lesson 2: R Basics Refresher

Welcome to our second lesson in the research-oriented data mining course! Today, we'll dive deeper into the fundamentals of R programming that are essential for working with data. Don't worry if you're new to R; we'll go step-by-step.

## Learning Objectives

By the end of this lesson, you will be able to:

- Understand and use core R data types (numeric, character, logical, factor, date).
- Work with basic R data structures like vectors, lists, and data frames.
- Read data in mutiple ways.
- Get a preview of loading real-world data into R.

## 1 Basics of R

R is a powerful tool for all manner of calculations, data manipulation and scientific computations. Before
getting to the complex operations possible in R we must start with the basics. Like most languages R has
its share of mathematical capability , variables, functions and data types.

### 1.1 Basic Math

In [None]:
1 + 2

In [None]:
4 - 2

In [None]:
3 * 4

In [None]:
8 / 2

In [None]:
8 / 3

In [None]:
2 + (4 - 1)

In [None]:
8 / (4 - 2)

### 1.2 Variables

#### 1.2.1 Variable Assignment

The valid assignment operators are <- and =
, with the first being preferred.

In [None]:
x <- 2
x

In [None]:
y = 4
y

The arrow operator can also point in the other direction.

In [None]:
6 -> z
z

The assignment operation can be used successively to assign a value to multiple variables
simultaneously .

In [None]:
a <- b <- 12

In [None]:
b

In [None]:
a

In [None]:
assign("j", 4) # function(parametere)

In [None]:
j

#### 1.2.2 Removing V ariables

In [None]:
j <- 3
j

In [None]:
rm(j)

In [None]:
j

ERROR: Error: object 'j' not found


Variable names are case sensitive (lowercase/uppercase)

In [None]:
age <- 18
age

In [None]:
age

In [None]:
AGE

ERROR: Error: object 'AGE' not found


### 1.3 Data Types

There are numerous data types in R that store various kinds of data. The four main types of data most likely to be used are

- numeric: 1, 3, 2.5. 3.14
- character (string): 'Alice', 'US'. TEXT
- Date/POSIXct (time-based): 'Jan-02-2025'
- logical: TRUE/FALSE

In [None]:
age <- 22
name <- 'Bob'
birthday <- as.Date('2021-05-02') # Use as.Date() with YYYY-MM-DD format
isHightSchool <- FALSE

print(age)
print(name)
print(birthday)
print(isHightSchool)

[1] 22
[1] "Bob"
[1] "2021-05-02"
[1] FALSE


Government expenditure on education, total (% of GDP)

In [None]:
class(age)

In [None]:
class(name)

In [None]:
class(birthday)

In [None]:
class(isHightSchool)

#### 1.3.1 Numeric Data

Testing whether a variable is numeric is done
with the function **is.numeric**

In [None]:
is.numeric(age)

In [None]:
is.numeric(name)

In [None]:
is.function(name)

Another important, if less frequently used, type is integer. As the name implies this is for whole
numbers only , **no decimals**. To set an integer to a variable it is necessary to append the value with an **L**.
As with checking for a numeric, the **is.integer** function is used

In [None]:
i <- 5

In [None]:
is.numeric(i)

In [None]:
is.integer(i)

In [None]:
i <- 15L

In [None]:
i

In [None]:
is.numeric(i)

In [None]:
is.integer(i)

In [None]:
class(2L)

In [None]:
class(2)

In [None]:
class(3.14)

In [None]:
is.integer(99L)

In [None]:
5L * 3.14

In [None]:
class(5L * 3.14)

In [None]:
5L / 2L

In [None]:
class(5L / 2L)

- modulus： %%
- integer division： %/%

In [None]:
5 / 2

In [None]:
5 %% 2

In [None]:
5 %%/ 2

#### 1.3.2 Character Data

R has two primary ways of handling character data:
character and factor. While they may seem similar on the surface, they are treated quite differently .

In [None]:
name <- 'Bob'
name

In [None]:
name <- "Bob"
name

In [None]:
x <- 'data mining 12345 '
x

In [None]:
y <- factor('data mining 12345 ') # list/array
y

Characters are case sensitive, so “Data” is different from “data” or “DA T A”

In [None]:
'data' == 'data'

In [None]:
'data' == 'DATA'

To find the length of a character (or numeric) use the **nchar** function.

In [None]:
x

In [None]:
nchar(x) # length

In [None]:
nchar('datamining')

In [None]:
nchar('hello')

In [None]:
nchar(100)

In [None]:
nchar(3.14)

In [None]:
y <- factor('R programming')
y

In [None]:
nchar(y)

ERROR: Error in nchar(y): 'nchar()' requires a character vector


#### 1.3.3 Dates

Dealing with dates and times can be difficult in any language, and to further complicate matters R has
numerous different types of dates. The most useful are Date and POSIXct. Date stores just a date
while POSIXct stores a date and time. Both objects are actually represented as the number of days
(Date) or seconds (POSIXct) since January 1, 1970.

In [None]:
date1 <- as.Date("2012-06-28")

In [None]:
date1

In [None]:
class(date1)

In [None]:
as.numeric(date1)

In [None]:
date_1970 <- as.Date("1970-01-01")

In [None]:
as.numeric(date_1970)

numeric: calculate how many days after '1970-01-01'

In [None]:
date2 <- as.POSIXct("2012-06-28 17:42")
date2

[1] "2012-06-28 17:42:00 UTC"

In [None]:
class(date2)

In [None]:
as.numeric(date2) # seconds

In [None]:
date3 <- as.POSIXct("1970-01-01 00:00:00")
date3

[1] "1970-01-01 UTC"

In [None]:
as.numeric(date3)

In [None]:
class(as.numeric(date3))

#### 1.3.4 Logical

Logicals are a way of representing data that can be either TRUE or FALSE. Numerically , **TRUE is the same as 1 and FALSE is the same as 0.**

In [None]:
# TRUE = 1
TRUE - 1

In [None]:
# FALSE = 0
FALSE + 1

In [None]:
TRUE * 3

In [None]:
FALSE * 3

In [None]:
FALSE * 3 + TRUE * 2

In [None]:
k <- TRUE
k

In [None]:
class(k)

In [None]:
is.logical(k)

In [None]:
is.factor(k)

In [None]:
T # T A

In [None]:
F

Logicals can result from comparing two numbers, or characters.

In [None]:
2 == 3

In [None]:
2 != 3

In [None]:
2 < 3

In [None]:
2 > 3

In [None]:
2 >= 3

In [None]:
"data" == "data"

### 1.4 Vectors

A vector is a collection of elements, all of the same type. For instance, c(1, 3, 2, 1, 5) is a vector
consisting of the numbers 1, 3, 2, 1, 5, in that order. Similarly , c(“R”
“Excel”
“SAS”
,
,
,
“Excel”) is a vector of the character elements,
“R”
“Excel”
“SAS”
,
,
, and “Excel”
. A vector
cannot be of mixed type.

# element: int, character, logcial, etc.

In [None]:
x <-c (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
x

#### 1.4.1 Vector Operations

In [None]:
x * 3

In [None]:
x + 2

In [None]:
x - 3

In [None]:
x / 4

In [None]:
x^2

In [None]:
sqrt(x)

Earlier we created a vector of the first ten numbers using the c function, which creates a vector. A
shortcut is the : operator, which generates a sequence of consecutive numbers, in either direction.

In [None]:
1 :10 # (value_from : value_to)

In [None]:
2 : 6 # step: 1

In [None]:
10:1 # step: -1

In [None]:
-2:3

In [None]:
5:-7

In [None]:
x <- 1:5
x

In [None]:
y <- 10:14
y

In [None]:
x + y

In [None]:
x - y

In [None]:
x * y

In [None]:
x / y

In [None]:
x^y

In [None]:
length(x) # how many elements in the vector

In [None]:
length(y)

In [None]:
x + y

In [None]:
1 : 10

In [None]:
length((1 : 10))

In [None]:
x

In [None]:
c(1, 2)

1 + 1 = 2, 2 + 2 = 4, 3 + 1 = 4, 4 + 2 = 6, 5 + 1 = 6

In [None]:
x + c(1, 2)

“longer object length is not a multiple of shorter object length”


In [None]:
x + c(1, 2, 3, 4, 5)

In [None]:
x

In [None]:
x <= 5

In [None]:
x > y

In [None]:
x < y

In [None]:
x <- 10:1
y <- -4:5

In [None]:
x

In [None]:
y

To test whether all the resulting elements are TRUE, use the all function. Similarly , the any function
checks whether any element is TRUE.

In [None]:
any(x < y) # if there is at least one element < elements in y

In [None]:
all(x < y) # if all elements in x are < elements in y

In [None]:
name <- c("Alice", "Bob", "Chris", "Dora")
name

In [None]:
nchar(name) # the length of each element

In [None]:
z <- 5: 15
z

In [None]:
nchar(z)

In [None]:
z

In [None]:
z[4] # vector_var[index] index >= 1

In [None]:
z[1:3] # val-x : val-y  --> FROM val-x TO val-y

In [None]:
z[3:5]

In [None]:
z[c(2, 4, 6)]

In [None]:
n <- c(One="a", Two="y", Last="r") # key: val , 3 elements, 3 pairs of key:val
n

In [None]:
n[3]

In [None]:
w <- 1:3

In [None]:
w

In [None]:
c("a", "b", "c")

In [None]:
names(w) <- c("a", "b", "c")

In [None]:
names(w)

In [None]:
w

In [None]:
names(w) <-c("one", "two", "three")

In [None]:
w

In [None]:
names(w)

In [None]:
w <- c(11, 22, 33)

In [None]:
w

In [None]:
names(w) <-c("one", "two", "three")

In [None]:
names(w) # variable

In [None]:
names_w <- c("one", "two", "three")
names_w

#### 1.4.2 Factor Vectors

Factors are an important concept in R, especially when building models. Let’s create a simple
vector of text data that has a few repeats.

In [None]:
q <- c(1, 2, 3)

In [None]:
q

In [None]:
q2 <-c(q, "apple", "apple", "banana","lemon", "lemon", "banana", TRUE, FALSE)

In [None]:
q2

In [None]:
q2Factor <- as.factor(q2)

In [None]:
q2Factor

In [None]:
class(q2Factor)

In [None]:
as.numeric(q2Factor)
# 1 2 3 apple(4) apple(4) banana(5) lemon(7) lemon(7) banana(5) TRUE(8) FALSE(6)

In [None]:
factor(x=c("High School", "College", "Masters", "Doctorate", "High School"), levels=c("High School", "College", "Masters", "Doctorate"), ordered=TRUE)

In [None]:
factor(x=c("Doctorate", "College", "High School", "Masters"), levels=c("High School", "College", "Masters", "Doctorate"), ordered=TRUE)

### 1.5 Calling Functions

In [None]:
x <- 2: 6
x

In [None]:
mean(x)

In [None]:
max(x)

In [None]:
min(x)

In [None]:
median(x)

### 1.6 Function Documentation


To get help on binary operators like +,
* or == surround them with back ticks (`)

In [None]:
?`+`

In [None]:
?`==`

There are occasions when we have only a sense of the function we want to use. In that case we can
look up the function by using part of the name with **apropos**.

In [None]:
apropos("mea")

### 1.7 Missing Data


Missing data plays a critical role in both statistics and computing, and R has two types of missing data,
**NA** and **NULL**. While they are similar, they behave differently and that difference needs attention.

#### 1.7.1 NA

In [None]:
zChar <- c("Tiger", NA, "Lion", NA)
zChar

In [None]:
is.na(zChar)

In [None]:
z <- c(1, 2, NA, 8, 3, NA, 3)
z

In [None]:
is.na(z)

If we calculate the mean of z, the answer will be NA since mean returns NA if even a single element is
NA.

In [None]:
mean(z)

When the na.rm is TRUE, mean first removes the missing data, then calculates the mean.

# na.rm means remove NA values from dataset

In [None]:
mean(z, na.rm=TRUE)
# mean(1 + 2 + 8 + 3 + 3)

In [None]:
# sum, min, max, var, sd
sum(z)

In [None]:
sum(z, na.rm = TRUE)

#### 1.7.2 NULL

NULL is the absence of anything. It is not exactly missingness, it is nothingness. Functions can sometimes
return NULL and their arguments can be NULL. An important difference between NA and NULL is that
NULL is atomical and cannot exist within a vector. If used inside a vector, it simply disappears.

In [None]:
z <- c(1, NULL,3)
z

In [None]:
d <- NULL

In [None]:
is.null(d)

In [None]:
is.null(7)

### 1.8 Pipes

The pipe from the magrittr package works by
taking the value or object on the left-hand side of the pipe and inserting it into the first argument of the
function that is on the right-hand side of the pipe. A simple example example would be using a pipe to
feed x to the mean function.

In [None]:
library(magrittr)

In [None]:
x <- 1:10
x

In [None]:
mean(x)

# %>%

In [None]:
x %>% mean # from left to right

In [None]:
z <- c(1, 2, NA, 8, 3, NA, 3)
z

In [None]:
sum(is.na(z))

In [None]:
z %>% is.na %>% sum

In [None]:
z %>% mean(na.rm=TRUE)

## 2 Advanced Data Structures


Sometimes data require more complex storage than simple vectors and thankfully R provides a host of
data structures. The most common are the data.frame, matrix and list, followed by the array.
Of these, the data.frame will be most familiar to anyone who has used a spreadsheet, the matrix to
people familiar with matrix math and the list to programmers.

### 2.1 data.frames


On the surface a data.frame is just like an Excel spreadsheet in that it has columns and rows. In
statistical terms, each column is a variable and each row is an observation.

There are numerous ways to construct a data.frame, the simplest being to use the data.frame
function.

In [40]:
x <- 10:1
x

In [41]:
y <- -4:5
y

In [42]:
q <- c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten")
q

In [43]:
theDF <- data.frame(x, y, q)

In [44]:
theDF

x,y,q
<int>,<int>,<chr>
10,-4,one
9,-3,two
8,-2,three
7,-1,four
6,0,five
5,1,six
4,2,seven
3,3,eight
2,4,nine
1,5,ten


In [None]:
theDF <- data.frame(First=x, Second=y, Number=q)

In [None]:
theDF

First,Second,Number
<int>,<int>,<chr>
10,-4,one
9,-3,two
8,-2,three
7,-1,four
6,0,five
5,1,six
4,2,seven
3,3,eight
2,4,nine
1,5,ten


In [None]:
nrow(theDF)

In [None]:
ncol(theDF)

In [None]:
dim(theDF)

In [None]:
names(theDF)

In [None]:
names(theDF)[2]

In [None]:
rownames(theDF) # index starts with 1

In [None]:
rownames(theDF) <- c("#1", "#2", "#3", "#4", "#5", "#6", "#7", "#8", "#9", "#10")

In [None]:
theDF

Unnamed: 0_level_0,First,Second,Number
Unnamed: 0_level_1,<int>,<int>,<chr>
#1,10,-4,one
#2,9,-3,two
#3,8,-2,three
#4,7,-1,four
#5,6,0,five
#6,5,1,six
#7,4,2,seven
#8,3,3,eight
#9,2,4,nine
#10,1,5,ten


In [None]:
rownames(theDF) <- NULL
theDF

First,Second,Number
<int>,<int>,<chr>
10,-4,one
9,-3,two
8,-2,three
7,-1,four
6,0,five
5,1,six
4,2,seven
3,3,eight
2,4,nine
1,5,ten


In [None]:
rownames(theDF)

In [None]:
head(theDF)

Unnamed: 0_level_0,First,Second,Number
Unnamed: 0_level_1,<int>,<int>,<chr>
1,10,-4,one
2,9,-3,two
3,8,-2,three
4,7,-1,four
5,6,0,five
6,5,1,six


In [None]:
head(theDF, n=3)

Unnamed: 0_level_0,First,Second,Number
Unnamed: 0_level_1,<int>,<int>,<chr>
1,10,-4,one
2,9,-3,two
3,8,-2,three


In [None]:
tail(theDF, n=1)

Unnamed: 0_level_0,First,Second,Number
Unnamed: 0_level_1,<int>,<int>,<chr>
10,1,5,ten


In [None]:
class(theDF)

Like many other aspects of R, there are multiple ways to access an individual
column. There is the $ operator and also the square brackets.

In [None]:
theDF

First,Second,Number
<int>,<int>,<chr>
10,-4,one
9,-3,two
8,-2,three
7,-1,four
6,0,five
5,1,six
4,2,seven
3,3,eight
2,4,nine
1,5,ten


In [None]:
theDF[5, 1:2]

Unnamed: 0_level_0,First,Second
Unnamed: 0_level_1,<int>,<int>
5,6,0


$

In [None]:
theDF$Number

In [None]:
theDF[10, 3] # 1st: row index, 2nd: col index

In [None]:
theDF[3, 2:3] # 2nd: 2:3 the col index is from 2 to 3

Unnamed: 0_level_0,Second,Number
Unnamed: 0_level_1,<int>,<chr>
3,-2,three


In [None]:
c(3, 5)

In [None]:
theDF[c(3, 5), 2] # 1st: c(3, 5) row index

In [None]:
theDF

First,Second,Number
<int>,<int>,<chr>
10,-4,one
9,-3,two
8,-2,three
7,-1,four
6,0,five
5,1,six
4,2,seven
3,3,eight
2,4,nine
1,5,ten


In [None]:
theDF[c(1, 3, 5, 7, 9), 3]

In [None]:
c(2, 5)

In [None]:
2 : 5 # from 2 to 3, step 1

In [None]:
theDF[c(3, 5), 2:3] # rows 3 and 5, columns 2 through 3

Unnamed: 0_level_0,Second,Number
Unnamed: 0_level_1,<int>,<chr>
3,-2,three
5,0,five


In [None]:
theDF[, 3] # treat the empty param 1st as all

In [None]:
theDF[, 2:3]

Second,Number
<int>,<chr>
-4,one
-3,two
-2,three
-1,four
0,five
1,six
2,seven
3,eight
4,nine
5,ten


In [None]:
theDF[2, ]

Unnamed: 0_level_0,First,Second,Number
Unnamed: 0_level_1,<int>,<int>,<chr>
2,9,-3,two


In [None]:
theDF[2:4, ]

Unnamed: 0_level_0,First,Second,Number
Unnamed: 0_level_1,<int>,<int>,<chr>
2,9,-3,two
3,8,-2,three
4,7,-1,four


In [None]:
class(c("First","Number"))

In [None]:
theDF[2, c("First","Second", "Number")]

Unnamed: 0_level_0,First,Second,Number
Unnamed: 0_level_1,<int>,<int>,<chr>
2,9,-3,two


In [None]:
theDF[,"First"]

In [None]:
class(theDF[,"First"])

In [None]:
theDF["Number"]

Number
<chr>
one
two
three
four
five
six
seven
eight
nine
ten


In [None]:
theDF["Number"]

Number
<chr>
one
two
three
four
five
six
seven
eight
nine
ten


In [None]:
class(theDF["Number"])

In [None]:
theDF

First,Second,Number
<int>,<int>,<chr>
10,-4,one
9,-3,two
8,-2,three
7,-1,four
6,0,five
5,1,six
4,2,seven
3,3,eight
2,4,nine
1,5,ten


In [None]:
theDF["Number"]

Number
<chr>
one
two
three
four
five
six
seven
eight
nine
ten


In [None]:
theDF[["Number"]]

In [None]:
theDF["Number"]

Number
<chr>
one
two
three
four
five
six
seven
eight
nine
ten


In [None]:
class(theDF[["Number"]])

In [None]:
theDF[,"Number", drop=FALSE]

Number
<chr>
one
two
three
four
five
six
seven
eight
nine
ten


In [None]:
class(theDF[,"Number", drop=FALSE])

In [None]:
theDF

First,Second,Number
<int>,<int>,<chr>
10,-4,one
9,-3,two
8,-2,three
7,-1,four
6,0,five
5,1,six
4,2,seven
3,3,eight
2,4,nine
1,5,ten


In [None]:
theDF[1, 3, drop=TRUE]

theDF[,
"Sport"
, drop=FALSE]

### 2.2 Lists

Often a container is needed to hold arbitrary objects of either the same type or varying types. R
accomplishes this through lists. They store any number of items of any type. A list can contain all
numerics or characters or a mix of the two or data.frames or, recursively , other lists.z

In [1]:
list(1, 2, 3, 99, 1000)

In [8]:
list(c(1, 2, 3), c(88, 99, 77))

In [9]:
list(c(1, 2, 3), 3:7)

In [10]:
(list3 <- list(c(1, 2, 3), 3:7))

In [11]:
list3

In [12]:
theDF

x,y,q
<int>,<int>,<chr>
10,-4,one
9,-3,two
8,-2,three
7,-1,four
6,0,five
5,1,six
4,2,seven
3,3,eight
2,4,nine
1,5,ten


In [13]:
list(theDF, 1:10, "Apple", c(999, 1000))

x,y,q
<int>,<int>,<chr>
10,-4,one
9,-3,two
8,-2,three
7,-1,four
6,0,five
5,1,six
4,2,seven
3,3,eight
2,4,nine
1,5,ten


In [45]:
list5 <- list(theDF, 1:10, list3)

In [46]:
list5

x,y,q
<int>,<int>,<chr>
10,-4,one
9,-3,two
8,-2,three
7,-1,four
6,0,five
5,1,six
4,2,seven
3,3,eight
2,4,nine
1,5,ten


In [16]:
names(list5)

NULL

In [49]:
names(list5) <-c("data.frame", "vector", "list")
names(list5)

In [18]:
list6 <- list(TheDataFrame=theDF, TheVector=1:10, TheList=list3)
list6

x,y,q
<int>,<int>,<chr>
10,-4,one
9,-3,two
8,-2,three
7,-1,four
6,0,five
5,1,six
4,2,seven
3,3,eight
2,4,nine
1,5,ten


In [19]:
names(list6)

In [20]:
(emptyList <- vector(mode="list", length=4))

In [50]:
theDF

x,y,q
<int>,<int>,<chr>
10,-4,one
9,-3,two
8,-2,three
7,-1,four
6,0,five
5,1,six
4,2,seven
3,3,eight
2,4,nine
1,5,ten


In [51]:
list5

x,y,q
<int>,<int>,<chr>
10,-4,one
9,-3,two
8,-2,three
7,-1,four
6,0,five
5,1,six
4,2,seven
3,3,eight
2,4,nine
1,5,ten


In [52]:
list5[[3]]

In [53]:
list5[["vector"]]

## $

In [54]:
list5[[1]]

x,y,q
<int>,<int>,<chr>
10,-4,one
9,-3,two
8,-2,three
7,-1,four
6,0,five
5,1,six
4,2,seven
3,3,eight
2,4,nine
1,5,ten


In [56]:
list5[[1]][,2, drop=TRUE]

- **drop = TRUE (Default)**: If the subset operation results in a single row or a single column, R will "drop" the dimension, returning a vector instead of a data frame or matrix.

- **drop = FALSE:** R will preserve the original dimensions, even if the result is a single row or column. It will return a data frame (for data frames) or a matrix (for matrices) with one row or one column.

In [33]:
length(list5)

In [34]:
list5[[4]] <- 2

In [35]:
length(list5)

In [36]:
list5[["NewElement"]] <- 3:6

In [37]:
list5

x,y,q
<int>,<int>,<chr>
10,-4,one
9,-3,two
8,-2,three
7,-1,four
6,0,five
5,1,six
4,2,seven
3,3,eight
2,4,nine
1,5,ten


In [38]:
length(list5)

In [39]:
names(list5)

### 2.3 Matrices


A very common mathematical structure that is essential to statistics is a matrix. This is similar to a
data.frame in that it is rectangular with rows and columns except that every single element,
regardless of column, must be the same type, most commonly all numerics.

In [57]:
# create a 5x2 matrix
A <- matrix(1:10, nrow=5)

In [58]:
A

0,1
1,6
2,7
3,8
4,9
5,10


In [59]:
# create another 5x2 matrix
B <- matrix(21:30, nrow=5)


In [60]:
B

0,1
21,26
22,27
23,28
24,29
25,30


In [63]:
# create another 2x10 matrix
C <- matrix(21:40, nrow=2)

In [64]:
C

0,1,2,3,4,5,6,7,8,9
21,23,25,27,29,31,33,35,37,39
22,24,26,28,30,32,34,36,38,40


In [67]:
nrow(B)

In [68]:
ncol(B)

In [69]:
dim(A)

In [70]:
1 * 2

In [71]:
A * B

0,1
21,156
44,189
69,224
96,261
125,300


In [73]:
A

0,1
1,6
2,7
3,8
4,9
5,10


In [74]:
B

0,1
21,26
22,27
23,28
24,29
25,30


In [78]:
E <- matrix(21:30, nrow=5) # 21:30

In [79]:
E

0,1
21,26
22,27
23,28
24,29
25,30


In [80]:
E == B

0,1
True,True
True,True
True,True
True,True
True,True


Matrix multiplication is a commonly used operation in mathematics, requiring the number of columns
of the left-hand matrix to be the same as the number of rows of the right-hand matrix. Both A and B
are 5X2 so we will transpose B so it can be used on the right-hand side.

In [82]:
B

0,1
21,26
22,27
23,28
24,29
25,30


In [None]:
5 rows x 2 cols # t(B) --> 2 rows v 5 cols

In [81]:
A %*% t(B)

0,1,2,3,4
177,184,191,198,205
224,233,242,251,260
271,282,293,304,315
318,331,344,357,370
365,380,395,410,425


In [87]:
colnames(A)

In [88]:
rownames(A)

In [85]:
colnames(A) <- c("Left", "Right")

In [86]:
rownames(A) <- c("1st", "2nd","3rd","4th","5th")

In [93]:
colnames(B)

In [94]:
rownames(B)

In [91]:
colnames(B) <- c("First", "Second")

In [92]:
rownames(B) <- c("One","Two","Three","Four","Five")

In [100]:
colnames(C)

In [98]:
rownames(C)

In [103]:
LETTERS[1:26]

In [99]:
colnames(C) <- LETTERS[1:10]

In [97]:
rownames(C) <- c("Top","Bottom")

In [104]:
A

Unnamed: 0,Left,Right
1st,1,6
2nd,2,7
3rd,3,8
4th,4,9
5th,5,10


Notice the effect when transposing a matrix and multiplying matrices. Transposing naturally flips
the row and column names. Matrix multiplication keeps the row names from the left matrix and the
column names from the right matrix.

In [105]:
t(A)

Unnamed: 0,1st,2nd,3rd,4th,5th
Left,1,2,3,4,5
Right,6,7,8,9,10


In [106]:
C

Unnamed: 0,A,B,C,D,E,F,G,H,I,J
Top,21,23,25,27,29,31,33,35,37,39
Bottom,22,24,26,28,30,32,34,36,38,40


In [107]:
A %*% C

Unnamed: 0,A,B,C,D,E,F,G,H,I,J
1st,153,167,181,195,209,223,237,251,265,279
2nd,196,214,232,250,268,286,304,322,340,358
3rd,239,261,283,305,327,349,371,393,415,437
4th,282,308,334,360,386,412,438,464,490,516
5th,325,355,385,415,445,475,505,535,565,595


### 2.4 Arrays


An array is essentially a multidimensional vector. It must all be of the same type, and individual
elements are accessed in a similar fashion using square brackets. The first element is the row index, the
second is the column index and the remaining elements are for outer dimensions.

In [111]:
theArray <- array(1:12, dim=c(2, 3, 2)) # 2d 3d

In [112]:
theArray

# 1, 2, 3,  4, 5, 6, 7, 8, 9, 10, 11, 12

# dim_1: 2
[[], []]
# dim_2: 3
[[[], []], [[], []], [[], []]]
# dim_3: 2
[[[[], []], [[], []], [[], []]], [[[], []], [[], []], [[], []]]]

[[[1][2]][[3][4]][[5][6]]], [[[7][8]][[9][10]][[11][12]]]

In [114]:
theArray[1, , 1]

In [115]:
theArray[, , 1] # empty: return all values for that dim, val: return a target value in that index

0,1,2
1,3,5
2,4,6


## 3 Reading Data into R


### 3.1 Reading CSVs


In [116]:
theUrl <- "http://www.jaredlander.com/data/TomatoFirst.csv"
tomato <-read.table(file=theUrl, header=TRUE, sep=",")

In [117]:
tomato

Round,Tomato,Price,Source,Sweet,Acid,Color,Texture,Overall,Avg.of.Totals,Total.of.Avg
<int>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,Simpson SM,3.99,Whole Foods,2.8,2.8,3.7,3.4,3.4,16.1,16.1
1,Tuttorosso (blue),2.99,Pioneer,3.3,2.8,3.4,3.0,2.9,15.3,15.3
1,Tuttorosso (green),0.99,Pioneer,2.8,2.6,3.3,2.8,2.9,14.3,14.3
1,La Fede SM DOP,3.99,Shop Rite,2.6,2.8,3.0,2.3,2.8,13.4,13.4
2,Cento SM DOP,5.49,D Agostino,3.3,3.1,2.9,2.8,3.1,14.4,15.2
2,Cento Organic,4.99,D Agostino,3.2,2.9,2.9,3.1,2.9,15.5,15.1
2,La Valle SM,3.99,Shop Rite,2.6,2.8,3.6,3.4,2.6,14.7,14.9
2,La Valle SM DOP,3.99,Faicos,2.1,2.7,3.1,2.4,2.2,12.6,12.5
3,Stanislaus Alta Cucina,4.53,Restaurant Depot,3.4,3.3,4.1,3.2,3.7,17.8,17.7
3,Ciao,,Other,2.6,2.9,3.4,3.3,2.9,15.3,15.2


In [118]:
library(data.table)

In [121]:
theUrl <- "http://www.jaredlander.com/data/TomatoFirst.csv"
tomato3 <-fread(input=theUrl, sep=',',, header=TRUE)

In [122]:
head(tomato3)

Round,Tomato,Price,Source,Sweet,Acid,Color,Texture,Overall,Avg of Totals,Total of Avg
<int>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,Simpson SM,3.99,Whole Foods,2.8,2.8,3.7,3.4,3.4,16.1,16.1
1,Tuttorosso (blue),2.99,Pioneer,3.3,2.8,3.4,3.0,2.9,15.3,15.3
1,Tuttorosso (green),0.99,Pioneer,2.8,2.6,3.3,2.8,2.9,14.3,14.3
1,La Fede SM DOP,3.99,Shop Rite,2.6,2.8,3.0,2.3,2.8,13.4,13.4
2,Cento SM DOP,5.49,D Agostino,3.3,3.1,2.9,2.8,3.1,14.4,15.2
2,Cento Organic,4.99,D Agostino,3.2,2.9,2.9,3.1,2.9,15.5,15.1


### 3.2 Excel Data


In [123]:
download.file(url='http://www.jaredlander.com/data/ExcelExample.xlsx', destfile='/content/sample_data/ExcelExample.xlsx', method='auto')

In [124]:
library(readxl)

In [125]:
excel_sheets('/content/sample_data/ExcelExample.xlsx')

In [130]:
tomatoXL <-read_excel('/content/sample_data/ExcelExample.xlsx', sheet=1)

In [131]:
tomatoXL

Round,Tomato,Price,Source,Sweet,Acid,Color,Texture,Overall,Avg of Totals,Total of Avg
<dbl>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,Simpson SM,3.99,Whole Foods,2.8,2.8,3.7,3.4,3.4,16.1,16.1
1,Tuttorosso (blue),2.99,Pioneer,3.3,2.8,3.4,3.0,2.9,15.3,15.3
1,Tuttorosso (green),0.99,Pioneer,2.8,2.6,3.3,2.8,2.9,14.3,14.3
1,La Fede SM DOP,3.99,Shop Rite,2.6,2.8,3.0,2.3,2.8,13.4,13.4
2,Cento SM DOP,5.49,D Agostino,3.3,3.1,2.9,2.8,3.1,14.4,15.2
2,Cento Organic,4.99,D Agostino,3.2,2.9,2.9,3.1,2.9,15.5,15.1
2,La Valle SM,3.99,Shop Rite,2.6,2.8,3.6,3.4,2.6,14.7,14.9
2,La Valle SM DOP,3.99,Faicos,2.1,2.7,3.1,2.4,2.2,12.6,12.5
3,Stanislaus Alta Cucina,4.53,Restaurant Depot,3.4,3.3,4.1,3.2,3.7,17.8,17.7
3,Ciao,,Other,2.6,2.9,3.4,3.3,2.9,15.3,15.2


In [128]:
wineXL1 <- read_excel('/content/sample_data/ExcelExample.xlsx', sheet=2)

In [129]:
head(wineXL1)

Cultivar,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735
1,14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450


In [132]:
wineXL1 <- read_excel('/content/sample_data/ExcelExample.xlsx', sheet='Wine')

In [133]:
head(wineXL1)

Cultivar,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735
1,14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450


### 3.3 Reading from Databases


In [134]:
install.packages('RSQLite') # DB: SQL SEVER, ORACLE, RSQLite

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependency ‘plogr’




In [135]:
library(RSQLite)

In [136]:
download.file("http://www.jaredlander.com/data/diamonds.db",destfile = "/content/sample_data/diamonds.db", mode='wb')

In [137]:
drv <- dbDriver('SQLite')

In [138]:
class(drv)

In [139]:
con <- dbConnect(drv,'/content/sample_data/diamonds.db')

In [140]:
class(con)

In [141]:
dbListTables(con)

In [142]:
dbListFields(con, name='diamonds')

In [143]:
dbListFields(con, name='DiamondColors')

In [144]:
diamondsTable <- dbGetQuery(con,"SELECT * FROM diamonds",stringsAsFactors=FALSE)

In [145]:
diamondsTable

carat,cut,color,clarity,depth,table,price,x,y,z
<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>
0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43
0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31
0.23,Good,E,VS1,56.9,65,327,4.05,4.07,2.31
0.29,Premium,I,VS2,62.4,58,334,4.20,4.23,2.63
0.31,Good,J,SI2,63.3,58,335,4.34,4.35,2.75
0.24,Very Good,J,VVS2,62.8,57,336,3.94,3.96,2.48
0.24,Very Good,I,VVS1,62.3,57,336,3.95,3.98,2.47
0.26,Very Good,H,SI1,61.9,55,337,4.07,4.11,2.53
0.22,Fair,E,VS2,65.1,61,337,3.87,3.78,2.49
0.23,Very Good,H,VS1,59.4,61,338,4.00,4.05,2.39


In [146]:
colorTable <- dbGetQuery(con,"SELECT * FROM DiamondColors",stringsAsFactors=FALSE)

In [147]:
colorTable

Color,Description,Details
<chr>,<chr>,<chr>
D,Absolutely Colorless,No color
E,Colorless,Minute traces of color
F,Colorless,Minute traces of color
G,Near Colorless,Color is dificult to detect
H,Near Colorless,Color is dificult to detect
I,Near Colorless,Slightly detectable color
J,Near Colorless,Slightly detectable color
K,Faint Color,Noticeable color
L,Faint Color,Noticeable color
M,Faint Color,Noticeable color


In [148]:
longQuery <- "SELECT * FROM diamonds, DiamondColors
WHERE
diamonds.color = DiamondColors.Color"

In [149]:
diamondsJoin <-dbGetQuery(con, longQuery, stringsAsFactors=FALSE)

In [151]:
diamondsJoin

carat,cut,color,clarity,depth,table,price,x,y,z,Color,Description,Details
<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>
0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43,E,Colorless,Minute traces of color
0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31,E,Colorless,Minute traces of color
0.23,Good,E,VS1,56.9,65,327,4.05,4.07,2.31,E,Colorless,Minute traces of color
0.29,Premium,I,VS2,62.4,58,334,4.20,4.23,2.63,I,Near Colorless,Slightly detectable color
0.31,Good,J,SI2,63.3,58,335,4.34,4.35,2.75,J,Near Colorless,Slightly detectable color
0.24,Very Good,J,VVS2,62.8,57,336,3.94,3.96,2.48,J,Near Colorless,Slightly detectable color
0.24,Very Good,I,VVS1,62.3,57,336,3.95,3.98,2.47,I,Near Colorless,Slightly detectable color
0.26,Very Good,H,SI1,61.9,55,337,4.07,4.11,2.53,H,Near Colorless,Color is dificult to detect
0.22,Fair,E,VS2,65.1,61,337,3.87,3.78,2.49,E,Colorless,Minute traces of color
0.23,Very Good,H,VS1,59.4,61,338,4.00,4.05,2.39,H,Near Colorless,Color is dificult to detect


In [150]:
head(diamondsJoin)

Unnamed: 0_level_0,carat,cut,color,clarity,depth,table,price,x,y,z,Color,Description,Details
Unnamed: 0_level_1,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>
1,0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43,E,Colorless,Minute traces of color
2,0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31,E,Colorless,Minute traces of color
3,0.23,Good,E,VS1,56.9,65,327,4.05,4.07,2.31,E,Colorless,Minute traces of color
4,0.29,Premium,I,VS2,62.4,58,334,4.2,4.23,2.63,I,Near Colorless,Slightly detectable color
5,0.31,Good,J,SI2,63.3,58,335,4.34,4.35,2.75,J,Near Colorless,Slightly detectable color
6,0.24,Very Good,J,VVS2,62.8,57,336,3.94,3.96,2.48,J,Near Colorless,Slightly detectable color


### 3.4 Data from Other Statistical Tools


In an ideal world another tool besides R would never be needed, but in reality data are sometimes locked
in a proprietary format such as those from SAS, SPSS or Octave. The foreign package provides a number
of functions similar to read.table to read in data from other tools.

### 3.5 R Binary Files PDF --> 0, 1


In [152]:
save(tomato, file="/content/sample_data/tomato.rdata") #save the tomato data.frame to disk 1, 0

In [153]:
rm(tomato) #remove tomato from memory

In [154]:
head(tomato)

ERROR: Error: object 'tomato' not found


In [None]:
load("/content/sample_data/tomato.rdata")

### 3.6 Data Included with R


R and some packages come with data included, so we can easily have data to use. Accessing this data is
simple as long as we know what to look for. ggplot2, for instance, comes with a dataset about diamonds.
It can be loaded using the data function.

In [155]:
data(diamonds, package='ggplot2')

In [156]:
head(diamonds)

carat,cut,color,clarity,depth,table,price,x,y,z
<dbl>,<ord>,<ord>,<ord>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>
0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43
0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31
0.23,Good,E,VS1,56.9,65,327,4.05,4.07,2.31
0.29,Premium,I,VS2,62.4,58,334,4.2,4.23,2.63
0.31,Good,J,SI2,63.3,58,335,4.34,4.35,2.75
0.24,Very Good,J,VVS2,62.8,57,336,3.94,3.96,2.48


### 3.7 Extract Data from Web Sites


#### Scraping Web Data

In [157]:
install.packages('rvest')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [158]:
library(rvest)

In [159]:
 ribalta <- read_html('http://www.jaredlander.com/data/ribalta.html')

In [160]:
ribalta

{html_document}
<html xmlns="http://www.w3.org/1999/xhtml">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\r\n<ul>\n<li class="address">\r\n    <span class="street">48 E 12t ...

In [161]:
class(ribalta)

In [162]:
ribalta %>% html_nodes('ul') %>% html_nodes('span')

{xml_nodeset (6)}
[1] <span class="street">48 E 12th St</span>
[2] <span class="city">New York</span>
[3] <span class="zip">10003</span>
[4] <span>\r\n    \t<span id="latitude" value="40.733384"></span>\r\n    \t<s ...
[5] <span id="latitude" value="40.733384"></span>
[6] <span id="longitude" value="-73.9915618"></span>

In [163]:
ribalta %>% html_nodes('.street')

{xml_nodeset (1)}
[1] <span class="street">48 E 12th St</span>

In [164]:
ribalta %>% html_nodes('.street') %>% html_text()

In [165]:
ribalta %>% html_nodes('#longitude') %>% html_attr('value')

In [166]:
ribalta %>% html_nodes('table.food-items') %>% magrittr::extract2(5) %>% html_table()

X1,X2,X3
<chr>,<chr>,<dbl>
Marinara Pizza Rosse,"basil, garlic and oregano.",9
Doc Pizza Rosse,buffalo mozzarella and basil.,15
Vegetariana Pizza Rosse,"mozzarella cheese, basil and baked vegetables.",15
Brigante Pizza Rosse,"mozzarella cheese, salami and spicy oil.",15
Calzone Pizza Rosse,"ricotta, mozzarella cheese, prosciutto cotto and black pepper.",16
Americana Pizza Rosse,"mozzarella cheese, wurstel and fries.",16


### 3.8 Reading JSON Data


In [167]:
install.packages('jsonlite')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [168]:
library(jsonlite)

In [169]:
pizza <-fromJSON('http://www.jaredlander.com/data/PizzaFavorites.json')

In [170]:
pizza

Unnamed: 0_level_0,Name,Details
Unnamed: 0_level_1,<chr>,<list>
1,Di Fara Pizza,"1424 Avenue J, Brooklyn , NY , 11230"
2,Fiore's Pizza,"165 Bleecker St, New York , NY , 10012"
3,Juliana's,"19 Old Fulton St, Brooklyn , NY , 11201"
4,Keste Pizza & Vino,"271 Bleecker St, New York , NY , 10014"
5,L & B Spumoni Gardens,"2725 86th St, Brooklyn , NY , 11223"
6,New York Pizza Suprema,"413 8th Ave, New York , NY , 10001"
7,Paulie Gee's,"60 Greenpoint Ave, Brooklyn , NY , 11222"
8,Ribalta,"48 E 12th St, New York , NY , 10003"
9,Totonno's,"1524 Neptune Ave, Brooklyn , NY , 11224"


In [171]:
class(pizza)

In [172]:
class(pizza$Name)

In [173]:
class(pizza$Details)

In [175]:
pizza$Details

Unnamed: 0_level_0,Address,City,State,Zip
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>
1,1424 Avenue J,Brooklyn,NY,11230

Unnamed: 0_level_0,Address,City,State,Zip
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>
1,165 Bleecker St,New York,NY,10012

Unnamed: 0_level_0,Address,City,State,Zip
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>
1,19 Old Fulton St,Brooklyn,NY,11201

Unnamed: 0_level_0,Address,City,State,Zip
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>
1,271 Bleecker St,New York,NY,10014

Unnamed: 0_level_0,Address,City,State,Zip
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>
1,2725 86th St,Brooklyn,NY,11223

Unnamed: 0_level_0,Address,City,State,Zip
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>
1,413 8th Ave,New York,NY,10001

Unnamed: 0_level_0,Address,City,State,Zip
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>
1,60 Greenpoint Ave,Brooklyn,NY,11222

Unnamed: 0_level_0,Address,City,State,Zip
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>
1,48 E 12th St,New York,NY,10003

Unnamed: 0_level_0,Address,City,State,Zip
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>
1,1524 Neptune Ave,Brooklyn,NY,11224


In [176]:
pizza$Details[[1]]

Unnamed: 0_level_0,Address,City,State,Zip
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>
1,1424 Avenue J,Brooklyn,NY,11230


In [174]:
class(pizza$Details[[1]])