# STA 141A Fundamentals of Statistical Data Science

### Lecture 2, 10/3/23, Vectors and Matrices

### Announcements

- HW1 online, due this Friday 11:59 PM
- Typo in HW1, see Piazza @10
- Office hours: 
    * Peter Kramlinger: T 10:45-11:45 AM, MSB 1143
    * Emily Chang: R 11 AM - 12 PM, MSB 1117
    * Pranash Ramesh: F 10 AM - 11 AM, MSB 1117

### Last week's topics

- Basics of R

Types: 

Vectors: 

### Today's topics

- Vectors
- Matrices
    - Creation
    - Subsetting
    - Operations
- Lists
- Arrays

Last week, we ended with summary statistics. Alternatively , you can perform arithmetic operations on the vector itself.

In [None]:
x <- c(5, pi, -6, 2.4, 2L, 1, 12, 99, NA)

In [None]:
x + 1

In [None]:
x + rep(c(7,4,4), times = 3)

In [None]:
x + c(1,2) # throws warning, but still gives output! 

In [None]:
x^2

In [None]:
x[3] = 3
sqrt(x)
log(x)

#### Combinatorics

In [None]:
x[3] = -6
x

In [None]:
sort(x) # ?sort

In [None]:
order(x) # ?order

In [None]:
x[order(x)]

In [None]:
rev(x) # check ?rev

In [None]:
set.seed(123) # setting the seed allows replication of pseudo-random results
sample(x, size = 5, replace = TRUE) # -6 is sampled twice! 

#### Subsetting Vectors

Elements in vectors can be accessed via squared brackets `[...]`. Contrary to the programming language Python, the first element is indexed with `1`. 


In [None]:
x

In [None]:
x[5]

In [None]:
x[c(6, 2:4)] # we can subset x by indexing multiple entries

In [None]:
x[-c(4:6)] # remove the third element

In [None]:
x

In [None]:
which.max(x)

In [None]:
x

In [None]:
x[which.max(x)] # check ?which.max
max(x, na.rm = T)

To ease illustration, lets remove the `NA`value in `x`: 

In [None]:
x <- na.omit(x)
x

As an alternative to indexing, elements in vectors can also be accessed via logical vectors: 

In [None]:
x > 3

In [None]:
x[x > 3] # x > 3 returns a logical vector

In [None]:
x[c(F, F, T, T, T, F, F, F)] 

In [None]:
x>3
which(x>3)

In [None]:
x[x>3]
x[which(x>3)]

In [None]:
abs(x[x>3] - x[which(x>3)]) > 10^-3 # boolean operations on vectors are element-wise

In [None]:
c(5, pi) %in% x # checks wether the constant pi is in x

Accessing the last entry is tricky as there is no `end` analogue. 

In [None]:
x[length(x)]
rev(x)[1]
tail(x, 7) # check ?tail and ?head

In [None]:
x[length(x)-1]
rev(x)[2]

#### Categorical vectors

Factors are classifications of categorical entries of a vector. Once created, a factor can only assume the pre-defined values, so called levels. Although mostly used for strings, factors can also be indicated with numbers.

In [None]:
set.seed(2022)
x <- sample(c("terrible", "bad", "awesome", "indifferent"), 10, replace = T) 

In [None]:
str(x) # character vector
x

In [None]:
y <- factor(x)
str(y) # factor vector

In [None]:
levels(y)

Although `indifferent` comes before `awesome` in `x`, the levels are alphabetically sorted. We can prevent this by calling: 

In [None]:
z <- factor(x, levels = unique(x)) # check ?unique
levels(z)

Note that `terrible` has not been sampled, and consequently not `terrible` level has been generated. In such cases or if we foresee that other levels to be included further downstream, we can already define them now: 

In [None]:
z <- factor(x, levels = c("awesome", "indifferent", "bad", "terrible"), ordered = TRUE) 
levels(z) # levels do not have to be strings!

In [None]:
attributes(z)

The `ordered` argument allows us to order entries. In this case, `terrible` is the highest (worst) possible entry in `z`, but the highest present is `bad`.

Ordered levels allow functions that require an ordinal structure. 

In [None]:
max(z)

The use of factors prevent typos and enhance data cleanliness. 

In [None]:
z[4] <- "terrrible" # typo!
z

R is particularly useful in visualizing data structures. Categorical data can be summarized via `table`. 

In [None]:
z

In [None]:
table(z)
z
x

In [None]:
set.seed(123)
x <- sample(c("apple", "banana"), 10, replace = T)
table(z, x)

### Matrices

Matrices are a collection of vectors of the same length and type. They are generated by partitioning a vector with the command `matrix`: 

In [None]:
set.seed(2022)
x <- sample(1:10, 12, replace = TRUE)
x

In [None]:
attributes(x)

In [None]:
matrix(data = x, nrow = 3, ncol = 4) # breaks x by column

In [None]:
matrix(data = x, nrow = 3, ncol = 4) # breaks x by column

In [None]:
matrix(x, nr = 3, nc = 4, byrow = T) # breaks x by row # replace ncol by nc

In [None]:
A <- matrix(1:12, 3, 4)
colnames(A) <- c("A", "B", "C", "D")
rownames(A) <- c("a", "b", "c")
A

In [None]:
str(A)

In [None]:
attributes(A)

In [None]:
nrow(A)
ncol(A)
dim(A) # returns 2-dimensional vector

In [None]:
tail(A, 2)

In [None]:
cbind(A, E = 1:3) # ?cbind add new column, note that it has no name

In [None]:
rbind(A, d = 1:4) 

In [None]:
A

In [None]:
diag(A) # extracts the diagonal from a matrix

In [None]:
diag(1:4) # generates a (diagonal) matrix

#### Subsetting 

The entries in a matrix can be accessed in two ways: 
i) By specifying the index of the vector of stacked column vectors, 
ii) by specifying row and column index.

In [None]:
A
a <- as.vector(A) # vector of stacked columns
a

In [None]:
A[8] # variant i), with index

In [None]:
A[c(F, F, F, F, F, F, F, T, F, F, F, F)] # variant i), with logical vector

In [None]:
A[2:3, 3:4] # variant ii) #  second row, third column—

In [None]:
A[c(F, T, T), c(F, F, T, T)] # variant ii) 

In [None]:
A[c("b", "c"), c("C", "h")]

By accessing an either row- or column-vector of a matrix, we can employ the techniques we learned last lecture for accessing vectors: 

In [None]:
x <- A[,"C"]
x
str(x)

In [None]:
which(A == 8)

In [None]:
which(x == 8) # returns index of column vector x which equals row

As for vectors, logical operations on matrices are element-wise. 

In [None]:
A > 5 

#### Operations

As for vectors, statistics can be computed for the whole data provided in the matrix. 

In [None]:
A

In [None]:
mean(A)
sum(A) / length(A)

In [None]:
prod(A) 

The values in the columns or rows are often of different nature and should be aggregated seperately. 

In [None]:
colMeans(A)
colSums(A)

In [None]:
t(A) # transpose

In [None]:
rowMeans(A)
rowSums(A)

In [None]:
colMeans(t(A))
colSums(t(A))

A very useful method used to on matrices is `apply`. It evaluates a function iteratively on all rows (or columns), respectively. 

In [None]:
?apply

In [None]:
apply(A, MARGIN = 2, FUN = function(x) which(x == 4)) # MARGIN = 1 refers to rows

In [None]:
apply(A, 2, prod) # MARGIN = 2 refers to columns

Matrix notation is commonly used for comprehensive formulae. Consequently, R can work with matrix algebra. 

In [None]:
dim(A)

In [None]:
set.seed(123)
B <- matrix(sample(1:4, 9, TRUE), 3, 3)
dim(B)

In [None]:
A %*% B

In [None]:
B %*% A # matrix product B A # note the preserved column names!

In [None]:
t(A) %*% B # matrix product of the transpose of A times B

Element-wise operations are computed with standard operators: 

In [None]:
A

In [None]:
3 * A

In [None]:
A[,-4] * B # element-wise operation

If non-singular, a matrix inversion can be computed using `solve`. 

In [None]:
solve(B)

In [None]:
C <- solve(B) %*% B

In [None]:
mean((C - diag(diag(C)))^2)

### Lists

Lists are different from vectors and matrices as their elements can be of any type. Their constructor is `list`.  They are a step up in complexity from atomic vectors, because lists can contain other lists.

In [None]:
x <- c(name = "Fred", wife = "Mary", children_number = 3, children_age = c(4, 7, 9), resident = TRUE)
x

In [None]:
x <- list(name = "Fred", 
          wife = "Mary", 
          children_number = 3,
          children_age = c(4, 7, 9), 
          resident = TRUE, 
          TRUE)

In [None]:
str(x)

In [None]:
x$children_age # accesss entry by entry name

In [None]:
x[[4]] # children_age is the fourth entry of the list

In [None]:
x[["children_age"]]

In [None]:
x[1]

Note the difference in using single brackets and double brackets: 

In [None]:
x[[1]]

In [None]:
str(x[[1]])
str(x[1])

In [None]:
attributes(x) # outputs a list of 1

In [None]:
attributes(x)$names

In [None]:
x

In [None]:
x$children <- list(number = x$children_number, 
                   age = x$children_age, 
                   names = c("Aaron", "Bob", "Charly"))

In [None]:
str(x)

One can apply a function on a list with `lapply`, if applicable: 

In [None]:
y <- lapply(x, length)
y

In [None]:
str(y)

Since the result is a list with entries of dimension one, it can be transformed to a vector using `unlist`: 

In [None]:
unlist(x) # again a named vector

In [None]:
dim(x)

### Arrays

Arrays are lists with a dimension argument. 


In [None]:
dim(x) # dimension of list does not work

In [None]:
?array

In [None]:
z <- array(data = 1:(2 * 3 * 2), dim = c(2, 3, 2)) # generates two 2 times 3 matrices 
dim(z)

In [None]:
z[,,1]

In [None]:
z[,,2]

In [None]:
z[,,3] # throws error

The entries of the matrices can be accessed by their corresponding index: 

In [None]:
z[,,2] # returns first column vector of second matrix 

In [None]:
z[,,2][,1] # alternative method

In [None]:
z[,2:3,2]

### Exercises

Set the seed as `2023`. 

Create a matrix $X$ of dimension $20\times 2$ of pseudo-random independent draws of the standard Gaussian distribution using the function `rnorm`. 
Add a column vector $(1, \dots, 1)^t$ of dimension 20 as the first column of $X$. 

Name the columns of $X$ as Intercept, Column 1 and Column 2. 

Create a vector $\beta = (4,0.1, 12)^t)$ and generate an pseudo-random independent Gaussian sample $y$ with $\mbox{E}(y) = X\beta$ using `rnorm`. 

Compute the least-squares estimator $\widehat\beta = (X^tX)^{-1}X^ty$ for $\beta$ and rate the estimate. 