# STA 141A Fundamentals of Statistical Data Science

### Lecture 5, 10/12/23, Data Frames

### Announcements

- Sample exam is online.

### Today's topics

- Basic data frames
- Including packages
- `tibble`
- `dplyr`
- `tidyr`

### Basic data frames

A data frame is a list with class `data.frame` in which all components provide the same number of variables to the data frame. 
These components can be numeric vectors or factors. 
Consequently, the function `dim` can be applied to data frames. 

In [None]:
df <- data.frame(x = 1:3, 
                 y = factor(c("a", "b", "c")), 
                 w = list(z = sin(1:3)))
df

In [None]:
class(df)

In [None]:
dim(df)

Even if a data frame is not a matrix, it can be thought of a matrix with columns possibly of differing modes and attributes. 

In [None]:
A <- matrix(1:12, 3,4)
df <- data.frame(A) # wrapper for as.data.frame, check ?as.data.frame
df

In [None]:
A  <- as.matrix(df)

In [None]:
A

Its rows and columns can be extracted using matrix conventions, but also using list conventions: 

In [None]:
str(df[,2])

In [None]:
df$X2

External files can be read into R with the function `read.table` or `read.csv`, depending on the format. Similarly, files can be generated by exporting data frames with the functions `write.table` or `write.csv`.

R is extremely powerful in processing data frames. Some data sets are available in basic R, e.g., `mtcars`. 

Whatever you do, your first step should be snooping the data set by calling `head` and learning about the nature of the data. 

In [None]:
head(mtcars, 8) # check data with ?mtcars

In [None]:
dim(mtcars)

In [None]:
colMeans(mtcars) 

In [None]:
summary(mtcars)

### Including packages

R allows to include packages that offer further features. 

Packages can be created and published by anyone. Generally, you should not execute foreign code on your local machine without prior consideration. We will carefully select any package we include in our analyses and obtain these from [CRAN](https://cran.r-project.org/). 

For now, we want to include the packages `tibble`, `dplyr` and `tidyr` for data manipulation and treating data frames. Both are packages from the larger package collection `tidyverse`. 

In [None]:
#install.packages("tibble")
#install.packages("dplyr")
# install.packages("tidyr")
# install.packages("tidyverse") # a lot of (useful) packages, takes time

In [None]:
library("tibble")
library("dplyr")
library("tidyr")

Some functions are defined in other packages and the newly included package. To access the `filter` function from `stats`, you need to call `stats::filter`. 

If you try to include a package that has not yet been installed, `library` throws an error. If you rather want to return a logical and warning, you can use `require`. 

In [None]:
library("analogue") # I have not installed analogue 

In [None]:
require("dplyr")

### `tibble` 

Having included `tibble`, we can coerce the data frame `mtcars` to a `tibble`. 

In [None]:
tbl <- tibble(mtcars)
#str(tbl)

In [None]:
head(tbl, 10) # note that tbl prints only first 10 lines in R

In [None]:
mpg

In [None]:
tbl$mpg

In [None]:
mpg <- 4

In [None]:
attach(tbl)

In [None]:
mpg

In [None]:
hp

In [None]:
detach(tbl)
mpg

Be careful when subsetting tibbles (or data frames): 

In [None]:
str(tbl[,1]) # this returns a tibble

In [None]:
str(tbl$mpg) # this returns a vector

Often times columns of data frames have to be coerced to the correct type prior to the analysis. 

In [None]:
mtcars$cyl <- as.integer(mtcars$cyl)
head(mtcars)

### `dplyr` 

`dplyr` is a useful package to bring data in the form you need. We will show this using the data set from the package `nycflights13`. 

In [None]:
install.packages("nycflights13")
library("nycflights13")

In [None]:
head(flights) # check ?flights 

We can elegantly access specific information within the data set.

In [None]:
# filter = pick observations by their value
filter(flights, month == 1, day == 1) # all flights on Jan 1st

In [None]:
filter(flights, month %in% c(11,12)) # same as filter(flights, month == 11 | month==12)

Often, multiple manipulations are executed at once. `dplyr` offers the pipe operator `%>%` as convenient syntax extension in such situations. 

In [None]:
flights %>% 
    filter(month %in% c(11,12)) %>% 
    filter(air_time > 120)

In [None]:
# reorder rows
flights %>% arrange(desc(arr_delay)) # check ?desc

In [None]:
# select columns
flights %>% dplyr::select(year, month, day) # takes select from dplyr package

In [None]:
# add new variable
flights %>% 
    mutate(gain = arr_delay - dep_delay,
                   speed = distance / air_time * 60)

In [None]:
# summarize
flights %>% summarize(delay = mean(dep_delay, na.rm = TRUE))

Note that the return is a tibble of one. To access, we need to subset: 

In [None]:
(flights %>% summarize(delay = mean(dep_delay, na.rm = TRUE)))$delay

In [None]:
flights %>% 
    group_by(month) %>% # split by month # returns a grouped tibble
    summarize(delay = mean(dep_delay, na.rm = TRUE)) # summarize! 

### `tidyr`

Tables can be easily rearranged using `tidyr`. 

In [None]:
table4a # check ?table4a

In [None]:
table4a %>% 
    pivot_longer(cols = c('1999', '2000'), 
                 names_to = 'year', 
                 values_to = 'cases')

The opposite is achieved with `pivot_wider`: 

In [None]:
table2

In [None]:
table2 %>% pivot_wider(names_from = "type", values_from = "count")

If the data stems from multiple sources, it has to be merged into one data frame. 

In [None]:
base = tibble(id = 1:3, 
              age = seq(55, 60, length = 3))
visits = tibble(id = c(rep(1:2, 2), 4), 
                visit = c(rep(1:2, 2), 1), 
                outcome = rnorm(5))
base
visits

In [None]:
left_join(base, visits, by = "id") # left (base) table keeps priority 

In [None]:
right_join(base, visits, by = "id") # right (visits) table keeps 

In [None]:
df <- full_join(base, visits, by = "id") # NA's are being included
df 

In [None]:
na.omit(df)

### Exercise 

Load the data set `BPdata.csv`. Transform it to a `tibble` object and its first column `Male` to a <kbd>logical</kbd> variable. 

Add a new column `BP` that contains the median blood pressure over the three measurements and select the columns `Male`, `Age` and `BP`. 

Remove all observations between `30` and `60` and calculate the average age and blood pressure based on the indicator in the `Male` column. 