## LECTURE 4: DATA STRUCTURES IN R  (contd)

### STAT598z: Intro. to computing for statistics


***




### Vinayak Rao

#### Department of Statistics, Purdue University

In [None]:
options(repr.plot.width=3, repr.plot.height=3)

## Data frames

Very common and convenient data structures

Used to store tables:
+ Columns are variables and rows are observations


 &#8203;| Age | PhD   | GPA
 ----- |:---:|:-----:| ----
 Alice | 25  | TRUE  |  3.6
 Bob   | 24  | TRUE  |  3.4
 Carol | 21  | FALSE |  3.8

An R data frame is a list of equal length vectors

In [None]:
df <- data.frame(age = c(25L,24L,21L),  # Warning: df is an
                 PhD = c( T , T , F ),  #   R function
                 GPA = c(3.6,2.4,2.8)) 

In [None]:
print(df)

In [None]:
typeof(df)

In [None]:
class(df)

Since data frames are lists, we can use list indexing

Can also use matrix indexing (more convenient)

In [None]:
print(df[2,'age']) 

In [None]:
print(df[2,])

In [None]:
print(df$GPA)

In [None]:
nrow(df)*ncol(df)

list functions apply as usual

matrix functions are also interpreted intuitively

Useful functions are:
+ 'length(), dim(), nrow(), ncol()'
+ 'names()' (or 'colnames()')', rownames'
+ 'rbind(), cbind()'

In [None]:
rownames(df) <- c("Alice", "Bob", "Carol")

In [None]:
df[4,1] <- 30L; print(df)

Many R datasets are data frames 

In [None]:
library("datasets")
class(mtcars)

In [None]:
print(head(mtcars)) # Print part of a large object

## Tibbles
Tibbles are essentially dataframes, with some convenience features

Interact will with the tidyverse package (later)

In [None]:
library(tidyverse)
t_mt_cars <- as_tibble(mtcars)
class(t_mt_cars)

Tribbles print more nicely that dataframes (but see RStudio's View())

In [None]:
print(t_mt_cars)

You can reference columns of a tibble as you create it

In [None]:
sin_tb <- tibble(x=seq(-5,5,.1), y=sin(x));
print(sin_tb)

## Factors
Categorical variables that take on a finite number of values

+ **Employee type**: `student/staff/faculty`
+ **Grade**: `A/B/C/F`

Useful when variable can take a fixed set of values
(unlike character strings)

R implements these internally as integer vectors

Has two attributes to distinguish from regular integers:

`levels()`  specifies possible values the factor can take
+ E.g. `c("male", "female")`

`class = factor` tells R to check for violations

In [None]:
# Character vector for 4 students
grades_bad <- c("a", "a", "b", "f") 

In [None]:
# Factor vector for 4 students
grades <- factor(c("a", "a", "b", "f"))

In [None]:
print(grades);

In [None]:
typeof(grades)

In [None]:
class(grades)

In [None]:
levels(grades) # Not quite what we wanted!

In [None]:
grades <- factor(c("a", "a", "b", "f"))
str(grades)

In [None]:
grades[2] <- "c"

In [None]:
str(grades)

In [None]:
grades <- factor(c("a","a","b","a","f"),
             levels = c("a","b","c","f"))

In [None]:
str(grades)

In [None]:
table(grades)   # table also works with other data-types

Factors can be ordered:

In [None]:
grades <- factor(c("a","a","b","f"),
            levels = c("f","c","b","a"),
            ordered = TRUE )
grades

In [None]:
 grades[1] > grades[3]

`gl()`: Generate factors levels

Usage (from the R documentation):

``` R
gl(n, k, length = n * k, labels = seq_len(n),
   ordered = FALSE )
```
Look at the examples there:

In [None]:
# First control, then treatment:
gl(2, 8, labels = c("Control", "Treat")) 

In [None]:
gl(2, 1, 20) # 20 alternating 1s and 2s

In [None]:
gl(2, 2, 20) # alternating pairs of 1s and 2s