Data frames and Factors
=======================

**Author:** Marcus Birkenkrahe



## What will you learn?



![img](../img/frame.jpg)

-   What is a data frame?
-   How do you create data frames?
-   How can you subset data frames?
-   Orange juice or Vitamin C?
-   What about lists?



## What is a data frame?



-   A data frame is a **table** of one row for each data point
-   A data frame consists of **vectors** of the same length
-   The vectors can have **different** data types



## Example: creating data frames



-   Open the `notebook.ipynb` file in the data frame practice workspace

-   Create three vectors to put into the data frame
    1.  the 1st uses the constant vector `LETTERS`
    2.  the 2nd uses the `sample` function to draw a sample from the first
    3.  the third one is just a colon-generated vector of integers



In [1]:
L3 <- LETTERS[1:3]   # create vector of capital letters
L3

In [1]:
fac <- sample(x=L3,        # generate a sample from L3
              size=10,     # draw 10 letters
              replace=TRUE)  # with replacement
fac

In [1]:
df_ex <- data.frame(1,1:10,fac) # create named 10x3 data frame (no names)
str(df_ex)  # check structure (implicitly unnamed)
df_ex

In [1]:
df_ex <- data.frame("x"=1,"y"=1:10,fac) # data frame with named columns
str(df_ex)  # check structure (explicitly named)
df_ex

-   More information: check out `example(data.frame)`



## Functions to test data structures



-   Check for vector, matrix, list, data frame:



In [1]:
is.vector(df_ex)
is.matrix(df_ex)
is.data.frame(df_ex)
is.list(df_ex)

-   Surprise: data frames are special [rectangular] lists

-   Can a data frame contain a data frame?



### Practice



-   Create and print the data frame shown in figure [2](#org8d19660)
    
    ![img](../img/7_df.png "data frame example (source: guru99.com)")



### Solution



-   Define vectors with `c()`
-   Create data frame with `data.frame()`
-   You can rename columns with `colnames()`
-   You can auto-convert `character` to `factor`



In [1]:
## define vectors
ID <- c(10,20,30,40) # numeric integer vector
items <- c("book","pen","textbook","pencil_case") # character vector
store <- c(TRUE,FALSE,TRUE,FALSE) # logical vector
price <- c(2.5,8.0,10.0,7.0)  # numeric double vector

## create data frame and properties
df <- data.frame(ID,items,store,price)
df
rownames(df)   # row names (auto-created)
colnames(df)   # column names
str(df)        # data frame structure

## auto-convert characters to factors
df_fac <- data.frame(ID,items,store,price,
                     stringsAsFactors = TRUE )
str(df_fac)

## Lab session: Creating/subsetting data frames



![img](../img/penguins.jpg)

1.  Go go the DataCamp workspace "[dataframe practice](https://app.datacamp.com/workspace/w/58e9e598-2624-44c8-aae5-fe327732eb3f)".
2.  Save it to your own workspace and complete the exercises.
3.  Finish it at home and submit it as a bonus assignment in Canvas.



## Properties of real data frames



![img](../img/guineapigs.jpg)



## Some useful functions



-   `dim` gives you the data frame dimensions
-   `nrow` gives you the number of rows
-   `ncol` gives you the number of columns
-   `head(x=,N)` gives you the first `N` rows
-   `order` gives you the indices of an ordered vector
-   `subset` gives you a subset of any data structure



In [1]:
dim(df)          # dimension of df
nrow(df)         # no. of rows
ncol(df)         # no. of columns

dim(tg)          # dimension of tg
nrow(tg)         # no. of rows
ncol(tg)         # no. of columns
head(tg$len,10)  # first 10 lines of vector

order(head(tg$len)) # order and print indices

## print ordered vector
tg$len[order(head(tg$len))]
tg$len[order(head(tg$len), decreasing = TRUE)]

## ?subset: type out the 'airquality' examples

## Data frame challenges



### Challenge 1



-   Try to create a not rectangular data frame
-   Define vectors of different length
-   Combine them using `data.frame`
-   Explain the result!



#### Solution



In [1]:
## the longer vector is an even multiple of the shorter one
data.frame(x1=c("moo","meh"),x2=1:4)

## the longer vector is an odd multiple of the shorter one
data.frame(x1=c("moo","meh"),x2=1:3)

![img](../img/7_challenge.png "element-wise vector operation")



### Challenge 2



-   Use the dataset `ToothGrowth` (aka `tg`)
-   Find the number of cases in which tooth length is less than 16



In [1]:
## create index vector for observations with tooth length < 16
small <- tg$len < 16

## look at the result - surprised?
head(small)    # print first few vector elements
sum(small)     # number of teeth of length < 16
length(small)  # some teeth are greater than 16

## print the tooth length values
tg$len[small]  # tg[small] won't work here - why not?

## Factor advantage



-   Compare the following two plots
-   You have to have `ggplot2` installed
-   Uses the quick plot function `qplot`



In [1]:
library(ggplot2)

## plot mpg vs wt, cyl
qplot(data=mtcars,x=wt,y=mpg,colour=cyl)
ggsave(file...

       ## plot mpg vs wt, factor(cyl)
       qplot(data=mtcars,x=wt,y=mpg,colour=factor(cyl))

## Orange Juice or Vitamin C?



### Extract factor levels



-   What's the class of `tg$supp`?
-   What're the levels of `tg$supp`?
-   We want to compare `mean` tooth length for each `level`



In [1]:
class(tg$supp)   # class check
levels(tg$supp)  # levels check

## select the rows for each level
tgoj <- tg[tg$supp == 'OJ',]  # Orange Juice
tgvc <- tg[tg$supp == 'VC',]  # Vitamin C

## compute the mean over all selected rows
mean(tgoj$len)
mean(tgvc$len)

### What's going on here?



`tg[tg$supp == 'OJ',]` is loaded with meaning:

-   `[i,j]`: select row `i`, column `j`
-   `i` can be a vector (several rows)
-   `j` can be a vector (several columns)
-   If either is missing: take all rows or columns
-   `==` produces logical values
-   `TRUE` means "take it", `FALSE` means "skip it"

>   `tg[tg$supp == 'OJ', ]` says:
> 
>   "Find which elements of the `tg$supp` vector equal `'OJ'` and
>   extract the corresponding rows of `tg`."
> 
>   = "Take from tg the rows in which the supplement was `OJ`."
> 
>   Notice that `tgoj`, `tgvc` are also still data frames.



## What about lists?



-   Data frames (and `data.table`) are really lists
-   Subsetting: same ol', same ol' (with `[[]]`)
-   Create lists with `list`
-   Useful for web data



In [1]:
class(mtcars)   # object class of data frame
typeof(mtcars)  # type or storage mode of data frame

## subsetting a data frame as a list
identical(mtcars$mpg[1], mtcars[[1]][1])

## create mtcars list (and add any other information)
mtcars_list <- list(mtcars)
typeof(mtcars_list)

## Concept summary



-   A data frame is a table of one row for each data point
-   A data frame consists of vectors of the same length
-   You can change row and column names
-   You can convert `character` into `factor` vectors
-   You can subset data frames using `[]` or `$` operators
-   You can run R scripts from the command line (e.g. `Rscript`)
-   You can plot to file (e.g. using `ggsave`)



## Code summary



| <code>library</code>|load package|
| <code>data</code>|load dataset|
| <code>str(df)</code>|structure of data frame <code>df</code>|
| <code>dslabs::murders</code>|data set <code>murders</code> in <code>dslabs</code>|
| <code>Rscript</code>|run R on script <code>.R</code>|
| <code>R CMD BATCH</code>|execute R as batch command|
| <code>ls</code>, <code>cat</code>|(linux) shell commands|
| <code>littler</code>|R script program package|
| <code>data.frame</code>|create data frame|
| <code>example</code>|show examples of function|
| <code>LETTERS</code>|pre-stored alphabet (caps)|
| <code>sample</code>|generate sample from vector|
| <code>is.vector</code>|test for vector|
| <code>is.matrix</code>|test for matrix|
| <code>is.data.frame</code>|test for data frame|
| <code>is.list</code>|test for list|
| <code>rownames</code>|get/set row names|
| <code>colnames</code>|get/set column names|
| <code>$</code>|access named vector|
| <code>[]</code>|select index values|
| <code>mean</code>|compute mean (1 argument)|
| <code>length</code>|compute vector length|
| <code>identical</code>|check equality (2 arguments)|
| <code>max</code>|find maximum value|
| <code>dim</code>|dimensions of object|
| <code>nrow</code>, <code>ncol</code>|number of rols, columns|
| <code>head</code>|top lines (default: 6)|
| <code>order</code>|order vector, print indices|
| <code>subset</code>|select subset|
| <code>list</code>|make list|
| <code>factor</code>|turn vector into factor vector|
| <code>ggplot2::ggsave</code>|save named plot|
| <code>ggplot2::qplot</code>|quick pretty plot|



## References



Matloff N (2019). fasteR: Fast Lane to Learning R! [Online: github](https://github.com/matloff/fasteR#--on-to-data-frames)

