### <center> R Data Frames </center> 

* DataFrames is one of the main tools for data analysis in R. 
* Unlike Vectors and Matrices, DataFrames can organize data of mixed data types, creating a very powerful data structure tool. 
* R has build-in DataFrames for quick reference and to practice with. Let's look at a few. 

**Some build-in DataFrames**

In [88]:
head(state.x77) # DataFrame about US States

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad,Frost,Area
Alabama,3615,3624,2.1,69.05,15.1,41.3,20,50708
Alaska,365,6315,1.5,69.31,11.3,66.7,152,566432
Arizona,2212,4530,1.8,70.55,7.8,58.1,15,113417
Arkansas,2110,3378,1.9,70.66,10.1,39.9,65,51945
California,21198,5114,1.1,71.71,10.3,62.6,20,156361
Colorado,2541,4884,0.7,72.06,6.8,63.9,166,103766


In [89]:
USPersonalExpenditure # US Perfonal Expense

Unnamed: 0,1940,1945,1950,1955,1960
Food and Tobacco,22.2,44.5,59.6,73.2,86.8
Household Operation,10.5,15.5,29.0,36.5,46.2
Medical and Health,3.53,5.76,9.71,14.0,21.1
Personal Care,1.04,1.98,2.45,3.4,5.4
Private Education,0.341,0.974,1.8,2.6,3.64


In [90]:
women # Women height and weight dataset

height,weight
58,115
59,117
60,120
61,123
62,126
63,129
64,132
65,135
66,139
67,142


* To get a list of all available build-in dataframes use the **`data()`** function.

In [91]:
data()

Package,Item,Title
datasets,AirPassengers,Monthly Airline Passenger Numbers 1949-1960
datasets,BJsales,Sales Data with Leading Indicator
datasets,BJsales.lead (BJsales),Sales Data with Leading Indicator
datasets,BOD,Biochemical Oxygen Demand
datasets,CO2,Carbon Dioxide Uptake in Grass Plants
datasets,ChickWeight,Weight versus age of chicks on different diets
datasets,DNase,Elisa assay of DNase
datasets,EuStockMarkets,"Daily Closing Prices of Major European Stock Indices, 1991-1998"
datasets,Formaldehyde,Determination of Formaldehyde
datasets,HairEyeColor,Hair and Eye Color of Statistics Students


#### <center> Working with DataFrames </center> 

**Some common DataFrame functions**

| Function | Description | 
|---- | ---- | 
|**`head(n)`** | Returns top n rows, default=6 |
|**`tail(n)`** | Returns bottom n rows, default=6| 
|**`str()`** | Returns the structure on the DataFrame and data it contains.| 
|**`summary()`**| Returns statistical summary of all the columns in the DataFrame.|
|**`data.frame()`**| Takes in vectors as arguments and makes them columns of a DataFrame. |
|**`df.name$column.name`** | Returns the values of the specified column |
|**`subset(x=df, subset=some condition)`** | Grabs subset of data frame based on condition |
|**`order(some column)`** | Returns a vecor with the index elements in order (default=ascending)|

In [92]:
states <- state.x77

In [93]:
class(states)

In [94]:
head(x=states, n=5)

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad,Frost,Area
Alabama,3615,3624,2.1,69.05,15.1,41.3,20,50708
Alaska,365,6315,1.5,69.31,11.3,66.7,152,566432
Arizona,2212,4530,1.8,70.55,7.8,58.1,15,113417
Arkansas,2110,3378,1.9,70.66,10.1,39.9,65,51945
California,21198,5114,1.1,71.71,10.3,62.6,20,156361


In [95]:
tail(x=states, n=5)

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad,Frost,Area
Virginia,4981,4701,1.4,70.08,9.5,47.8,85,39780
Washington,3559,4864,0.6,71.72,4.3,63.5,32,66570
West Virginia,1799,3617,1.4,69.48,6.7,41.6,100,24070
Wisconsin,4589,4468,0.7,72.48,3.0,54.5,149,54464
Wyoming,376,4566,0.6,70.29,6.9,62.9,173,97203


In [96]:
str(object=states)

 num [1:50, 1:8] 3615 365 2212 2110 21198 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
  ..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...


In [97]:
summary(states)

   Population        Income       Illiteracy       Life Exp    
 Min.   :  365   Min.   :3098   Min.   :0.500   Min.   :67.96  
 1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625   1st Qu.:70.12  
 Median : 2838   Median :4519   Median :0.950   Median :70.67  
 Mean   : 4246   Mean   :4436   Mean   :1.170   Mean   :70.88  
 3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575   3rd Qu.:71.89  
 Max.   :21198   Max.   :6315   Max.   :2.800   Max.   :73.60  
     Murder          HS Grad          Frost             Area       
 Min.   : 1.400   Min.   :37.80   Min.   :  0.00   Min.   :  1049  
 1st Qu.: 4.350   1st Qu.:48.05   1st Qu.: 66.25   1st Qu.: 36985  
 Median : 6.850   Median :53.25   Median :114.50   Median : 54277  
 Mean   : 7.378   Mean   :53.11   Mean   :104.46   Mean   : 70736  
 3rd Qu.:10.675   3rd Qu.:59.15   3rd Qu.:139.75   3rd Qu.: 81163  
 Max.   :15.100   Max.   :67.30   Max.   :188.00   Max.   :566432  

#### Creating a DataFrame

In [98]:
days <- c("Mon", "Tue", "Wed", "Thu", "Fri")
temp <- c(22.2, 21.0, 23.0, 24.3, 25.0)
rain <- c(TRUE, TRUE, FALSE, FALSE, TRUE)

In [99]:
df <- data.frame(days, temp, rain)

In [100]:
df

days,temp,rain
Mon,22.2,True
Tue,21.0,True
Wed,23.0,False
Thu,24.3,False
Fri,25.0,True


#### <center> Data Frame Selection and Indexing </center> 

* We can select elements from within a Data Frame using bracket notation. 

**Select using index location**

In [101]:
df[1,] # Select the first row

days,temp,rain
Mon,22.2,True


In [102]:
df[,1] # Select the first column

**Select using column names**

In [103]:
df[, "rain"] # Select rain data

In [105]:
df[1:3, c("days", "temp")] # Select first three rows for days and temps columns

days,temp
Mon,22.2
Tue,21.0
Wed,23.0


* To get the values of a particular column, you use the dollar sign after the dataframe. 
* General format: **`df.name$column.name`**

In [106]:
df$rain # Get rain values

In [107]:
df$days

* You can also use the bracket notation to return a data frame with the same information.

In [108]:
df["rain"]

rain
True
True
False
False
True


In [109]:
df["days"]

days
Mon
Tue
Wed
Thu
Fri


**Filtering with a subset condition** 

* The **`subset()`** function grabs a subset of values based on some condition. 
* Note that the function knows it is using column names in its argument, so it is not passed as a character string.
* For example, to grab the days when it rained:

In [110]:
subset(x=df, subset=(rain==TRUE)) # Grab data for the days it rained

Unnamed: 0,days,temp,rain
1,Mon,22.2,True
2,Tue,21.0,True
5,Fri,25.0,True


In [111]:
subset(x=df, subset=(temp>23)) # Grab data for the days temp > 23

Unnamed: 0,days,temp,rain
4,Thu,24.3,False
5,Fri,25.0,True


**Ordering a Data Frame**

* **`order(some column)`** Returns a vecor with the index elements in order (default=ascending)

In [112]:
order(df["temp"])

* Then the data frame uses the order to sort accordingly

In [113]:
df[order(df["temp"]),]

Unnamed: 0,days,temp,rain
2,Tue,21.0,True
1,Mon,22.2,True
3,Wed,23.0,False
4,Thu,24.3,False
5,Fri,25.0,True


In [114]:
df[order(-df["temp"]),] # Orders based on temp in descending order

Unnamed: 0,days,temp,rain
5,Fri,25.0,True
4,Thu,24.3,False
3,Wed,23.0,False
1,Mon,22.2,True
2,Tue,21.0,True


* The alternative column selection method also works.

In [115]:
df[order(df$temp),] 

Unnamed: 0,days,temp,rain
2,Tue,21.0,True
1,Mon,22.2,True
3,Wed,23.0,False
4,Thu,24.3,False
5,Fri,25.0,True


#### <center> Overview of Data Frame Operations </center> 

* Data Frames are the workhorse of R, so in this lecture will will effectively be building a "cheat sheet" of common operations.

**Creating an Empty Data Frame**

In [116]:
df <- data.frame() # Create an empty data frame

**Create Data Frame from Vectors**

In [117]:
c1 <- 1:10          # Vector of integers
c2 <- letters[1:10] # Vector of strings (first ten letters)

Unlike previous example, here we are aslso renaming the column names as (column_name = vector).

In [118]:
df <- data.frame(col.name.1=c1, col.name.2=c2)
df

col.name.1,col.name.2
1,a
2,b
3,c
4,d
5,e
6,f
7,g
8,h
9,i
10,j


**Importing and Exporting Data**

In [119]:
write.csv(df, file="data.filename.csv") # Write data frame into a csv file

In [120]:
df2 <- read.csv("data.filename.csv")  # Function to read csv files into a data frame

* Note that when saving a data frame you are also getting an additional column (named X here) which is the index.  
* Many time the index is not just a number, it will be a field with useful information, like some id, name, etc.

In [121]:
df2

X,col.name.1,col.name.2
1,1,a
2,2,b
3,3,c
4,4,d
5,5,e
6,6,f
7,7,g
8,8,h
9,9,i
10,10,j


In [122]:
library(readxl) # R library needed for reading data from Excel files.

In [123]:
df3 <- read_excel("Sample-Sales-Data.xlsx", sheet="Sheet1") # Command to read Excel data

#### Getting Information about Data Frames

In [124]:
nrow(df) # No of rows

In [125]:
ncol(df) # No of columns

In [126]:
colnames(df) # Column names in a vector

* Note that getting a data frame's row names, it can be very large with large data frames. 

In [127]:
rownames(df) # Row names in a vector.

* The **`str()`** function give information such as number of observations (rows), number of variables (columns), the column names and their data type and factors, etc. 

In [128]:
str(df) # Information about the structure of a data frame

'data.frame':	10 obs. of  2 variables:
 $ col.name.1: int  1 2 3 4 5 6 7 8 9 10
 $ col.name.2: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10


In [129]:
summary(df) # Statistical Summary of a summary

   col.name.1      col.name.2
 Min.   : 1.00   a      :1   
 1st Qu.: 3.25   b      :1   
 Median : 5.50   c      :1   
 Mean   : 5.50   d      :1   
 3rd Qu.: 7.75   e      :1   
 Max.   :10.00   f      :1   
                 (Other):4   

#### Referencing Cells

* You can think of the basics as using two sets of brackets for a single cell, and a single set of bracketsfor multiple cells.

**Referencing a single cell**

In [130]:
vec <- df[[5,2]]
vec

In [131]:
df[5,2]

In [132]:
df[5, "col.name.2"]  # This is the most typically used in practice

In [133]:
df[[2, "col.name.1"]] <- 999 # Reassigning a single cell

In [134]:
df

col.name.1,col.name.2
1,a
999,b
3,c
4,d
5,e
6,f
7,g
8,h
9,i
10,j


**Referencing multiple cells**

In [135]:
new.df <- df[1:5, 1:2] # Get multiple cells into a new df
new.df

col.name.1,col.name.2
1,a
999,b
3,c
4,d
5,e


#### Referencing Rows

* We usually use the **`[row,]`** format

In [136]:
df[1,] # Returns a dataframe of that row

col.name.1,col.name.2
1,a


* If you just want a vector of those values alone

In [137]:
vrow <- as.numeric(as.vector(df[1,]))
vrow

#### Referencing Columns 

In [138]:
cars <- mtcars
head(cars)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


**Methods that return a column as a vector**

In [139]:
colv1 <- cars$mpg 
colv2 <- cars[, "mpg"] 
colv3 <- cars[, 1]
colv4 <- cars[["mpg"]]

In [140]:
head(colv1)

**Methods that return a column as a Data Frame**

In [141]:
mpg.df <-cars["mpg"] # We use a single set of brackets, rather than two!
mpg.df <-cars[1]     # Can also use index location of column

In [142]:
head(mpg.df)

Unnamed: 0,mpg
Mazda RX4,21.0
Mazda RX4 Wag,21.0
Datsun 710,22.8
Hornet 4 Drive,21.4
Hornet Sportabout,18.7
Valiant,18.1


In [143]:
many.cols <- cars[c("mpg", "cyl")]
head(many.cols)

Unnamed: 0,mpg,cyl
Mazda RX4,21.0,6
Mazda RX4 Wag,21.0,6
Datsun 710,22.8,4
Hornet 4 Drive,21.4,6
Hornet Sportabout,18.7,8
Valiant,18.1,6


**Adding Rows to a Data Frame**

* **`rbind()`** function adds a new row
* Below a new data frame with same columns and a single row.

In [144]:
df2 <- data.frame(col.name.1=2000, col.name.2="new")
df2

col.name.1,col.name.2
2000,new


In [145]:
dfnew <- rbind(df, df2)

In [146]:
dfnew

col.name.1,col.name.2
1,a
999,b
3,c
4,d
5,e
6,f
7,g
8,h
9,i
10,j


**Adding Columns to a Data Frame**

In [147]:
df$new.col <- 2*df$col.name.1 # New column two times col 1
head(df)

col.name.1,col.name.2,new.col
1,a,2
999,b,1998
3,c,6
4,d,8
5,e,10
6,f,12


In [148]:
df$new.col.copy <- df$new.col  # another new column copy of new column
head(df)

col.name.1,col.name.2,new.col,new.col.copy
1,a,2,2
999,b,1998,1998
3,c,6,6
4,d,8,8
5,e,10,10
6,f,12,12


In [149]:
df[, "new.col.copy2"] <- df[, "new.col"] # Just reference columns differently
head(df)

col.name.1,col.name.2,new.col,new.col.copy,new.col.copy2
1,a,2,2,2
999,b,1998,1998,1998
3,c,6,6,6
4,d,8,8,8
5,e,10,10,10
6,f,12,12,12


In [150]:
df <- cbind(df, df$new.col.copy2)
head(df)

col.name.1,col.name.2,new.col,new.col.copy,new.col.copy2,df$new.col.copy2
1,a,2,2,2,2
999,b,1998,1998,1998,1998
3,c,6,6,6,6
4,d,8,8,8,8
5,e,10,10,10,10
6,f,12,12,12,12


**Setting Column Names**

In [151]:
colnames(df)  # Returns columns of a data frame

In [152]:
colnames(df)[2] <- "NEW NAME"  # Assigns new name to second column
colnames(df)

In [153]:
# Assign a vector to rename all column names. 
colnames(df) <- c('col.name.1', 'col.name.2', 'newcol', 'copy.of.col2' ,'col1.times.2')
colnames(df)

**Selecting Multiple Rows**

In [154]:
first.ten.rows <- df[1:10, ] # Similar fashion as with Matrices

In [155]:
first.ten.rows

col.name.1,col.name.2,newcol,copy.of.col2,col1.times.2,NA
1,a,2,2,2,2
999,b,1998,1998,1998,1998
3,c,6,6,6,6
4,d,8,8,8,8
5,e,10,10,10,10
6,f,12,12,12,12
7,g,14,14,14,14
8,h,16,16,16,16
9,i,18,18,18,18
10,j,20,20,20,20


In [156]:
everything.but.row.two <- df[-2, ] # Selects everything but second row
head(everything.but.row.two)

Unnamed: 0,col.name.1,col.name.2,newcol,copy.of.col2,col1.times.2,NA
1,1,a,2,2,2,2
3,3,c,6,6,6,6
4,4,d,8,8,8,8
5,5,e,10,10,10,10
6,6,f,12,12,12,12
7,7,g,14,14,14,14


**Conditional selection**

* Remember to pass the comma, to request all other columns!

In [157]:
cars[(cars$mpg > 25),]  # Get all cars with mpg > 25

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Fiat 128,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
Honda Civic,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
Toyota Corolla,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
Fiat X1-9,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1
Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2


In [158]:
cars[(cars$mpg >20) & (cars$cyl==6),] # Selection with multiple conditions

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1


In [159]:
# Also select which columns to return
cars[(cars$mpg >20) & (cars$cyl==6), c("mpg", "cyl", "hp")]

Unnamed: 0,mpg,cyl,hp
Mazda RX4,21.0,6,110
Mazda RX4 Wag,21.0,6,110
Hornet 4 Drive,21.4,6,110


**`subset()`** function can do the same thing

In [160]:
subset(x=cars, subset= (mpg >20 & cyl==6), select=c("mpg", "cyl", "hp"))

Unnamed: 0,mpg,cyl,hp
Mazda RX4,21.0,6,110
Mazda RX4 Wag,21.0,6,110
Hornet 4 Drive,21.4,6,110


**Select Multiple Columns**

In [161]:
head(cars[,c(1,2,3)])

Unnamed: 0,mpg,cyl,disp
Mazda RX4,21.0,6,160
Mazda RX4 Wag,21.0,6,160
Datsun 710,22.8,4,108
Hornet 4 Drive,21.4,6,258
Hornet Sportabout,18.7,8,360
Valiant,18.1,6,225


In [162]:
head(cars[,c("mpg", "cyl", "disp")])

Unnamed: 0,mpg,cyl,disp
Mazda RX4,21.0,6,160
Mazda RX4 Wag,21.0,6,160
Datsun 710,22.8,4,108
Hornet 4 Drive,21.4,6,258
Hornet Sportabout,18.7,8,360
Valiant,18.1,6,225


In [163]:
head(cars[, -1]) # Drop column 1

Unnamed: 0,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,6,225,105,2.76,3.46,20.22,1,0,3,1


In [164]:
head(cars[, -c(1, 3)]) # Drop columns 1 and 3

Unnamed: 0,cyl,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,6,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,6,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,4,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,6,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,8,175,3.15,3.44,17.02,0,0,3,2
Valiant,6,105,2.76,3.46,20.22,1,0,3,1


**Dealing with Missing Data** 

* If we run the **`is.na(df)`** by itself, we can a boolean data frame of same size. 
* The **`any()`** function returns a single TRUE or FALSE depending if there any missing points. 

In [165]:
any(is.na(cars)) # Checks if there is any missing data in data frame

* If you want to check a spefici column or columns, you just specify them in the df call. 

In [166]:
any(is.na(cars$mpg))

* Typically we will do an assignment like this for a single column

In [167]:
df[is.na(df)] <- 0 # Assigns all missing data with chosen value

In [168]:
df[is.na(cars$mpg)] <- 0

In [169]:
 # Replaces missing values of column with mean value of column
df[is.na(cars$mpg)] <- mean(cars$mpg)

In [170]:
# Keep rows where there are no missing data in col column
df <- df[!is.na(df$col),]