## Data Frame

We learned in lecture that a dataframe is a specific way to organize data points sampled from a sample space $\mathcal{G}$.
A dataframe supposes a dataset $\mathcal{D} = ( d_{1},d_{2},\cdots,d_{N})$ should be organized so that each row is a single outcome $d_{i}$ and each column is a position in the tuple $d_{i} = ( x,y,z,...  )$

The dataframe is a common way computational scientists think about data. 

### How to read a CSV file into a data frame

The function ``read.csv`` takes as an argument a string that indicates a file on your local computer OR a URL online.
Below the ``read.csv`` function is used to import into memory a **dataframe** related to bad drivers from the statsitcal news outlet called *FiveThirtyEight*.

The news article is here https://fivethirtyeight.com/features/which-state-has-the-worst-drivers/


In [34]:
## the line below is how you read in a csv
d = read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/bad-drivers/bad-drivers.csv")

## this is just me making the column names a little easier to handle
new_cols <- c('state', 'total', 'speeding', 'alcohol', 'not_distracted', 'previous_accidents', 'insurance_premium', 'insurance_loss')
print('Column names and descriptions:')
data.frame(column_names = new_cols, column_descriptions = colnames(d))
colnames(d) = new_cols



[1] "Column names and descriptions:"


column_names,column_descriptions
<fct>,<fct>
state,State
total,Number.of.drivers.involved.in.fatal.collisions.per.billion.miles
speeding,Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Speeding
alcohol,Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Alcohol.Impaired
not_distracted,Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Not.Distracted
previous_accidents,Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Had.Not.Been.Involved.In.Any.Previous.Accidents
insurance_premium,Car.Insurance.Premiums....
insurance_loss,Losses.incurred.by.insurance.companies.for.collisions.per.insured.driver....


In [35]:
print('The actual data:')
d # Jupyter will by default print any variable 

[1] "The actual data:"


state,total,speeding,alcohol,not_distracted,previous_accidents,insurance_premium,insurance_loss
<fct>,<dbl>,<int>,<int>,<int>,<int>,<dbl>,<dbl>
Alabama,18.8,39,30,96,80,784.55,145.08
Alaska,18.1,41,25,90,94,1053.48,133.93
Arizona,18.6,35,28,84,96,899.47,110.35
Arkansas,22.4,18,26,94,95,827.34,142.39
California,12.0,35,28,91,89,878.41,165.63
Colorado,13.6,37,28,79,95,835.5,139.91
Connecticut,10.8,46,36,87,82,1068.73,167.02
Delaware,16.2,38,30,87,99,1137.87,151.48
District of Columbia,5.9,34,27,100,100,1273.89,136.05
Florida,17.9,21,29,92,94,1160.13,144.18


### Selecting a column and the $ operator

With a dataframe you can select rows by asking R to return the number column where the first column in the dataframe is 1, the second column 2, and so on. 
You can also ask R to return a column of your dataframe by column name.

The 3rd column of the dataframe that we read records, by state, the percentage of fatal collision that involved a driver who was impaired by alcohol. 

We can ask R for the 3rd column by using square brackets in the same way that we used square brackets to select rows and columns of matrices. 

In [36]:
d[,3]

We can request a column from a dataframe by asking for 
1. The dataframe
2. square brackets
Inside the square brackets you divide the rows you want selected and the columns you want selected by a comma.

For example, if we wanted to look at the 10th row and 3rd column we can write

In [37]:
d[10,3]

If instead we wanted to view rows 10,11,12 and columns 3 and 4 we could write

In [38]:
d[ 10:12, c(3,4)  ]

Unnamed: 0_level_0,speeding,alcohol
Unnamed: 0_level_1,<int>,<int>
10,21,29
11,19,25
12,54,41


The above examples select columns and rows by number. 
One advantage of a dataframe is that we can select columns **by name**. 


For example, we may be interested in rows 10-12 of the column "Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Speeding"

In [39]:
d[10:12,"speeding"]

We can ask for **all** rows that correspond to a column by leaving the entry to the left of the comma blank.

In [40]:
d[,"speeding"]

Finally, there is a shorthand for the code ``d[,"speeding"]`` using the dollarsign operator.
The dollar sign operator can be thought of as a function that takes a column name as input and returns the column of data inside your data frame corresponding to that column.

In [41]:
d$speeding

### Logical indexing
We are allowed to use logical indexing like we learned when selecting items in vectors and matrices to select rows and columns of a data frame.
For example, suppose we are interested in **all** columns of our data frame where the Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding is above 40\%.

In [42]:
d[ d$speeding > 40, ] 

Unnamed: 0_level_0,state,total,speeding,alcohol,not_distracted,previous_accidents,insurance_premium,insurance_loss
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<int>,<int>,<int>,<dbl>,<dbl>
2,Alaska,18.1,41,25,90,94,1053.48,133.93
7,Connecticut,10.8,46,36,87,82,1068.73,167.02
12,Hawaii,17.5,54,41,82,87,861.18,120.92
26,Missouri,16.1,43,34,92,84,790.32,144.45
39,Pennsylvania,18.2,50,31,96,88,905.99,153.86
45,Utah,11.3,43,16,88,96,809.38,109.48
48,Washington,10.6,42,33,82,86,890.03,111.62
51,Wyoming,17.4,42,32,81,90,791.14,122.04


or maybe we are interested in not all columns, but just the states where this percentage is above 40\%.
We can subset to a specific column by including in square brackets the name of the column

In [43]:
d[ d$speeding > 40, "state" ] 

If we wanted to include more than one column then we can include a vector that contains each column we want to select.

In [44]:
d[ d$speeding > 40, c("state","insurance_premium") ] 

Unnamed: 0_level_0,state,insurance_premium
Unnamed: 0_level_1,<fct>,<dbl>
2,Alaska,1053.48
7,Connecticut,1068.73
12,Hawaii,861.18
26,Missouri,790.32
39,Pennsylvania,905.99
45,Utah,809.38
48,Washington,890.03
51,Wyoming,791.14


### Functions for data frames

There are several useful functions that take as an argument a data frame.
The ``nrow`` function takes a data frame as an argument and returns the number of rows (i.e. the number of data points) in the data frame. 
The ``ncol`` function takes a data frame as an argument and returns the number of columns in the data frame. 

In [45]:
number_of_rows = nrow(d)
number_of_columns = ncol(d)

print(number_of_rows)
print(number_of_columns)

[1] 51
[1] 8


The ``colnames`` function takes as an argument a data frame and returns a vector that contains the name of each column in the data frame.

In [46]:
column_names = colnames(d)
print(column_names)

[1] "state"              "total"              "speeding"          
[4] "alcohol"            "not_distracted"     "previous_accidents"
[7] "insurance_premium"  "insurance_loss"    


The ``summary`` function is a function that takes as an argument a data frame and returns, for each column in the data frame, the minimum value, maximum value, mean (average), median, 25th percentile (called the 1st quartile), and the 75th percentile (called the 3rd quartile).

In [47]:
summary(d)

        state        total          speeding        alcohol     
 Alabama   : 1   Min.   : 5.90   Min.   :13.00   Min.   :16.00  
 Alaska    : 1   1st Qu.:12.75   1st Qu.:23.00   1st Qu.:28.00  
 Arizona   : 1   Median :15.60   Median :34.00   Median :30.00  
 Arkansas  : 1   Mean   :15.79   Mean   :31.73   Mean   :30.69  
 California: 1   3rd Qu.:18.50   3rd Qu.:38.00   3rd Qu.:33.00  
 Colorado  : 1   Max.   :23.90   Max.   :54.00   Max.   :44.00  
 (Other)   :45                                                  
 not_distracted   previous_accidents insurance_premium insurance_loss  
 Min.   : 10.00   Min.   : 76.00     Min.   : 642.0    Min.   : 82.75  
 1st Qu.: 83.00   1st Qu.: 83.50     1st Qu.: 768.4    1st Qu.:114.64  
 Median : 88.00   Median : 88.00     Median : 859.0    Median :136.05  
 Mean   : 85.92   Mean   : 88.73     Mean   : 887.0    Mean   :134.49  
 3rd Qu.: 95.00   3rd Qu.: 95.00     3rd Qu.:1007.9    3rd Qu.:151.87  
 Max.   :100.00   Max.   :100.00     Max.   :130

The `str` function is a handy tool that will tell you what type of data each column is (e.g., integer, string), and it will show you the first few elements of each column.

In [48]:
str(d)

'data.frame':	51 obs. of  8 variables:
 $ state             : Factor w/ 51 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ total             : num  18.8 18.1 18.6 22.4 12 13.6 10.8 16.2 5.9 17.9 ...
 $ speeding          : int  39 41 35 18 35 37 46 38 34 21 ...
 $ alcohol           : int  30 25 28 26 28 28 36 30 27 29 ...
 $ not_distracted    : int  96 90 84 94 91 79 87 87 100 92 ...
 $ previous_accidents: int  80 94 96 95 89 95 82 99 100 94 ...
 $ insurance_premium : num  785 1053 899 827 878 ...
 $ insurance_loss    : num  145 134 110 142 166 ...


The `head` function is an *extremely* handy tool that I use all the time. It shows you the first few rows of all columns from your data, like a preview. The `tail` function does the same execpt it shows the last few rows.

In [49]:
head(d)

Unnamed: 0_level_0,state,total,speeding,alcohol,not_distracted,previous_accidents,insurance_premium,insurance_loss
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<int>,<int>,<int>,<dbl>,<dbl>
1,Alabama,18.8,39,30,96,80,784.55,145.08
2,Alaska,18.1,41,25,90,94,1053.48,133.93
3,Arizona,18.6,35,28,84,96,899.47,110.35
4,Arkansas,22.4,18,26,94,95,827.34,142.39
5,California,12.0,35,28,91,89,878.41,165.63
6,Colorado,13.6,37,28,79,95,835.5,139.91


## Assignment

We are going to look at a dataset that was collected on University "Fight Songs" that are played during sporting events. The article about the data set is here= https://projects.fivethirtyeight.com/college-fight-song-lyrics/

1. The data is at the URL (https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv). Please read in this raw CSV as a data.frame
2. Use the nrow and ncol function to describe the number of fight songs sampled and the number of different characteristics collected about each song. 
3. How many songs were sampled with a BPM (beats per minute) above 150? 
4. Select the rows where BPM is greater than 150 and select the column "nonsense" (Whether or not the song uses nonsense syllables (e.g. "Whoo-Rah" or "Hooperay") )
5. Summarize the data frame. Why do you think the summary function does not produce useful information for many of the columns? 