# The Data Frame
The data frame is the primary framework that you'll be dealing with in R for data analysis. A lot of the same vector rules apply, except now we're dealing in two dimensions.

It's often useful to work with single columns (ie, vectors), so a lot of the logic we went over with respect to vectors will be highly relevant when working with a single variable in a data frame. 

In [1]:
## pulling in a dataset that R has sort of built in
library(MASS)
df <- ToothGrowth

The assignment operator `<-` means that the variable `df` now represents the dataset. We can index this dataset with brackets `[]` much in the same way that we indexed vectors. But the first thing to do is to run some summary functions so we can get a sense of what data we're dealing with:

In [2]:
head(df) ## print out the first few rows
summary(df) ## quick descriptives on each variable in the dataset
str(df) ## shows us the variable types
nrow(df) ## returns the number of rows in the data
colnames(df) ## returns the names of the columns as a string vector
View(df) ## view the data in a spreadsheet format

len,supp,dose
4.2,VC,0.5
11.5,VC,0.5
7.3,VC,0.5
5.8,VC,0.5
6.4,VC,0.5
10.0,VC,0.5


      len        supp         dose      
 Min.   : 4.20   OJ:30   Min.   :0.500  
 1st Qu.:13.07   VC:30   1st Qu.:0.500  
 Median :19.25           Median :1.000  
 Mean   :18.81           Mean   :1.167  
 3rd Qu.:25.27           3rd Qu.:2.000  
 Max.   :33.90           Max.   :2.000  

'data.frame':	60 obs. of  3 variables:
 $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
 $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
 $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...


ERROR: Error in View(df): ‘View()’ not yet supported in the Jupyter R kernel


If we want to index out a certain value out of the data we can do so with the following format:  
`data[rows, columns]`

In [3]:
df[3,2]

One shortcut to grabbing a whole column out of the data is to use dollar sign notation `$`.  
Example: `data$column1` will return as a vector all the values in `column1`, where *column1* is the name of the actual column in the data.

In [5]:
head(df)

len,supp,dose
4.2,VC,0.5
11.5,VC,0.5
7.3,VC,0.5
5.8,VC,0.5
6.4,VC,0.5
10.0,VC,0.5


In [6]:
df$len

We can then work with this output just like you would any other vector:

In [10]:
df$len[1:5]

Notice how the code above is equivalent to indexing the whole data frame with two dimensions as below:

In [9]:
df[1:5,'len']

Whenever you reference a column in brackets like `data[rows, columns]`, you either have to refer to the column with the column number or with the column name **in quotes**. When using the dollar sign notation, you don't need to use quotes.

In [11]:
df[1:5,len]

ERROR: Error in `[.data.frame`(df, 1:5, len): object 'len' not found


If you want to index rows *or* columns and return all values on the dimension you didn't index (eg, index certain rows and return all columns), you still use a comma but just leave the slot blank that you don't wish to index over.

In [12]:
df[1:7,]

len,supp,dose
4.2,VC,0.5
11.5,VC,0.5
7.3,VC,0.5
5.8,VC,0.5
6.4,VC,0.5
10.0,VC,0.5
11.2,VC,0.5


In [15]:
df[,'len'][1:5][1] ## notice how this gets returned as a vector, not as a data frame with one column

The crazy nesting rules that we learned about with vectors come back into play when we want to index rows of the data:

In [20]:
df[df$len > 30 & df$supp == 'VC',] ## return whole rows of the dataset where values in the 'len' variable are greater than 30

Unnamed: 0,len,supp,dose
23,33.9,VC,2
26,32.5,VC,2


Remember, if you ever forget what's going on with the nesting stuff, break it down into parts:

In [16]:
df$len

In [18]:
head(df)

len,supp,dose
4.2,VC,0.5
11.5,VC,0.5
7.3,VC,0.5
5.8,VC,0.5
6.4,VC,0.5
10.0,VC,0.5


In [17]:
df$len > 30

### Coding new variables
You can use the dollar sign notation to come up with new variable names and then assign values to that variable.

To center a variable, for example:

In [22]:
df$len_c <- df$len - mean(df$len)
head(df)

len,supp,dose,len_c
4.2,VC,0.5,-14.613333
11.5,VC,0.5,-7.313333
7.3,VC,0.5,-11.513333
5.8,VC,0.5,-13.013333
6.4,VC,0.5,-12.413333
10.0,VC,0.5,-8.813333


In [21]:
df$len - mean(df$len)

**Making if / else contingencies in coding a new variable**  
For example, I want to create a new variable that has a score of 1 if the 'len' is greater than the median of 'lens', otherwise make it have a score of 0.

In [24]:
df$len_d <- ifelse(df$len > median(df$len), 1, 0)
df[20:30,]

Unnamed: 0,len,supp,dose,len_c,len_d
20,15.5,VC,1,-3.3133333,0
21,23.6,VC,2,4.7866667,1
22,18.5,VC,2,-0.3133333,0
23,33.9,VC,2,15.0866667,1
24,25.5,VC,2,6.6866667,1
25,26.4,VC,2,7.5866667,1
26,32.5,VC,2,13.6866667,1
27,26.7,VC,2,7.8866667,1
28,21.5,VC,2,2.6866667,1
29,23.3,VC,2,4.4866667,1


In [23]:
ifelse(df$len > median(df$len), 1, 0)

In [None]:
if (something){
    operation
} else {
    operation
}

## Messing with a real dataset

In [42]:
df <- read.csv('https://davebraun.org/quant/current_data.csv')
head(df)

direction,orientation,station,subjective_distance
EAST,1,1,5
EAST,1,1,4
EAST,1,1,3
EAST,1,1,3
EAST,1,1,4
EAST,1,1,1


In [41]:
mean(df[ ,'subjective_distance'])

In [48]:
## wyntre
mean(df[df$direction == 'EAST', 4])