# Intro to Data Science with `R`

<img src="../images/r_logo.png\">

### Read in data

In [1]:
df <- read.csv('../../../datasets/baby-names/NationalNames1.csv')
df <- subset(df, select = -Id)

Remember the data frame? Well, it returns in R, as promised. The `read.csv` function creates a data frame from the file you just read in. In the above example, we read in the data set, `NationalNames1.csv` and then remove the `Id` column by running the `subset` function. Take a look at the first few rows below...

In [2]:
head(df)

Unnamed: 0,Name,Year,Gender,Count
1,Anna,1880,F,2604
2,Emma,1880,F,2003
3,Margaret,1880,F,1578
4,Sarah,1880,F,1288
5,Annie,1880,F,1258
6,Florence,1880,F,1063


In both `R` and `Python`, the data frame tries to interpret the datatype of each field automatically. This is a really convenient feature as it means we don't have to spend a whole lot of time trying to convert our data into the appropriate types. 

In [3]:
sapply(df, class)

`sapply` is a function that applies a function accross each column in a data frame and returns the generated output for each column. In this case, it applies the `class` function to find the data type for each column. 

## Subsetting

Subsetting allows us to narrow down the scope of our analysis. It removes any data that we deem unnecessary, which, in turn, permits our code to run faster.

One thing that you will begin to notice about `R` is how flexible it is. There is often several ways to do the same thing. Subsetting is a prime example of this.

In [4]:
female <- subset(df, Gender == 'F')
male <- subset(df, Gender == 'M' )

Above is an example of subsetting the entire data frame into two separate data frames based on `Gender`. Here we call the `subset` function which takes the data frame as its first parameter, then some conditional statements based on column header names. We could also call more conditions by adding the `&` or `|` operaters. For example, if I wanted all rows in the data frame that were in the year 1880 and had a count greater than or equal to 2000, it would look like this...

```splus
subset(df, Year == 1880 & Count >= 2000)
```

or you could do this...

In [5]:
year1880 <- df[df$Year == 1880 & df$Count >= 2000,]

That is one thing about `R`, it is very flexible; we find ourselves often finding multiple solutions to the same problem. Go ahead and run the next line to see what that data frame looks like.

In [6]:
head(year1880)

Unnamed: 0,Name,Year,Gender,Count
1,Anna,1880,F,2604
2,Emma,1880,F,2003
293,George,1880,M,5126
294,Henry,1880,M,2444
295,Edward,1880,M,2364
296,Harry,1880,M,2152


## Sorting

We can also sort data frames based on the values in a column or multiple columns. In the following example, we sort the dataframe by Count in descending order and by Year in ascending order. We'll go ahead and call this sorted data frame `sorted`.

In [7]:
sorted <- df[order(-df$Count & df$Year),]

and let's see what that looks like...

In [8]:
head(sorted) # the top few rows
tail(sorted) # the bottom few rows

Unnamed: 0,Name,Year,Gender,Count
1,Anna,1880,F,2604
2,Emma,1880,F,2003
3,Margaret,1880,F,1578
4,Sarah,1880,F,1288
5,Annie,1880,F,1258
6,Florence,1880,F,1063


Unnamed: 0,Name,Year,Gender,Count
608398,Zeidan,2014,M,5
608399,Ziar,2014,M,5
608400,Zichen,2014,M,5
608401,Ziden,2014,M,5
608402,Ziyah,2014,M,5
608403,Zykeem,2014,M,5


So keep in mind that we just sorted the data frame based on two columns. Notice that there is a minus sign in front of the `df$Count`. This is specifying that we want the column to be sorted in *descending* order, or largest value to smallest value. The opposite would be *ascending* order, which is the default.

## Manipulating Data Frames with `dplyr`

<img src="../images/dplyr.png\">

`dplyr` is a very convenient data manipulation package. As you will see, it much of the functionality provided by this package should be familiar to you already. However, `dplyr` makes it managing data frames simpler through the use of its grammar and the ability to pipe functions together.  

### `dplyr` grammar

#### `rename`

The `rename` function is simple but extremely useful. All it does is allow you to rename the columns of a data frame, and although possible in base `R`, it is surprisingly difficult to do. `rename` takes the data frame as the first argument, and then any column whose name you want to change subsequently. For example, if you hate capital letters and you would prefer to name the `Name` column be `name`, the following in `dplyr` would allow you to do so:

```splus
rename(df, name = Name)
```

Now, let's try for ourselves.

In [9]:
library(dplyr)

print(names(df))
head(rename(df, name = Name, year = Year, gender = Gender, count = Count ))


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



[1] "Name"   "Year"   "Gender" "Count" 


Unnamed: 0,name,year,gender,count
1,Anna,1880,F,2604
2,Emma,1880,F,2003
3,Margaret,1880,F,1578
4,Sarah,1880,F,1288
5,Annie,1880,F,1258
6,Florence,1880,F,1063


#### **`select`**

`select()` will return a subset of specified columns of a dataframe.

In [10]:
head(df)

# just the Name colume
print("Just the Name")
select(head(df), Name)

# select the name and count columns
print("Name and Count")
select(head(df), c(Name,Count))

# the Name column through Gender column
print("Name through Gender")
select(head(df), (Name:Gender))

# remove the Year column 
print("Remove the Year column")
select(head(df), -Year)

Unnamed: 0,Name,Year,Gender,Count
1,Anna,1880,F,2604
2,Emma,1880,F,2003
3,Margaret,1880,F,1578
4,Sarah,1880,F,1288
5,Annie,1880,F,1258
6,Florence,1880,F,1063


[1] "Just the Name"


Unnamed: 0,Name
1,Anna
2,Emma
3,Margaret
4,Sarah
5,Annie
6,Florence


[1] "Name and Count"


Unnamed: 0,Name,Count
1,Anna,2604
2,Emma,2003
3,Margaret,1578
4,Sarah,1288
5,Annie,1258
6,Florence,1063


[1] "Name through Gender"


Unnamed: 0,Name,Year,Gender
1,Anna,1880,F
2,Emma,1880,F
3,Margaret,1880,F
4,Sarah,1880,F
5,Annie,1880,F
6,Florence,1880,F


[1] "Remove the Year column"


Unnamed: 0,Name,Gender,Count
1,Anna,F,2604
2,Emma,F,2003
3,Margaret,F,1578
4,Sarah,F,1288
5,Annie,F,1258
6,Florence,F,1063


the `select` function also includes some nice parameters to select certain columns based on partial string matches on the headers. For example, if we wanted to select all columns that ended with the letter 'r', we can just call the function `ends_with` within the `select` function itself. If we ran this on our `df` object, we should only expect to have the `Yea`**`r`** and `Gende`**`r`** columns returned. Let's see if that works...

In [11]:
library(dplyr)

print("Only those columns whose headers end with an 'r'")
head(select(df, ends_with("r")))

[1] "Only those columns whose headers end with an 'r'"


Unnamed: 0,Year,Gender
1,1880,F
2,1880,F
3,1880,F
4,1880,F
5,1880,F
6,1880,F


As you can probably imagine, this isn't the only way to select based on a partial string, or all of the functionality of the `select` function for that matter. But for now, we are going to move on to other parts of `dplyr` that are handy.

#### `filter`

`filter()` allows you to subset rows of data frame based on logical conditions. Here is an example...

In [12]:
library(dplyr)
head(filter(df, Count < 6))

Unnamed: 0,Name,Year,Gender,Count
1,Adelle,1880,F,5
2,Adina,1880,F,5
3,Ana,1880,F,5
4,Clair,1880,F,5
5,Delila,1880,F,5
6,Dosha,1880,F,5


This should be familiar. This is the same thing as running `subset(df, Count < 6)` or `df[df$Count < 6,]`. In plain English, we are simply asking for all the rows that have a `Count` less than 6 to be returned.

#### `arrange`

`arrange` is simply a way to sort your columns. The defualt is to sort in an ascending manner, but if you prefer a descending order, you can either specify the function `desc(`*`col_name`*`)` or the minus sign, `-`*`col_name`*.

In [13]:
print("Ascending")
head(arrange(df, Count))

print("Descending")
head(arrange(df, desc(Count)))

[1] "Ascending"


Unnamed: 0,Name,Year,Gender,Count
1,Adelle,1880,F,5
2,Adina,1880,F,5
3,Ana,1880,F,5
4,Clair,1880,F,5
5,Delila,1880,F,5
6,Dosha,1880,F,5


[1] "Descending"


Unnamed: 0,Name,Year,Gender,Count
1,Linda,1947,F,99680
2,Linda,1949,F,91010
3,Michael,1961,M,86916
4,James,1949,M,86856
5,Robert,1954,M,86258
6,James,1950,M,86221


**`mutate`**

The `mutate` function is good for manipulating data of a data frame and then returning it as a new column. For example, if I wanted to subtract the mean `Count` from every value of `Count` to see if that row is above or below the mean, `mutate` would be the function to do so. Here is an example of that scenario.

In [14]:
head(mutate(df, CountDetrend = Count - mean(Count)))

Unnamed: 0,Name,Year,Gender,Count,CountDetrend
1,Anna,1880,F,2604,2419.14567975503
2,Emma,1880,F,2003,1818.14567975503
3,Margaret,1880,F,1578,1393.14567975503
4,Sarah,1880,F,1288,1103.14567975503
5,Annie,1880,F,1258,1073.14567975503
6,Florence,1880,F,1063,878.145679755031


#### The Pipe Character `%>%` and more Verbiage

Up until this point, you may be asking the question as to what `dplyr` has to offer that `R`'s base functionality doesn't? The thing is, `dplyr` doesn't add any new functionality, per se, but it **does** simplifying the existing functionality. 

Much of the power of `dplyr` comes in the `%>%` character, which allows you to take a function and feed its output to another function. You can string several of the statements along to manipulate data step by step. For some, this concept is better seen than explained. Take the following, albeit trivial, example.

In [15]:
df %>% 
    filter(Name == 'Linda') %>% 
    head(10)

Unnamed: 0,Name,Year,Gender,Count
1,Linda,1881,F,38
2,Linda,1882,F,36
3,Linda,1887,F,50
4,Linda,1888,F,77
5,Linda,1901,F,86
6,Linda,1903,F,90
7,Linda,1904,F,101
8,Linda,1905,F,106
9,Linda,1906,F,98
10,Linda,1907,F,102


Okay, so in this example we are doing nothing more than selecting rows based on a specific value, nothing that we haven't done before. But let's tease this apart line by line. **`df %>%`** is just specifying that we are working with our data frame `df` while the pipe operator **`%>%`** strings this into the next fuction, which is **`filter(Name == 'Linda') %>%`**. So, since we are operating on the output of the first line, we are telling `R` to *`filter`* on *`df`*, and select those rows that have the value *`Linda`* in the *`Name`* column. But notice again the `%>%` operator at the end of this line. Again, we are taking the output from this line, or a data frame with only rows whose `Name` column is `Linda`, and feeding it into the next function, **`head(10)`**. This is the end of our pipeline, and will return the first 10 rows of this data frame of `Linda`s.

Like, I said, this is a rather trivial example, and one in which we could certainly have found a simpler solution to, but there are more complexities to data manipulation that make this `%>%` operator a very useful tool in data science. In the following, we will explore the final verbs of the `dplyr` package.

#### `group_by`

`group_by` is great for beginning to explore data. Imagine we were interest in the most popular name for each year, and for each gender. To do this, we would have to group the dataframe based on factor levels (discussed more in Module 1) for both the columns, `Year` and `Gender` and then find filter on every max `Count`. What should return is a data frame that has the most popular name for each gender per year.

In [16]:
df %>%
    group_by(Year,Gender) %>%
    filter(Count == max(Count)) %>%
    head()

Unnamed: 0,Name,Year,Gender,Count
1,Anna,1880,F,2604
2,George,1880,M,5126
3,Margaret,1881,F,1658
4,John,1881,M,8769
5,Ida,1882,F,1673
6,John,1882,M,9557


***Something to Think About:*** How would you find the most popular (by absolute count) male and female name of the entire dataset? How about just the most popular name of either gender per year?

#### `summarize`

This is the final verb in `dplyr`'s grammar that we will be discussing for this lesson. `summarize` (also `summarise`) is a function for providing some summary stats about the dataset. We won't dig into what summary statistics are too much in this lesson, as it will be one of the central concepts in module 2. 

Imagine we were interested in the mean name Count per year in this data frame; `summarize` is the function we would use for such a task...in conjuction with the `group_by` function, of course.

In [17]:
df %>%
    group_by(Year) %>%
    summarize(Mean_Count = mean(Count)) %>%
    mutate(Year = as.factor(Year)) %>%
    tail()

Unnamed: 0,Year,Mean_Count
1,2009,111.012083297626
2,2010,116.276321595623
3,2011,106.999735566329
4,2012,110.567276261154
5,2013,106.274884938182
6,2014,108.100405588103


## And To Come...

In the lessons to come, we dive into some exploratory analysis. In order to do this, we are going to need to know how to find the summary statistics of variables such as the image below. We are also going to need to know how to interpret these summary statistics as well as look at the relationship between variables. This will help us down the line when we begin to run some machine learning models to predict the output more interesting problems... but that will be down the road in Module 6.

### Summary Statistics for the Count Variable

<img src="../images/summary_sample_r.png\">

We will also begin to explore further methods of data manipulation, such as normalization, which will allow us to make more meaningful interpretations and predictions. Below is a graph that normalized the Count variable so that we can see the relative popularity of a name for that year. In module 5, we will learn how to make graphs that will communicate your patterns in your data effectively.

<img src="../images/popularity_sandy.jpg\">