# Data Science with `R` Practice

This practice will be going over the fundamentals of data science with `R`. Much of the content will be similar to your lab, [Introduction to Data Science with R](../labs/intro_data_science_r.ipynb), and thus it will serve as a good guide post to answering some of the questions. We'll begin today with reading in the data...

## Read in the Data

In [34]:
df <- read.csv('/dsa/data/all_datasets/baby-names/NationalNames2.csv')
head(df)

Id,Name,Year,Gender,Count
<int>,<fct>,<int>,<fct>,<int>
4,Elizabeth,1880,F,1939
8,Alice,1880,F,1414
12,Clara,1880,F,1226
13,Ella,1880,F,1156
18,Nellie,1880,F,995
21,Maude,1880,F,858


Now that the data is read in, how would you remove the `Id` column from the dataframe in `R`. Remember, `R` is very flexible so there are several ways to remove columns. Do the following:

**Activity 1**: Remove the column `Id` from the dataframe and rename this data frame, `df`.

**Activity 2**: Find the dimensions (the number of rows and columns) of this updated data frame. *Hint: You will have to look up the function to do this as it was not mentioned in the lab*. 

**Activity 3**: Display the ***last*** 10 columns of the updated data frame.

In [35]:
#' Code for Activity 1 goes here -----------------------
df = subset(df,select=-Id)
head(df)

Name,Year,Gender,Count
<fct>,<int>,<fct>,<int>
Elizabeth,1880,F,1939
Alice,1880,F,1414
Clara,1880,F,1226
Ella,1880,F,1156
Nellie,1880,F,995
Maude,1880,F,858


In [29]:
#' Code for Activity 2 goes here -----------------------
dim(df) #Number of rows followed by number of columns
nrow(df) #Number of rows
ncol(df) #Number of columns
length(df) #Number of columns

In [30]:
#' Code for Activity 3 goes here -----------------------
tail(df,10)

Unnamed: 0_level_0,Name,Year,Gender,Count
Unnamed: 0_level_1,<fct>,<int>,<fct>,<int>
608883,Zaymere,2014,M,5
608884,Zekeriah,2014,M,5
608885,Zenas,2014,M,5
608886,Ziion,2014,M,5
608887,Zijun,2014,M,5
608888,Zirui,2014,M,5
608889,Zo,2014,M,5
608890,Zyel,2014,M,5
608891,Zyran,2014,M,5
608892,Zyrin,2014,M,5


Below is one of the ways that you can subset a data frame. Run the code below to see what it returns.

In [31]:
head(df[df$Count > 2000,])

Unnamed: 0_level_0,Name,Year,Gender,Count
Unnamed: 0_level_1,<fct>,<int>,<fct>,<int>
330,Charles,1880,M,5348
331,Thomas,1880,M,2534
985,James,1881,M,5442
986,Frank,1881,M,2834
987,Thomas,1881,M,2282
988,Robert,1881,M,2140


The bit of code only returns the first few rows of data but, in reality, there are lot more rows.

**Activity 4**: *Find out how many rows where the `Count` is greater than 2000.*

In [33]:
#' Code for Activity 4 goes here -----------------------
nrow(df[df$Count > 2000,])

Now let's say that we were interested in only male names, particularly those names that are popular and not so popular.

**Activity 5**: subset the data set for rows where the gender is male and the count is above 2000 and below 100.

In [43]:
#' Code for Activity 5 goes here -----------------------
subset(df,Gender=='M' & (Count>2000 | Count<100))


Unnamed: 0_level_0,Name,Year,Gender,Count
Unnamed: 0_level_1,<fct>,<int>,<fct>,<int>
330,Charles,1880,M,5348
331,Thomas,1880,M,2534
367,Gus,1880,M,99
368,Jake,1880,M,96
369,Adolph,1880,M,93
370,Felix,1880,M,92
371,Wallace,1880,M,91
372,Claud,1880,M,90
373,Roscoe,1880,M,90
374,Hiram,1880,M,88


Imagine we were interested in names during a certain time period and we didn't need any of the data for the years were were not interested in. Let's say we were interested in names from the year 1950 through 1965. Below is one way to subset based on a specific year.

In [44]:
df1950 <- head(df[df$Year == 1950,])
df1965 <- head(df[df$Year == 1965,])

comb <- rbind(df1950,df1965)

head(comb);tail(comb)

Unnamed: 0_level_0,Name,Year,Gender,Count
Unnamed: 0_level_1,<fct>,<int>,<fct>,<int>
153876,Mary,1950,F,65460
153877,Nancy,1950,F,29621
153878,Deborah,1950,F,29071
153879,Sandra,1950,F,28893
153880,Karen,1950,F,24139
153881,Pamela,1950,F,16200


Unnamed: 0_level_0,Name,Year,Gender,Count
Unnamed: 0_level_1,<fct>,<int>,<fct>,<int>
211248,Patricia,1965,F,23554
211249,Linda,1965,F,19339
211250,Michelle,1965,F,16215
211251,Lori,1965,F,15698
211252,Teresa,1965,F,14578
211253,Barbara,1965,F,14026


**Questions**
1. What does the `rbind` function do?
2. Describe what the code above is doing.

Well, the above code would be extremely inefficient when trying to find all of the names between the years 1950 and 1965. How might you subset the data frame based off names between the year 1950 through 1965 without subsetting each year seperately. 

**Activity 6**: *Subset the data frame based off names between the year 1950 through 1965 without subsetting each year seperately.*

In [63]:
#' Code for Activity 6 goes here -----------------------
a = subset(df, Year < 1965 & Year > 1950)
a = a[order(a$Name),]
head(a)

Unnamed: 0_level_0,Name,Year,Gender,Count
Unnamed: 0_level_1,<fct>,<int>,<fct>,<int>
162578,Aaron,1952,F,6
166780,Aaron,1953,M,897
174104,Aaron,1955,M,1097
191797,Aaron,1960,F,20
197356,Aaron,1961,M,1882
200149,Aaron,1962,F,12


## A Bit of Sorting

The code below sorts the data frame based of `Count`, but we can also sort the data alphabetically off of a column of strings.


In [64]:
head(df[order(df$Count),])

Unnamed: 0_level_0,Name,Year,Gender,Count
Unnamed: 0_level_1,<fct>,<int>,<fct>,<int>
295,Adrienne,1880,F,5
296,Albertine,1880,F,5
297,Alys,1880,F,5
298,Celie,1880,F,5
299,Cordella,1880,F,5
300,Corrine,1880,F,5


**Activity 7**: Sort the data frame based on `Name` where the names starting with "Z" are on the top of the frame.

In [71]:
#' Code for Activity 7 goes here -----------------------
head(arrange(df, desc(Name)))

Name,Year,Gender,Count
<fct>,<int>,<fct>,<int>
Zzyzx,2010,M,5
Zyvion,2009,M,5
Zytavious,2005,M,5
Zytaveon,2011,M,8
Zyshonne,2001,M,12
Zyshon,2005,M,5


## Manipulating data with dplyr

`dplyr` makes manipulating data a bit simpler. Take a look at the code below.

In [72]:
library(dplyr)
df %>%
    filter(Name == "Linda") %>%
    arrange(-Count) %>%
    head()

Name,Year,Gender,Count
<fct>,<int>,<fct>,<int>
Linda,1952,F,67096
Linda,1953,F,61244
Linda,1946,F,52708
Linda,1955,F,51275
Linda,1957,F,44496
Linda,1945,F,41465


**Questions**
1. Describe what the code above is doing.

There is a lot more that you can do with `dplyr` such as renaming column names, lets try some.

**Activity 8**: *Copy the code above, but rename each of the columns so that the headers begin with lowercase letters. Name this data frame 'df_lower`.*

**Activity 9**: *Now, on the `df_lower` frame, remove the `gender` column.*

In [84]:
#' Code for Activity 8 goes here -----------------------
library(dplyr)
df_lower = rename(df,name = Name,year = Year, gender = Gender, count = Count)
df_lower %>%
    filter(name == "Linda") %>%
    arrange(-count) %>%
    head()

name,year,gender,count
<fct>,<int>,<fct>,<int>
Linda,1952,F,67096
Linda,1953,F,61244
Linda,1946,F,52708
Linda,1955,F,51275
Linda,1957,F,44496
Linda,1945,F,41465


In [86]:
#' Code for Activity 9 goes here -----------------------
df_lower = subset(df_lower,select = -gender)
head(df_lower)

name,year,count
<fct>,<int>,<int>
Elizabeth,1880,1939
Alice,1880,1414
Clara,1880,1226
Ella,1880,1156
Nellie,1880,995
Maude,1880,858


**Activity 10**: *Using `dplyr`, select rows based on both the names "Linda" and "John".*

In [95]:
#' Code for Activity 10 goes here -----------------------
df = filter(df,Name == 'Linda' | Name == 'John')

**Challenge Activity 1**: *Use `dplyr` to find the max Count per year per name (both Linda and John).*

In [106]:
#'' Code for Challenge Activity 1 goes here -----------------------
df %>%
    group_by(Year,Name) %>%
    filter(Count == max(Count)) %>%
    head()

Name,Year,Gender,Count
<fct>,<int>,<fct>,<int>
Linda,1880,F,27
Linda,1883,F,49
John,1883,M,8894
John,1884,F,40
Linda,1885,F,60
John,1885,M,8756


# Save your notebook, then `File > Close and Halt`