# Sampling and Data Manipulation

## Chapter 2.8-2.11: How to Work with Data Frames


In [None]:
# run this cell to set up the notebook
suppressMessages(library(coursekata))
css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))


# This code will import the data frame []
StudentData <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTnfW0lx6Wo1qMAsVYMht2zyd3q75uRR1wldcN7JP8EdL8uEgqWd0oHOrNoA1O6IzRU_5B7QsUiL7JA/pub?gid=291642978&single=true&output=csv")

# This makes font sizes on graphs bigger
theme_set(theme(
  text = element_text(size = 15),
  axis.text = element_text(size = 10)
))

In this JNB, we'll practice sampling and manipulating data using this [StudentData](https://docs.google.com/spreadsheets/d/17I0WDazOh6mOoS5SdXlkkkEMf3_HYA0Mh5zVjnCj2kQ/edit?resourcekey#gid=291642978) collected by students in a high school statistics course.

## Practice #1:

Suppose our population is all the students in this class who took the survey.  We're going to take a sample from that population and calculate the mean height of students.  If we did this twice, do you think we'll get the same mean?  Make a prediction, then run the code twice to verify.

In [None]:
# the function sample() will take a sample of 5 students from StudentData and save it in class.sample
class_sample <- sample(StudentData, 5)

# the function mean will calculate the height of students in our sample
mean(class_sample$height)


> Answer the questions here.

## Practice #2:

We can select some variables in our data frame to focus on using the function `select()`.  Let's select a couple of variables of your choice.  What are the rows representing?  What are the columns representing?  

In [None]:
# modify the code to pick two variables in StudentData you want to view
select(StudentData, first_variable_here, second_variable_here)

> Write your answers here.

In [None]:
# you could also combine the functions head() and select()
head(select(StudentData, first_variable_here, second_variable_here))

# try modifying the code to get just the first 3 rows

> How was this different from just using select?

## Practice 3:

The function `filter()` lets us find certain rows that meet conditions we specify.  Suppose you wanted to find students who were born in your birth month.  What do you see?

In [None]:
# modify the code to filter for your birth month
filter(StudentData, birth_month == "Put_month_name_here")

> Write your response here.

In [None]:
# You could also filter using quantitative variables.  
filter(StudentData, height == 62)

# Feel free to change the height and use other inequalities.

> Describe your observation with the function filter here.

## Practice 4:

Missing data could have `NA` or is left blank.  In `StudentData`, some students don't have a favorite superhero and left it blank.  Let's **filter in** students who have a favorite superhero and save them into a new data frame.

In [None]:
# First see that some students left favorite superhero blank
StudentData

In [None]:
# Now filter in students with favorite superhero 
# and save it in a new data frame called StudentData_subset
StudentData_subset <- filter(StudentData, fav_superhero != "")
StudentData_subset

## Practice 5:

There is a ratio called the [Ape Index](https://en.wikipedia.org/wiki/Ape_index) that compares our arm span to our height.  Having an index of 1 means your arm span and height are equal in length.  

Suppose we want to study the Ape Index for students without having to look at both columns for `height` and `arm_span`.  What could we do?  

We could create a new variable that displays the ratio of arm span to height.  Let's try it out.

In [None]:
#This code will take the student's individual arm span and divide it by height.
#This ratio will be stored in a new variable that we'll add to the data frame.
StudentData$ape_index <- StudentData$arm_span/StudentData$height

#this will let us see the first six rows with the new variable
#scroll all the way to the end to see it
head(StudentData)

> What did you notice for the first 6 Ape Indices?

## Practice 6:

How would we create that same variable using the `mutate()` function? Try creating the Ape Index using `mutate()` and save it back into the data frame as `ape_index_2`

In [None]:
StudentData <- StudentData %>%
  mutate(new_variable_name = variable_1/variable_2)

# Check that the new variable was added
head(StudentData)

## Practice 7:

Right now, the values of `sport` are `No` or `Yes`.  We want to change them to 0 and 1.  To do this, we'll use the function `recode()` then take a look at the data frame to see if `recode()` worked.

In [None]:
StudentData$sport_new <- recode(StudentData$sport, "No" = 0, "Yes" = 1)

# write code to view the data frame 