# Data Foundations -- Problem Set 2: R and Data 

This is your second problem set, where you will be testing your ability to code in R while working with data. As always, using Google, documentation and sites like Stack Overflow for help in completing the assignment. For now, continue to avoid using AI so that you can practice developing your own solution strategies. 


This is an auto graded assignment where you will have chunks that you need to edit. Those will then be tested. There are also chunks of code already provided for you... but be sure to run them! Otherwise, you may be attempting to work with data objects that are not yet loaded in the work space. 

This problem set should take 60 - 90 minutes. Remember not to get frustrated with yourself. Struggling through errors and trying to come up with right answer is an important part of learning how to code! 

## Introduction

First, let's run some basic set up. 

In [None]:
# load libraries 
library(dplyr)
library(ggplot2)

In this problem set, we are going to use the `mpg` dataset that is available from `dplyr` package. Load the the datset by running the below code chunk. 

In [None]:
# load mpg dataset
data(mpg)

Run the following code chunks to do some data exploration. This will help you orient yourself to the dataset before you need to answer the problem set questions. 

In [None]:
# get a quick snap shot of what's in the dataset
glimpse(mpg)

In [None]:
# get a summary of the variables, particularly the numeric variables  
summary(mpg)

In [None]:
# look at the first few rows 
head(mpg)

## Question 1: Filtering (1 Point)
Filter the mpg dataset to only include cars that are manufactured by toyota (*i.e.,* `manufacturer == "toyota"`). use pipes, `%>%` and the `fitler()` command. **Store the short dataframe in a new data frame named df_toy.**

In [None]:
# filter mpg to only toyotas 
# uncomment the following line of code to get started 
# df_toy <- mpg %>%

# your code here
.NotYetImplemented()


In [None]:
# glimpse your new dataframe 
glimpse(df_toy)

In [None]:
# your grade is based on the following tests: 

# Question 1 test 
if (nrow(df_toy) != 34) {
   stop("Your new dataframe isn't the proper length after filting to only toyotas. Expected: length(df_toy) == 34")
}else{
    print("Nice work, you filtered correctly.")
}


## Question 2: Unit Converstion (3 points)
When working with data, you may need to convert the units. For example, in the `mpg` dataset, we can convert highway miles per gallon to kilometers per liter. Two necessary facts are: 

- 1 mile = 1.61 kilometers
- 1 gallon = 3.79 liters 

The unit transformation equation is 

$$kpl = hwy /3.79 \times 1.61.$$


### Part A

Convert the variable `hwy` that measures highway miles per gallon to highway kilometers per liter. Use the `mutate()` function to create a new variable called `hwy_kpl`that measures highway kilometers per liter using the above unit transformation equation. Do so by starting with the original `mpg` dataset but storing the new dataset (with the new variable `hwy_kpl`) in a data frame called `df_metric`. 


In [None]:
# Create a new var called hwy_kpl in a new data framed called df_metric 
df_metric <- mpg %>%

# your code here
.NotYetImplemented()
    

In [None]:
# investigate your new df 
summary(df_metric)
glimpse(df_metric)

In [None]:
# testing your answer for Q2 Part A
if (ncol(df_metric) < 12) {
   stop("Your new dataframe doesn't have a new variable.")
}else{
    print("Nice work, you created your new variable in your new df.")
}

### Part B 

Find the average highway kilometers per gallon for cars in your `df_metric` dataset. Store the value in a variable called `mean_kpl`. 

In [None]:
# your code here
.NotYetImplemented()

In [None]:
# testing your answer for Q2 Part B
if (round(mean_kpl) != 10) {
   stop("Your mean_kpl doesn't equal the correct answer.")
}else{
    print("Nice work, you calculated the average highway kilometers per hour correctly.")
}

## Question 3: Summarizing Data (3 Points)

### Part A 
Go back to using your `df_toy` dataset (data frame that you filtered to toyota cars, only). Group the `df_toy` data frame by the variable `class` and summarize the average high miles per gallon (variable named `hwy`) for each class. Store your summary dataframe in a new df called `df_sum`. 

In [None]:
# create summary df 
df_sum <- df_toy %>%

# your code here
.NotYetImplemented()


# investigate your summary df 
df_sum

In [None]:
# Testing Q3 Part A 
if (ncol(df_sum) != 2 | nrow(df_sum) != 4) {
   stop("You have not created a summary dataframe. Remember to use the summarise() and group_by() function.")
}else{
    print("Nice work, you created a summary df.")
}

### Part B 

What glass of Toyota car gets the best gas milage? Store the class of car that gets the best gas milage as a string/character variable in a variable named `best_class`.


In [None]:
# store your answer here 
best_class <- ""

# your code here
.NotYetImplemented()


In [None]:
# Testing Q3 Part B 
if (best_class != "compact") {
   stop("You haven't identified the class of car correctly. Be sure you're not using capital letters. Use your summary dataframe to inform your answer.")
}else{
    print("Nice work, you interpreted your summary dataframe correctly.")
}

## Question Four (3 Points) 
Let's finish off with some basic data visualization. 


### Part A
Create a histogram of the `hwy` variable in the `mpg` dataset. Be sure to store your plot in an object called `plot_1`. Use ggplot. Note: it may warn you about `binwidth` but you can ignore this message. 


In [None]:
# create ggplot histogram 
plot_1 <- ggplot(mpg, aes(x =hwy))+
# your code here
.NotYetImplemented()

# print plot
plot_1

In [None]:
# test Q4 Part A 
if(!any(grepl("count", names(ggplot_build(plot_1)$data[[1]])))){
    stop("You haven't made a histogram and stored it in plot_1")
}else{
    print("Nice work! You made a histogram plot.")
}


### Part B
Practice some data visualization best practice. Create a new plot called `plot_2` that is a histogram of the highway miles per gallon in the `mpg` dataset (like Part A), but do some of the follow: 

- write clear labels and a title 
- make the theme as simple as possible
- make sure the text is legible
- consider changing colors (is it legible for people who are color blind?)


In [None]:
# create ggplot histogram (dont forget your + signs )
plot_2 <- ggplot(mpg, aes(x =hwy))+
    geom_histogram() +

# your code here
.NotYetImplemented()


# view plot 
plot_2


In [None]:
# test Q4 Part B 
if(isTRUE(all.equal(plot_2, plot_1))){
    stop("You haven't added anything to plot_2 beyond what we did in part A.")
}else{
    print("Nice work! You made a histogram plot and did some data vis best practices.")
}