# Lab 1 Goals

The goals of this lab are:

* To refamiliarize yourself with Jupyter notebooks.
* To output a PDF document from a Jupyter notebook.
* To practice data cleaning skills, including selecting columns, identifying and removing unusual values, and maintaining a tidy dataset. 
* To practice calculating summary statistics.
* To practice making data visualizations, such as histograms and scatterplots. 

For this lab, it may be helpful to install and load the following modules: 

* `numpy`
* `pandas` 

In [None]:
import numpy as np
import pandas as pd

We will be using a dataset scraped from the website Epicurious, which curates different recipes. The epi_r.csv file can be found in the Week 1 module on Canvas. Load the dataset, and title it `epicurious`.

In [None]:
epicurious = pd.read_csv("epi_r.csv")

# Data Cleaning

1. Print the dimensions of the dataset. How many observations are there? How many variables are there?

2. View the head of the dataset. What kinds of variables do you see?

3. Since so many of the variables are actually binary variables indicating different facts about the recipes, let's discard most of them them for now and focus on the numeric variables. Identify which of the variables are numeric (if you get stuck, check out the [data dictionary](https://www.kaggle.com/datasets/hugodarwood/epirecipes/code)). Save a version of the dataset with only those variables as `epicurious_num`. You should also keep the recipe titles to use later.

In [None]:
epicurious_num = 

4. As you can see, many of these variables have to do with the nutritional information. Compute a table of summary statistics for these variables, including the minimum, first quartile, median, third quartile, maximum, mean, and standard deviation. 

5. It is also helpful to discuss how many missing values the dataset has. Identify the percentage of missing values in each column. 

6. It depends on how many observations you have to begin with, but a general rule of thumb is that if the observations are missing at random (i.e., the reason that they are missing is not tied to any other confounding variable) and less than 10% of the values are missing, you can simply discard the values. Is it appropriate to discard the values in this case? Either way, drop the rows with missing values and save the remaining values in a new data frame called `epicurious_num_2`. 

In [None]:
epicurious_num_2

7. Now, take a closer look at the summary statistics. What are the means for calories, protein, fat, and sodium? Do a quick Google search to try and identify what reasonable values for those variables might be. 

# Making Plots

## `plotnine`

8. For this class, I want to try using the `plotnine` module for creating visualizations. Install and load the `plotnine` module (you may want to use the alias `p9`).

In [None]:
import plotnine as p9

`plotnine` is based off of `ggplot2`, part of Hadley Wickham's work on the `tidyverse` in R (in which tidy data plays a large role).`ggplot2` is a "system for declaratively creating graphics". Here is the basic idea: You, as the user, tell `ggplot2` or `plotnine` what data to use, how to map the variables to the different aesthetics (encoding channels) of the graph, and what type of graph you need--`ggplot2` takes care of the rest! 

First, we start by providing the data and mapping the variables to the graph's aesthetics. This means that we are defining things like what's on the $x$-axis or what color the graph is, among many others. 

A sample line of code for investigating the average ratings of the different recipes appears inside the box below. Note that the functions `ggplot()` and `aes()` come from `plotnine`-- the data frame comes first, then the aesthetics of the graph are defined with `aes()`. 

In [None]:
(p9.ggplot(epicurious_num_2, p9.aes(x = 'rating')))

## Histograms

If you run this chunk, a big, nearly blank box appears--there is no actual graph, but we can see that the `height` variable is now located on the $x$-axis. There's nothing there because we haven't added a **geom** yet. A geom is a command representing the type of plot we want. To add a histogram, we use `+ geom_histogram()` from `plotnine`. 

In [None]:
(p9.ggplot(epicurious_num_2, p9.aes(x = 'rating')) 
 + p9.geom_histogram())

9. Do you see a warning message saying "`stat_bin() using bins = 30. Pick better value with binwidth`"? This error is unique to `geom_histogram()`. To fix the warning, we can change the number of bins by adding a new argument, `bins = 10`, into the "`geom_histogram()`" function. 

In [None]:
(p9.ggplot(epicurious_num_2, p9.aes(x = 'rating')) 
 + p9.geom_histogram())

10. What happens to the histogram? Would you describe it in the same way as you described your first histogram? Are there any differences?



You can also avoid the message by specifying the `binwidth` instead of `bins`. There is a direct relationship between the bin width and the number of bins, so setting one also fixes the other. In general, increasing the number of bins leads to narrower wins, and decreasing the number of bins leads to wider bins.

11. Instead of creating a histogram with 10 bins, create a histogram where the bin width is 0.5. 

In [None]:
(p9.ggplot(epicurious_num_2, p9.aes(x = 'rating')) 
 + p9.geom_histogram(bins = 10))

Notice that on the $y$-axis, we are displaying the counts of the observations in each bin. We can change this to instead display the frequency by adding another argument to `ggplot()`.

In [None]:
(p9.ggplot(epicurious_num_2, p9.aes(x = 'rating', y=p9.after_stat('density'))) 
 + p9.geom_histogram(binwidth = 0.5))

Now, let's actually add a density--

In [None]:
(p9.ggplot(epicurious_num_2, p9.aes(x = 'rating', y=p9.after_stat('density'))) 
 + p9.geom_histogram(binwidth = 0.5) 
 + p9.geom_density())

We can see two things from this latest graph--

1. There is a relationship between a histogram and a density plot! When created from the same dataset, they should show roughly the same shape. 
2. You can add multiple geoms to a graph! In fact, this is how we build the graphs--by supplying the dataset and variable mapping in the first `ggplot()` command, then by adding different things using different geoms. In this case, we added both a histogram (`geom_histogram()`) and a density (`geom_density()`). 

12. Now, try creating your own combination histogram/density plot for the variable `calories`. Describe the distribution. 


In [None]:
(p9.ggplot(epicurious_num_2, p9.aes(x = , y = )) 
 + p9.geom 
 + p9.geom)

13. Hopefully you have arrived at the conclusion that some of these values seem very unreasonable. Let's take a closer look--create histograms for each of the other numeric variables. Make sure that you are able to write reasonable captions for each plot. What do you see?

14. Between the mean and the histograms, you should be able to see that the distributions are being skewed by a handful of extreme values. Find the top ten largest values for `calories`, `protein`, `fat`, and `sodium`. Which recipes do they correspond to? Does the calorie count make sense knowing what the recipes were for?

15. Now, remove the rows containing those values. Save the remaining values in a new data frame called `epicurious_num_3`. Remake the histograms and note the changes in the distributions. 

In [None]:
epicurious_num_3 = 

## Scatterplots

16. Let's continue to the last type of graph we will review for this lab, the scatterplot. This time, try using the [ggplot2 cheat sheet](https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf) to look up the appropriate geom you need (the cheat sheet is written from R, but you should be able to convert to Python relatively easily). Create a plot displaying the relationship between `calories` and `fat`. Notice that unlike the other plots we have created, such as histograms and densities, you *have* to supply an `x` and a `y` aesthetic. Remember that $x$ is traditionally the explanatory variable and $y$ is traditionally the response--which one makes sense to use as the explanatory variable here?


*Adding fatty ingredients (or more of a particular fatty ingredient) increases the calories, so it makes sense to have `fat` on the $x$-axis.*

In [None]:
(p9.ggplot(epicurious_num_3, p9.aes(x = , y = )) 
 + p9.geom())


17. Describe this relationship in terms of its form, direction, strength, and unusual values. Google the relationship between calories and fat. Does this plot make sense in light of what we know about those two variables?

18. Let's get a little fancier. Now try adding a `color = "red"` argument to the `geom_point()` statement. 

In [None]:
(p9.ggplot(epicurious_num_3, p9.aes(x = 'fat', y = 'calories')) 
 + p9.geom_point())

19. This looks kind of messy, but you should have managed to change the color! Now try changing the color to something different. You might be able to use this [R color guide](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf) to look up the available colors--but I haven't tested all of them to see if they work in Python, so no promises.


In [None]:
(p9.ggplot(epicurious_num_3, p9.aes(x = 'fat', y = 'calories')) 
 + p9.geom_point(color = "red"))

On your assignments, you can pick whatever color you want! But in addition, you can change the color according to another variable for an additional encoding channel. To do this, you can change the `color` argument to be the name of a variable inside the `aes()` statement (rather than inside the geom).

20. Based on the following chunk, color the points according their average rating.


In [None]:
(p9.ggplot(epicurious_num_3, p9.aes(x = 'fat', y = 'calories')) 
 + p9.geom_point(color = "red"))

21. Does knowing the `rating` in the recipe add any additional information to the relationship?

22. Now create a scatterplot beween `calories` and `protein`. Describe this relationship in terms of its form, direction, strength, and unusual values. Does *this* plot make sense in light of what we know about those two variables?

In [None]:
(p9.ggplot(p9.aes(x = , y = )) 
+ geom())

23. Finally, create a scatterplot between `calories` and `average rating`. Describe this relationship in terms of its form, direction, strength, and unusual values. If time allows, try experimenting with various aesthetics like color, size, etc. You can also try changing the title, labels, and font size. 

In [None]:
(p9.ggplot())