In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab04.ipynb")

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns
%matplotlib inline

# Lab 4: Intro to Visualization

### Learning Objectives

In this notebook, you will learn about:

- How to explore datasets
- How to prepare data to be visualizaed
- The purpose of different visualizations 
- How to create and code visualizations
- How to analyze and draw insights from visualizations

#### Helpful Data Science Resources 
Here are some resources you can check out while doing this notebook and to explore data visualization further!
- [DATA 8 Textbook](https://inferentialthinking.com/chapters/07/Visualization.html) - Visualization chapter
- [Cool Data Visualizations](https://www.tableau.com/learn/articles/best-beautiful-data-visualization-examples)
- [Statistica: Find Data on Interesting Topics](https://www.statista.com/)

**A note on the autograder for this lab:** The test cases in the autograder are not comprehensive -- you can create some completely incorrect graph while passing the autograder. So assure yourself that you are making the correct visualizations, we have provided the correct outputs with each question. Your score for this lab will still solely depend on the autograder provided. 

---
## Part 1: Explorative Data Analysis (EDA)

### Unemployment rate and NaN values 

Let's start off with loading in the dataset. We will be using unemployment rate data from Fred. 

**Question 1.1:** Load in the dataset `data/unemployment_rate.csv` and read it into a Pandas dataframe. Name it `unemployment_df`.

In [None]:
unemployment_df = ...
unemployment_df.head()

In [None]:
grader.check("q1_1")

**Question 1.2:** The dataset contains three columns: the overall unemployment rate, unemployment rate for males, and unemployment rate for females. Referencing [Fred's website](https://fred.stlouisfed.org/graph/?g=jXvf), change the column names accordingly and in the following order: `Date`, `Male Unemployment`, `Female Unemployment`, and `Overall Unemployment`. 

In [None]:
unemployment_df = ...
unemployment_df.head()

In [None]:
grader.check("q1_2")

Before plotting the data, it is important to determine if there are any nan values in the dataset. 


<!-- BEGIN QUESTION -->

**Question 1.3:**  Does the dataset contain any NaN values? Explain what you did to reach your conclusion.

*Hint*: `df.isnull()` could be useful; [here](https://pandas.pydata.org/docs/reference/api/pandas.isnull.html) for reference. 

_Type your answer here, replacing this text._

In [None]:
# OPTIONAL FOR TESTING


<!-- END QUESTION -->

---
## Part 2: Data Visualization

### Line Plots

A line plot is used to display data as a series of points connected by a line. It's generally used to visualize how a variable changes over time (also known as [time series data](https://www.investopedia.com/terms/t/timeseries.asp)), often with a time-related variable on the x-axis (minutes, days, months, years, etc.) and a numerical variable on the y-axis.

Let's create a line plot to see how the overall unemployment rate changes over the years.


**Question 2.1:** Use *matplotlib* to make a line plot for the overall unemployment rate over the entire sample period. Specifically here we want to use [datetime](https://docs.python.org/3/library/datetime.html). Label your plot properly (both axes and title).  There is no need to include a legend in your plot. It's ok if your plot looks slightly different (e.g. different colors, different line width, etc) as long as it is still readable and contains all the pertinent information. We have included an image of what your line plot should look like.

*Hint:* you might need to convert the date column into a pandas [datetime data type](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) object first.

<img src="assets/q2_1.png" width="700">

In [None]:
def q2_1():
    ...
    plt.ylabel(...)
    plt.xlabel(...)
    plt.title(...)
    return plt.gca() # DO NOT edit this line, it's necessary for the autograder
q2_1(); # Once you have created the plot, try removing the semi-colon at the end. What happens?

In [None]:
grader.check("q2_1")

<!-- BEGIN QUESTION -->

**Question 2.2:** What kind of trend in unemployment rate can you find from the graph above? Please provide potential reasonings of the trend you find from the graph above. 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 2.3:** Create similar line plots as in question 2.1, but illustrate the unemployment rate separately for each sex. Label your plot properly (both axes and title) and include an appropriate legend. We have included an image of what the plot should look like.

*Hint*: To graph multiple line graphs on the axis, you can simply stack multiple `plt.plot(...)` with the same x argument. 

*Hint*: Graph males before females for specific test cases. 

<img src="assets/q2_3.png" width="800">

In [None]:
def q2_3():
    ...
    plt.ylabel(...)
    plt.xlabel(...)
    plt.title(...)
    plt.legend()
    return plt.gca() # DO NOT edit this line

q2_3();

In [None]:
grader.check("q2_3")

<!-- BEGIN QUESTION -->

**Question 2.4:** What kind of differences between the sexes do you see in the plot above? Why may they have occured?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

#### Inflation and Unemployment Rate

Now, we'll be plotting both inflation and unemployment data. What's the famous curve that attempts to connect these quantities?

**Question 2.5**: Load in the dataset `data/inflation_rate.csv` and read it into a pandas dataframe named `inflation_df`. Then, rename the columns to `Date` and `Inflation`. 

In [None]:
inflation_df = ...
...
inflation_df.head()

In [None]:
grader.check("q2_5")

#### Merging datasets

It's far easier to plot with a single dataframe; several plotting libraries require you to pass in the dataframe you're intending to plot. 

**Question 2.6**: Merge `unemployment_df` and `inflation_df`, only keep rows that are present in both data frames.


In [None]:
merged_df = ...
merged_df.head()

In [None]:
grader.check("q2_6")

**Question 2.7**: Using the merged dataframe, plot line graphs for both inflation and unemployment over time. Label your plot properly (both axes and title), and use a legend. We have included an image of what the plot should look like.

*Hint:* The code for this question should look fairly similar to the code for question 2.3.

*Hint:* Graph inflation before unemployment for specific test cases. 

<img src="assets/q2_7.png" width="700">

In [None]:
def q2_7():
    ... # Make your plot
    ... # Label your plot
    return plt.gca() # DO NOT edit this line
q2_7();

In [None]:
grader.check("q2_7")

<!-- BEGIN QUESTION -->

**Question 2.8**: We briefly spoke about the Phillip's Curve in lab 2; read more [here](https://data88e.org/textbook/content/09-macro/phillips_curve.html). Does the graph above roughly match what you would expect from the curve? Why or why not?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 2.9**: For the rest of our analysis, it will be helpful to have a `Decade` column that represents which decade a measurement was taken in. Add this column to `merged_df`; each decade should be represented by the first year in that decade (so all years from 1950-1959 fall under the decade of 1950).

*Hint:* This may be tricky, `//` is floor division; look [here](https://www.geeksforgeeks.org/floor-division-in-python/) for more reference. 

In [None]:
merged_df['Decade'] = ...
merged_df.head()

In [None]:
grader.check("q2_9")

### Bar Charts

A bar chart is a familiar way of visualizing categorical distributions. It displays a bar for each category, and the length of each bar is proportional to the frequency of the corresponding category. While not necessary, most bar charts have equally spaced and equally wide columns.

**Question 2.10**: `Decade` is a categorical variable in our analysis. Make a bar chart representing the number of years from each decade in our analysis. We have included an image of what the plot should look like.

*Hint:* You will first need to get the number of years from each decade in `merged_df`, consider making a series and storing it in `counts`.

*Hint:* Don't use `.value_counts()` as it will fail specific test cases; think of other ways to find the sizes. 

<img src="assets/q2_10.png" width="700">

In [None]:
def q2_10():
    counts = ...
    ... # Make your plot; consider using the parameters `color='skyblue', edgecolor='black', width=5`
    ... # Label your plot
    return plt.gca() # DO NOT edit this line
q2_10();


In [None]:
grader.check("q2_10")

### Histograms

A histogram allows you to visualize the distribution of a numerical variable, helping you understand how spread out the values in your data are. It looks quite similar to a bar chart, with a few important differences.

Histograms follow the *area principle* and have two defining properties:

1. As the values on the horizontal axis are numerical and therefore have fixed positions on the number line, the bins are drawn to scale and are contiguous (though some might be empty).
1. The area of each bar is proportional to the number of entries (or percent of data values) in the corresponding bin. The histogram is said to be drawn on a *density scale*.


**Question 2.11**: Plot histogram for overall unemployment, using all the data available in `merged_df`. The `bins` of the histogram should range from 3 to 11 (inclusive) with stepsizes of 0.5. Label your plot appropriately (including axes and titles). We have included an image of what the plot should look like.

*Hint:* Python's `np.arange` function is exclusive for the endpoint. 

<img src="assets/q2_11.png" width="700">

In [None]:
def q2_11():
    bins = ...
    ... # Make your plot
    ... # Label your plot
    return plt.gca() # DO NOT edit this line
q2_11();

In [None]:
grader.check("q2_11")

### Scatter Plots

Scatter plots are used to visualize the relationship between two numerical variables. They help us infer the association between two variables. The association between two variables refers to how one variable changes with respect to the other.  We can describe the association between two variables based on two factors:

1. *Magnitude:* Is the association strong or weak? If the points on the scatter plot all line up along a straight line (in any direction), it means that the association between the variables is strong. On the other hand, if the points are all spread out and scattered (no pun intended), it means that the association is weak.

2. *Direction (or sign):* Is the association positive or negative? If one variable increases as the other variable increases, the association between the two variables is positive. If one decreases as the other increases, the association is negative.

In lab 2, we had already created a scatter plot of US unemployment vs inflation. That scatter plot is replicated below for your convenience.


In [None]:
plt.figure(figsize=(8,6))
plt.scatter(merged_df['Overall Unemployment'], merged_df['Inflation']);
plt.xlabel("Unemployment")
plt.ylabel("Inflation")
plt.title("The Phillips Curve from 1958 to 2022");

If you look [here](https://data88e.org/textbook/content/09-macro/phillips_curve.html?highlight=phillip), you can see that economists were looking at the Phillip's curve by decade. They do this because we may have too broad of a timeframe to accurately see the pattern of the curve. Let's help remedy this issue by considering the relationship separately for each decade.

**Question 2.12**: Use the [seaborn](https://seaborn.pydata.org/index.html) package to create a scatter plot using the same data as above, but with colored scatter points based on the `Decade` column. Keep unemployment on the x-axis and inflation on the y-axis. Include a legend and label your plot properly. We have included an image of what the plot should look like.

*Hint:* Read the [documentation](https://seaborn.pydata.org/generated/seaborn.scatterplot.html).

<img src="assets/q2_12.png" width="700">

In [None]:
def q2_12():
    sns....(data=..., ... , palette="bright", alpha=0.75)
    ... # Label your plot
    return plt.gca()  # DO NOT edit this line
q2_12();

In [None]:
grader.check("q2_12")

<!-- BEGIN QUESTION -->

**Question 2.13**: We defined an `alpha` parameter for you in the function call above. What is that parameter, what does it do and which problem does it help avoid?

*Hint:* Try removing/changing the alpha value and see how that changes the plot.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

Seaborn's `sns.lmplot` is great for drawing linear patterns on scatter plots. Not only can it draw one linear pattern, but it can plot and color multiple lines for different values of a categorical variable.

**Question 2.14**: Firstly, as we don't have complete data from the decades `1950` and `2020`, please drop all rows from those decades and store the resulting dataframe in `relevant_decades`. Then, create your `lmplot`. Keep unemployment on the x-axis and inflation on the y-axis. Include a legend and label your plot properly. We have included an image of what the plot should look like.

*Hint:* Read the [documentation](https://seaborn.pydata.org/generated/seaborn.lmplot.html), the parameters you pass in are extremely similar to what you did for question 2.12.

<img src="assets/q2_14.png" width="700">

In [None]:
def q2_14():
    relevant_decades = ...
    ag_test = sns.lmplot(..., # DO NOT edit ag_test = sns.lmplot
            ci = None, lowess = True, line_kws={'lw': 2}, palette="bright") 
    ...
    return relevant_decades, ag_test.axes.flatten(), plt.gca() # DO NOT edit this line
q2_14();

In [None]:
grader.check("q2_14")

---
## Part 3. More Plots with Seaborn

### Boxplot

A boxplot is similar to a histogram as it also visualizes the distribution of a numerical variable, but it gives you more specific statistics about the distribution: the minimum, lower quartile (value at the bottom 25th percentile), median (value at the 50th percentile), upper quartile (value at the top 25th percentile) and maximum. 


**Question 3.1**: Use [`sns.boxplot()`](https://seaborn.pydata.org/generated/seaborn.boxplot.html) and `unemployment_df` to create a box plot of three different unemployment rates: `Male Unemployment`, `Female Unemployment`, and `Overall Unemployment rate`. Label your plot properly (although there should be no need to manually label your x-axis). We have included an image of what the plot should look like.

<img src="assets/q3_1.png" width="700">

In [None]:
def q3_1():
    plt.figure(figsize=(9, 6))
    ag_test = sns....(...) # DO NOT edit ag_test = sns...
    ...
    return ag_test.lines, plt.gca() # DO NOT edit this line
q3_1();

In [None]:
grader.check("q3_1")

### Violin Plot

A violin plot is a combination of a histogram and boxplot. It shows you the general distribution of the data (by creating a histogram and drawing a line to capture its general shape) as well as specific statistics (same as the boxplot). 

**Question 3.2**: Use [`sns.violinplot()`](https://seaborn.pydata.org/generated/seaborn.violinplot.html) on `merged_df` to create a violin plot of just the `Overall Unemployment` and `Inflation`. Label your plot properly (although there should be no need to manually label your x-axis). We have included an image of what the plot should look like.

<img src="assets/q3_2.png" width="700">

In [None]:
def q3_2():
    ag_test = sns...(..., , palette="prism") # DO NOT edit ag_test = sns...
    return ag_test.lines, plt.gca() # DO NOT edit this line
q3_2();

In [None]:
grader.check("q3_2")

<!-- BEGIN QUESTION -->

**Question 3.3**: What do you notice from the violin plot above? (Distribution, skewness, etc.) Please give the reasoning to your answer.   

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 3.4**: For your final plot, recreate the plot from question 2.11 in seaborn, but add a kernel density estimate. Label your plot properly. We have included an image of what the plot should look like.

<img src="assets/q3_4.png" width="700">

In [None]:
def q3_4():
    bins = ...
    ag_test = sns...(..., edgecolor='black', linewidth=1.2) # DO NOT edit ag_test = sns......
    return ag_test.get_lines(), plt.gca() # DO NOT edit this line
q3_4();

In [None]:
grader.check("q3_4")

<!-- BEGIN QUESTION -->

**Question 3.5**: What does a kernel density estimate try to do? How does it work?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---
## Conclusion

**Congratulations!** You have finished lab 4! We hope you enjoyed the lab!

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)