In [None]:
### Don't push "Run notebook"!!!

assert False, 'Please don\'t press the "Run notebook" button!'

# Welcome to the Data Science Kit!

<center><img src='http://www.sintetia.com/wp-content/uploads/2014/05/Data-Scientist-What-I-really-do.png' width=80%></center>

In this starter kit, you will learn to use data science to analyze **2016 US primary election** data and uncover sociological, geographical, and economic voting patterns. Data science is an important application of computer science that helps distill big data into insightful and interpretable stories. You can find data science being applied in a huge diversity of fields including medicine, finance, biology, manufacturing, social media, and, as you will see in this kit, political science.

The US primary election dataset you will work with contains voting information for over 2,000 counties and spans demographic, spatial, and economic data! This is an overwhelming amount of information, but data science will give you the tools to find patterns amongst the chaos. You will answer important questions such as... **(WIP)**

By the time you finish this starter kit, you will be ready to expand your analysis on your own and uncover novel phenomena within the voting data. What discoveries will you make?

>**Pro Tip**:  
You can quickly navigate this document using the table of contents tab in the "Files" menu.
![](.images/TOC_files.png)
![](.images/TOC_location.png)
<!-- <img src=".images/TOC_files.png" width=350>
<img src='.images/TOC_location.png' width=350> -->

# Before We Begin...
Let's also take note of the coding environment we're in: a Deepnote **notebook** running Python.

## What is Deepnote?
Deepnote is a web-based data science platform for real-time collaboration in coding notebooks. Notebooks allows us to work with data interactively and see the outputs from each operation we apply to our data. In your file browser, you can distinguish between notebooks and other file types by their file extension. Notebooks end with `.ipynb`, short for "IPython notebook." The fundamental unit of notebooks are **cells**. In Deepnote, cells come in two flavors: **markdown** cells and **code** cells.

### Markdown Cells
Markdown cells allow us to write, edit, and display formatted text. They are written in the Markdown programming language which provides a minimal formatting syntax. All of the texts in this kit are in Markdown cells and if you double click on them you can see (and edit) the underlying Markdown code. Deepnote also provides a Markdown cheat sheet on the right sidebar when you have a Markdown cell selected. **You won't need Markdown until you finish this kit but feel free to check it out.**

### Code Cells
Code cells in Deepnote allow us to write and execute Python code. You can run cells by selecting them and pressing **Ctrl+Enter (or Cmd+Return on Mac) or by clicking the Run button in the right sidebar**. Go ahead and run the example below (you might have to wait a few seconds for your environment to boot up).

In [None]:
# This is a code cell
print('This is a print statement.')
my_var = 'I am a variable :)'
my_var2 = 'I am a variable too!'
my_var # this will not be output because it's not last
my_var2 # this will be output

You'll know that a cell has been executed by the green check mark in the bottom left

>**Quirky Output:**  
Code cells show the output of the last line of code if it is not explicitly assigned to a variable. Notice the difference between the `print` output and implicit output. Even though `my_var2` was not in a `print` statement it was output because it was written on a line by itself. You can also see that `my_var` was not output despite being written by itself because it wasn't the last line. Feel free to edit the cell above to get a feel for running code cells and displaying output.

**Global Variables**

When you run a cell, any values you assigned to variables will be stored for the entire notebook and not just that cell. The `my_var` variable you created in the previous cell can be called from any other cell after you have run that original cell.

So, if you have properly run the code cell above, `my_var` should print 'I am a variable :)' after you **run the cell below**.

In [None]:
# Run this cell
my_var

>**Warning:**  
If you happen to change the value of `my_var` at any point and run the cell you changed it in, the value for that variable will update for future use. If you want to restore its previous value, you will need to re-run the cell that assigned that previous value.

In [None]:
# This cell will change the value of my_var to something different
# If you want the original value back, you willl need to run the first cell again

my_var = 'my_var has been changed! :('

**Semicolons**

Adding a semicolon to the end of the cell will suppress any output that isn't explicit, like a plot or a `print()` statement. Try **running the cell below** to see for yourself!

You should see that `my_var` is not output from the cell.

In [None]:
# Run this cell to see how ; suppressed output
my_var;

### Adding Cells
You can also add your own cells by clicking the "+ Block" or "+ Code" buttons that pop up when you hover your cursor over the bottom or top left corners of an existing cell. Alternatively, you can use the hotkeys "Ctrl+J" and "Ctrl+K" to add code cells below and above the selected cell, respectively. There are many more Deepnote shortcuts available too, check them out from the "?" menu in the bottom left of Deepnote.

# 0. `import data_science`
Before we start working with our data we have to get our coding environment ready to do some work.

>**Run the cell below** to `import` the Python packages we need, set a plotting theme, and load our data.

![](.images/import.jpeg)

In [None]:
# Run this cell with Ctrl+Enter (or Cmd+Return on a Mac)

# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from utils import *

# Set a plotting theme with seaborn
sns.set_theme() 

# Load our data
print('Loading data from Datasets folder...')

(geo_cnty,
republican_primaries_county_level,
democrat_primaries_county_level,
republican_primaries_state_level,
democrat_primaries_state_level) = load_data()

print('Finished loading data!')

Now we're ready to start working with our data and explore what we just loaded!

>**Field Notes:**  
Usually, getting your data isn't as easy as using a `load_data()` function. Often times, the data you want is scattered across different websites, filled with NaN's, or in an unstructured form. These irregularities need to be "cleaned" before you can start working with your data. Data cleaning is beyond the scope of this project but you can read more [here](https://towardsdatascience.com/data-cleaning-in-python-the-ultimate-guide-2020-c63b88bf0a0d) if you're interested.

# 1. What's In Here?
<p style='font-size:30px'>Initial Data Exploration</p>

![](.images/dataexploration.jpeg)

Let's take a first look at the primary election data. In the code cell above we loaded in four **dataframes** of interest:
```python
republican_primaries_county_level
democrat_primaries_county_level
republican_primaries_state_level
democrat_primaries_state_level
```
However, these variable names are really long! Let's assign them to some more conveniently named variables:

In [None]:
# Run this cell with Ctrl+Enter (or Cmd+Return on a Mac)

# Assign:
#  rep as republican_primaries_county_level
#  dem as democrat_primaries_county_level
#  rep_st as republican_primaries_state_level
#  dem_st as democrat_primaries_state_level

# For example, assign rep as republican_primaries_county_level
# Run this cell with ctrl + enter (or cmd + enter on a Mac)
rep = republican_primaries_county_level
print('rep has been assigned! When there is a green check mark on the bottom left, it means the cell has ran successfully.')

When using `=` to assign the object, we are assigning a variable (on the left side of the `=`) to have a particular value (on the right side of the `=`). We can reassign variables or we can create new ones.
>**Your turn**:  
Assign the long-named variable to shorter-named variables! We've already set up the new variable names, no need to come up with your own. We will be using these variable names for the rest of the notebook.

>*Hint: use the right side of the `=` to assign values to the new variables.*

In [None]:
### Finish and run this code to reassign the other dataframe variables ###

dem = 
rep_st = 
dem_st = 

##########################################################################

Throughout this notebook we will provide answers in hidden code cells. You can see the answer to this problem by clicking the blue "Show it.":

In [None]:
dem = democrat_primaries_county_level
rep_st = republican_primaries_state_level
dem_st = democrat_primaries_state_level

The first two, `rep` and `dem`, are the county-level primary election results for the Republican and Democrat primaries, respectively. We also have their state-level counterparts, `rep_st` and `dem_st`, that are home to the same data except summarized at the state level. Let's start by looking at `rep`. We can simply type in `rep` and run the cell to see the first and last 5 rows of data.

In [None]:
# Run this cell with Ctrl+Enter (or Cmd+Return on a Mac) to see the output
rep

If you ran the cell above you should see a dataframe with 2092 rows and 22 columns. **Dataframes** are tables of data. Each row is an observation (for `rep` this is a county) and each column, or **feature**, is a property of the observations.

We can also use `head()` to view just the first 5 rows of any dataframe. Note that `head()` is a method for dataframes.

In [None]:
# Run this cell!
rep.head()

![](.images/NumbersEverywhere.png)
<!-- <img src='.images/NumbersEverywhere.png' style="width:80%; margin-left: auto; margin-right: auto;"> -->

The `head()` function is a useful function when you just want to get a quick idea of what is in your data. Let's take stock of what features are in `rep`, the Republican primary election data.

The 22 features of `rep` are:
- `st_abbrev`: State abbreviation
- `fips`: US Census ID (unique identifier for every census region e.g. state, county, etc.)
- `population`: Total population
- `income`: Median household income
- `hispanic`, `asian`, `black`, `white`, `foreign`, `college`, `female`, `senior`, `children`: Percentage of individuals that are in these demographics.
    - Note that `foreign` refers to foreign born individuals and `college` refers to individuals with Bachelor's degrees.
- `density`: Population per square mile
- `vets`: Population of veterans
- `st_cnty`: Concatenation of state abbreviation with county name
- `state`: Full state name
- `winner`: Name of the candidate that won the election for the county
- `votes`: Number of votes for the winning candidate
- `fraction_votes`: Fraction of the votes that the winning candidate received
- `total_votes`: Total number of votes cast in the county
- `voter_turnout`: Proportion of eligible voters that cast votes in the election

>**Your turn:**  
Compare the output from `head()` on `dem`, `rep_st`, and `dem_st`. What are the differences? What can you tell about the data from just the first 5 rows?


In [None]:
### Enter your code for dem below: ###



######################################

In [None]:
### Enter your code for rep_st below: ###



#########################################

In [None]:
### Enter your code for dem_st below: ###



#########################################

The answer is hidden in the cell below:

In [None]:
# You should have entered:
# dem.head()
# rep_st.head()
# dem_st.head()
# into separate cells.

# Notice that both `rep` and `dem` have one additional column, `st_cnty`
# compared to `rep_st` and `dem_st`. This is because they contain county
# level data.

# 2. Show Don't Tell!
<p style='font-size:30px'>Introduction to Data Visualization</p>

![](.images/dataviz2.jpeg)

<!-- <img src=".images/dataviz2.jpeg" width=100 /> -->

## What is Data Visualization?
Data visualization is the process of presenting the patterns, trends, or outliers in our data through pictures and graphs. It is an important aspect of Data Science because it makes it easier for people to understand our findings from the data. Good visualizations can have a strong impact on the information
and messages you are trying to convey to your audience, and are a powerful tool in Data Science and Analytics.

<table><tr><td><img src='https://imgs.xkcd.com/comics/fuck_grapefruit.png' width=1000></td><td><img src='https://imgs.xkcd.com/comics/scary_names_2x.png' width=1000></td></tr></table>

### Plotting Libraries

Throughout this notebook we will use **`seaborn`** as our primary plotting library. Seaborn is built upon Matplotlib and allows us to easily create visually appealing data visualizations.

We imported seaborn earlier in part **0. `import data_science`**:

```python
import seaborn as sns
```
where we aliased `seaborn` to `sns`. Now we can access all of seaborn's plotting functionalities using `sns` instead.

>**Field Notes:**  
There are many (many!) other plotting libraries in the data science toolbox and seaborn is just a popular and convenient choice for this project. You can check out some common ones in the graph visualization below. Note that these are *just plotting libraries for Python* and there are even more if you consider other languages and standalone software!
<img src='https://rougier.github.io/python-visualization-landscape/landscape-colors.png' width=100%>
<p style='font-size:13px'><em>Source: <a href='https://pyviz.org/overviews/index.html'>PyViz</a></em></p>

## Types of Visualizations

![](.images/dataviz.jpeg)

### Scatter Plots

Scatter plots convey the relationship between two variables $$x,y$$ by plotting a scatter of all your data points at locations $$(x_i,y_i)$$ corresponding to the values of the two variables. Chances are you've already seen some before. They help answer questions about whether two variables are related in some way.

For example, do you think there is a relationship between the proportion of people with college degrees and the median household income of a county?

**Run the cell below** to plot the variables `college` and `income` to find out. We'll use the `scatterplot()` function from `seaborn` (aliased as `sns`).

In [None]:
# Run this cell!

sns.scatterplot(data=rep, x='college',  y='income');

We used the **`sns.scatterplot()`** function to create this scatter plot using three arguments: `x`, `y`, and `data`.

- `x`, `y`: features in the dataframe that correspond to the respective x and y axes.
- `data`: the dataframe you want to plot

From this scatter plot we can see that `college` and `income` are indeed related! Recall that the `college` feature is the percentage of individuals with college degrees. This plot shows us that counties with a higher percentage of college graduates tend to have higher median household incomes.

>**Your turn**:  
We'd like to see if there is a relationship between the percentage of blacks and the percentage of whites in counties from the **Democratic** primary. Can you create a scatter plot to show this relationship?

>*Hint: use the `white` and `black` features from **`dem`**.*

In [None]:
### Enter your code below: ###



##############################

In [None]:
# Solution:

# sns.scatterplot(data=dem, x='white', y='black');

From this plot we can see that `white` and `black` are also related, but in a different way than `college` and `income` are. We'll unpack this more later but for now we'll just note that there seem to be **two distinct "arms"** in the plot that have different relationships.

>**Food for thought:**  
Can you think about why that is? What does this mean? What is this plot telling us?

#### Hue
Another parameter for many seaborn functions you should know is called **`hue`**. As you might guess, this parameter lets us specify a unique hue for each data point or plot element. For `scatterplot()`, this parameter allows us to **color each data point based on a third variable**.

**Run the cell below** where we've added `hue='winner'` to color each data point based on which candidate won the county.

In [None]:
# Run this cell!

sns.scatterplot(data=rep, x='college',  y='income', hue='winner');

From this we can glean some more information from our `income` vs. `college` scatterplot and see that Marco Rubio predominantly won counties that had high incomes and high percentages of college graduates.

>**Your turn:**  
Can you use the **`hue`** parameter to color each data point in the `black` vs. `white` scatter plot by the `winner`? What new conclusion can we draw about which demographics Hillary and Bernie won?

In [None]:
### Enter your code below: ###



##############################

In [None]:
# Solution

# sns.scatterplot(data=dem, x='white', y='black', hue='winner');

# Bernie tends to win counties with low percentages of black individuals
# and high percentages of white individuals.
# Vice versa for Hillary.

Great job so far! Let's come back to the `income` vs. `college` scatter plot briefly. **Run the cell below** so we can take a look at it again.

In [None]:
# Run this cell!

sns.scatterplot(data=rep, x='college',  y='income');

Earlier we said that based on this scatter plot, we know that `income` and `college` are related. But how exactly? As `college` increases, so does `income`. We would say that these variables have a **positive correlation**.

### Correlations

A scatter plot can have one of three types of linear relationships (aka correlations):

- **Positive correlation**: As the x variable increases, the y variable also increases
- **Negative correlation**: As the x variable increases, the y variable decreases
- **No correlation**: No relationship or trend between the x and y variable. 

![](.images/correlation.jpeg)
<!-- <img src='.images/correlation.jpeg' width=800 style="margin:auto"> -->

More specifically, correlations can take on values between -1 and 1 with the strongest correlations being -1 and 1, and the weakest being 0. We want to take note of strong correlations in our data because they can indicate important relationships between variables.

Here are some rules of thumb for correlation values $r$:
-  $|r| \leq 0.1$: **weak**
- $0.1 \leq |r| \leq 0.3$: **moderate**
- $|r| \geq 0.5$: **strong**

As the strength of the correlation increases this suggests the variables are more strongly linked.

>**Field notes:**  
There are actually many different kinds of correlation coefficients! The one we will be using is one of the most common and called the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) developed by Karl Pearson. The Pearson correlation coefficient is formulated to quantify the strength of **linear** relationships. Other correlations can quantify the strength of different relationships that are not necessarily linear.

The `corr()` function from `pandas` can calculate correlations between features of our data. First, we need to pick out two features, in this case `college` and `income`, and we can do this using `get()` like so:

In [None]:
# Run this cell!
college = rep.get('college') # save college to a new variable

# or
# college = rep['college']

# or
# college = rep.college

college.head() # and view it

Even though we've selected just one column, `head()` is showing us what looks like two. The one of the left is the **index** and gives unique id's to each observation. The one on the right are the actual values for `college`.

We can use the `get()` function to get any column or feature from a dataframe with any of the following syntax
```python
df.get('my_column')
df['my_column']
df.my_column
```
where `df` represents the dataframe we want a column from, and `'my_column'` is the name is the column. Note that each of the lines are equivalent.

Note that there are two other ways to extract columns from dataframes using the bracket and dot syntax shown above.

>**Your turn**:  
`get()` the income feature and save it to a new variable called `income`. Remember using the `=` for variable assginment.

In [None]:
### Enter your code below: ###



##############################

In [None]:
# Solution
income = rep.get('income')

Now we can **compute the correlation** between `college` and `income` using the `corr()` function:

In [None]:
college.corr(income)

The **correlation coefficient is ~0.7**, so college and income are quite **strongly related**!

**Note:** the Perason correlation coefficient is symmetric, so whether we compute `college.corr(income)` or `income.corr(college)` we will get the same value. Go ahead and try it if you're curious!

The `corr()` function let's us compute the correlation between any two columns of data with the following syntax:
```python
col1 = df.get('col1')
col2 = df.get('col2')
col1.corr(col2)
```
or in one line:
```python
df.col1.corr(df.col2)
```

where `col1` and `col2` are the columns or features we want to compute a correlation between and `df` is the dataframe we're using.

>**Your turn**:  
Let's come back to your `black` vs. `white` scatter plot using data from `dem`. Regenerate that plot below and compute the correlation so we can quantify the relationship between `black` and `white`.

>What is the value of the correlation coefficient? What does this tell you about the relationship between the percentages of black and white individuals in counties?

In [None]:
### Enter your code below: ###



##############################

In [None]:
# Solution:

# sns.scatterplot(data=dem, x='white', y='black', hue='winner');

# black = dem.black
# white = dem.white
# black.corr(white)

# r = -0.654. This is a strong negative correlation!
# As the percentage of blacks increases the percentage of whites decreases
# and vice versa.

### Correlation $\ne$ Causation!

In the earlier scatter plot we saw that `college` and `income` were positively correlated. This might suggest that being college educated causes an increase in median income. However, whether this is true or not **CANNOT** be determined from this data alone. If a set of data has variables with an existing correlation, it does NOT immediately mean one variable is the cause for the other. There could be other confounding variables (ones outside the scope of this data set) that influence both college education and income that explain the relationship between them.

Consider the situation where individuals from families with high socioeconomic status are more likely to end up in high-income jobs regardless of their education level. However, they are also more likely to attain a college education. Then we would see college and income correlated but both are to some degree caused by familial socioeconomic status. In fact, there is a [body of literature](http://www.shirleymohr.com/JHU/Sample_Articles_JHUP/RHE_2003_27_1.pdf) that suggests this is true. So, we'll say it again, correlation does not equal causation!

To really sell you on this, [here](https://www.tylervigen.com/spurious-correlations) are many other obviously non-causal examples demonstrating this.

![](.images/SpuriousCorr.png)
![](.images/CausationIthinkNotMeme.png)
<!-- <img src='.images/SpuriousCorr.png' style="float:left;" width=700>

<img src=".images/CausationIthinkNotMeme.png" width=400> -->

### Linear Regression Plots

Sometimes plain scatter plots can be hard to interpret if the relationship between two variables is ambiguous. We can use a regression plot in this case to include a **regression line** that will clearly convey the trend between two variables. Note that regressions are not symmetric and show the effect of an independent variable `x` on a dependent variable `y`.

Let's start by revisiting the first scatter plot we made showing a positive correlation between `college` and `income`. **Run the cell below** to plot the effect of `college` on `income` using the `regplot()` function from `seaborn`.

In [None]:
# Run this cell!

sns.regplot(data=rep, x="college", y="income")

# Compute the correlation coefficient of the regression line
college = rep.get("college")
income = rep.get("income")
college.corr(income)

# or
# rep.college.corr(rep.income)

# recall:
# abs(r) ~= 0.1 small effect
# abs(r) ~= 0.3 medium effect
# abs(r) >= 0.5 large effect


We used the **`sns.regplot()`** function to create a scatter plot with a line of best fit. Its arguments are nearly identical to `sns.scatterplot()`:

- `x`, `y`: features in the dataframe that correspond to the respective x and y axes.
- `data`: the dataframe you want to plot
- `hue`: optional, colors the data based on this feature

As we saw earlier, there is a clear positive relationship between college and income shown by the regression line. Also notice: the data points fit fairly close to the regression line.

>**Your turn:**  
Can you create a regression line for a new scatter plot to show the effect of `white` on `voter_turnout` in the **Republican** primary?

>*Hint: use the `white` and `voter_turnout` features from **`rep`**.*

>Don't forget to compute the **correlation**!

>What does the regression plot tell you about the relationship between `white` and `voter_turnout`? What is the strength of this relationship?

In [None]:
### Enter your code below: ###



##############################

In [None]:
# Solution

# sns.regplot(data=rep, x='white', y='voter_turnout');
# rep.white.corr(rep.voter_turnout)

# r = 0.38. Moderate, positive correlation.
# The slope of regression is positive as well and sits pretty nicely in the middle
# of the data. This gives us some confidence in saying that as the percentage of
# whites increases, so does voter turnout in the Republican primaries.

**TODO?**

In [None]:
sns.regplot(data=dem, x='white', y='voter_turnout');

This time let's consider a different question: do you think there is a correlation between the percentage of a county that is `white` and the percentage that have `college` degrees?

>**Your turn:**  
Create a regression plot showing the effect of `college` vs. `white` in the **Republican** primary. Don't forget to **compute the correlation**!

In [None]:
### Enter your code below: ###



##############################

In [None]:
# Solution

# sns.regplot(data=rep, x='white', y='college')  # college = m*(white) + b
# rep.college.corr(rep.white)

In the plot above, we can see that the regression lines are essentially flat and likewise there is very little correlation between `white` and `college`.

Now let's see an example where a plot appears to have a correlation, but is actually invalid.

>**Your turn:**  
Create a regression plot showing the effect of `college` vs. `female`, the percentage of females in a county, in the **Republican** primary. As always, **compute the correlation coefficient** too.

In [None]:
### Enter your code below: ###



##############################

In [None]:
# Solution

# sns.regplot(data=rep, x='female', y='college')
# rep.female.corr(rep.college)

As you can see, there exists a moderate ($\rho \approx 0.2$) positive correlation between the two features. However, you can also see that most of the data are centered on one portion of the x-axis and do not follow the line very well. In more advanced settings, you would need to perform a number of statistical tests to ensure the validity of the regression. In short, **the regression above is a poor fit and its result is invalid**. Be careful when making inferences about regression lines and always check the plots & correlation coefficents!

<!-- In many of the cases, if we had just looked at the plain scatter plot we might have thought there is a negative correlation between `college` and `black`. However, the scatter plot somewhat obscures the density of data points in the left of the plot. There are many counties that have a very low proportion of blacks and in these cases it appears that college attainment is independent of the proportion of blacks in the population.

We can try excluding these counties to better answer our question. It might not make sense to include counties with no blacks to see if being black has an effect on college attainment. We'll revisit this problem in the next part on data manipulation and learn how we can subset the data to exclude these counties. -->

Earlier we also mentioned that regression plots are not symmetric. They estimate the effect of the `x` variable on `y`, and not the other way around.

>**Your turn:**  
Let's revisit our last plot and try swapping the `x` and `y` variables. Can you plot the effect of `college` on `female`?

In [None]:
### Enter your code below: ###



##############################

In [None]:
# Solution

# sns.regplot(data=rep, x='college', y='female');

Before we even try to interpret this plot, let's think about the question it's trying to answer. Does it make sense to ask whether college attainment has an effect on being female?

This question seems ill-posed since someone's biological sex is determined long before they graduate college. When creating regression plots it's important to use your prior knowledge to justify the x vs. y relationship you're investigating. Otherwise, you might end up with non-sensical results.

### Box Plots

![](.images/boxandwhisk.png)
<!-- <img src=".images/boxandwhisk.png" style="width:70%; margin: auto;"> -->

Box plots (aka box & whisker plots) show a high-level summary of what values your data has. It includes information about where the "middle" of your data is, where the central 50% is, and how spread out the values are. All of this provides a quantitative description of the distribution of your data. Take a look at the diagram of a box plot below.

<img src='https://miro.medium.com/max/18000/1*2c21SkzJMf3frPXPAR_gZA.png' width=80% style="margin:auto">

In the center you can see the median of the data as a yellow line. The box surrounding the median is the interquartile range (IQR) where 50% of the data is. The "whiskers" extend out to 1.5 times the IQR (or the true minimum/maximum, whichever is closer to the median). Finally, any data points outside the whiskers are considered outliers and plotted individually.

For a more in-depth explanation check out this [article](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51).

To start, let's use a boxplot to get an summary of some of a feature in our data. **In the cell below** we use `boxplot()` from `seaborn` to generate a box plot of the feature, `children`, the percentage of the population <18.

In [None]:
# Run this cell!

sns.boxplot(data=rep, x='children');

We can see that the median county is about 22.5% children, and 50% of counties are between 20% and 25% children. Furthermore, almost all counties have a proportion of children between 15% and 30% with a few more outliers with more than 30% and just a handful less than 15%.

With just one feature plotted, box plots don't provide much more information than that. However, we can also split up or group a feature by another. Check out the plot below.

In [None]:
# Run this cell!

# Horizontal box plot
sns.boxplot(x = 'income', y = 'winner', data = rep);

We used the **`sns.boxplot()`** function to create the box plots above. It has similar arguments to the scatter and regression plot functions.

One axis must be numeric and the other must be categorical
- `x`, `y`: variables in the dataframe that correspond to the respective `x` and `y` axes.
- `data`: name of the dataframe

The `x` and `y` axes can be interchanged depending on if you want a horizontally or vertically oriented box plot.

The second plot has grouped the `income` feature by the winning candidates. In other words, we've created box plots of the `income` feature for counties won by each Republican candidate. The box plot in blue is the income of counties Donald Trump won, the orange box plot is the income of counties Ted Cruz won, and so on. Now we can tell that there is a difference between the income of counties that each candidate won. Marco Rubio's counties stand out in particular: he won counties with higher incomes than the other candidates.

>**Your turn:**  
Change the plot above to be vertical instead.  
Hint: try swapping the `x` and `y` arguments.

In [None]:
# Vertical box plot

### Enter your code below: ###



##############################

In [None]:
# sns.boxplot(x = 'winner', y = 'income', data = rep);

Let's create one last box plot before we move on. Notice that we've been plotting the `income` feature in the plots above. If you'll recall the scatter plots we made earlier, remember that `income` had a positive correlation with `college`. Create a box plot showing `college` for each winning candidate. What pattern do you expect to see?

In [None]:
### Enter your code below: ###



##############################

In [None]:
# sns.boxplot(data=rep, x='college', y='winner')

# Since income and college are correlated, we should expect to see the same pattern in the income
# and college box plots. Specifically, the pattern where Marco Rubio won higher income
# counties should also appear with college attainment.

### Choropleths

Choropleths show data on maps. Since we're working with United States voting data, it will be very useful to see our data in a geographical context. Choropleths will let us see patterns in the data that might be due to geography and answer the question of **"where?"**

For example, where do most people in the United States live? We can use a choropleth to answer this by plotting population across the United States.

Run the cell below to visualize the populations of all states in the Republican primary dataset using the `choropleth()` function.

In [None]:
# Run this cell!

choropleth(rep_st, 'population')

In the code above we used the choropleth function to plot the population of each state in `rep_st` onto a map of the United States. The choropleth function takes just two parameters: the dataset we want to plot and the name of the feature we want to show.

In this example we gave it `rep_st` as the dataset, so it plotted state-level features. What would happen if we passed a county level dataset instead?

**Interpretation:** What are the most populous states? Where are they? Why are they the most populous? What else might you infer about these states just based on their population?

**Your turn:** Can you use a choropleth to plot the populations of each county?

>**Field Notes:**  
You may have noticed that there is no data for several states (they're greyed out). This is because the voting data we are working with was only collected roughly up until a winner could be determined. Thus, some counties and states that voted later in the election were excluded. If you'd like access to the full census data let us know and we can provide you the additional files.

In [None]:
### Enter your code below: ###



# This cell might take a few seconds to finish!
#############################

In [None]:
# choropleth(rep, 'population')

# 3. Cleaning the Data Panda-Monium
<p style='font-size:30px'>Data Manipulation</p>

![](.images/clean.jpeg)

To help with data manipulation we're going to take a closer look at another Python library, `pandas`. Luckily you've already been using `pandas` this whole time! In part **0. `import data_science`** we loaded all of our data in as `pandas` DataFrame objects and have been passing them as arguments to `seaborn` and `plotly` for visualization.

We imported pandas like this:
```python
import pandas as pd
```
so now we can reference `pandas` by its alias `pd`.

`pandas` is a Python library that makes it easier for us to clean and organize data for analysis and visualization. There are several functions that `pandas` has for dataframes that are extremely useful for this project.

![](.images/sad_panda.jpg)

<!-- <img src='.images/sad_panda.jpg' style="margin-left:auto; margin-right:auto;" width=50%> -->

## Selecting Features

Earlier, we briefly touched on how to select features (or columns) from a dataframe, but here's a quick refresher.

The following lines of code are all equivalent and select a single feature called `column_name` from a dataframe, `df`:

```python
df.get('column_name')
df['column_name']
df.column_name
```

Let's say we want to look at individual columns. For example, if we use `rep.get('income')` or `rep['income']`, we will see the income column of the `rep` dataframe.

>**Your Turn:**  
Let's try it with another column. Can you view the `'population'` column of `rep`?

In [None]:
### Enter your code below: ###



##############################

In [None]:
# Solution:

# rep.get('population')

# or
# rep['population']

# or
# rep.population

The `get()` function and bracket notation are useful when you want to see some data of a specified column. When you call this function on a dataframe, you can see the first and last 5 rows of the specified column.

## Selecting Observations
What if we only wanted to look at one specific row or number of rows?

In [None]:
# Run this cell!
# Output the first row (index 0) of the dataframe
rep.iloc[0]

In [None]:
# You can also access a number or rows!
# 1st and 6th row (index 0 and 5, respectively)
rep.iloc[[0,5]]

In [None]:
# 1st three rows
rep.iloc[:3]

The `iloc` function accesses and returns groups of rows and/or columns by index integers. Remember, Python is 0-indexed so if you want to access the nth row, you access the (n-1)th position.

>**Your turn:**  
Get the 10th row of the `rep` dataframe.

Now see if you can try and get the 10th row of the `rep` dataframe.

In [None]:
### Enter your code below: ###



##############################

In [None]:
# rep.iloc[9]
# or use double brackets to view horizontally

## Selecting Observations based on Conditions

<p style='font-size:20px'>...and more plotting options!</p>

We can select rows based on a condition. Let's say we only wanted to view counties with an income of less than $30,000. We can do this by utilizing `get()` or brackets to
filter rows that meet this requirement.

In [None]:
# Run this cell!
rep.get(rep.get('income') < 30000)

# or
# rep[rep['income'] < 30000]

# or
# rep[rep.income < 30000]

We can now plot subsets of data to, for example, highlight a specific subset of data in a plot.

In the cells below, we're going to do exactly this by stacking **two plots together**. Namely,

1. A scatterplot of `college` vs. `income` of all counties in `rep`
2. A scatterplot of `college` vs. `income` of just counties with income < 30,000 in `rep`

Finally, we'll add a legend. Let's take it one plot at a time and when we're done we'll have something like this:

![](.images/low_income.png)

>**Your turn**:  
Create a scatter plot with `x=income` and `y=college` for just counties with income < 30,000 in `rep`. It should look like this:
![](.images/just_low_income.png)

In [None]:
### Enter your code below: ###



##############################

In [None]:
# Solution
low_income = rep[rep.income < 30000]
sns.scatterplot(data=low_income, x='income', y='college');

Now, we can add a plot to this figure by calling a plotting function like `scatterplot()` **in the same cell**.

>**Your turn**:  
Add your old scatter plot with `income` on the x-axis and `college` on the y-axis for counties in `rep`. The one that looked like this:
![](.images/college_v_income.png)


In [None]:
# Copy and paste your code from the last plot here:


### Now add the new plot below: ###



###################################

In [None]:
# Solution
# low_income = rep[rep.income < 30000]
# sns.scatterplot(data=low_income, x='income', y='college')
# sns.scatterplot(data=rep, x='income', y='college');

**_Wait a minute, why is everything orange?! And where's the low income data?_**

It turns out that **order matters** when combining consecutive plots. New plots will be stacked on top of old ones and cover up anything from older plots.

>**Your turn:**  
Change the plotting order so that you show the low income data last!

In [None]:
### Enter your code below: ###



##############################

In [None]:
# Solution
# sns.scatterplot(data=rep, x='income', y='college')
# low_income = rep[rep.income < 30000]
# sns.scatterplot(data=low_income, x='income', y='college');

**Much better!**

![](.images/hidethedata_small.png)

Notice that **data from the new plot has a different color** than the first plot. The first plot is blue and the second is orange. Seaborn and Matplotlib automatically change the colors of consecutive plots using a default color cycle.

![](.images/MPL_default_colors.png)

You can directly use the default colors by their names `'C0'`, `'C1'`, `'C2'`, and so on. Most `seaborn` plotting functions accept a `color` argument that you can specify like this:
```python
sns.scatterplot(..., color='C3')
sns.scatterplot(..., color='C4');
# The ellipsis is the other required arguments like data=rep, x='income', etc.
```

We've changed the colors of the plots below, go ahead and edit the cell to try changing the colors yourself too!

In [None]:
# Run this cell!

# Notice the new argument for scatterplot: color

sns.scatterplot(data=rep, x='income', y='college', color='C8')
low_income = rep[rep.income < 30000]
sns.scatterplot(data=low_income, x='income', y='college', color='C9');

We should **add a legend** to indicate what the colors mean. The `scatterplot()` function and many other seaborn functions (including every plot in this kit) have a `label` parameter to help with this. The syntax looks like this:
```python
sns.scatterplot(..., label='label 1')
sns.scatterplot(..., label='label 2');
# The ellipsis is the other normal arguments like data=rep, x='income', etc.
```
This would create a scatter plot with a legend showing points from the first plot as 'label 1' and points from the second plot as 'label 2'.

>**Your turn**:  
Use the `label` parameter to add an informative legend to the figure you've made so far.

In [None]:
### Enter your code below: ###



##############################

In [None]:
# Solution
# sns.scatterplot(data=rep, x='income', y='college', label='All')
# low_income = rep[rep.income < 30000]
# sns.scatterplot(data=low_income, x='income', y='college', label='Income < 30,000');

Great job! Hopefully it's clear now how you can subset your data for more targeted visualizations and analysis.

There are also several other boolean operators you can use to subset your data based on conditions.

**Boolean Operators:** used to compare numeric values.
* `<`: less than
    - Example: `rep['income'] < 30000`
* `>`: greater than
    - Example: `rep['income'] > 30000`
* `==`: equal to
    - Example: `rep['income'] == 30000`
* `<=`: less than or equal to
    - Example: `rep['income'] <= 30000`
* `>=`: greater than or equal to
    - Example: `rep['income'] >= 30000`

**Combining operators/conditions:**
* `|`: OR, used to match either condition or both
    - Example: `(rep['income'] < 30000) | (rep['income'] > 60000)`
    - Will match counties where the median income is either less than \$30,000 or greater than \$60,000.
* `&`: AND, used to match both conditions
    - Example: `(rep['income'] < 30000) & (rep['college'] > 15)`
    - Will match counties where the median income is less than \$30,000 and more than 15% of population has a college degree.

>**Important Note!** When combining conditions each condition must be wrapped with parentheses `()`!

Now let's try creating a data visualization using more complex boolean operators and conditional statements. When we're done we'll have a plot like this:

![](.images/black_v_white_filtered.png)

>**Your turn**:  
We'll work with a plot from earlier. Create a `scatterplot()` comparing `black` vs. `white` in counties from the **Democratic** primary. Color the points by the `winner` of each county. It should look like this:
![](.images/black_v_white_dem.png)

In [None]:
### Enter your code below: ###



##############################

In [None]:
# Solution
# sns.scatterplot(data=dem, x='white', y='black', hue='winner');

As we noticed earlier, this plot is interesting because it has two "arms" stretching out from the bottom right with almost no counties in between. Furthermore, a large proportion of Hillary's counties are in the upper arm.

**We'd like to pick out counties in the upper arm that Hillary won for further analysis.**

How can we do this? We could imagine drawing a line to cut off the bottom arm and select it that way, but this would also select the counties falling in between the arms. Something like this:

![](.images/black_v_white_dem_simple_cut.png)

Instead, we can draw a line through the plot of the form $ y > mx + b$ to pick out everything above that line. Like so:

![](.images/black_v_white_dem_diag_cut.png)

Let's get an expression to select the upper arm. 

We'll start with the equation of a line:

$$ y > mx + b $$

Based on our plot's axes, we know `black` is the y variable and `white` is the x variable.

$$ \text{black} > m\cdot\text{white} + b $$

Now we just need to figure out what $m$ and $b$ should be. We can pick two points we want the line to connect and get an expression. We'll use $(0, 50)$ and $(50, 0)$.

$$ \text{black} - 50 = \frac{0 - 50}{50 - 0}(\text{white} - 0) $$

$$ \text{black} - 50 = \frac{-50}{50}\cdot \text{white} $$

$$ \text{black} = -1 \cdot \text{white} + 50 $$

$$ \text{black} + \text{white} > 50 $$

and more generally,

$$ \text{black} + \text{white} > \text{threshold} $$

Intuitively, this filter let's us select counties that are exclusively white and black. If $\text{threshold} = 50$, then that would mean only 20% of the population can be some other race. As we increase the threshold, this will select counties that are more exclusively white and black.

In code this would look like
```python
filter = df.white + df.black > threshold
```
where `df` is any of the four dataframes used in this project, `rep`, `dem`, `rep_st`, or `dem_st`. And now we could use
```python
df[filter]
```
to select the counties (or states if you use a state-level dataframe) that satisfy the filter.

>**Your turn:**  
Start by adding the white percentage and black percentage together and compare this sum to a threshold, like 50%. Assign this to a new variable as shown in the example above and use this filter to pick out these counties from `dem`. Finally, plot the result of subsetting.  
*Hint: use the same code for the original scatter plot but instead pass the subset dataframe as the argument to `data`. Something like this: `data=df[filter]`*

In [None]:
### Enter your code below: ###



##############################

In [None]:
# Solution
# bnw_threshold = 50
# bnw_mask = dem.white + dem.black > bnw_threshold
# bnw = dem[bnw_mask]
# sns.scatterplot(data=bnw, x='white', y='black', hue='winner');

With the threshold at 50% we're still getting a fair amount of counties that aren't in the upper "arm". Let's try changing the threshold to something more stringent.

>**Your turn:**  
Instead of 50, try changing the threshold to something else and create a new plot using the updated threshold.

In [None]:
### Enter your code below: ###



##############################

In [None]:
# Solution

# bnw_threshold = 80
# bnw_mask = dem.white + dem.black > bnw_threshold
# bnw = dem[bnw_mask]
# sns.scatterplot(data=bnw, x='white', y='black', hue='winner');

Great job of picking out the upper arm! However, we're not done yet, there are still some counties in this arm that Hillary didn't win. 

Let's try applying an additional filter to our statement. In addition to our statement above, we can try adding **another** conditional that selects counties that have a percentage of blacks greater than some percentage.

For example:
```python
original_filter = df.white + df.black > 80

additional_filter = df.black > 50 # selects counties that are >50% black

new_filter = original_filter & additional_filter # selects counties with black + white > 80% AND black > 50%

filter = (df.white + df.black > threshold) & (df.black > 50) # or in one line (need the parentheses!)
```

>**Your turn:**  
Add to your original filter to select only the counties that Hillary won in the upper arm. Then, plot `x='white'` and `y='black'` again after applying your final. Your final plot should look something like this:
![](.images/black_v_white_filtered.png)

In [None]:
# Solution
# 
# bnw_mask = (dem.white + dem.black > 80) & (dem.black > 16)
# bnw = dem[bnw_mask]
# sns.scatterplot(data=bnw, x='white', y='black', hue='winner');

Great job! We can see that that the graph is entirely blue now, showing that Hillary Clinton won all of the counties that satisfied our filter. From this process, we've learned that Hillary dominated in counties that were even slightly black. Furthermore, this was especially true if those communities didn't have much of a population of asians or hispanics.

We could also look at other variables within the subset of data and try to understand if there are any other reasons why these counties all voted for Hillary. For example, **where** are these counties? What are their **population sizes**? Since Bernie is more liberal than Hillary, **does this mean these counties are more moderate?**

### `isin()`

Sometimes you will need to compare a feature to several possibilities at the same time. For example, suppose we wanted to select counties from California, New York, and Texas. We could use the following filter:
```python
filter = (df.state == 'California') | (df.state == 'New York') | (df.state == 'Texas')
```
but you can imagine that if we wanted to include more states the filter would become very verbose.

`pandas` has a function to help! The `isin()` function checks whether every element you input is in the DataFrame. The function returns True if the element is found in the dataframe and False if it is not.

We could replace our filter above with the following code instead:
```python
filter = df.state.isin(['California', 'New York', 'Texas'])
```
Much better! In the example above we gave `isin()` a list, but `isin()` also works with any iterable object. This includes lists, sets, dictionary, and other `pandas` objects like DataFrames and Series (i.e. columns from DataFrames). Check out the [official documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html) for more information.

>**Your turn:**  
Create a choropleth showing the percentage of whites in counties from the **Republican** primary within the following states:  
**Delaware, Florida, Georgia, Maryland, North Carolina, South Carolina, Virginia, District of Columbia, West Virginia, Alabama, Kentucky, Mississippi, Tennessee, Arkansas, Louisiana, Oklahoma, and Texas**.  
These are all the states in the 'South' US Census region.

In [None]:
### Enter your code below: ###



##############################

In [None]:
# Solution

# southern_states = ['Delaware', 'Florida', 'Georgia', 'Maryland', 'North Carolina', 'South Carolina', 'Virginia', 'District of Columbia', 'West Virginia', 'Alabama', 'Kentucky', 'Mississippi', 'Tennessee', 'Arkansas', 'Louisiana', 'Oklahoma', 'Texas']
# southern_states_filter = rep.state.isin(southern_states)
# choropleth(rep[southern_states_filter], 'white')

### `subset()`
Now that you've had practice with the `isin()` function and filtering, we want to introduce a function we have created for you to help subset the data by US Census regions and historical voting patterns.  

What if we wanted to subset the dataset by the western census region?

In [None]:
# Run this cell!
subset(rep_st, 'west')

The function will return a dataframe subsetted from the `rep_st` to the states in the `west`. The `subset()` function allows you to easily filter the data with our preset filters. If you want to create your own, you can use a filter formula to subset by race or `isin()` function to subset by geographical location we used earlier.
  
We have several filters you can use:  
    - `solid_blue`, `solid_red`, `swing`: states that traditionally vote in one way in 2016  
    - `northeast`, `midwest`, `south`, `west`: states that are divided by region  

You can also run the cell below to view some helpful information about the `subset()` function.

In [None]:
# Run this cell if you want more information!
help(subset)

> **Your turn:**  
Use `subset()` to subset the `rep` dataframe by the `south` filter.

In [None]:
### Enter your code below: ###



##############################

In [None]:
# Solution
# subset(rep, 'south')

For example, we can use `subset()` to explore difference between US Census regions.

>**Your turn:**  
Create a choropleth to show the percentage of white people in historically red states. You should first subset to select `'solid_red'` states and pass the resulting dataframe to `choropleth` to show the `white` feature.

In [None]:
### Enter your code below: ###



##############################

In [None]:
# Solution
# solid_red_df = subset(rep, 'solid_red')
# choropleth(solid_red_df, 'white')

>**Your turn:**  
1. Make a scatterplot from `rep` with `income` on the x-axis and `college` on the y-axis.  
2. Add a second scatterplot from a `solid_blue` subset of `rep` with `income` on the x-axis and `college` on the y-axis.  
3. Use the `label` argument to create an informative legend.  
4. Use the `color` argument to make the counties from the solid blue states blue and the rest orange. You can use 'blue' and 'orange' (or use 'C0' and 'C1' to access the default color cycle).

>Your finished plot should look like this:
![](.images/solid_blue_college_v_income.png)

In [None]:
### Enter your code below: ###



##############################

In [None]:
# Solution
# sns.scatterplot(data=rep, x='income', y='college', color='C1', label='All States')
# sns.scatterplot(data=subset(rep,'solid_blue'), x='income', y='college', color='C0', label='Solid Blue States');

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=6db671ff-4b94-4ec9-9d8f-30a849bb0caf' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>