In [1]:
### Don't push "Run notebook"!!!

# assert False, 'Please don\'t press the "Run notebook" button!'

# `import data_science`

Welcome to Chapter 1, **What's In Here?** Before we start working with our data we have to get our coding environment ready to do some work.

**Reminder:** you can use the table of contents in the bottom left to quickly navigate notebooks!

>**Run the cell below** to `import` the Python packages we need, set a plotting theme, and load our data.

![](.images/import.jpeg)

In [2]:
# Run this cell with Ctrl+Enter (or Cmd+Return on a Mac)

# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from utils import *

# Set a plotting theme with seaborn
sns.set_theme() 

# Load our data
print('Loading data from Datasets folder...')

(geo_cnty,
republican_primaries_county_level,
democrat_primaries_county_level,
republican_primaries_state_level,
democrat_primaries_state_level) = load_data()

print('Finished loading data!')

Loading data from Datasets folder...
Finished loading data!


>**Field Notes:**  
Usually, getting your data isn't as easy as using a `load_data()` function. Often times, the data you want is scattered across different websites, filled with NaN's, or in an unstructured form. These irregularities need to be "cleaned" before you can start working with your data. Data cleaning is beyond the scope of this project but you can read more [here](https://towardsdatascience.com/data-cleaning-in-python-the-ultimate-guide-2020-c63b88bf0a0d) if you're interested.

## Restarting a Notebook

If you decide to take a break from working on the kit, you will find that the Notebook will turn off. You can tell since the "Machine" icon in the left sidebar will say "OFF'.
<center>

![](.images/machine_off.png)
</center>

This is happens when the machine hasn't received any inputs for 15 minutes or more. You'll have to restart the machine, which will cause all the variables to be cleared. To restart the machine, simply **re-run the `import data_science` cell in the respective Notebook**.

What about the new variables that you created while working in the particular notebook?

For our particular kit, you are supposed to **manually run all the cells with new variable assignments**, just like the `import data_science` cell.  Similar to all the libraries, variables are cleared from the memory as well. In that case, if you only import the libraries, you will see the error message **"NameError: name 'xxx' is not defined"**, which means downstream cells are trying to use a variable that does not have a value. This is resulted from the cells with the variable defination not being run after restarting. 

*Hint: When you're working on your project (outside of this kit) simply push the "Run notebook" button, and all cells will be run.*

**Now we're ready to start working with our data and explore what we just loaded!**

# 1. What's In Here?
<p style='font-size:30px'>Initial Data Exploration</p>

![](.images/dataexploration.jpeg)

Let's take a first look at the primary election data. In the code cell above we loaded in four **dataframes** of interest:
```python
republican_primaries_county_level
democrat_primaries_county_level
republican_primaries_state_level
democrat_primaries_state_level
```
However, these variable names are really long! Let's assign them to some more conveniently named variables:

In [3]:
### Finish and run this code to reassign the other dataframe variables ###

dem = democrat_primaries_county_level

# Uncomment the lines below and assign them to their corresponding dataframes
# rep = 
# rep_st = 
# dem_st = 

##########################################################################

Throughout this notebook we will provide answers in hidden code cells. You can see the answer to this problem by clicking the blue "Show it.":

In [4]:
dem = democrat_primaries_county_level
rep_st = republican_primaries_state_level
dem_st = democrat_primaries_state_level

The first two, `rep` and `dem`, are the county-level primary election results for the Republican and Democrat primaries, respectively. We also have their state-level counterparts, `rep_st` and `dem_st`, that are home to the same data except summarized at the state level. Let's start by looking at `rep`. We can simply type in `rep` and run the cell to see the first and last 5 rows of data.

In [5]:
# Run this cell with Ctrl+Enter (or Cmd+Return on a Mac) to see the output
rep

Unnamed: 0,st_abbrev,fips,population,income,hispanic,asian,black,white,foreign,college,...,female,senior,children,st_cnty,state,winner,votes,fraction_votes,total_votes,voter_turnout
0,AL,01001,55395,53682,2.7,1.1,18.7,75.6,1.6,20.9,...,51.4,13.8,25.2,AL_Autauga,Alabama,Donald Trump,5387,0.445,11839,0.285721
1,AL,01003,200111,50221,4.6,0.9,9.6,83.0,3.6,27.7,...,51.2,18.7,22.2,AL_Baldwin,Alabama,Donald Trump,23618,0.469,49100,0.315378
2,AL,01005,26887,32911,4.5,0.5,47.6,46.6,2.9,13.4,...,46.6,16.5,21.2,AL_Barbour,Alabama,Donald Trump,1710,0.501,3357,0.158447
3,AL,01007,22506,36447,2.1,0.2,22.1,74.5,1.2,12.1,...,45.9,14.8,21.0,AL_Bibb,Alabama,Donald Trump,1959,0.494,3891,0.218845
4,AL,01009,57719,44145,8.7,0.3,1.8,87.8,4.3,12.1,...,50.5,17.0,23.6,AL_Blount,Alabama,Donald Trump,7390,0.487,14791,0.335417
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2087,WI,55133,395118,75850,4.6,3.3,1.4,89.4,4.6,40.2,...,50.9,16.5,22.4,WI_Waukesha,Wisconsin,Ted Cruz,75123,0.610,120830,0.394082
2088,WI,55135,52066,50822,3.0,0.5,0.4,94.8,1.6,16.6,...,49.7,19.5,21.2,WI_Waupaca,Wisconsin,Ted Cruz,5194,0.471,10748,0.261967
2089,WI,55137,24178,43070,6.4,0.5,2.1,89.8,2.4,14.3,...,47.3,22.5,18.7,WI_Waushara,Wisconsin,Donald Trump,2391,0.451,5152,0.262099
2090,WI,55139,169511,51010,3.9,2.7,2.0,89.6,3.2,25.5,...,49.7,14.7,20.8,WI_Winnebago,Wisconsin,Ted Cruz,16049,0.469,33294,0.247995


<center>

![](.images/NumbersEverywhere.png)

</center>

If you ran the cell above you should see a dataframe with 2092 rows and 22 columns. **Dataframes** are tables of data. Each row is an observation (for `rep` this is a county) and each column, or **feature**, is a property of the observations.

You can also use `df.shape` and `df.columns` to start to get a handle on your data, where `df` is a Dataframe. **Run the cell below** to see the `shape` of `rep`.

In [6]:
# Run this cell!
rep.shape

(2092, 22)

`df.shape` returns the number of rows and columns in the dataframe `df` as a tuple. The first value is the number of rows, and the second value is the number of columns. From the output above we see again that `rep` contains 2092 counties and 22 columns, or features, describing each county.

Now **run the next cell** to see the `columns` or features of `rep`.

In [7]:
# Run this cell!
rep.columns

Index(['st_abbrev', 'fips', 'population', 'income', 'hispanic', 'asian',
       'black', 'white', 'foreign', 'college', 'density', 'vets', 'female',
       'senior', 'children', 'st_cnty', 'state', 'winner', 'votes',
       'fraction_votes', 'total_votes', 'voter_turnout'],
      dtype='object')

`df.columns` returns a list of names of each column in the dataframe, `df`. We can see that `rep` has a number of features.

>**Food for thought:**  
With just these names, what do you think these features are? In practice, you may or may not be given a **data dictionary** that describes what each feature of your data is. If not, you often have to do some investigating to figure out what exactly you have! This can mean simply asking whoever gave you the data what the features are or otherwise digging into the process that generated the data.

## What are the county-level features?
Let's take a look at what features are in `rep`, the Republican primary election data. There's a lot here so just skim it for now. **You can come back to this if you are ever confused about what the columns of `rep`, `dem`, `rep_st`, and `dem_st` represent!**

- `st_abbrev`: State abbreviation
- `fips`: US Census ID (unique identifier for every census region e.g. state, county, etc.)
- `population`: Total population
- `income`: Median household income
- `hispanic`, `asian`, `black`, `white`, `foreign`, `college`, `female`, `senior`, `children`: Percentage of individuals that are in these demographics.
    - Note that `foreign` refers to foreign born individuals and `college` refers to individuals with Bachelor's degrees.
- `density`: Population per square mile
- `vets`: Population of veterans
- `st_cnty`: Concatenation of state abbreviation with county name
- `state`: Full state name
- `winner`: Name of the candidate that won the election for the county
- `votes`: Number of votes for the winning candidate
- `fraction_votes`: Fraction of the votes that the winning candidate received
- `total_votes`: Total number of votes cast in the county
- `voter_turnout`: Proportion of eligible voters that cast votes in the election

>**Your turn:**  
Compare the output from `df.shape` and `df.columns` for `rep`, `dem`, `rep_st`, and `dem_st`.

>How many counties are in `rep` and `dem`? Are they different? Why might that be the case?

>Likewise, how many states are in `rep_st` and `dem_st`? Are they also different and why?

>What about the number of features between a county-level dataframe, like `dem`, and a state-level dataframe, like `dem_st`.

>We've gotten you started with `rep` below.


In [8]:
# Run this cell!

print('Shape of rep:')
print(rep.shape)
print('Columns of rep:')
print(rep.columns)

Shape of rep:
(2092, 22)
Columns of rep:
Index(['st_abbrev', 'fips', 'population', 'income', 'hispanic', 'asian',
       'black', 'white', 'foreign', 'college', 'density', 'vets', 'female',
       'senior', 'children', 'st_cnty', 'state', 'winner', 'votes',
       'fraction_votes', 'total_votes', 'voter_turnout'],
      dtype='object')


In [9]:
### Enter your code for dem below: ###



######################################

In [10]:
### Enter your code for rep_st below: ###



#########################################

In [11]:
### Enter your code for dem_st below: ###



#########################################

The answer is hidden in the cell below:

In [12]:
# You should have entered something like:
# print(dem.shape)
# print(dem.columns)
# for dem, rep_st, and dem_st into separate cells.

# Notice that both `rep` and `dem` have one additional column, `st_cnty`
# compared to `rep_st` and `dem_st`. This is because they contain county
# level data and need an extra column for their county name.

# Additionally, each dataframe has a different number of rows and furthermore
# the state-level dataframes, rep_st and dem_st, do not have 50 observations!
# Why might this be? See the section below for an explanation.

## What is a U.S. Presidential Primary Election?
Primary elections allow parties to select their presidential candidates for the upcoming general election. The date of these elections vary state-to-state. Voters will be able to show up to voting booths to vote for their preferred candidate. 

The more votes a candidate receives during the primaries, the more delegates they will receive for the general election, which are individuals that represent a larger state/county and will vote for the respective party. Thus, candidates will be aiming to win a majority of delegates. If you want to learn more you can read more about it [here](https://en.wikipedia.org/wiki/United_States_presidential_primary).

As you go through this starter kit, you'll notice that some states and counties have no data. One potential reason is that they did not hold a primary election and instead held a caucus (which involves a meeting discussion rather than having the general population vote). Another potential reason is that some states may have had a later caucus or primary, so these datasets may not have collected any data from them.

# End of Chapter 1

<center><img src='https://i.imgflip.com/nnvsc.jpg' width=300></center>

Great job so far! By now you should have started to get familiar with the 2016 primary election data. When you're ready, navigate to `2 Show Don't Tell.ipynb` to continue.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=6db671ff-4b94-4ec9-9d8f-30a849bb0caf' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>