In [1]:
import numpy as np
from datascience import *
from math import *

import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## Iteration: making a set of code run many times

In Python, there are 2 ways we can make a set of code run multiple times: `while` loops and `for` loops.

Let's start off with `while` loops since they aren't covered in Data 8, but then let's move on to `for` loops.

The general structure of a `while` loop is as follows:

`while <condition>:
    <code>`
 
We can read it as: "While this condition is True, let's run the indented code in a loop. Once the condition becomes False, stop the code."

In [None]:
# A toy example: halving until we reach 10

counter = 0

while counter < 5:
    print(counter / 2)
    counter = counter + 1
    
# What will happen when I run this code? Why?

In [None]:
# Be careful: don't make an infinite loop

x = 0

while x < 10:
    print("School Spirit Counter: " + str(x))
    print("Go Bears!")

# If you get stuck, interrupt the kernel before it crashes
# How do we fix this so we accurately count the number of "Go Bears"?

In [None]:
# The Hailstone sequence from last week
# If even, divide by 2, if odd multiply by 3 and add 1
# stop at 1

hailstone = 3

while ...: 
    print(hailstone)
    if ...:
        ...
    else:
        ...

Although `while` loops are a good way to make a block of code run many times, they're prone to crashing your kernel and running infinitely. They're great if we don't know how many times we want to loop through our code but still want to continue until we stop fulfilling a condition. 

What is used much more often in data science is something called a `for` loop, which is better for our purposes because it lets us control the **number of loops** we want to perform. 

The general structure is follows:

`for <dummy variable> in <collection>:
    <code>
`

where `<dummy variable>` is any name, and `<collection>` is a collection like a list or array.

In [None]:
## For loops work by "iterating" through a collection

colors = ["red", "orange", "yellow", "green", "blue", "indigo", "violet"]

for ... in ...:
    ...

In [None]:
# in each iteration, the dummy variable "color" is equal to a specific value (in order) in colors
# that means we can do modifications to each value
new_colors = make_array()

for ... in ...:
    upper_case_color = ...
    new_colors = ...
    print(upper_case_color)
    
new_colors

In [None]:
# That's good if we want to do many of the same modifications, 
# but we don't /need/ to reference the values in the collection

die = np.arange(6) + 1
rolls = np.arange(...)
turn_results = make_array()

for ... in ...:
    one_roll = ...
    turn_results = ...
    
turn_results

In [None]:
# Another way of thinking about this:
num_rolls = ...

for ... in ...:
    one_roll = ...
    print("On roll #" + str(...) + " we rolled a " + str(...))

In [None]:
# From slides: Temp Check 1 
# What will be the output? What is the code doing?

stars = make_array(4, 3.5, 4.9, 4.7, 4.4, 4.5)
counter = 0
for rating in stars:
    if rating >= 4.5:
        counter = counter + 1
counter

In [None]:
# From slides: Temp Check 2
# Why doesn't this code work? Fix the loop so it correctly calculates these salaries post-tax

salaries = make_array(25, 50, 100, 25, 100)
post_tax = make_array()
for i in salaries:
    tax = i * 0.2
    np.append(post_tax, i - tax)
post_tax 

We'll get a lot more practice working with these loops  when we start doing empirical simulations for statistics & hypothesis tests in 8.2X!


Utilizing this computational power is going to be a key part of data science. 

## An application of everything we've learned so far: Analyzing the 2020 Presidential Election in Pennsylvania

As we know, Joe Biden beat Donald Trump for the US Presidency in 2020. Pennsylvania is considered a battleground state that traditionally votes Democratic (the "Blue Wall"). However, in 2016, it flipped for the first time to a Republican candidate since 1988. For today's class, we won't look at 2016 data (ask me in office hours for that analysis!). Instead, we're going to focus on how the votes were distributed in 2020.

Let's practice doing some data cleaning and exploratory data analysis using the more advanced table methods: `group`, `join`, and `pivot`!

In [None]:
# Let's start off with data cleaning!
pa_2020 = Table().read_table("2020_presidential.CSV")
pa_2020.show(5)

In [None]:
# Let's only take the columns we want
cols_of_interest = ["County Name", "Party Name", "Candidate Name", "Votes", "Election Day Votes", "Mail Votes", "Provisional Votes"]

pa_2020 = ...
pa_2020

In [None]:
# How many people voted in Pennsylvania in 2020? Try calculating it.
total_votes = ...
total_votes

In [None]:
# What's the problem? Let's check the data types.
...

In [None]:
# It looks like we need to do some data cleaning
# How do we convert this into a number we can work with?
adamsctybiden = pa_2020.column("Votes").item(0)

...

In [None]:
# Now, how do we do this to all of the numbers in the dataset? We need a function.
def vote_to_int(votes):
    """..."""
    ...

vote_to_int(adamsctybiden)

In [None]:
# Now, we can apply this to the data in the table!
...

In [None]:
# Let's go fix the whole dataset
# Just run this cell, but figure out what's happening

cols_to_convert = ["Votes", "Election Day Votes", "Mail Votes", "Provisional Votes"]

for col in cols_to_convert:
    votes_as_ints = pa_2020.apply(vote_to_int, col)
    pa_2020 = pa_2020.with_column(col, votes_as_ints)
    
pa_2020

In [None]:
# Checking data types
...

In [None]:
# Looks good! Now, as someone interested in media, let's add in some info about TV media markets
pa_media_markets = Table().read_table("media_markets.csv")
pa_media_markets

In [None]:
# I want this information attached to pa_2020
# what happens if we try to join?

...

In [None]:
# What went wrong? How do we fix it?

mm_counties = ...
cleaned_counties = make_array()

for ...:
    ...

cleaned_counties

In [None]:
# Cleaning up the media markets table with the new names
pa_media_markets = ...
pa_media_markets

In [None]:
# Finally, combining all of our data with a join
pa_2020_mm = ...
pa_2020_mm

Now that we have a workable dataset, let's do some short EDA! Let's learn more about where voters voted and how they voted. We're going to use `group` and `pivot` for this.

Recall: `tbl.group("col", func)` If func is not specified, by default finds the count of each unique value in "col". Otherwise, applies func to the grouped values in every other column. 

`tbl.pivot("col", "row", "vals", func)` cross-classifies a dataset, making all the unique values in 1 column the new rows and all the unique values in the other column the new column labels. Then, it puts the values of "vals", with the function applied to each group, in the corresponding cells.

For example:
http://data8.org/interactive_table_functions/ 

In [None]:
# Now we can finally begin a quick analysis using interesting table methods
# First, we can use a group to quickly quantify the # of counties in each media market
pa_2020_mm...

In [None]:
# How many voters are in each media market? 
# Notice that we need to clean the resultant table a bit
pa_2020_mm...

In [None]:
# We can also use a double group to quantify by both media market and party
pa_2020_mm...

In [None]:
# In general, how did each party vote? (Election Day, Mail Votes, Provisional)
# Let's figure it out. 
party_by_votes = ...
party_by_votes

In [None]:
# Let's convert all of the columns into a proportion using array arithmetic; run this cell
# This makes it a bit easier to compare across party by controlling by # of votes

def vote_prop(col_str):
    return party_by_votes.column(col_str) / party_by_votes.column("Votes sum")

party_vote_props = party_by_votes.select("Party Name").with_columns("Election Day Votes", vote_prop("Election Day Votes sum"),
                                                                   "Mail Votes", vote_prop("Mail Votes sum"),
                                                                   "Provisional Votes", vote_prop("Provisional Votes sum"))

# Bar chart to see the breakdown! 
party_vote_props.barh(0)

In [None]:
# Reminder of the table setup
pa_2020_mm.show(5)

In [None]:
# Now let's try "cross classifying"; similar to a 2 column group, but let's focus on a specific value
# How were the votes broken down by Media Market and party? 
# or: what media markets did most of the raw votes come from for each party?
market_vs_party = ...
market_vs_party

In [None]:
# Let's look at the data with a bar chart
...

In [None]:
# Just run this cell; it's a lot of wordy array arithmetic
# We're converting each party to proportions

market_vs_party.column("Democratic") / (market_vs_party.column("Democratic") + market_vs_party.column("Libertarian") + market_vs_party.column("Republican"))

mvp_props = market_vs_party.select("Media Market").with_columns("Democratic", market_vs_party.column("Democratic") / (market_vs_party.column("Democratic") + market_vs_party.column("Libertarian") + market_vs_party.column("Republican")),
                                                                "Libertarian", market_vs_party.column("Libertarian") / (market_vs_party.column("Democratic") + market_vs_party.column("Libertarian") + market_vs_party.column("Republican")),
                                                                "Republican", market_vs_party.column("Republican") / (market_vs_party.column("Democratic") + market_vs_party.column("Libertarian") + market_vs_party.column("Republican")))

# Now that we've controlled for number of votes in media market, this shows us who "won" each media market
mvp_props.barh(0)


## Now, that was a lot of work!

As useful as this was, the `datascience` library has a lot of limitations and requires a lot of work to do just this level of cleaning and EDA. 

However, it'll get easier as we get more comfortable working with and manipulating tables! Once we have a good handle on working with tabular data, we'll go over how to use `pandas` in discussion. It's a lot more powerful, concise, and lets us do even more interesting analysis with data tables. More on that soon!