# Homework 3: Distributions and Tables

Please complete this notebook by filling in the cells provided. Before you begin, execute the following provided tests.

In [None]:
# Don't change this cell; just run it.
from datascience import *
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
Table.interactive_plots()

import otter
grader = otter.Notebook()

**Reading:** 
- Textbook chapter [5](https://data-8r.gitbooks.io/textbook/chapters/05/visualizing-distributions.html)

Deadline:

This assignment is due **Wednesday, July 29 at 11:59PM** on Gradescope. You will receive an early submission bonus point if you turn in your final submission by **Tuesday, July 28 at 11:59 PM**. Late work will not be accepted unless you have made special arrangements with your (u)GSI or the instructor.

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

You should start early so that you have time to get help if you're stuck. Drop-in office hours will be held at various times in the week; check the bCourses page for the latest schedule.

## 1. Political Campaigning

The modern political campaign is built on data, such as voter histories and geographic and demographic information. Campaign managers tend to base their strategies on past elections; save for a few exceptions, an area's past voting history is generally a strong predictor for how that area will vote in a future election. Using a data-based strategy is especially important when a campaign decides how to allocate its resources: staffing, advertising dollars, get-out-the-vote efforts, and more.

The Commonwealth of Pennsylvania was considered a "Blue Wall" state from 1992 to 2012, meaning that the state consistently voted for Democrats for President. This changed in 2016, when the Republican candidate, Donald Trump, won the state by a narrow margin. With this change, Pennsylvania is now considered a significant swing state for the 2020 election. If you ran a campaign in this state, where would you focus your efforts based on the data?

*Note:* To analyze the state of Pennsylvania, we will be using Media Markets to determine geographic locations instead of counties or other regions. This is because Media Markets are how broadcast television and media are distributed within a state, which makes it easier for us to target larger audiences. If you are interested, there are ways to target smaller groups (called "microtargeting") via social media which involves much more complex data science techniques that you may learn in the future.

In [None]:
### The following table contains voting data by media market for the 2012 and 2016 Presidential elections
## in the Commonwealth of Pennsylvania, using data from the state.
# You only need to run the next three cells; do not change any code.
pa_votes = Table().read_table("pa_voting.csv")
pa_votes.show(5)

In [None]:
### This cell will print the graph for the 2012 Presidential campaign with Barack Obama (D) versus Mitt Romney (R).
pa_votes.where("Election", "2012 Presidential").select(0, 2, 3).barh("Media Market")

In [None]:
### This cell will print the graph for the 2016 Presidential campaign with Hillary Clinton (D) versus Donald Trump (R).
pa_votes.where("Election", "2016 Presidential").select(0, 2, 3).barh("Media Market")

**Question 1**: What Media Market did the Democratic Party win with the largest proportion in both elections? What Media Market did the Republican Party win in both elections with the largest proportion?

**Type your answer here.**

**Question 2:** What Media Market has had the tightest race (i.e. smallest difference in proportions between the Republican and Democratic candidates) for both 2012 and 2016?

**Type your answer here.**

**Question 3:** One Media Market "flipped" (voted for different parties) in the 2012 and 2016 elections. What Media Market did so? Describe a possible reason why voters in this market changed their minds. 

*Hint*: If you're struggling trying to find a possible reason why, read [this article](https://www.nytimes.com/2016/11/13/us/politics/pennsylvania-trump-votes.html) by the New York Times. 

**Type your answer here.**

**Question 4:** Imagine you are now a campaign media manager for a political party (Democrat or Republican) in the State of Pennsylvania. Given the data in these graphs, what Media Market would you allocate the most resources to? Why?

**Type your answer here.**

## 2. Presidential Tables I: California Votes

*Note:* If you ever need to reference the documentation for a method or function we use in this course, check out the Data 8 Python Reference Sheet [here](http://data8.org/su20/python-reference.html). We will not use all of the tools listed on the page this term, but we will use many of them.

**Question 1.** In 2016, there were 5 main Presidential candidates:  Donald Trump, Hillary Clinton, Jill Stein, Gloria La Riva, and Gary Johnson. In California, Trump received 4483810 votes, Clinton received 8753788 votes, Stein received 278657 votes, La Riva received 66101 votes, and Johnson received 478500 votes.

Create a table using `datascience` that contains all of this information. It should have two columns: "candidates" and "votes" with the name `ca_presidential` and look like this table (but with the candidates from above in the same order).

| candidates       | votes       |
| -----------     | ----------- |
| donald trump    | 4483810       |
| hillary clinton | 8753788        |

**Note:** Use lower-case for the name for each candidate, like "hillary clinton."

In [None]:
ca_presidential = 
ca_presidential

In [None]:
grader.check('q2_1')

**Question 2.** Using table methods and the `ca_presidential` table, create a new table named `ca_candidates`. This new table should contain only the top 3 candidates who received the most votes in California (Trump, Clinton, and Johnson in that order).

*Hint:* You may need to use a method that references these rows by their index or filters a table by a certain condition.

In [None]:
ca_candidates = 
ca_candidates

In [None]:
grader.check('q2_2')

**Question 3.** Let's do a bit more reformatting to make the table more readable. Using `ca_candidates,` create a table called `ca_candidates_final` that renames the "candidates" column to "Electoral Candidates" and "votes" column to "Total Votes."

In [None]:
ca_candidates_final = 
ca_candidates_final

In [None]:
grader.check('q2_3')

## 3. Presidential Tables II: 1976-2016

**Question 1.** The file `1976-2016-president.csv` contains information about the presidential candidates from 1976 to 2016.  Load it as a table named `president`. 

Confused about some of the column names or values within the table? Read the data dictionary/codebook [here](https://drive.google.com/file/d/1zXKyX_LT4O3w6SpsiPMVPcruahXw8il7/view?usp=sharing).

In [None]:
president = 
president.show(10)

**Question 2.** Does each row contain a different candidate? Set `all_different` to the boolean value True if each row contains a different candidate or to False if multiple rows contain the same candidate.

In [None]:
all_different = 

In [None]:
grader.check('q3_2')

**Question 3.** There is a column called "state_cen" that contains some cryptic numbers. What do these values represent, according to the [codebook](https://drive.google.com/file/d/1zXKyX_LT4O3w6SpsiPMVPcruahXw8il7/view?usp=sharing)?

*Note:* Whenever you begin analysis on a dataset, it's always good practice to read the documentation associated with the dataset to confirm what each column and value represents.

**Type your answer here.**

**Question 4.** Using the `president` table you created in Question 2, make a new table called `hrc_votes` that only contains rows for the candidate "Clinton, Hillary."

*Hint:* What `tbl.where` predicate (are.___ ) should you use?

In [None]:
hrc_votes = 
hrc_votes

In [None]:
grader.check('q3_4')

**Question 5.** How many votes did Hillary Clinton receive, from all states, in total? Set `votes_for_clinton` to an integer corresponding to the total amount of votes Hillary Clinton received. 

*Hint:* Your answer should involve an array calculation.

In [None]:
votes_for_clinton = 
votes_for_clinton

In [None]:
grader.check('q3_5')

**Question 6.** What is the proportion of votes Hillary Clinton received as a candidate over the total number of votes in each specific state and territory? Set `prop_votes` to an array of proportions.

In [None]:
prop_votes = 
prop_votes

Notice that the the proportions of votes have several numbers to the right side of the decimal. They are currently in scientific notation (e-01 means x10^-1, so move one decimal point to the left). To make values more readable, we will round the proportions to integer values. The following function, `np.round,` will go through each element of `prop_votes` array and round the value to the decimal values specified, which in this case, goes to the hundredths.

Just run the cell below.

In [None]:
## Just run this cell. 
prop_votes_rounded = np.round(prop_votes, 2)
prop_votes_rounded

In [None]:
grader.check('q3_6')

**Question 7.** Now, let's learn more about the dataset as a whole. You may have noticed that each state varies in the number of candidates the residents have voted for. 

How many total candidates did each state vote for in the elections between 1976 to 2016? Using the `president` table, make a new table called `cand_per_state` with one row per state and 2 columns: "state" and "number of candidates," which contain the state name and the number of candidates the state voted for from 1976 to 2016, respectively.


In [None]:
cand_per_state = 
cand_per_state

In [None]:
grader.check('q3_7')

**Question 8.** Using `cand_per_state`, make a bar chart comparing each state and the number of presidential candidates who received votes from that state. Sort the bars from shortest to longest.

Since we are graphing 50 states + DC, the resulting chart will be large and may take a second to load.

In [None]:
...

## 4. Presidential Tables III: Political Parties

In this section, we will continue to use the data from the 1976-2016 presidential elections (the `president` table). We will use it in this section to learn more about the political parties.

**Question 1.** Make a table of all unique political parties that presidential candidates associate with. Call it `parties`. It should have one row per party and 2 columns: "party" (the name of the political party) and "# of times" (the number of times a that party is associated with a unique candidate).

In [None]:
parties = 
parties

In [None]:
grader.check('q4_1')

**Question 2:** It wouldn't be a good idea to make a bar chart of that data (don't try it!). Why would this not make a good visualization, given the data we have now?

**Type your answer here.**

**Question 3:** Let's improve the table to make it a bit better to visualize. Make a bar chart of only the top 10 parties that were associated with a unique candidate from 1976 to 2016 the most. 

*Hint:* This is a multi-step problem. What steps do we need to take, and in what order, to create this visualization?

*Note:* The "nan" value refers to "Not a Number." This is generally a placeholder for individuals that do not have a value in the dataset. This is also known as a null value.

In [None]:
...

**Question 4:** The following line plot is generated using the `president` table. We've left it intentionally vague. You don't need to understand all of the code in the cell below, but you should be able to understand what is being shown based on the information provided. If you're having trouble, try to break the table down into its separate methods and view the table before it is plotted.

Write an appropriate title and labels for the axes, based on the information being shown.

In [None]:
## Just run this cell.
president.pivot("party", "year", "candidatevotes", sum).select("republican", "democrat")\
.with_column("Year", np.arange(1976, 2020, 4)).plot("Year")

**Title:** ...

**X-Axis:** ...

**Y-Axis:** ...

## 5. Ages of Congress

For the following section, we will analyze the ages of all of the members of Congress from 1977 to 2014. The dataset comes from [FiveThirtyEight](https://datahub.io/five-thirty-eight/congress-age).

One current hypothesis is that age is becoming increasingly related to party affiliation; younger voters, especially millenials tend to lean much more towards the Democratic party, than the Republican party. However, is this trend reflected in Congress? We'll use histograms to find out. 

In [None]:
## Just run this cell.
congress_ages = Table().read_table("congress-age.csv")
congress_ages

In [None]:
## Just run this cell.
print("The average age of a Democratic Member of Congress is", np.mean(congress_ages.where("party", "d").column("age")), "years.")
congress_ages.where("party", "d").hist("age", bins = np.arange(25, 100, 5))

In [None]:
## Just run this cell. 
print("The average age of a Republican Member of Congress is", np.mean(congress_ages.where("party", "r").column("age")), "years.")
congress_ages.where("party", "r").hist("age", bins = np.arange(25, 100, 5))

**Question 1.** Do these charts use frequency (count) scale or density scales? How do you know?

**Type your answer here.**

For the next set of questions, the bins are in sets of 5, starting at 20 and ending at 100. For example, the first bin is [20 - 25), [25 - 30), and so forth.

There are also 5568 Democratic members of Congress and 4763 Republican members of Congress represented in this dataset. 

You should estimate the height of the bars to the closest tenth of a decimal place (e.g. 1, 1.5, 2).

**Question 2.** What percent of Democratic members of Congress in this dataset are aged between 45 and 50 years old? Show your work, and if there is not enough information to answer this question, explain why below. 

**Type your answer here.**

**Question 3.** How many Republican members of Congress are aged 60 to 70 years old? Show your work, and if there is not enough information to answer this question, explain why below. 

**Type your answer here.**

**Question 4.** What percent of Republican members of Congress are aged 50 to 54 years old? Show your work, and if there is not enough information to answer this question, explain why below. 

**Type your answer here.**

## 6. Submission

To submit your homework, please download your notebook as a .ipynb file and submit to Gradescope. You can do so by navigating to the toolbar at the top of this page, clicking File > Download as... > Notebook (.ipynb). Then, go to our class's Gradescope page [here](https://www.gradescope.com/courses/136698) and upload your file under "Homework 3". 

To check your work, you may run the cell below. Remember that for homework assignments, passing the tests does not necessarily mean your answer is correct.

In [None]:
grader.check_all()