In [None]:
# Initialize Otter
import otter
grader = otter.Notebook()

## Sampling Demo

### Adapted from Homework 2 in Summer 2020

**Note for NWDSE:** This "lab" is a shortened version of what is typically a homework assignment. You may see that some of the test names don't match up with the question number in markdown (for instance, the test says `q6a` for Question 1); this is a consequence of the fact that the questions here are a subset of the original assignment.

In addition, this uses our new (Summer 2020 and onwards) autograding platform, `otter`.

In [None]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

# Default plot configurations
%matplotlib inline
plt.rcParams['figure.figsize'] = (16,8)
plt.rcParams['figure.dpi'] = 150
sns.set()

from IPython.display import display, Latex, Markdown

The outcome of the US presidential election in 2016 took many people and many pollsters by surprise. In this assignment we will carry out a simulation study / post mortem in an attempt to understand what happened. Doing such an analysis is especially important given that the 2020 federal elections are right around the corner.

<!-- END QUESTION -->



### How might the sampling frame differ from the population?

After the fact, many experts have studied the 2016 election results. For example, according to the American Association for Public Opinion Research (AAPOR), predictions made before the election were flawed for three key reasons:

1. voters changed their preferences a few days before the election
2. those sampled were not representative of the voting population, e.g., some said that there was an overrepresentation of college graduates in some poll samples 
3. voters kept their support for Trump to themselves (hidden from the pollsters)

In the next two problems on this homework, we will do two things:

+ We will carry out a study of the sampling error when there is no bias. In other words, we will try to compute the chance that we get the election result wrong even if we collect our sample in a manner that is completely correct. In this case, any failure of our prediction is due entirely to random chance.
+ We will carry out a study of the sampling error when there is bias of the second type from the list above. In other words, we will try to compute the chance that we get the election result wrong if we have a small systematic bias. In this case, any failure of our prediction is due to a combination of random chance and our bias.


## The Electoral College

The US president is chosen by the Electoral College, not by the
popular vote. Each state is alotted a certain number of 
electoral college votes, as a function of their population.
Whomever wins in the state gets all of the electoral college votes for that state.

There are 538 electoral college votes (hence the name of the Nate Silver's site, FiveThirtyEight).

Pollsters correctly predicted the election outcome in 46 of the 50 states. 
For these 46 states Trump received 231 and Clinton received 232 electoral college votes.

The remaining 4 states accounted for a total of 75 votes, and 
whichever candidate received the majority of the electoral college votes in these states would win the election. 

These states were Florida, Michigan, Pennsylvania, and Wisconsin.

|State |Electoral College Votes|
| --- | --- |
|Florida | 29 |
|Michigan | 16 |
|Pennsylvania | 20 |
|Wisconsin | 10|

For Donald Trump to win the election, he had to win either:
* Florida + one (or more) other states
* Michigan, Pennsylvania, and Wisconsin


The electoral margins were very narrow in these four states, as seen below:


|State | Trump |   Clinton | Total Voters |
| --- | --- |  --- |  --- |
|Florida | 49.02 | 47.82 | 9,419,886  | 
|Michigan | 47.50 | 47.27  |  4,799,284|
|Pennsylvania | 48.18 | 47.46 |  6,165,478|
|Wisconsin | 47.22 | 46.45  |  2,976,150|

Those narrow electoral margins can make it hard to predict the outcome given the sample sizes that the polls used. 

---
## Simulation Study of the Sampling Error

Now that we know how people actually voted, we can carry
out a simulation study that imitates the polling.

Our ultimate goal in this problem is to understand the chance that we will incorrectly call the election for Hillary Clinton even if our sample was collected with absolutely no bias.

### Question 1

#### Part 1 

For your convenience, the results of the vote in the four pivotal states is repeated below:

|State | Trump |   Clinton | Total Voters |
| --- | --- |  --- |  --- |
|Florida | 49.02 | 47.82 | 9,419,886  | 
|Michigan | 47.50 | 47.27  |  4,799,284|
|Pennsylvania | 48.18 | 47.46 |  6,165,478|
|Wisconsin | 47.22 | 46.45  |  2,976,150|


Using the table above, write a function `draw_state_sample(N, state)` that returns a sample with replacement of N voters from the given state. Your result should be returned as a list, where the first element is the number of Trump votes, the second element is the number of Clinton votes, and the third is the number of Other votes. For example, `draw_state_sample(1500, "florida")` could return `[727, 692, 81]`. You may assume that the state name is given in all lower case.

You might find `np.random.multinomial` useful.

<!--
BEGIN QUESTION
name: q6a
points: 2
-->

In [None]:
def draw_state_sample(N, state):
    # BEGIN SOLUTION 
    if state == "florida":
        return np.random.multinomial(N, [0.4902, 0.4782, 1 - (0.4902 + 0.4782)])
    
    if state == "michigan": 
        return np.random.multinomial(N, [0.475, 0.4727, 1 - (0.475 + 0.4727)])

    if state == "pennsylvania":
        return np.random.multinomial(N, [0.4818, 0.4746, 1 - (0.4818 + 0.4746)])
  
    if state == "wisconsin":
        return np.random.multinomial(N, [0.4722, 0.4645, 1 - (0.4722 + 0.4645)])

    raise("invalid state")
    # END SOLUTION

In [None]:
grader.check("q6a")

#### Part 2

Now, create a function `trump_advantage` that takes in a sample of votes (like the one returned by `draw_state_sample`) and returns the difference in the proportion of votes between Trump and Clinton. For example `trump_advantage([100, 60, 40])` would return `0.2`, since Trump had 50% of the votes in this sample and Clinton had 30%.

<!--
BEGIN QUESTION
name: q6b
points: 1
-->

In [None]:
def trump_advantage(voter_sample):
    # YOUR CODE HERE
    ...

In [None]:
grader.check("q6b")

#### Part 3

Below, we have simulated Trump's advantage across 100,000 simple random samples of 1500 voters for the state of Pennsylvania, and we stored the results of each simulation in a list called `simulations`. 

That is, `simulations[i]` is Trump's proportion advantage for the `i+1`th simple random sample.

In [None]:
simulations = [trump_advantage(draw_state_sample(1500, "pennsylvania")) for i in range(100000)]

Now, make a histogram of the sampling distribution of Trump's proportion advantage in Pennsylvania. Make sure to give your plot a title and add labels where appropriate.
Hint: You should use the [`plt.hist`](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.hist.html) function in your code.

Make sure to include a title as well as axis labels. You can do this using `plt.title`, `plt.xlabel`, and `plt.ylabel`.

In [None]:
# YOUR CODE HERE
...

#### Part 4


Below, we define the function `trump_wins(N)` that creates a sample of N voters for each of the four crucial states (Florida, Michigan, Pennsylvania, and Wisconsin) and returns 1 if Trump is predicted to win based on these samples and 0 if Trump is predicted to lose.

Recall that for Trump to win the election, he must either:
* Win the state of Florida and 1 or more other states
* Win Michigan, Pennsylvania, and Wisconsin

In [None]:
def trump_wins(N):
    wins_florida = trump_advantage(draw_state_sample(N, "florida")) > 0
    wins_michigan = trump_advantage(draw_state_sample(N, "michigan")) > 0
    wins_pennsylvania = trump_advantage(draw_state_sample(N, "pennsylvania")) > 0
    wins_wisconsin = trump_advantage(draw_state_sample(N, "wisconsin")) > 0
    if wins_michigan and wins_pennsylvania and wins_wisconsin:
        return 1
    if wins_florida and (wins_michigan or wins_pennsylvania or wins_wisconsin):
        return 1
    return 0

If we repeat 100,000 simulations of the election, i.e. we call `trump_wins(1500)` 100,000 times, what proportion of these simulations predict a Trump victory? Give your answer as `percent_trump`.

This number represents the percent chance that a given sample will correctly predict Trump's victory *even if the sample was collected with absoutely no bias*. 

**Note: Many laypeople, even well educated ones, assume that this number should be 1. After all, how could a non-biased sample be wrong? This is the type of incredibly important intuition we hope to develop in you throughout this class and your future data science coursework.**

<!--
BEGIN QUESTION
name: q6f
manual: false
points: 1
-->

In [None]:
proportion_trump = ... # YOUR CODE HERE
proportion_trump

In [None]:
grader.check("q6f")

We have just studied the sampling error, and found how 
our predictions might look if there was no bias in our 
sampling process. 
Essentially, we assumed that the people surveyed didn't change their minds, 
didn't hide who they voted for, and were representative
of those who voted on election day.

---
## Simulation Study of Selection Bias

According to an article by Grotenhuis, Subramanian, Nieuwenhuis, Pelzer and Eisinga (https://blogs.lse.ac.uk/usappblog/2018/02/01/better-poll-sampling-would-have-cast-more-doubt-on-the-potential-for-hillary-clinton-to-win-the-2016-election/#Author):

"In a perfect world, polls sample from the population of voters, who would state their political preference perfectly clearly and then vote accordingly."

That's the simulation study that we just performed. 


It's difficult to control for every source of selection bias.
And, it's not possible to control for some of the other sources of bias.

Next we investigate the effect of small sampling bias on the polling results in these four battleground states.  

Throughout this problem, we'll examine the impacts of a 0.5 percent bias in favor of Clinton in each state. Such a bias has been suggested because highly educated voters tend to be more willing to participate in polls.

### Question 2

Throughout this problem, adjust the selection of voters so that there is a 0.5% bias in favor of Clinton in each of these states. 

For example, in Pennsylvania, Clinton received 47.46 percent of the votes and Trump 48.18 percent. Increase the population of Clinton voters to 47.46 + 0.5  percent and correspondingly decrease the percent of Trump voters. 

#### Part 1

Simulate Trump's advantage across 100,000 simple random samples of 1500 voters for the state of Pennsylvania and store the results of each simulation in a list called `biased_simulations`.

That is, `biased_simulation[i]` should hold the result of the `i`th simulation.

That is, your answer to this problem should be just like your answer from Q1.3, but now using samples that are biased as described above.

<!--
BEGIN QUESTION
name: q7a
points: 1
-->

In [None]:
def draw_biased_state_sample(N, state):
    # YOUR CODE HERE
    ...
    
biased_simulations = [trump_advantage(draw_biased_state_sample(1500, "pennsylvania")) for i in range(100000)]

In [None]:
grader.check("q7a")

<!-- BEGIN QUESTION -->

#### Part 2

Make a histogram of the new sampling distribution of Trump's proportion advantage now using these biased samples. That is, your histogram should be the same as in Q1.3, but now using the biased samples.

Make sure to give your plot a title and add labels where appropriate.


<!--
BEGIN QUESTION
name: q7b
manual: true
points: 1
-->

In [None]:
# BEGIN SOLUTION
plt.hist(biased_simulations) ;
plt.title('Biased Sampling of Pennsylvania') ;
plt.ylabel('# of Simulations');
plt.xlabel('Sampling Distribution Advantage');
# END SOLUTION

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Part 3

Compare the histogram you created in Q2.2 to that in Q1.3. 

<!--
BEGIN QUESTION
name: q7c
manual: true
points: 2
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

#### Part 4

Now perform 100,000 simulations of all four states and return the proportion of these simulations that result in a Trump victory. This is the same fraction that you computed in Q1.4, but now using your biased samples.

Give your answer as `proportion_trump_biased`.

This number represents the chance that a sample biased 0.5% in Hillary Clinton's favor will correctly predict Trump's victory. You should observe that the chance is signficantly lower than with an unbiased sample, i.e. your answer in Q1.4.

<!--
BEGIN QUESTION
name: q7d
manual: false
points: 1
-->

In [None]:
# This typically isn't given to students
def trump_wins_biased(N): 
    wins_florida = trump_advantage(draw_biased_state_sample(N, "florida")) > 0
    wins_michigan = trump_advantage(draw_biased_state_sample(N, "michigan")) > 0
    wins_pennsylvania = trump_advantage(draw_biased_state_sample(N, "pennsylvania")) > 0
    wins_wisconsin = trump_advantage(draw_biased_state_sample(N, "wisconsin")) > 0
    if wins_michigan and wins_pennsylvania and wins_wisconsin:
        return 1
    if wins_florida and (wins_michigan or wins_pennsylvania or wins_wisconsin):
        return 1
    return 0

proportion_trump_biased = ... # YOUR CODE HERE
proportion_trump_biased

In [None]:
grader.check("q7d")

## Further Study


### Question 3

Would increasing the sample size have helped?

#### Part 1

Try a sample size of 5,000 and run 100,000 simulations of a sample with replacement. What proportion of the 100,000 times is Trump predicted to win the election in the unbiased setting? In the biased setting?

Give your answers as `high_sample_size_unbiased_percent_trump` and `high_sample_size_biased_percent_trump`.

<!--
BEGIN QUESTION
name: q8a
manual: false
points: 1
-->

In [None]:
high_sample_size_unbiased_percent_trump = ... # YOUR CODE HERE
high_sample_size_biased_percent_trump = ... # YOUR CODE HERE
print(high_sample_size_unbiased_percent_trump, high_sample_size_biased_percent_trump)

In [None]:
grader.check("q8a")

#### Part 2

What do your observations from part 1 say about the impact of sample size
on the sampling error and on the bias?   

Extra question for those who are curious: Just for fun, you might find it interesting to see what happens with even larger sample sizes (> 5000 voters) for both the unbiased and biased cases. Can you get them up to 99% success with sufficient large samples? How many? Why or why not? If you do this, include your observations in your answer.

In [None]:
# Feel free to use this cell for any scratch work (creating visualizations, examining data, etc.)

<!-- BEGIN QUESTION -->

Write your answer in the cell below.

<!--
BEGIN QUESTION
name: q8b
manual: true
points: 2
-->

_Type your answer here, replacing this text._

## Submission

Make sure you have run all cells in your notebook in order before     running the cell below, so that all images/graphs appear in the output. The cell below will generate     a zipfile for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)