# Installation & Setup

Install some python packages

In [None]:
%pip install us
%pip install matplotlib
%pip install numpy
%pip install pandas
%pip install plotnine

Import those packages

In [None]:
# Inline Chart Parameters
%matplotlib inline

from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 9)

# Ignore Warnings
import warnings
warnings.filterwarnings('ignore')

# Python Imports
import pandas as pd
import numpy as np
import us
from datetime import datetime, timedelta
from plotnine import *

# Display all columns on tables
pd.set_option('display.max_columns', None)

In [None]:
# Download some data
pd.read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/pollster-ratings/raw-polls.csv")\
  .to_csv('raw-polls.csv', index=False)

# Download some data
pd.read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/pollster-ratings/pollster-ratings.csv")\
  .to_csv('pollster-ratings.csv', index=False)

# Thinking About An Upcoming Election

Can we trust the polls? And if so...how much?

Let's see how polls have been doing so far!

## Getting some data

Today we'll be working with data from [FiveThirtyEight's Pollster Ratings project](http://projects.fivethirtyeight.com/pollster-ratings/). 

This data contains:
* Every poll that FiveThirtyEight has collected in the last 21 days prior to general election (pres, senate, house, governor) or presidential primary.
* And... results for those elections!

In [None]:
# Load some data into variables
polls = pd.read_csv('raw-polls.csv')

# Reverses some values so that Democratic is on the left (-) and Republican is on the right (+)
polls['margin_poll'] = -polls['margin_poll']
polls['margin_actual'] = -polls['margin_actual']
polls['bias'] = -polls['bias']
polls['bias_overestimate'] = polls.bias.apply(lambda x: 'overestimates democrat' if x < 0 else 'overestimates republican')
polls['bias_overestimate'] = pd.Categorical(polls['bias_overestimate'], categories=['overestimates republican','overestimates democrat'])

# Create a variable to distinguish national vs state polls and for winner party
polls['national'] = polls.location.apply(lambda x: 'national' if x == 'US' else 'state')
polls['winner_party'] = polls.margin_actual.apply(lambda x: 'D' if x < 0 else 'R')
polls['winner_party'] = pd.Categorical(polls['winner_party'], categories=['R','D'])

polls.tail(2)

## Nationwide Presidential Polls

Lets look at polls of the Nationwide popular vote. First, an exploratory visualization.

In [None]:
# Get all NATIONAL-level presidential polls
polls_to_analyze = polls.query("type_detail=='Pres-G' and location=='US'")

# Display 3 random polls
polls_to_analyze.sample(3)

In [None]:
(
    ggplot(polls_to_analyze, aes(x='margin_poll', y='year'))
     + geom_point(size=4, alpha=.2)
     + geom_point(aes(x='margin_actual', color="winner_party"), size=8)
     + geom_vline(aes(xintercept=0))
     + scale_y_continuous(breaks=list(range(2000,2024,4)), 
                          labels=list(reversed(['Biden (2020)', 'Trump (2016)', 'Obama (2012)', 'Obama (2008)', 'Bush (2004)', 'Bush (2000)'])))
     + theme_xkcd()
     + theme(figure_size=(16, 4)) 
     + labs(title='Presidential Polling (NATIONAL level)')
)

In [None]:
# Plotting "bias" rather than margin, since we care about how far off the poll was from the actual result
(
    ggplot(polls_to_analyze, aes(x='bias', y='year'))
     + geom_point(size=4, alpha=.2)
     + geom_point(aes(x=0, color="winner_party"), size=8)
     + geom_vline(aes(xintercept=0))
     + scale_y_continuous(breaks=list(range(2000,2024,4)), 
                          labels=list(reversed(['Biden (2020)', 'Trump (2016)', 'Obama (2012)', 'Obama (2008)', 'Bush (2004)', 'Bush (2000)'])))
     + theme_xkcd()
     + theme(figure_size=(16, 4)) 
     + labs(title='Presidential Polling "Bias" (NATIONAL level)\n <---- Overestimates Democrat                                                                                                       Overestimates Republican---->')

)

### What do you notice about this chart?

In [None]:
# Here is another view
(
    ggplot(polls_to_analyze, aes(x='bias', fill='bias_overestimate'))
     + geom_histogram()
     + geom_vline(aes(xintercept=0))
     + theme_minimal()
     + facet_wrap('~year')
     + theme(figure_size=(16, 8)) 
     + labs(title='Presidential Polling "Bias" (NATIONAL level)\n <---- Overestimates Democrat                                                                                                       Overestimates Republican---->')

)

### What have we noticed about the presidential polls at the national level?

<details><summary> ---> DON'T CLICK ME </summary>
<p>

* Even if MoE Is +/-3 for example, in practice there are other sources of error. Histoircally polls have been more like +/-5 pts on average.
* Polls sometimes miss in the same direction in any given year.
```

</p>
</details>

### How might this impact how you report on a new poll that comes out?



<details><summary> ---> DON'T CLICK ME </summary>
<p>

* Place individual polls in their aggregate context
* Convey uncertainty appropriately

</p>
</details>

## But, we don't have one Presidential election in the U.S. ... 
...we have 50 separate ones (plus DC and some quirks in Maine and Nebraska). And the nationwide polls can only tell us so much about who might win the election. So what about state polling? Has it been getting better or worse over the years? Can we still rely on it this coming election cycle? 


In [None]:
# A quick look at the polls dataframe
presidential_state_level_polls = polls.query("type_detail=='Pres-G' and location!='US'")
presidential_state_level_polls.head(2)

In [None]:
(
    ggplot(presidential_state_level_polls
           , aes(x='bias', y='year', color='bias<0'))
     + geom_point(size=4, alpha=.2)
     + geom_vline(aes(xintercept=0))
     + theme_minimal()
     + scale_y_continuous(breaks=list(range(2000,2024,4)))
     + facet_wrap('~location')
     + theme(figure_size=(16, 16)) 
         + labs(title='Presidential Polling "Bias" (STATE level)\n <---- Overestimates Democrat                                                                                                       Overestimates Republican---->')

)

### What have we noticed about the presidential polls at the state level compared to the national level?

<details><summary> ---> DON'T CLICK ME </summary>
<p>

* State level polls seem less accurate than national polls
* Some states have a lot more polling than others
* Polling in Hawaii is historically super inaccurate
* State level polling errors are correlated in any given year
```

</p>
</details>


In [None]:
presidential_state_level_polls[['year', 'location', 'pollster', 'bias', 'error']]\
    .groupby('year').mean().round(2)

In [None]:
# state-level presidential polls over the years
polls_to_analyze = polls.query("type_detail=='Pres-G'")

(
    ggplot(polls_to_analyze, aes(x='bias', fill='bias<0'))
     + geom_histogram()
     + geom_vline(aes(xintercept=0))
     + theme_minimal()
     + facet_grid('national~year', scales='free_y')
     + theme(figure_size=(16, 4)) 
     + labs(title='Presidential Polling "Bias" (national vs state level)')

)

## What about polling in primary elections?

In [None]:
# Primary and general elections
polls_to_analyze = polls.query("type_detail.isin(['Pres-G', 'Pres-R', 'Pres-D']) and error.notna()")
polls_to_analyze = polls_to_analyze.query("national=='state'")

# polls_to_analyze = polls_to_analyze.query("location.isin(@swing_states_2020)")
display(
    ggplot(polls_to_analyze, aes(x='error'))
     + geom_histogram()
     + theme_minimal()
     + facet_grid('type_detail~year', scales='free_y')
     + theme(figure_size=(16, 4)) 
     + labs(title='Presidential Polling "Bias" (STATE level)')

)

display(
    polls_to_analyze.pivot_table(index='type_detail', values='error', columns='year',aggfunc='mean').T.round(1).fillna('')
)

## And how about Senate, House, Governor, etc...?

In [None]:
(
    polls
    .pivot_table(index=['type_simple'], values='error', columns='national',aggfunc='mean')
    .round(1)
    .fillna('')
)


# Are polls becoming less accurate over time? 

In [None]:
(
    polls.query('year%2==0')
    .pivot_table(index=['type_simple', 'national'], values='error', columns='year',aggfunc='mean')
    .round(1)
    .fillna('')
    .T
)


# Part 2: Statistical Treatment

Statistical treatment can help you get more out of a dataset! Sometimes the polls miss, but have consistent biases one way or another. If we can detect these patterns, we can correct for them and get more out of the data. Here is one example of how. Let's take a look at what we know about each pollster

- Pollster Ratings: https://projects.fivethirtyeight.com/pollster-ratings/
- Methodology: https://fivethirtyeight.com/features/how-fivethirtyeight-calculates-pollster-ratings/
- Latest Update: https://fivethirtyeight.com/features/the-state-of-the-polls-2019/

In [None]:
pollster_ratings = pd.read_csv('pollster-ratings.csv').set_index('Pollster Rating ID')
pollster_ratings.head()

> **Mean-Reverted Bias** - A pollster's historical average statistical bias toward Democratic or Republican candidates, reverted to a mean of zero based on the number of polls in the database. A score of "R +1.5", for example, indicates that the pollster has historically overrated the performance of the Republican candidate.

In [None]:
# Append grade
polls['grade'] = polls.pollster_rating_id.apply(lambda x: pollster_ratings.loc[x]['538 Grade'] if x in pollster_ratings.index else None)

# Append MRB and chnage it to an integer value
polls['mrb'] = polls.pollster_rating_id.apply(lambda x: pollster_ratings.loc[x]['Mean-Reverted Bias'] if x in pollster_ratings.index else None)
# reverse polarity to match above charts where D is left (-) and R is right (+)
# polls['mrb'] = -pd.to_numeric(polls['mrb'].str.replace('D +', '', regex=False).str.replace('R +', '-', regex=False))

# Adjust poll, bias, and error by MRB
polls['margin_poll_adjusted'] = polls['margin_poll'] - polls['mrb']
polls['bias_adjusted'] = polls.margin_poll_adjusted - polls.margin_actual
polls['error_adjusted'] = np.abs(polls.margin_poll_adjusted - polls.margin_actual)

In [None]:
# Select all state and national polls
pres_polls_national = polls.query("type_detail=='Pres-G' and location=='US'")
pres_polls_by_state = polls.query("type_detail=='Pres-G' and location!='US'")

Let's look at some polls now! What do they look like with this insight applied? 

In [None]:
pres_polls_national.query('year==2016').sample(3)

And in the aggregate for 2016? What about for other races? Other years?

In [None]:
pres_polls_national.query('year==2016')\
    [['year', 'location', 'pollster', 'error', 'error_adjusted', 'bias', 'bias_adjusted']]\
    .groupby('year').mean().round(1)

What about at the state level?

In [None]:
pres_polls_by_state.query('year==2016')\
    [['year', 'location', 'pollster', 'error', 'error_adjusted', 'bias', 'bias_adjusted']]\
    .groupby('year').mean().round(1)

Let's see what this means for national polls by pollster

In [None]:
pres_polls_national.query("year==2016")\
    [['year', 'location', 'pollster', 'error', 'error_adjusted']]\
    .groupby('pollster').agg(['mean', 'count'])\
    .sort_values(by=('error_adjusted', 'count'), ascending=False)\
    .drop([('year', 'count'), ('error', 'count')], axis=1)

Lets look at a few particular cases. Hawaii, or Washington D.C. for example.

In [None]:
pres_polls_by_state.query("year==2016 and location=='HI'")\
    [['pollster', 'grade', 'margin_actual', 'margin_poll', 'margin_poll_adjusted', 'mrb', 'bias', 'bias_adjusted']]

Uh oh, what happened here? 

Beware of "unskewing" polls. Statistical treatments don't necessarily improve each individual poll result. But on average, when incoporated into your analysis they will help you get more information from the aggregate of your data.

Still, in the aggregate, you're better off looking at an adjusted average of the polls. Also, this is only one statsitical treatment! Remember how Hawaii polls tend to underestimate how well Democrats will do? We have adjusted for the pollster's average bias, but have not accounted for factors like that. And there are so many other things to consider! 

# Election Forecast Models

Let's talk about our election forecasts, which apply a lot of statistical treatments based on what we know about the nature of politics and political data in the U.S.


## Forecasts

### 2020 Forecast

- Forecast: https://projects.fivethirtyeight.com/2020-election-forecast/
- Methodology: https://fivethirtyeight.com/features/how-fivethirtyeights-2020-presidential-forecast-works-and-whats-different-because-of-covid-19/


### 2016 Forecast

- Forecast: https://projects.fivethirtyeight.com/2016-election-forecast/
- Methodology: https://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/
- Analysis: https://projects.fivethirtyeight.com/2016-election-forecast/articles/?ex_cid=2016-forecast

### 2018 Forecast

- https://projects.fivethirtyeight.com/2018-midterm-election-forecast/senate
- https://projects.fivethirtyeight.com/2018-midterm-election-forecast/house


## Polls Stories from 2016

These help elucidate how we turn analysis, like what you just did above, into insights for our readers.

- https://fivethirtyeight.com/features/how-much-the-polls-missed-by-in-every-state/
- https://fivethirtyeight.com/features/pollsters-probably-didnt-talk-to-enough-white-voters-without-college-degrees/
- https://fivethirtyeight.com/features/what-a-difference-2-percentage-points-makes/
- https://fivethirtyeight.com/features/shy-voters-probably-arent-why-the-polls-missed-trump/
- https://fivethirtyeight.com/features/the-polls-missed-trump-we-asked-pollsters-why/
- https://fivethirtyeight.com/features/why-fivethirtyeight-gave-trump-a-better-chance-than-almost-anyone-else/
- https://fivethirtyeight.com/features/the-polls-are-all-right/
- https://fivethirtyeight.com/features/trump-is-just-a-normal-polling-error-behind-clinton/



## Some other folks

- [CNN](https://www.cnn.com/election/2018/forecast)
- [Daily Kos](https://elections.dailykos.com/)
- [New York Times - Real Time Polling!](https://www.nytimes.com/interactive/2018/upshot/elections-polls.html)


# Visualizing Uncertainty

- FiveThirtyEight in [2010](https://www.nytimes.com/elections/2010/forecasts/senate.html), [2014](https://fivethirtyeight.com/interactives/senate-forecast/), [2016](https://projects.fivethirtyeight.com/2016-election-forecast/), [2018](https://projects.fivethirtyeight.com/2018-midterm-election-forecast/house/)
    * I think 2010 still works in Safari...
- New York Times
    * The Spinners https://www.nytimes.com/2014/11/01/upshot/how-confirmation-bias-can-lead-to-a-spinning-of-wheels.html
    * The Needle https://www.youtube.com/watch?v=iq5rW6zYeP4
- [Huffpost's](http://elections.huffingtonpost.com/pollster) custom charts.

