# Election 2016: An Exploratory Data Analysis

#### Table of Contents
1. Environment Setup
2. Loading Data
3. Scatter Plots
4. Boxplots
5. Line Graph with Error Bars
6. Bubble Chart
7. Chloropleth Maps

## Environment Setup
Information regarding environment setup can be found under Prerequisites on the [README](../master/README.md).

## Loading Data
We start off by loading the packages that we want to use.

In [1]:
import numpy as np
import pandas as pd
#import matplotlib.pyplot as plt
from plotly import tools
import plotly.plotly as py
import plotly.graph_objs as go
pd.set_option('display.max_columns', 100) #overrides default to display up to 100 columns in dataframes

Let's bring in the dataset. We load it into a [Pandas](http://pandas.pydata.org/) dataframe (the preferred tool when working with data in Python) and run some basic commands.

In [3]:
df = pd.read_csv('http://projects.fivethirtyeight.com/general-model/president_general_polls_2016.csv')
df.head() #display the first five rows of dataframe

Unnamed: 0,cycle,branch,type,matchup,forecastdate,state,startdate,enddate,pollster,grade,samplesize,population,poll_wt,rawpoll_clinton,rawpoll_trump,rawpoll_johnson,rawpoll_mcmullin,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin,multiversions,url,poll_id,question_id,createddate,timestamp
0,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/3/2016,11/6/2016,ABC News/Washington Post,A+,2220.0,lv,8.720654,47.0,43.0,4.0,,45.20163,41.7243,4.626221,,,https://www.washingtonpost.com/news/the-fix/wp...,48630,76192,11/7/16,09:35:33 8 Nov 2016
1,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/1/2016,11/7/2016,Google Consumer Surveys,B,26574.0,lv,7.628472,38.03,35.69,5.46,,43.34557,41.21439,5.175792,,,https://datastudio.google.com/u/0/#/org//repor...,48847,76443,11/7/16,09:35:33 8 Nov 2016
2,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/2/2016,11/6/2016,Ipsos,A-,2195.0,lv,6.424334,42.0,39.0,6.0,,42.02638,38.8162,6.844734,,,http://projects.fivethirtyeight.com/polls/2016...,48922,76636,11/8/16,09:35:33 8 Nov 2016
3,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/4/2016,11/7/2016,YouGov,B,3677.0,lv,6.087135,45.0,41.0,5.0,,45.65676,40.92004,6.069454,,,https://d25d2506sfb94s.cloudfront.net/cumulus_...,48687,76262,11/7/16,09:35:33 8 Nov 2016
4,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/3/2016,11/6/2016,Gravis Marketing,B-,16639.0,rv,5.316449,47.0,43.0,3.0,,46.84089,42.33184,3.726098,,,http://www.gravispolls.com/2016/11/final-natio...,48848,76444,11/7/16,09:35:33 8 Nov 2016


In [4]:
print("Number of rows (polls): " + str(df.shape[0]))
print("Number of columns (data categories): " + str(df.shape[1]))

print("\nNumber of empty values for each column:")
print(df.isnull().sum())

Number of rows (polls): 12624
Number of columns (data categories): 27

Number of empty values for each column:
cycle                   0
branch                  0
type                    0
matchup                 0
forecastdate            0
state                   0
startdate               0
enddate                 0
pollster                0
grade                1287
samplesize              3
population              0
poll_wt                 0
rawpoll_clinton         0
rawpoll_trump           0
rawpoll_johnson      4227
rawpoll_mcmullin    12534
adjpoll_clinton         0
adjpoll_trump           0
adjpoll_johnson      4227
adjpoll_mcmullin    12534
multiversions       12588
url                     3
poll_id                 0
question_id             0
createddate             0
timestamp               0
dtype: int64


We see that there are 12624 polls and 27 categories of data. Of these, we can subset the dataframe to select only the categories that we're interested in. Let's go ahead and do that:

In [5]:
categories = [ 'type', 'state', 'enddate', 'pollster', 'grade', 'samplesize', 'population',
             'adjpoll_clinton', 'adjpoll_trump', 'adjpoll_johnson', 'adjpoll_mcmullin', 'poll_id']
df2 = df.loc[:, categories]
df2.head()

Unnamed: 0,type,state,enddate,pollster,grade,samplesize,population,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin,poll_id
0,polls-plus,U.S.,11/6/2016,ABC News/Washington Post,A+,2220.0,lv,45.20163,41.7243,4.626221,,48630
1,polls-plus,U.S.,11/7/2016,Google Consumer Surveys,B,26574.0,lv,43.34557,41.21439,5.175792,,48847
2,polls-plus,U.S.,11/6/2016,Ipsos,A-,2195.0,lv,42.02638,38.8162,6.844734,,48922
3,polls-plus,U.S.,11/7/2016,YouGov,B,3677.0,lv,45.65676,40.92004,6.069454,,48687
4,polls-plus,U.S.,11/6/2016,Gravis Marketing,B-,16639.0,rv,46.84089,42.33184,3.726098,,48848


*Note: We've gone ahead and decided to use the adjusted poll data (adjpoll) instead of the raw poll data (rawpoll); this will give us a slight adjustment to account for sampling error.*

Awesome! But what is this "type" variable? We can tell from `df2.head()` that there's a type called "polls-plus", but we can't tell much else.

In [6]:
print(df2.loc[:,'type'].unique()) #display unique values of the 'type' factor

['polls-plus' 'now-cast' 'polls-only']


We can see three unique types of polls. According to the source of the dataset on [FiveThirtyEight](https://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/):
+ **Polls-plus**: Combines polls with an economic index. Since the economic index implies that this election should be a tossup, it assumes the race will tighten somewhat.
+ **Polls-only**: A simpler, what-you-see-is-what-you-get version of the model. It assumes current polls reflect the best forecast for November, although with a lot of uncertainty.
+ **Now-cast**: A projection of what would happen in a hypothetical election held today. Much more aggressive than the other models.

We want to work with the simple adjusted poll data, not combined with other data. So we're going to take out all the polls that have been adjusted to "polls-plus" and "now-cast".

In [7]:
df_po = df2[df2.loc[:,'type']=='polls-only'] #create df_po containing only the polls of type 'polls-only'
df_po = df_po.reset_index(drop=True) #reset the dataframe indices, and drop the original indices from memory
df_po.head()

Unnamed: 0,type,state,enddate,pollster,grade,samplesize,population,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin,poll_id
0,polls-only,U.S.,11/6/2016,ABC News/Washington Post,A+,2220.0,lv,45.21947,41.70754,4.606925,,48630
1,polls-only,U.S.,11/7/2016,Google Consumer Surveys,B,26574.0,lv,43.40083,41.14659,5.164047,,48847
2,polls-only,U.S.,11/6/2016,Ipsos,A-,2195.0,lv,42.01984,38.74365,6.816055,,48922
3,polls-only,U.S.,11/7/2016,YouGov,B,3677.0,lv,45.68214,40.90047,6.118311,,48687
4,polls-only,U.S.,11/6/2016,Gravis Marketing,B-,16639.0,rv,46.83107,42.27754,3.749071,,48848


In [8]:
df_po.describe() #display summary statistics for numerical variables

Unnamed: 0,samplesize,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin,poll_id
count,4207.0,4208.0,4208.0,2799.0,30.0,4208.0
mean,1148.216068,43.322517,42.654425,4.651088,24.508827,45910.899477
std,2630.856265,7.097772,6.948612,2.47239,5.235812,2864.763228
min,35.0,17.11589,4.488276,-3.677883,11.02832,35362.0
25%,447.5,40.22023,38.449348,3.130344,23.108497,45151.75
50%,772.0,44.142125,42.70472,4.36681,25.135225,46384.5
75%,1236.5,46.901398,46.315503,5.763004,27.976062,47741.25
max,84292.0,86.7132,72.37661,20.357,31.57469,48922.0


## Scatter Plots
We start with a simple scatter plot, with date plotted on the x-axis and voter percentage plotted on the y-axis. The original dataset contained `startdate`, `enddate`, and `forecastdate`; of these three, we've subsetted only the `enddate` because it's the most accurate representation of the timeframe of each poll.

In [9]:
df_po.loc[:,'enddate'].head() #view first 5 'enddate' values

0    11/6/2016
1    11/7/2016
2    11/6/2016
3    11/7/2016
4    11/6/2016
Name: enddate, dtype: object

Each date is an `object` type; that means that Python will see these as individual discrete variables instead of a continuous variable of dates. To fix this, we use the `to_datetime` function from Pandas on each of the date entries.

In [10]:
df_po.loc[:,'enddate'] = pd.to_datetime(df_po.loc[:,'enddate']) #convert 'enddate' into 'datetime' variables
df_po.loc[:, 'enddate'].head()

0   2016-11-06
1   2016-11-07
2   2016-11-06
3   2016-11-07
4   2016-11-06
Name: enddate, dtype: datetime64[ns]

In [11]:
df_po.loc[:, ['enddate', 'adjpoll_clinton', 'adjpoll_trump', 'adjpoll_johnson', 'adjpoll_mcmullin']].head(10)

Unnamed: 0,enddate,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin
0,2016-11-06,45.21947,41.70754,4.606925,
1,2016-11-07,43.40083,41.14659,5.164047,
2,2016-11-06,42.01984,38.74365,6.816055,
3,2016-11-07,45.68214,40.90047,6.118311,
4,2016-11-06,46.83107,42.27754,3.749071,
5,2016-11-06,49.05626,43.87898,3.018706,
6,2016-11-06,45.31196,40.80614,4.230162,
7,2016-11-05,43.68695,40.80897,5.381917,
8,2016-11-06,45.03026,41.83415,8.034579,
9,2016-11-07,42.88452,42.18602,6.367243,


Let's plot! We start off with a basic scatter plot; API can be found on the [Plotly website](https://plot.ly/python/).

In [12]:
#PLOT ALL DATA USING PLOTLY

fig = {
    'data':[
        #Clinton data
        {
            'name':"Clinton", 
            'x':df_po.loc[:,'enddate'], 
            'y':df_po.loc[:,'adjpoll_clinton'],
            'mode':'markers', 'marker':{'color':'blue', 'opacity':0.1}},
        #Trump data
        {   
            'name':"Trump",
            'x':df_po.loc[:,'enddate'], 
            'y':df_po.loc[:,'adjpoll_trump'],
            'mode':'markers', 'marker':{'color':'red', 'opacity':0.1}},
        #Johnson data
        {
            'name':"Johnson", 
            'x':df_po.loc[:,'enddate'], 
            'y':df_po.loc[:,'adjpoll_johnson'],
            'mode':'markers', 'marker':{'color':'gold', 'opacity':0.1}},
        #McMullin data
        {
            'name':"McMullin", 
            'x':df_po.loc[:, 'enddate'], 
            'y':df_po.loc[:,'adjpoll_mcmullin'],
            'mode':'markers', 'marker':{'color':'green', 'opacity':0.1}}
    ],
    #set graph layout
    'layout':{
        'title':"Adjusted Poll Data",
        #set x-axis default range from one month before the first observation, until one month after the last observation
        'xaxis':{'title':"Date", 
                 'range':[min(df_po.loc[:, 'enddate']) - pd.DateOffset(months=1), 
                                max(df_po.loc[:, 'enddate']) + pd.DateOffset(months=1)]},
        #set y-axis default range from 0 to 100, with tick marks every 10 percentage points
        'yaxis':{'title':"Percentage", 
                 'range':[0, 100], 'tick0':0, 'dtick':10},
        #set background color
        'plot_bgcolor':'ghostwhite', 'paper_bgcolor':'ghostwhite'}
}

#plot the data
py.iplot(fig, filename='Election Poll Data')

High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~emmanduncan/0 or inside your plot.ly account where it is named 'Election Poll Data'


Graph can be seen [here](https://plot.ly/~junseopark/14/adjusted-poll-data/). 

It seems like Trump generally began the campaign with lower figures, but there are too many data points toward the end to determine anything conclusive. Let's see what kind of insights we can gain from the `grade` variable.

In [13]:
df_po.loc[:,'grade'].unique() #display unique values of the 'grade' factor

array(['A+', 'B', 'A-', 'B-', 'A', nan, 'B+', 'C+', 'C-', 'C', 'D'], dtype=object)

We see that there are 11 different `grade` types. That's a lot to work with, so we'll whittle it down to six: A+, A, B, C, D, and N/A. With the exception of A+, we drop the +/- from all the grades, then we'll plot scatterplots for each grade.

In [14]:
df_grade = df_po
for index in range(len(df_grade)):
    grade = df_grade.loc[index, 'grade']
    if (grade=='A-'):                     #change A- grades to A
        df_grade.loc[index, 'grade']='A'
    elif (grade=='B+' or  grade=='B-'):   #change B+ and B- grades to B
        df_grade.loc[index, 'grade']='B'
    elif (grade=='C+' or grade=='C-'):
        df_grade.loc[index, 'grade']='C'  #change C+ and C- grades to C
df_grade.loc[:, 'grade'][df_grade.loc[:, 'grade'].isnull()] = 'NA'     #change empty grades ('nan') to string 'NA'

df_grade.grade.unique() #display unique values of 'grade'



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



array(['A+', 'B', 'A', 'NA', 'C', 'D'], dtype=object)

In [15]:
#PLOTTING CLINTON, TRUMP BY GRADE

#create traces for each grade, for each candidate
trace_CAp = go.Scatter(
    name="Clinton A+", 
    x=df_grade.loc[:, 'enddate'][df_grade.loc[:, 'grade']=='A+'], 
    y=df_grade.loc[:, 'adjpoll_clinton'][df_grade.loc[:, 'grade']=='A'],
    mode='markers', marker=dict(color='blue', opacity=0.1))
trace_CA = go.Scatter(
    name="Clinton A", 
    x=df_grade.loc[:, 'enddate'][df_grade.loc[:, 'grade']=='A'], 
    y=df_grade.loc[:, 'adjpoll_clinton'][df_grade.loc[:, 'grade']=='A'],
    mode='markers', marker=dict(color='blue', opacity=0.1))
trace_CB = go.Scatter(
    name="Clinton B", 
    x=df_grade.loc[:, 'enddate'][df_grade.loc[:, 'grade']=='B'], 
    y=df_grade.loc[:, 'adjpoll_clinton'][df_grade.loc[:, 'grade']=='B'],
    mode='markers', marker=dict(color='blue', opacity=0.1))
trace_CC = go.Scatter(
    name="Clinton C", 
    x=df_grade.loc[:, 'enddate'][df_grade.loc[:, 'grade']=='C'], 
    y=df_grade.loc[:, 'adjpoll_clinton'][df_grade.loc[:, 'grade']=='C'],
    mode='markers', marker=dict(color='blue', opacity=0.1))
trace_CD = go.Scatter(
    name="Clinton D", 
    x=df_grade.loc[:, 'enddate'][df_grade.loc[:, 'grade']=='D'], 
    y=df_grade.loc[:, 'adjpoll_clinton'][df_grade.loc[:, 'grade']=='C'],
    mode='markers', marker=dict(color='blue', opacity=0.1))
trace_CNA = go.Scatter(
    name="Clinton NA", 
    x=df_grade.loc[:, 'enddate'][df_grade.loc[:, 'grade']=='NA'], 
    y=df_grade.loc[:, 'adjpoll_clinton'][df_grade.loc[:, 'grade']=='NA'],
    mode='markers', marker=dict(color='blue', opacity=0.1))

trace_TAp = go.Scatter( 
    name="Trump A+",
    x=df_grade.loc[:, 'enddate'][df_grade.loc[:, 'grade']=='A+'], 
    y=df_grade.loc[:, 'adjpoll_trump'][df_grade.loc[:, 'grade']=='A+'], 
    mode='markers', marker=dict(color='red', opacity=0.1))
trace_TA = go.Scatter( 
    name="Trump A",
    x=df_grade.loc[:, 'enddate'][df_grade.loc[:, 'grade']=='A'], 
    y=df_grade.loc[:, 'adjpoll_trump'][df_grade.loc[:, 'grade']=='A'], 
    mode='markers', marker=dict(color='red', opacity=0.1))
trace_TB = go.Scatter( 
    name="Trump B",
    x=df_grade.loc[:, 'enddate'][df_grade.loc[:, 'grade']=='B'], 
    y=df_grade.loc[:, 'adjpoll_trump'][df_grade.loc[:, 'grade']=='B'], 
    mode='markers', marker=dict(color='red', opacity=0.1))
trace_TC = go.Scatter( 
    name="Trump C",
    x=df_grade.loc[:, 'enddate'][df_grade.loc[:, 'grade']=='C'], 
    y=df_grade.loc[:, 'adjpoll_trump'][df_grade.loc[:, 'grade']=='C'], 
    mode='markers', marker=dict(color='red', opacity=0.1))
trace_TD = go.Scatter( 
    name="Trump D",
    x=df_grade.loc[:, 'enddate'][df_grade.loc[:, 'grade']=='D'], 
    y=df_grade.loc[:, 'adjpoll_trump'][df_grade.loc[:, 'grade']=='D'], 
    mode='markers', marker=dict(color='red', opacity=0.1))
trace_TNA = go.Scatter( 
    name="Trump NA",
    x=df_grade.loc[:, 'enddate'][df_grade.loc[:, 'grade']=='NA'], 
    y=df_grade.loc[:, 'adjpoll_trump'][df_grade.loc[:, 'grade']=='NA'], 
    mode='markers', marker=dict(color='red', opacity=0.1))

#create a figure with 3 rows and 2 columns
fig = tools.make_subplots(rows=3, cols=2, subplot_titles=('A+', 'A', 'B',
                                                          'C', 'D', 'NA'))
#apply traces to matching figure in subplot
fig.append_trace(trace_CAp, 1, 1)
fig.append_trace(trace_TAp, 1, 1)
fig.append_trace(trace_CA, 1, 2)
fig.append_trace(trace_TA, 1, 2)
fig.append_trace(trace_CB, 2, 1)
fig.append_trace(trace_TB, 2, 1)
fig.append_trace(trace_CC, 2, 2)
fig.append_trace(trace_TC, 2, 2)
fig.append_trace(trace_CD, 3, 1)
fig.append_trace(trace_TD, 3, 1)
fig.append_trace(trace_CNA,  3, 2)
fig.append_trace(trace_TNA, 3, 2)

#create layout: x-axis range over course of data, y-axis from 0 to 100
layout = go.Layout(
    title="Clinton and Trump, By Pollster Grade",
    xaxis1=dict(range=(min(df_grade.loc[:, 'enddate']) - pd.DateOffset(months=1), 
                       max(df_grade.loc[:, 'enddate']) + pd.DateOffset(months=1))),
    xaxis2=dict(range=(min(df_grade.loc[:, 'enddate']) - pd.DateOffset(months=1), 
                       max(df_grade.loc[:, 'enddate']) + pd.DateOffset(months=1))),
    xaxis3=dict(range=(min(df_grade.loc[:, 'enddate']) - pd.DateOffset(months=1), 
                       max(df_grade.loc[:, 'enddate']) + pd.DateOffset(months=1))),
    xaxis4=dict(range=(min(df_grade.loc[:, 'enddate']) - pd.DateOffset(months=1), 
                       max(df_grade.loc[:, 'enddate']) + pd.DateOffset(months=1))),
    xaxis5=dict(range=(min(df_grade.loc[:, 'enddate']) - pd.DateOffset(months=1), 
                       max(df_grade.loc[:, 'enddate']) + pd.DateOffset(months=1))),
    xaxis6=dict(range=(min(df_grade.loc[:, 'enddate']) - pd.DateOffset(months=1), 
                       max(df_grade.loc[:, 'enddate']) + pd.DateOffset(months=1))),
    yaxis1=dict(range=[0,100], tick0=0, dtick=25), 
    yaxis2=dict(range=[0,100], tick0=0, dtick=25), 
    yaxis3=dict(range=[0,100], tick0=0, dtick=25), 
    yaxis4=dict(range=[0,100], tick0=0, dtick=25),
    yaxis5=dict(range=[0,100], tick0=0, dtick=25),
    yaxis6=dict(range=[0,100], tick0=0, dtick=25),
    plot_bgcolor='ghostwhite', paper_bgcolor='ghostwhite')
fig['layout'].update(layout)

py.iplot(fig, filename='Election Poll Data, By Grade')

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]
[ (2,1) x3,y3 ]  [ (2,2) x4,y4 ]
[ (3,1) x5,y5 ]  [ (3,2) x6,y6 ]



[Link to graph](https://plot.ly/~junseopark/10/clinton-and-trump-by-pollster-grade/)

Better... We can see that the D grade doesn't have many points, and that the A+/A polls generally ranked Clinton higher until August 2016 — at which point there are still too many points to decipher anything. Boxplots, perhaps?

## Boxplots
By plotting boxplots of grades, we'll be able to grab a quick view on the distribution of the dataset for each grade. Let's take a look.

In [16]:
#BOXPLOTS BY GRADE

trace_Clinton = go.Box(
    name="Clinton",
    x=df_grade.loc[:,'grade'], 
    y=df_grade.loc[:, 'adjpoll_clinton'],
    marker=dict(color='blue'))
trace_Trump = go.Box(
    name="Trump",
    x=df_grade.loc[:, 'grade'], 
    y=df_grade.loc[:, 'adjpoll_trump'], 
    marker=dict(color='red'))

data = [trace_Clinton, trace_Trump]
layout = go.Layout(
    title="Boxplots of Election Polls, By Pollster Grade",
    yaxis=dict(title="Percentage", range=[0, 100], tick0=0, dtick=10, zeroline=False), 
    boxmode='group', #place boxes in groups by grade
    plot_bgcolor='ghostwhite', paper_bgcolor='ghostwhite')
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename="Election Poll Boxplots, by Grade")

[Link to graph](https://plot.ly/~junseopark/16/boxplots-of-election-polls-by-pollster-grade/)

It's not in alphabetical order; that's because the first unique grade entry was an A+, second was B+/B/B-, third was A/A-, fourth was NA, etc. We can reorder by creating a dictionary of grades assigned to numbers, then creating a new variable called `graderank` to sort these values. Then we re-plot:

In [17]:
#reordering data in order of grade
grades = {'A+':1, 'A':2, 'B':3, 'C':4, 'D':5, 'NA':6}
df_grade.insert(loc=len(df_grade.iloc[0]), column='graderank', value=0)
df_grade.loc[:,'graderank'] = df_grade.loc[:, 'grade'].map(grades)
df_grade.sort_values(by='graderank', inplace=True)
df_grade = df_grade.drop(labels='graderank', axis=1)

trace_Clinton = go.Box(
    name="Clinton",
    x=df_grade.loc[:,'grade'], 
    y=df_grade.loc[:, 'adjpoll_clinton'],
    marker=dict(color='blue'))
trace_Trump = go.Box(
    name="Trump",
    x=df_grade.loc[:, 'grade'], 
    y=df_grade.loc[:, 'adjpoll_trump'], 
    marker=dict(color='red'))

data = [trace_Clinton, trace_Trump]
layout = go.Layout(
    title="Boxplots of Election Polls, By Ordered Pollster Grade",
    yaxis=dict(title="Percentage", range=[0, 100], tick0=0, dtick=10, zeroline=False), 
    boxmode='group', #place boxes in groups by grade
    plot_bgcolor='ghostwhite', paper_bgcolor='ghostwhite')
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename="Election Poll Boxplots, by Ordered Pollster Grade")

[Link to graph](https://plot.ly/~junseopark/28/boxplots-of-election-polls-by-ordered-pollster-grade/)

D doesn't have enough data points to provide accurate insight, and by nature of NA we don't know the quality of those pollsters. Besides those two grades, we see that as the grade decreases, the distribution widens and there are more outliers. Even so, all the grade letter distributions rank Clinton above Trump. Assuming that predictions were based on poll data, then it makes sense that we were predicting Clinton to win the election.


## Line Graph with Error Bars
Next, let's take a look at something called a "Continuous Error Graph." We can note that there are two different ways to find the standard deviation - either with Pandas or Numpy. What's the difference?

In [18]:
print("Pandas SD: " + str(df_po.loc[:,'adjpoll_clinton'].std()))
print("Numpy SD : " + str(np.std(df_po.loc[:, 'adjpoll_clinton'])))

Pandas SD: 7.097772011313118
Numpy SD : 7.096928594756007


Pandas employs [Bessel's Correction](https://en.wikipedia.org/wiki/Bessel's_correction), which divides the sum of squared residuals by n-1 instead of n to account for sampling error. Since our dataset is a population (compilation of every single poll regarding the 2016 elections), there's no need to make this accomodation. So we'll proceed with Numpy's standard deviation formula:

In [19]:
df_po.sort_values(by='enddate', inplace=True)
clinton_upper = go.Scatter(
    name="Clinton Upper Bound", 
    x=df_po.loc[:,'enddate'], 
    y=df_po.loc[:,'adjpoll_clinton'] + np.std(df_po.loc[:,'adjpoll_clinton']), 
    mode='lines', marker=dict(color='blue'),
    line=dict(width=0),
    fillcolor='rgba(0, 0, 255, 0.2)',
    fill='tonexty')
clinton = go.Scatter(
    name="Clinton Mean", 
    x=df_po.loc[:,'enddate'], 
    y=np.mean(df_po.loc[:,'adjpoll_clinton']), 
    mode='lines', marker=dict(color='blue'),
    line=dict(color='blue'),
    fillcolor='rgba(0,0,255,0.2)',
    fill='tonexty')
clinton_lower = go.Scatter(
    name="Clinton Lower Bound",
    x=df_po.loc[:,'enddate'],
    y=df_po.loc[:,'adjpoll_clinton'] - np.std(df_po.loc[:,'adjpoll_clinton']),
    mode='lines', marker=dict(color='blue'),
    line=dict(width=0),
    fillcolor='rgba(0,0,255,0.2)',
    fill='tonexty')

trump_upper = go.Scatter(
    name="Trump Upper Bound", 
    x=df_po.loc[:,'enddate'], 
    y=df_po.loc[:,'adjpoll_trump'] + np.std(df_po.loc[:,'adjpoll_trump']), 
    mode='lines', marker=dict(color='red'),
    line=dict(width=0),
    fillcolor='rgba(255, 0, 0, 0.2)',
    fill='tonexty')
trump = go.Scatter(
    name="Trump Mean", 
    x=df_po.loc[:,'enddate'], 
    y=np.mean(df_po.loc[:,'adjpoll_trump']), 
    mode='lines', marker=dict(color='red'),
    line=dict(color='red'),
    fillcolor='rgba(255, 0, 0, 0.2)',
    fill='tonexty')
trump_lower = go.Scatter(
    name="Trump Lower Bound",
    x=df_po.loc[:,'enddate'],
    y=df_po.loc[:,'adjpoll_trump'] - np.std(df_po.loc[:,'adjpoll_trump']),
    mode='lines', marker=dict(color='red'),
    line=dict(width=0),
    fillcolor='rgba(255, 0, 0, 0.2)')

data = [clinton_lower, clinton, clinton_upper]
layout = go.Layout(
    title="Election Poll Means with Standard Deviation",
    xaxis=dict(range=(min(df_grade.loc[:, 'enddate']) - pd.DateOffset(months=1), 
                            max(df_grade.loc[:, 'enddate']) + pd.DateOffset(months=1))),
    yaxis=dict(title="Percentage", range=[0,100], tick0=0, dtick=10),
    plot_bgcolor='ghostwhite', paper_bgcolor='ghostwhite')

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Election Poll Data, Continuous Error Chart')

[Link to graph](https://plot.ly/~junseopark/22/election-poll-means-with-standard-deviation/)

Clearly something's not right here. Let's take a look at a sample code provided on the Plotly API website:

In [20]:
#EXAMPLE CODE FOR STANDARD DEVIATION GRAPH
import plotly.plotly as py
import plotly.graph_objs as go

import pandas as pd

dftest = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/wind_speed_laurel_nebraska.csv')

upper_bound = go.Scatter(
    name='Upper Bound',
    x=dftest['Time'],
    y=dftest['10 Min Sampled Avg']+dftest['10 Min Std Dev'],
    mode='lines',
    marker=dict(color="444"),
    line=dict(width=0),
    fillcolor='rgba(68, 68, 68, 0.3)',
    fill='tonexty')

trace = go.Scatter(
    name='Measurement',
    x=dftest['Time'],
    y=dftest['10 Min Sampled Avg'],
    mode='lines',
    line=dict(color='rgb(31, 119, 180)'),
    fillcolor='rgba(68, 68, 68, 0.3)',
    fill='tonexty')

lower_bound = go.Scatter(
    name='Lower Bound',
    x=dftest['Time'],
    y=dftest['10 Min Sampled Avg']-dftest['10 Min Std Dev'],
    marker=dict(color="444"),
    line=dict(width=0),
    mode='lines')

# Trace order can be important
# with continuous error bars
data = [lower_bound, trace, upper_bound]

layout = go.Layout(
    yaxis=dict(title='Wind speed (m/s)'),
    title='Continuous, variable value error bars.<br>Notice the hover text!')

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='pandas-continuous-error-bars')

[Link to graph](https://plot.ly/~junseopark/20/continuous-variable-value-error-bars-notice-the-hover-text/)

In [21]:
dftest.head()

Unnamed: 0,10 Min Std Dev,Time,10 Min Sampled Avg
0,2.73,2001-06-11 11:00,22.3
1,1.98,2001-06-11 11:10,23.0
2,1.87,2001-06-11 11:20,23.3
3,2.03,2001-06-11 11:30,22.0
4,3.1,2001-06-11 11:40,20.5


Instead of individual data points, the sample data is split into bins of 10 minutes each, and takes the mean and standard deviation from every 10 minutes. Of course! There's no way you can find a standard deviation from individual data points, so we'll have to adjust the dataset to be in bins of 5 days.

In [22]:
#creating bins for dates
df_date = df_po.loc[:,['enddate','adjpoll_clinton','adjpoll_trump']]

for index in range(len(df_date)):     #change all dates to be on the 1st, 11th, or 21st
    d = df_date.loc[index, 'enddate']
    if (d.strftime("%d") <= '05'):
        df_date.loc[index, 'enddate']=d.replace(day=1)
    elif (d.strftime("%d") <= '10'):
        df_date.loc[index, 'enddate']=d.replace(day=6)
    elif (d.strftime("%d") <= '15'):
        df_date.loc[index, 'enddate']=d.replace(day=11)
    elif (d.strftime("%d") <= '20'):
        df_date.loc[index, 'enddate']=d.replace(day=16)
    elif (d.strftime("%d") <= '25'):
        df_date.loc[index, 'enddate']=d.replace(day=21)
    elif (d.strftime("%d") <= '31'):
        df_date.loc[index, 'enddate']=d.replace(day=26)

#adding rows to new dataframe df_bins
df_bins = pd.DataFrame(columns=('date', 'mean_c', 'mean_t', 'sd_c', 'sd_t'))
while (len(df_date) != 0):
    row0 = df_date.iloc[0,:]
    d = row0.loc['enddate']
    mean_c = np.mean(df_date.loc[:,'adjpoll_clinton'][df_date.loc[:,'enddate']==d])
    mean_t = np.mean(df_date.loc[:,'adjpoll_trump'][df_date.loc[:,'enddate']==d])
    sd_c = np.std(df_date.loc[:,'adjpoll_clinton'][df_date.loc[:,'enddate']==d])
    sd_t = np.std(df_date.loc[:,'adjpoll_trump'][df_date.loc[:,'enddate']==d])
    df_bins.loc[df_bins.shape[0]] = [d, mean_c, mean_t, sd_c, sd_t]
    df_date = df_date[df_date.loc[:,'enddate']!=d]
    df_po = df_po.reset_index(drop=True)
    
#reordering rows by date
df_bins.sort_values(by='date', inplace=True)
df_bins.head()

Unnamed: 0,date,mean_c,mean_t,sd_c,sd_t
0,2015-11-06,40.86641,45.80082,0.0,0.0
1,2015-11-11,44.608547,41.087447,5.778527,3.89799
2,2015-11-16,43.626064,43.885286,1.88581,3.093642
3,2015-11-21,29.19415,49.73265,0.0,0.0
4,2015-11-26,47.27088,40.88484,0.0,0.0


Now our data resembles the sample Plotly dataset. Now we re-plot:

In [23]:
#replotting continuous error charts
clinton_upper = go.Scatter(
    name="Clinton Upper Bound", 
    x=df_bins.loc[:,'date'], 
    y=df_bins.loc[:,'mean_c'] + df_bins.loc[:,'sd_c'], 
    mode='lines', marker=dict(color='blue'),
    line=dict(width=0),
    fillcolor='rgba(0, 0, 255, 0.2)',
    fill='tonexty')
clinton = go.Scatter(
    name="Clinton Mean", 
    x=df_bins.loc[:,'date'], 
    y=df_bins.loc[:,'mean_c'], 
    mode='lines', marker=dict(color='blue'),
    line=dict(color='blue'),
    fillcolor='rgba(0,0,255,0.2)',
    fill='tonexty')
clinton_lower = go.Scatter(
    name="Clinton Lower Bound",
    x=df_bins.loc[:,'date'],
    y=df_bins.loc[:,'mean_c'] - df_bins.loc[:,'sd_c'],
    mode='lines', marker=dict(color='blue'),
    line=dict(width=0),
    fillcolor='rgba(0,0,255,0.2)')

trump_upper = go.Scatter(
    name="Trump Upper Bound", 
    x=df_bins.loc[:,'date'], 
    y=df_bins.loc[:,'mean_t'] + df_bins.loc[:,'sd_t'], 
    mode='lines', marker=dict(color='red'),
    line=dict(width=0),
    fillcolor='rgba(255, 0, 0, 0.2)',
    fill='tonexty')
trump = go.Scatter(
    name="Trump Mean", 
    x=df_bins.loc[:,'date'], 
    y=df_bins.loc[:,'mean_t'], 
    mode='lines', marker=dict(color='red'),
    line=dict(color='red'),
    fillcolor='rgba(255, 0, 0, 0.2)',
    fill='tonexty')
trump_lower = go.Scatter(
    name="Trump Lower Bound",
    x=df_bins.loc[:,'date'],
    y=df_bins.loc[:,'mean_t'] - df_bins.loc[:,'sd_t'],
    mode='lines', marker=dict(color='red'),
    line=dict(width=0),
    fillcolor='rgba(255, 0, 0, 0.2)')

data = [clinton_lower, clinton, clinton_upper, trump_lower, trump, trump_upper]
layout = go.Layout(
    title="Election Poll Means with Standard Deviation (Adjusted)",
    xaxis=dict(range=(min(df_grade.loc[:, 'enddate']) - pd.DateOffset(months=1), 
                            max(df_grade.loc[:, 'enddate']) + pd.DateOffset(months=1))),
    yaxis=dict(title="Percentage", range=[0,100], tick0=0, dtick=10),
    plot_bgcolor='ghostwhite', paper_bgcolor='ghostwhite')

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Election Poll Data, Continuous Error Chart (Adjusted)')

[Link to graph](https://plot.ly/~junseopark/26/election-poll-means-with-standard-deviation-adjusted/)

We observe that the standard deviation seems to increase as time goes on. But one more interesting thing: the means and standard deviations essentially overlap towards the end of the election campaign, suggesting that the election might have been tighter than we originally thought.

## Bubble Charts

First, we look at the relation between sample size and poll results.

In [39]:
sizemode = 'area'

sizeref = df_po.loc[:,'samplesize'].max()/1e2**2

hover_c = []
for index, row in df_grade.iterrows():
    hover_c.append(('{state}<br>'+
                      'Sample Size: {samplesize}<br>'+
                        'Percentage: {adjpoll_clinton}'
                      ).format(state=row['state'],
                               samplesize=row['samplesize'],
                                adjpoll_clinton=row['adjpoll_clinton']))
df_grade['text'] = hover_c

trace0 = go.Scatter(
        x=df_grade.loc[:,'enddate'],
        y=df_grade.loc[:,'adjpoll_clinton'], 
        name = 'Clinton',
        mode='markers',
        text = df_grade.loc[:, 'text'],
        marker= go.Marker(
            color = 'blue',
            size=df_grade.loc[:,'samplesize'], 
            sizeref=sizeref, 
            sizemode=sizemode, 
            opacity=0.1, 
            line=go.Line(width=1)
        )     
    )

hover_t = []
for index, row in df_grade.iterrows():
    hover_t.append(('{state}<br>'+
                      'Sample Size: {samplesize}<br>'+
                        'Percentage: {adjpoll_trump}'
                      ).format(state=row['state'],
                               samplesize=row['samplesize'],
                                adjpoll_trump=row['adjpoll_trump']))
df_grade['text'] = hover_t

trace1 = go.Scatter(
        x=df_grade.loc[:,'enddate'],
        y=df_grade.loc[:,'adjpoll_trump'], 
        name = 'Trump',
        mode='markers',
        text = df_grade['text'],
        marker= go.Marker(
            color = 'red',
            size=df_grade.loc[:,'samplesize'], 
            sizeref=sizeref, 
            sizemode=sizemode, 
            opacity=0.1, 
            line=go.Line(width=2)
        )     
    )

data = go.Data([trace0, trace1])

title = "Bubble Chart: Sample Size - Poll Results"
x_title = "Date"
y_title = "Percentage"

layout1 = go.Layout(
    title=title,
    xaxis=go.XAxis( title=x_title),
    yaxis=go.YAxis(title=y_title),
    hovermode = 'closest'
)

fig = go.Figure(data=data, layout=layout1)
py.iplot(fig)

[Link to graph](https://plot.ly/~emmanduncan/260)

By isolating each candidate, we can see that as sample size increases, the poll results tend to converge. Very high and low percentages correlate to smaller polls, conducted at the state level. This further shows that the outcome of the election was difficult to predict, even with a larger poll sampling. In, addition, we see that sample size tends to increase as we get closer to the date of the election. This makes sense, since more political awareness increases as the election date approaches.

Next, we look at sample size as it relates to the grade of each pollster.

In [42]:
sizemode = 'area'

sizeref = df_grade.loc[:,'samplesize'].max()/1e2**2

samplesize = df_grade.loc[:,'samplesize']

hover_text = []
for index, row in df_grade.iterrows():
    hover_text.append(('{state}<br>'+
                      'Sample Size: {samplesize}<br>'
                      ).format(state=row['state'],
                               samplesize=row['samplesize']))
df_grade['text'] = hover_text

trace0 = go.Scatter(
        x=df_grade.loc[:,'enddate'][df_grade.loc[:,'state'] == 'U.S.'],
        y=df_grade.loc[:,'grade'][df_grade.loc[:,'state'] == 'U.S.'], 
        name = 'U.S.',
        mode='markers',
        text = df_grade.loc[:,'text'][df_grade.loc[:,'state'] == 'U.S.'],
        marker= go.Marker(
            color = 'light blue',
            size=df_grade.loc[:,'samplesize'][df_grade.loc[:,'state'] == 'U.S.'], 
            sizeref=sizeref, 
            sizemode=sizemode, 
            opacity=0.2, 
            line=go.Line(width=1)
        )     
    )

trace1 = go.Scatter(
        x=df_grade.loc[:,'enddate'][df_grade.loc[:,'state'] != 'U.S.'],
        y=df_grade.loc[:,'grade'][df_grade.loc[:,'state'] != 'U.S.'], 
        mode='markers',
        name = 'State',
        text = df_grade.loc[:,'text'][df_grade.loc[:,'state'] != 'U.S.'],
        marker= go.Marker(
            color = 'green',
            size=df_grade.loc[:,'samplesize'][df_grade.loc[:,'state'] != 'U.S.'], 
            sizeref=sizeref, 
            sizemode=sizemode, 
            opacity=0.2, 
            line=go.Line(width=1)
        )     
    )

title = "Bubble Chart: Sample Size - Poll Grade"
x_title = "Date"
y_title = "Grade"

layout = go.Layout(
    title=title,
    xaxis=go.XAxis( title=x_title),
    yaxis=go.YAxis(title=y_title),
    hovermode = 'closest'
)


data = go.Data([trace0,trace1])

fig = go.Figure(data=data, layout=layout)
py.iplot(fig)



[Link to graph](https://plot.ly/~emmanduncan/262)

As seen above, a larger sample size does not necesarially indicate a more accurate poll. Looking just at the U.S. polling data, polls graded 'B' or 'C' tend to have the largest sample size, with this size growing as the date approaches the election date. This indicated that sample size may not have a direct influence on the accuracy of the poll. The polls graded 'A' and 'A+' have a smaller and fairly consistent sample size no larger than 2,500. From fivethirtyeight, in order to calculate these poll grades, they consider "type of election surveyed, a poll’s sample size, and the number of days separating the poll from the election" ([source](https://fivethirtyeight.com/features/how-fivethirtyeight-calculates-pollster-ratings/)). Although sample size plays a roll in predicting accuracy, perhaps a more consistent sampling is better than a larger size of sample. For state polling data, sample size is relatively smaller and dependent on the population of each state.

In [80]:
df_states = df_po.sort_values(by='state')
trace0 = go.Scatter(
        x = df_states.loc[:,'state'],
        y = df_states.loc[:,'adjpoll_clinton'],
        mode = 'markers',
        name = 'Clinton',
        marker = dict(color = 'blue', opacity = 0.1))

trace1 = go.Scatter(
        x = df_states.loc[:,'state'],
        y = df_states.loc[:,'adjpoll_trump'],
        mode = 'markers',
        name = 'Trump',
        marker = dict(color = 'red', opacity = 0.1))

title = "Adjusted Poll Data by State"
x_title = "State"
y_title = "Percentage"

layout = go.Layout(
    title=title,
    xaxis=go.XAxis( title=x_title),
    yaxis=go.YAxis(title=y_title),
    hovermode = 'closest'
)


data = go.Data([trace0,trace1])

fig = go.Figure(data=data, layout=layout)
py.iplot(fig)


[Link to graph](https://plot.ly/~emmanduncan/278/adjusted-poll-data-by-state/)

In this plot, it is easy to locate consisted red and blue states (data points grouped by color and spread out), as well as the swing states (data points all grouped towards center). It is interesting to note that even in the most polarized states, such as D.C. there are a few points where Trump and Clinton's percentages are almost equal. 