# Election 2016: An Exploratory Data Analysis

#### Table of Contents
1. Environment Setup
2. Loading Data
3. Scatter Plots
4. Box Plot
5. Line Graph with Error Bars
6. Bubble Chart
7. Chloropleth Maps

## Environment Setup
Information regarding environment setup can be found under Prerequisites on the [README](../master/README.md).

## Loading Data
We start off by loading the packages that we want to use.

In [2]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 100) #overrides default to display up to 100 columns in dataframes
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

It's time to bring in the dataset. We load it into a [Pandas](http://pandas.pydata.org/) dataframe (the preferred tool when working with data in Python) and run some basic commands.

In [7]:
df = pd.read_csv('http://projects.fivethirtyeight.com/general-model/president_general_polls_2016.csv')
df.head() #display the first five rows of dataframe

Unnamed: 0,cycle,branch,type,matchup,forecastdate,state,startdate,enddate,pollster,grade,samplesize,population,poll_wt,rawpoll_clinton,rawpoll_trump,rawpoll_johnson,rawpoll_mcmullin,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin,multiversions,url,poll_id,question_id,createddate,timestamp
0,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/3/2016,11/6/2016,ABC News/Washington Post,A+,2220.0,lv,8.720654,47.0,43.0,4.0,,45.20163,41.7243,4.626221,,,https://www.washingtonpost.com/news/the-fix/wp...,48630,76192,11/7/16,09:35:33 8 Nov 2016
1,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/1/2016,11/7/2016,Google Consumer Surveys,B,26574.0,lv,7.628472,38.03,35.69,5.46,,43.34557,41.21439,5.175792,,,https://datastudio.google.com/u/0/#/org//repor...,48847,76443,11/7/16,09:35:33 8 Nov 2016
2,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/2/2016,11/6/2016,Ipsos,A-,2195.0,lv,6.424334,42.0,39.0,6.0,,42.02638,38.8162,6.844734,,,http://projects.fivethirtyeight.com/polls/2016...,48922,76636,11/8/16,09:35:33 8 Nov 2016
3,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/4/2016,11/7/2016,YouGov,B,3677.0,lv,6.087135,45.0,41.0,5.0,,45.65676,40.92004,6.069454,,,https://d25d2506sfb94s.cloudfront.net/cumulus_...,48687,76262,11/7/16,09:35:33 8 Nov 2016
4,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/3/2016,11/6/2016,Gravis Marketing,B-,16639.0,rv,5.316449,47.0,43.0,3.0,,46.84089,42.33184,3.726098,,,http://www.gravispolls.com/2016/11/final-natio...,48848,76444,11/7/16,09:35:33 8 Nov 2016


In [9]:
print("Number of rows (polls): " + str(df.shape[0]))
print("Number of columns (data categories): " + str(df.shape[1]))

print("\nNumber of empty values for each column:")
print(df.isnull().sum())

Number of rows (polls): 12624
Number of columns (data categories): 27

Number of empty values for each column:
cycle                   0
branch                  0
type                    0
matchup                 0
forecastdate            0
state                   0
startdate               0
enddate                 0
pollster                0
grade                1287
samplesize              3
population              0
poll_wt                 0
rawpoll_clinton         0
rawpoll_trump           0
rawpoll_johnson      4227
rawpoll_mcmullin    12534
adjpoll_clinton         0
adjpoll_trump           0
adjpoll_johnson      4227
adjpoll_mcmullin    12534
multiversions       12588
url                     3
poll_id                 0
question_id             0
createddate             0
timestamp               0
dtype: int64


We see that there are 12624 polls and 27 categories of data. Of these, we can subset the dataframe to select only the categories that we're interested in. Let's go ahead and do that:

In [10]:
df2 = df.loc[:, ['type', 'state', 'enddate', 'pollster', 'grade', 'samplesize', 'population', 'poll_wt',
             'adjpoll_clinton', 'adjpoll_trump', 'adjpoll_johnson', 'adjpoll_mcmullin', 'poll_id']]
df2.head()

Unnamed: 0,type,state,enddate,pollster,grade,samplesize,population,poll_wt,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin,poll_id
0,polls-plus,U.S.,11/6/2016,ABC News/Washington Post,A+,2220.0,lv,8.720654,45.20163,41.7243,4.626221,,48630
1,polls-plus,U.S.,11/7/2016,Google Consumer Surveys,B,26574.0,lv,7.628472,43.34557,41.21439,5.175792,,48847
2,polls-plus,U.S.,11/6/2016,Ipsos,A-,2195.0,lv,6.424334,42.02638,38.8162,6.844734,,48922
3,polls-plus,U.S.,11/7/2016,YouGov,B,3677.0,lv,6.087135,45.65676,40.92004,6.069454,,48687
4,polls-plus,U.S.,11/6/2016,Gravis Marketing,B-,16639.0,rv,5.316449,46.84089,42.33184,3.726098,,48848


Note: We've gone ahead and decided to use the adjusted poll data instead of the raw poll data; this will give us a slight adjustment to account for sampling error. 

Awesome! But what is this "type" variable? We can tell from `df2.head()` that there's a type called "polls-plus", but we can't tell much else.

In [12]:
print(df2.loc[:,'type'].unique()) #display unique values of the 'type' factor

['polls-plus' 'now-cast' 'polls-only']


We can see three unique types of polls. According to the source of the dataset on [FiveThirtyEight](https://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/):
+ **Polls-plus**: Combines polls with an economic index. Since the economic index implies that this election should be a tossup, it assumes the race will tighten somewhat.
+ **Polls-only**: A simpler, what-you-see-is-what-you-get version of the model. It assumes current polls reflect the best forecast for November, although with a lot of uncertainty.
+ **Now-cast**: A projection of what would happen in a hypothetical election held today. Much more aggressive than the other models.

We want to work with the simple adjusted poll data, not combined with other data. So we're going to take out all the polls that have been adjusted to "polls-plus" and "now-cast".

In [13]:
df_po = df2[df2.loc[:,'type']=='polls-only'] #select only the values for which 'type' is 'polls-only'
df_po = df_po.reset_index(drop=True) #reset the dataframe indices, and drop the original indices from memory
df_po.head()

Unnamed: 0,type,state,enddate,pollster,grade,samplesize,population,poll_wt,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin,poll_id
0,polls-only,U.S.,11/6/2016,ABC News/Washington Post,A+,2220.0,lv,8.720654,45.21947,41.70754,4.606925,,48630
1,polls-only,U.S.,11/7/2016,Google Consumer Surveys,B,26574.0,lv,7.628472,43.40083,41.14659,5.164047,,48847
2,polls-only,U.S.,11/6/2016,Ipsos,A-,2195.0,lv,6.424334,42.01984,38.74365,6.816055,,48922
3,polls-only,U.S.,11/7/2016,YouGov,B,3677.0,lv,6.087135,45.68214,40.90047,6.118311,,48687
4,polls-only,U.S.,11/6/2016,Gravis Marketing,B-,16639.0,rv,5.316449,46.83107,42.27754,3.749071,,48848
