In [50]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as pyplot
from matplotlib import style

In [8]:
us_2016_primary_results = pd.read_csv('./data/us-2016-primary-results.csv', sep=';')
usa_2016_presidential_election_by_county = pd.read_csv('./data/usa-2016-presidential-election-by-county.csv', sep=';')

In [22]:
us_2016_primary_results.head(3)

Unnamed: 0,state,state_abbreviation,county,fips,party,candidate,votes,fraction_votes
0,Vermont,VT,Sutton,95000197.0,Republican,John Kasich,20,0.227
1,Vermont,VT,Tunbridge,95000204.0,Republican,John Kasich,36,0.319
2,Vermont,VT,Weathersfield,95000220.0,Republican,Ted Cruz,46,0.111


We first analyze the columns of the dataset, and drop any unnecessary columns, such as fips (Federal Information Processing Standards, which are numeric codes for identifying states and counties), and state_abbreviation, as they are redundant

In [23]:
us_2016_primary_results.drop(columns=['state_abbreviation', 'fips'], inplace=True)

We now check for missing values in the dataset

In [25]:
us_2016_primary_results.isnull().sum()

state             0
county            0
party             0
candidate         0
votes             0
fraction_votes    0
dtype: int64

In [26]:
us_2016_primary_results.describe()

Unnamed: 0,votes,fraction_votes
count,24611.0,24611.0
mean,2306.252773,0.304524
std,9861.183572,0.231401
min,0.0,0.0
25%,68.0,0.094
50%,358.0,0.273
75%,1375.0,0.479
max,590502.0,1.0


We may notice a very high maximum vote count in comparison to the 25%~75% IQR values. This is most likely due to the fact that in the US elections, there are often heavily favoured candidates, and their respective party, which often prevails the majority of votes. Such dominance is apparent in numerous states and counties, and it is also the probable cause of the seemingly high standard deviation and mean compared to the IQR values.

This phenomena is renown, with the terms 'Red States' and 'Blue States' commonly characterizing regions of the United States politically. These observations of dominant candidates and parties, divided by geographic locations are often critical in assessing, and predicting the outcomes of an election.

Let us now continue with this intuition, by analyzing the candidates, parties, and their associated vote counts.

In [27]:
print(us_2016_primary_results.candidate.unique())

['John Kasich' 'Ted Cruz' 'Ben Carson' 'Donald Trump' 'Marco Rubio'
 'Hillary Clinton' 'Bernie Sanders' "Martin O'Malley" 'Uncommitted'
 'Carly Fiorina' 'Chris Christie' 'Mike Huckabee' 'Rick Santorum'
 'Jeb Bush' 'Rand Paul' 'No Preference']


In [28]:
candidate_votes = us_2016_primary_results.groupby(["candidate"]).sum()
print(candidate_votes)

                    votes  fraction_votes
candidate                                
Ben Carson         564553       98.373066
Bernie Sanders   11959102     2074.393879
Carly Fiorina       15191        2.408571
Chris Christie      24353        1.937211
Donald Trump     13302541     1671.854969
Hillary Clinton  15692452     1939.776121
Jeb Bush            94411        6.901265
John Kasich       4159949      440.609220
Marco Rubio       3321076      375.479603
Martin O'Malley       752        0.822000
Mike Huckabee        3345        2.395000
No Preference        8152        2.276000
Rand Paul            8479        3.372000
Rick Santorum        1782        0.986000
Ted Cruz          7603006      873.022095
Uncommitted            43        0.045000


At first glance, Donald Trump, Bernie Sanders, Hillary Clinton are intuitively the dominant candidates, with Ben Carson, John Kasich, Marco Rubio, and Ted Cruz possessing "meaningful" influence in the number of votes of the presidential election.

In [30]:
party_votes = us_2016_primary_results.groupby(["party"]).sum()
print(party_votes)

               votes  fraction_votes
party                               
Democrat    27660501        4017.313
Republican  29098686        3477.339


It is interesting to notice the relatively uniform distributions between the two parties. One party does not seem to significatly outweigh the other. 

With Hillary Clinton and Bernie Sanders belonging in the Democrat party, and Donald Trump in the Republican party, one may suggest the favour of the Democrats. However, with Ben Carson, John Kasich, Marco Rubio, and Ted Cruz all belonging in the Republican party, it can be proposed that voters favouring the Republican party have a wider variety of candidates that have a recognizeable number of supporters, spread out in the apparent analysis. (One or two candidates do not completely prevail the number of votes in the Republican party)

In [126]:
# We may utilize pd.set_option('display.max_rows', None) to expand the output, to visualize the full list of calculations.
# pd.reset_option('display.max_rows') may be used to reset the above command.
pd.reset_option('display.max_rows')
state_votes = us_2016_primary_results.groupby(['state', 'party']).agg({'votes': "count"})
print(state_votes)

                          votes
state         party            
Alabama       Democrat      134
              Republican    335
Alaska        Democrat       80
              Republican    200
Arizona       Democrat       30
...                         ...
West Virginia Republican    165
Wisconsin     Democrat      144
              Republican    216
Wyoming       Democrat       46
              Republican     48

[95 rows x 1 columns]


In [137]:
print(state_votes.reset_index())

            state       party  votes
0         Alabama    Democrat    134
1         Alabama  Republican    335
2          Alaska    Democrat     80
3          Alaska  Republican    200
4         Arizona    Democrat     30
..            ...         ...    ...
90  West Virginia  Republican    165
91      Wisconsin    Democrat    144
92      Wisconsin  Republican    216
93        Wyoming    Democrat     46
94        Wyoming  Republican     48

[95 rows x 3 columns]
