# Milestone 2 Assignment - Capstone Check-in

## Author - Elizabeth Lunsford

### Capstone Project Instructions
Select a problem and data sets of particular interest and apply the analytics process to find and report on a solution.

Students will construct a simple dashboard to allow a non-technical user to explore their solution. The data should be read from a suitable persistent data storage, such as an Internet URL or a SQL data base.

The process followed by the students and the grading criteria include:
<ol style="list-style-type: lower-alpha;">
<li>Understand the business problem <span class="label" style="border-radius: 3px; background-color: darkcyan; color: white;">Milestone 1</span></li>
<li>Evaluate and explore the available data <span class="label" style="border-radius: 3px; background-color: darkcyan; color: white;">Milestone 1</span></li>
<li>Proper data preparation <span class="label" style="border-radius: 3px; background-color: darkcyan; color: white;">Milestone 1</span> <span class="label" style="border-radius: 3px; background-color: royalblue; color: white;">Milestone 2</span></li>
<li>Exploration of data and understand relationships <span class="label" style="border-radius: 3px; background-color: darkcyan; color: white;">Milestone 1</span> <span class="label" style="border-radius: 3px; background-color: royalblue; color: white;">Milestone 2</span></li>
<li>Perform basic analytics and machine learning, within the scope of the course, on the data.  <span class="label" style="border-radius: 3px; background-color: royalblue; color: white;">Milestone 2</span> <span class="label" style="border-radius: 3px; background-color: slateblue; color: white;">Milestone 3</span> <BR/>For example, classification to predict which employees are most likely to leave the company.</li>
<li>Create a written and/or oral report on the results suitable for a non-technical audience. <span class="label" style="border-radius: 3px; background-color: slateblue; color: white;">Milestone 3</span></li>
</ol>



## Tasks
<img src="https://library.startlearninglabs.uw.edu/DATASCI420/img/Milestone2Sample.PNG" style="float: right; width: 400px;">
For this check-in, you are to:

1). Explicitly state the problem, list sources, and define the methodology: classification, regression, other

2). List data processing steps (psuedo code) including steps from data source collection & preparation, feature engineering & selection, modeling, performance evaluation.

3). Read in the previously generated data file of cleaned up data

4). Perform feature engineering and selection

5). Conduct some preliminary modeling 

6). Identify potential machine learning model(s) to improve performance


## Project Goal

### Project Goal
The goal of this project is to predict if a state votes for a Republican or Democratic candidiate.

### Problem Statement
How much does a state's social demographics influence how a state votes?  
 
### Problem Definition
Understanding how the United States of America is governed is a complex problem. Simply described as a democracy where people vote for elected officials is an overgeneralization for the constitutional federal republic which is governed by three branches; executive, legislative and judicial. The judicial branch ensures that implemented laws are in accordance with the constitution that can be either places by the executive or legislative branches.

Each state has its own constitution and form of government that is also structured with executive, legislative and judicial branch. A republic is a form of government in which the people elect representatives. In the United States, people elect officials into the legislative branches based on popular vote. The Federal Executive branch which included the President and Vice President of the United States is elected by an electoral college. The electoral college mainly consists of two political parties; the Democratic and Republican Party

The republic is not required to vote or align themselves to a political party. Therefore voter turnout is a dynamic variable and a state’s political majority can change at every election. To help understand how the United States is govern, we will study if social demographics have an influence on how people vote. Using Machine Learning Algorithms, we will create model to predict each state’s range of voter turnout and which party will be elected in federal elections.

### List of Raw Data Sources
* United States Presidential Election Results Data Set: "https://uselectionatlas.org/RESULTS/"
* United States Legislative Branch Party Composition: https://www.kaggle.com/kiwiphrases/partystrengthbystate#states_party_strength_cleaned.csv
* Employment / unemployment data: https://data.bls.gov/PDQWeb/la
*  Poverty and income estimates data: https://www.census.gov/data-tools/demo/saipe/saipe.html?s_appName=saipe&menu=grid_proxy&s_USStOnly=y&map_yearSelector=2016&map_geoSelector=aa_c&s_measures=aa_snc
* Education Spending per student, 2016 dollars, by state, 2007-2016, (Pre-K through 12): http://www.governing.com/gov-data/education-data/state-education-spending-per-pupil-data.html
* National Public Education Financial Survey Data (Pre-K through 12):https://nces.ed.gov/ccd/stfis.asp

### List of Data Sources from Milestone 1 to use for Milestone 2
* political_data.csv - Election data
* final_dataset.csv - Demographic data
* state_edu_spending.csv - Educational Finance data

### Methodology
* Classification of if a state votes Republican or Democrat.

## Data Processing Steps
###  List data processing steps (psuedo code) including steps from data source collection & preparation, feature engineering & selection, modeling, performance evaluation.
1. Data Source Collection and Preparartion
 1. Load three data files into dataframes
    1. Process each dataframe to have the same names for state and year
    1. Merge dataframes based on state and year
 2. Analysis and Process combined dataframe
    2. Drop duplicate entries, create key fields, and create flag field noting combine participation
 3. Clean the target value 

### Read in the previously generated data file of cleaned up data

In [129]:
# Import libraries
import pandas as pd
import numpy as np
import pandas_profiling

# Read data files into dataframes
df_p_a = pd.read_csv("political_data.csv", header=0)
df_d = pd.read_csv("final_dataset.csv", header=0)
df_e = pd.read_csv("state_edu_spending.csv", header=0)

### View political data

In [130]:
# Most political data attributes are leakage to the target value since they are the results of
#  other political seats election results, therefore only use the following values
df_p = df_p_a[['year', 'state', 'population', 'total_vap', 'perc_vap', 'perc_reg', 'd_placed', 'r_placed']]

# Keep first place as 1 but convert second place as 0, there 1 would be win and 0 would be lost
df_p['d_placed'] = df_p.d_placed.replace(2,0)
df_p['r_placed'] = df_p.r_placed.replace(2,0)

# 3rd values means a third party came in first or second place
# If a value is 3, that means the did not come in first or win, so convert 3 to 0 which means lost
df_p['d_placed'] = df_p.d_placed.replace(3,0)
df_p['r_placed'] = df_p.r_placed.replace(3,0)

df_p['d_won'] = df_p['d_placed']
df_p = df_p.drop(['d_placed', 'r_placed'], axis =1)

# View political data
pandas_profiling.ProfileReport(df_p)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/p

0,1
Number of variables,7
Number of observations,535
Total Missing (%),0.0%
Total size in memory,29.3 KiB
Average record size in memory,56.1 B

0,1
Numeric,4
Categorical,1
Boolean,1
Date,0
Text (Unique),0
Rejected,1
Unsupported,0

0,1
Distinct count,2
Unique (%),0.4%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.39813

0,1
0,322
1,213

Value,Count,Frequency (%),Unnamed: 3
0,322,60.2%,
1,213,39.8%,

0,1
Distinct count,240
Unique (%),44.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,70.914
Minimum,51.4
Maximum,104.7
Zeros (%),0.0%

0,1
Minimum,51.4
5-th percentile,58.87
Q1,66.1
Median,70.8
Q3,75.7
95-th percentile,82.2
Maximum,104.7
Range,53.3
Interquartile range,9.6

0,1
Standard deviation,7.5809
Coef of variation,0.1069
Kurtosis,1.503
Mean,70.914
MAD,5.9257
Skewness,0.43101
Sum,37939
Variance,57.471
Memory size,4.3 KiB

Value,Count,Frequency (%),Unnamed: 3
75.29999999999998,11,2.1%,
70.6,10,1.9%,
75.3,6,1.1%,
71.1,6,1.1%,
69.0,6,1.1%,
74.6,6,1.1%,
68.3,5,0.9%,
67.3,5,0.9%,
73.3,5,0.9%,
67.1,5,0.9%,

Value,Count,Frequency (%),Unnamed: 3
51.4,1,0.2%,
52.1,2,0.4%,
53.3,1,0.2%,
53.9,1,0.2%,
55.1,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
92.0,1,0.2%,
99.2,1,0.2%,
103.6,1,0.2%,
104.0,1,0.2%,
104.7,1,0.2%,

0,1
Distinct count,271
Unique (%),50.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,55.451
Minimum,35.1
Maximum,74.7
Zeros (%),0.0%

0,1
Minimum,35.1
5-th percentile,43.4
Q1,50.05
Median,55.6
Q3,61.05
95-th percentile,67.3
Maximum,74.7
Range,39.6
Interquartile range,11.0

0,1
Standard deviation,7.4094
Coef of variation,0.13362
Kurtosis,-0.50573
Mean,55.451
MAD,6.091
Skewness,-0.006587
Sum,29666
Variance,54.899
Memory size,4.3 KiB

Value,Count,Frequency (%),Unnamed: 3
56.6,8,1.5%,
55.5,6,1.1%,
49.1,6,1.1%,
54.3,6,1.1%,
58.1,5,0.9%,
52.1,5,0.9%,
50.0,5,0.9%,
54.8,5,0.9%,
67.3,5,0.9%,
55.2,5,0.9%,

Value,Count,Frequency (%),Unnamed: 3
35.1,1,0.2%,
38.3,1,0.2%,
38.8,1,0.2%,
38.9,1,0.2%,
39.4,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
72.0,1,0.2%,
72.4,1,0.2%,
72.7,1,0.2%,
74.2,1,0.2%,
74.7,2,0.4%,

0,1
Distinct count,491
Unique (%),91.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,5685100
Minimum,401850
Maximum,39250000
Zeros (%),0.0%

0,1
Minimum,401850
5-th percentile,608750
Q1,1444800
Median,3695400
Q3,6436400
95-th percentile,19324000
Maximum,39250000
Range,38848000
Interquartile range,4991600

0,1
Standard deviation,6490600
Coef of variation,1.1417
Kurtosis,7.7341
Mean,5685100
MAD,4297900
Skewness,2.5564
Sum,3041500000
Variance,42128000000000
Memory size,4.3 KiB

Value,Count,Frequency (%),Unnamed: 3
11485910.0,3,0.6%,
4410796.0,3,0.6%,
5633597.0,3,0.6%,
2614554.0,3,0.6%,
19490297.0,3,0.6%,
4530182.0,2,0.4%,
553523.0,2,0.4%,
12784227.0,2,0.4%,
3535200.0,2,0.4%,
4975276.0,2,0.4%,

Value,Count,Frequency (%),Unnamed: 3
401851.0,1,0.2%,
465080.0,1,0.2%,
466251.0,1,0.2%,
469557.0,1,0.2%,
488167.0,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
33871648.0,2,0.4%,
35893799.0,1,0.2%,
36756666.0,1,0.2%,
38041430.0,2,0.4%,
39250017.0,1,0.2%,

0,1
Distinct count,51
Unique (%),9.5%
Missing (%),0.0%
Missing (n),0

0,1
louisiana,14
oregon,14
california,14
Other values (48),493

Value,Count,Frequency (%),Unnamed: 3
louisiana,14,2.6%,
oregon,14,2.6%,
california,14,2.6%,
kansas,13,2.4%,
pennsylvania,13,2.4%,
mississippi,12,2.2%,
ohio,12,2.2%,
maryland,12,2.2%,
new york,12,2.2%,
connecticut,12,2.2%,

0,1
Correlation,0.98262

0,1
Distinct count,10
Unique (%),1.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1998.4
Minimum,1980
Maximum,2016
Zeros (%),0.0%

0,1
Minimum,1980
5-th percentile,1980
Q1,1988
Median,2000
Q3,2008
95-th percentile,2016
Maximum,2016
Range,36
Interquartile range,20

0,1
Standard deviation,11.292
Coef of variation,0.0056506
Kurtosis,-1.1894
Mean,1998.4
MAD,9.7918
Skewness,-0.086958
Sum,1069156
Variance,127.51
Memory size,4.3 KiB

Value,Count,Frequency (%),Unnamed: 3
2008,66,12.3%,
2004,57,10.7%,
2000,56,10.5%,
2012,53,9.9%,
1996,53,9.9%,
1992,51,9.5%,
1988,50,9.3%,
1984,50,9.3%,
1980,50,9.3%,
2016,49,9.2%,

Value,Count,Frequency (%),Unnamed: 3
1980,50,9.3%,
1984,50,9.3%,
1988,50,9.3%,
1992,51,9.5%,
1996,53,9.9%,

Value,Count,Frequency (%),Unnamed: 3
2000,56,10.5%,
2004,57,10.7%,
2008,66,12.3%,
2012,53,9.9%,
2016,49,9.2%,

Unnamed: 0,year,state,population,total_vap,perc_vap,perc_reg,d_won
0,2016,arizona,6931071.0,2650448.0,47.136364,63.7,0
1,2016,arkansas,2988248.0,1716395.0,50.254545,64.2,0
2,2016,california,39250017.0,20203290.0,50.263636,73.3,1
3,2016,colorado,5540545.0,2483406.0,58.781818,72.4,1
4,2016,connecticut,3576452.0,2381272.0,61.490909,69.8,1


### View educational data

In [131]:
pandas_profiling.ProfileReport(df_e)

0,1
Number of variables,7
Number of observations,946
Total Missing (%),0.0%
Total size in memory,51.8 KiB
Average record size in memory,56.1 B

0,1
Numeric,4
Categorical,1
Boolean,0
Date,0
Text (Unique),0
Rejected,2
Unsupported,0

0,1
Correlation,0.99477

0,1
Distinct count,941
Unique (%),99.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,11055
Minimum,-1.3341
Maximum,22520
Zeros (%),0.0%

0,1
Minimum,-1.3341
5-th percentile,6580.0
Q1,8907.6
Median,10374.0
Q3,12571.0
95-th percentile,18059.0
Maximum,22520.0
Range,22521.0
Interquartile range,3663.0

0,1
Standard deviation,3462.9
Coef of variation,0.31323
Kurtosis,1.0978
Mean,11055
MAD,2603.2
Skewness,0.71741
Sum,10458000
Variance,11992000
Memory size,7.5 KiB

Value,Count,Frequency (%),Unnamed: 3
10350.301875,2,0.2%,
10689.165300379298,2,0.2%,
10358.169853648116,2,0.2%,
9874.578348878076,2,0.2%,
12267.892074560565,2,0.2%,
7957.234439924388,1,0.1%,
9157.809674067315,1,0.1%,
9141.945042878293,1,0.1%,
16981.08740873904,1,0.1%,
16168.197972160166,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
-1.3341133963312952,1,0.1%,
-1.3043858695652175,1,0.1%,
-1.2289144905273937,1,0.1%,
1.3552060982495766,1,0.1%,
3152.434459586404,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
21907.37810695344,1,0.1%,
22201.174214943592,1,0.1%,
22366.367483679,1,0.1%,
22512.45954509592,1,0.1%,
22519.642752320506,1,0.1%,

0,1
Distinct count,894
Unique (%),94.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,284640000
Minimum,-1.3552
Maximum,2833400000
Zeros (%),0.2%

0,1
Minimum,-1.3552
5-th percentile,4083100.0
Q1,60172000.0
Median,181850000.0
Q3,358790000.0
95-th percentile,992950000.0
Maximum,2833400000.0
Range,2833400000.0
Interquartile range,298620000.0

0,1
Standard deviation,349890000
Coef of variation,1.2293
Kurtosis,12.632
Mean,284640000
MAD,229580000
Skewness,2.9815
Sum,269270000000
Variance,1.2243e+17
Memory size,7.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,2,0.2%,
264350353.44263783,2,0.2%,
49130966.64334869,2,0.2%,
255312334.84785315,2,0.2%,
244722929.42002407,2,0.2%,
340562216.1969738,2,0.2%,
28459134.693415336,2,0.2%,
1009762070.028276,2,0.2%,
301841501.172265,2,0.2%,
31431410.55097259,2,0.2%,

Value,Count,Frequency (%),Unnamed: 3
-1.3552060982495766,1,0.1%,
-1.3341133963312952,1,0.1%,
-1.3043858695652175,1,0.1%,
-1.2289144905273937,1,0.1%,
0.0,2,0.2%,

Value,Count,Frequency (%),Unnamed: 3
2263271958.605552,1,0.1%,
2428327288.5172896,1,0.1%,
2547772491.835825,1,0.1%,
2708600190.276421,1,0.1%,
2833407949.605431,1,0.1%,

0,1
Distinct count,59
Unique (%),6.2%
Missing (%),0.0%
Missing (n),0

0,1
Colorado,17
Oklahoma,17
Georgia,17
Other values (56),895

Value,Count,Frequency (%),Unnamed: 3
Colorado,17,1.8%,
Oklahoma,17,1.8%,
Georgia,17,1.8%,
Kentucky,17,1.8%,
Wisconsin,17,1.8%,
Arizona,17,1.8%,
Maryland,17,1.8%,
New Mexico,17,1.8%,
West Virginia,17,1.8%,
Nevada,17,1.8%,

0,1
Correlation,0.99397

0,1
Distinct count,895
Unique (%),94.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,11097000000
Minimum,-1.3552
Maximum,80515000000
Zeros (%),0.0%

0,1
Minimum,-1.3552
5-th percentile,238700000.0
Q1,2609900000.0
Median,6312900000.0
Q3,12582000000.0
95-th percentile,44509000000.0
Maximum,80515000000.0
Range,80515000000.0
Interquartile range,9972100000.0

0,1
Standard deviation,13987000000
Coef of variation,1.2605
Kurtosis,7.4715
Mean,11097000000
MAD,9193500000
Skewness,2.5997
Sum,10497000000000
Variance,1.9564e+2
Memory size,7.5 KiB

Value,Count,Frequency (%),Unnamed: 3
1718856206.5670533,2,0.2%,
9387143880.316463,2,0.2%,
2381415971.9639893,2,0.2%,
61305274532.11572,2,0.2%,
1734911983.2608647,2,0.2%,
6312853297.980557,2,0.2%,
3717964925.368211,2,0.2%,
30488138907.173218,2,0.2%,
12482559425.234188,2,0.2%,
6059432507.1031,2,0.2%,

Value,Count,Frequency (%),Unnamed: 3
-1.3552060982495766,1,0.1%,
-1.3341133963312952,1,0.1%,
-1.3043858695652175,1,0.1%,
-1.2289144905273937,1,0.1%,
59131961.85819648,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
75334137771.51602,1,0.1%,
75937776128.87248,1,0.1%,
79079029347.3148,1,0.1%,
79396312699.06856,1,0.1%,
80515421622.62401,1,0.1%,

0,1
Distinct count,17
Unique (%),1.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2008
Minimum,2000
Maximum,2016
Zeros (%),0.0%

0,1
Minimum,2000
5-th percentile,2000
Q1,2004
Median,2008
Q3,2012
95-th percentile,2016
Maximum,2016
Range,16
Interquartile range,8

0,1
Standard deviation,4.8755
Coef of variation,0.0024281
Kurtosis,-1.2044
Mean,2008
MAD,4.2133
Skewness,0.002119
Sum,1899536
Variance,23.77
Memory size,7.5 KiB

Value,Count,Frequency (%),Unnamed: 3
2008,56,5.9%,
2001,56,5.9%,
2014,56,5.9%,
2013,56,5.9%,
2012,56,5.9%,
2011,56,5.9%,
2010,56,5.9%,
2009,56,5.9%,
2015,56,5.9%,
2007,56,5.9%,

Value,Count,Frequency (%),Unnamed: 3
2000,55,5.8%,
2001,56,5.9%,
2002,56,5.9%,
2003,56,5.9%,
2004,56,5.9%,

Value,Count,Frequency (%),Unnamed: 3
2012,56,5.9%,
2013,56,5.9%,
2014,56,5.9%,
2015,56,5.9%,
2016,51,5.4%,

Unnamed: 0,year,state,total_revenue,instruction_expense,property_expense,total_edu_expense,per_pupil_expense
0,2000,Alabama,6734880000.0,3592552000.0,224927800.0,6879115000.0,7314.254756
1,2000,Alaska,1895197000.0,923974400.0,72375200.0,1916757000.0,12751.12796
2,2000,Arizona,7670325000.0,3631073000.0,698406000.0,7788936000.0,7102.916696
3,2000,Arkansas,3805996000.0,2017782000.0,134297400.0,3663107000.0,7403.965385
4,2000,California,62828740000.0,33217650000.0,2263272000.0,61966320000.0,8563.135731


In [132]:
# Instruction_expensive is highly coordinated enough with total_edu_expense to be dropped
df_e = df_e.drop(['instruction_expense'], axis =1 )

### View demographic data

In [133]:
pandas_profiling.ProfileReport(df_d)

0,1
Number of variables,18
Number of observations,2184
Total Missing (%),0.2%
Total size in memory,307.2 KiB
Average record size in memory,144.0 B

0,1
Numeric,6
Categorical,1
Boolean,0
Date,0
Text (Unique),0
Rejected,11
Unsupported,0

0,1
Correlation,0.9794

0,1
Correlation,0.97833

0,1
Correlation,0.99431

0,1
Correlation,0.97284

0,1
Correlation,0.97932

0,1
Distinct count,274
Unique (%),12.5%
Missing (%),1.9%
Missing (n),42
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,13.269
Minimum,5.6
Maximum,24.6
Zeros (%),0.0%

0,1
Minimum,5.6
5-th percentile,8.75
Q1,10.7
Median,12.6
Q3,15.7
95-th percentile,19.1
Maximum,24.6
Range,19.0
Interquartile range,5.0

0,1
Standard deviation,3.3582
Coef of variation,0.25308
Kurtosis,-0.25561
Mean,13.269
MAD,2.7873
Skewness,0.57197
Sum,28423
Variance,11.277
Memory size,17.1 KiB

Value,Count,Frequency (%),Unnamed: 3
11.5,48,2.2%,
11.0,48,2.2%,
13.0,42,1.9%,
13.6,41,1.9%,
16.3,38,1.7%,
9.9,38,1.7%,
14.4,38,1.7%,
15.8,32,1.5%,
10.8,32,1.5%,
11.9,31,1.4%,

Value,Count,Frequency (%),Unnamed: 3
5.6,1,0.0%,
6.0,1,0.0%,
6.3,2,0.1%,
6.4,1,0.0%,
6.6,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
23.15,1,0.0%,
23.525,1,0.0%,
23.8,1,0.0%,
23.9,2,0.1%,
24.6,5,0.2%,

0,1
Distinct count,1414
Unique (%),64.7%
Missing (%),1.9%
Missing (n),42
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,43265
Minimum,19874
Maximum,78787
Zeros (%),0.0%

0,1
Minimum,19874
5-th percentile,27518
Q1,36131
Median,42568
Q3,49247
95-th percentile,62070
Maximum,78787
Range,58913
Interquartile range,13116

0,1
Standard deviation,10234
Coef of variation,0.23653
Kurtosis,0.32294
Mean,43265
MAD,7985.8
Skewness,0.51238
Sum,92674000
Variance,104730000
Memory size,17.1 KiB

Value,Count,Frequency (%),Unnamed: 3
45006.0,15,0.7%,
41963.0,15,0.7%,
43215.0,14,0.6%,
48440.0,14,0.6%,
50028.0,14,0.6%,
41554.0,14,0.6%,
37925.0,14,0.6%,
43217.0,14,0.6%,
35091.0,14,0.6%,
52409.0,14,0.6%,

Value,Count,Frequency (%),Unnamed: 3
19874.0,1,0.0%,
20644.75,1,0.0%,
20729.0,1,0.0%,
21415.5,1,0.0%,
21551.5,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
75207.0,2,0.1%,
75784.0,1,0.0%,
76144.0,2,0.1%,
76212.0,2,0.1%,
78787.0,2,0.1%,

0,1
Distinct count,52
Unique (%),2.4%
Missing (%),0.0%
Missing (n),0

0,1
Colorado,42
Massachusetts,42
Delaware,42
Other values (49),2058

Value,Count,Frequency (%),Unnamed: 3
Colorado,42,1.9%,
Massachusetts,42,1.9%,
Delaware,42,1.9%,
Kansas,42,1.9%,
Alaska,42,1.9%,
District of Columbia,42,1.9%,
Virginia,42,1.9%,
Connecticut,42,1.9%,
Arkansas,42,1.9%,
Maine,42,1.9%,

0,1
Correlation,0.9774

0,1
Correlation,0.9802

0,1
Correlation,0.97703

0,1
Correlation,0.97919

0,1
Correlation,0.97639

0,1
Correlation,0.94626

0,1
Distinct count,2184
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1091.5
Minimum,0
Maximum,2183
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,109.15
Q1,545.75
Median,1091.5
Q3,1637.2
95-th percentile,2073.8
Maximum,2183.0
Range,2183.0
Interquartile range,1091.5

0,1
Standard deviation,630.61
Coef of variation,0.57775
Kurtosis,-1.2
Mean,1091.5
MAD,546
Skewness,0
Sum,2383836
Variance,397670
Memory size,17.1 KiB

Value,Count,Frequency (%),Unnamed: 3
2047,1,0.0%,
1314,1,0.0%,
1326,1,0.0%,
1324,1,0.0%,
1322,1,0.0%,
1320,1,0.0%,
1318,1,0.0%,
1316,1,0.0%,
1312,1,0.0%,
1228,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,1,0.0%,
1,1,0.0%,
2,1,0.0%,
3,1,0.0%,
4,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2179,1,0.0%,
2180,1,0.0%,
2181,1,0.0%,
2182,1,0.0%,
2183,1,0.0%,

0,1
Distinct count,42
Unique (%),1.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1996.5
Minimum,1976
Maximum,2017
Zeros (%),0.0%

0,1
Minimum,1976.0
5-th percentile,1978.0
Q1,1986.0
Median,1996.5
Q3,2007.0
95-th percentile,2015.0
Maximum,2017.0
Range,41.0
Interquartile range,21.0

0,1
Standard deviation,12.124
Coef of variation,0.0060725
Kurtosis,-1.2014
Mean,1996.5
MAD,10.5
Skewness,0
Sum,4360356
Variance,146.98
Memory size,17.1 KiB

Value,Count,Frequency (%),Unnamed: 3
2017,52,2.4%,
1996,52,2.4%,
2012,52,2.4%,
2010,52,2.4%,
2008,52,2.4%,
2006,52,2.4%,
2004,52,2.4%,
2002,52,2.4%,
2000,52,2.4%,
1998,52,2.4%,

Value,Count,Frequency (%),Unnamed: 3
1976,52,2.4%,
1977,52,2.4%,
1978,52,2.4%,
1979,52,2.4%,
1980,52,2.4%,

Value,Count,Frequency (%),Unnamed: 3
2013,52,2.4%,
2014,52,2.4%,
2015,52,2.4%,
2016,52,2.4%,
2017,52,2.4%,

0,1
Distinct count,2184
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2594600
Minimum,163570
Maximum,19311958
Zeros (%),0.0%

0,1
Minimum,163570
5-th percentile,312670
Q1,697840
Median,1716100
Q3,3125600
95-th percentile,8675900
Maximum,19311958
Range,19148388
Interquartile range,2427700

0,1
Standard deviation,2884700
Coef of variation,1.1118
Kurtosis,8.8
Mean,2594600
MAD,1939900
Skewness,2.6214
Sum,5666694827
Variance,8321700000000
Memory size,17.1 KiB

Value,Count,Frequency (%),Unnamed: 3
5890046,1,0.0%,
8482355,1,0.0%,
697760,1,0.0%,
2311582,1,0.0%,
521628,1,0.0%,
400795,1,0.0%,
3241366,1,0.0%,
6380949,1,0.0%,
2973073,1,0.0%,
740667,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
163570,1,0.0%,
173013,1,0.0%,
179878,1,0.0%,
182111,1,0.0%,
183807,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
18624992,1,0.0%,
18758399,1,0.0%,
18896477,1,0.0%,
19093658,1,0.0%,
19311958,1,0.0%,

0,1
Distinct count,138
Unique (%),6.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,6.1943
Minimum,2.3
Maximum,23.4
Zeros (%),0.0%

0,1
Minimum,2.3
5-th percentile,3.2
Q1,4.6
Median,5.7
Q3,7.3
95-th percentile,10.5
Maximum,23.4
Range,21.1
Interquartile range,2.7

0,1
Standard deviation,2.444
Coef of variation,0.39456
Kurtosis,6.3415
Mean,6.1943
MAD,1.7818
Skewness,1.8073
Sum,13528
Variance,5.9731
Memory size,17.1 KiB

Value,Count,Frequency (%),Unnamed: 3
4.3,60,2.7%,
5.0,58,2.7%,
5.3,56,2.6%,
5.5,53,2.4%,
5.1,49,2.2%,
5.2,47,2.2%,
6.2,45,2.1%,
4.8,45,2.1%,
5.7,44,2.0%,
5.4,44,2.0%,

Value,Count,Frequency (%),Unnamed: 3
2.3,3,0.1%,
2.4,3,0.1%,
2.5,3,0.1%,
2.6,8,0.4%,
2.7,14,0.6%,

Value,Count,Frequency (%),Unnamed: 3
19.9,2,0.1%,
20.7,1,0.0%,
21.8,1,0.0%,
22.8,1,0.0%,
23.4,1,0.0%,

Unnamed: 0.1,Unnamed: 0,State,Year,labor force,unemployment rate,All Ages SAIPE Poverty Universe,All Ages in Poverty Count,All Ages in Poverty Percent,Under Age 18 SAIPE Poverty Universe,Under Age 18 in Poverty Count,Under Age 18 in Poverty Percent,Ages 5 to 17 in Families SAIPE Poverty Universe,Ages 5 to 17 in Families in Poverty Count,Ages 5 to 17 in Families in Poverty Percent,Under Age 5 SAIPE Poverty Universe,Under Age 5 in Poverty Count,Under Age 5 in Poverty Percent,Median Household Income in Dollars
0,0,Alabama,1976,1501284,6.8,4524161.0,749749.5,17.2,1097007.0,267674.0,24.625,797580.5,178183.0,22.7,294729.5,82291.0,27.7,36131.0
1,1,Alabama,1977,1568504,7.3,4524161.0,749749.5,17.2,1097007.0,267674.0,24.625,797580.5,178183.0,22.7,294729.5,82291.0,27.7,36131.0
2,2,Alabama,1978,1621710,6.4,4524161.0,749749.5,17.2,1097007.0,267674.0,24.625,797580.5,178183.0,22.7,294729.5,82291.0,27.7,36131.0
3,3,Alabama,1979,1656358,7.2,4524161.0,749749.5,17.2,1097007.0,267674.0,24.625,797580.5,178183.0,22.7,294729.5,82291.0,27.7,36131.0
4,4,Alabama,1980,1669289,8.9,4524161.0,749749.5,17.2,1097007.0,267674.0,24.625,797580.5,178183.0,22.7,294729.5,82291.0,27.7,36131.0


In [134]:
df_d = df_d[['All Ages in Poverty Percent', 'Median Household Income in Dollars', 'State', 'Year', 'labor force', 'unemployment rate']]

In [138]:
df_d['state'] = df_d['State']
df_d['year']  = df_d['Year']

df_d = df_d.drop('State', axis=1)
df_d = df_d.drop('Year', axis =1)
df_d.dtypes

All Ages in Poverty Percent           float64
Median Household Income in Dollars    float64
labor force                             int64
unemployment rate                     float64
state                                  object
year                                    int64
dtype: object

### Clean that state values

In [141]:
# Make all state values lowercase
df_p['state'] = df_p.state.str.lower()
df_d['state'] = df_d.state.str.lower()
df_e['state'] = df_e.state.str.lower()

# Check state data in political dataset
sorted(df_p.state.unique())

['alabama',
 'alaska',
 'arizona',
 'arkansas',
 'california',
 'colorado',
 'connecticut',
 'd.c.',
 'delaware',
 'florida',
 'georgia',
 'hawaii',
 'idaho',
 'illinois',
 'indiana',
 'iowa',
 'kansas',
 'kentucky',
 'louisiana',
 'maine',
 'maryland',
 'massachusetts',
 'michigan',
 'minnesota',
 'mississippi',
 'missouri',
 'montana',
 'nebraska',
 'nevada',
 'new hampshire',
 'new jersey',
 'new mexico',
 'new york',
 'north carolina',
 'north dakota',
 'ohio',
 'oklahoma',
 'oregon',
 'pennsylvania',
 'rhode island',
 'south carolina',
 'south dakota',
 'tennessee',
 'texas',
 'utah',
 'vermont',
 'virginia',
 'washington',
 'west virginia',
 'wisconsin',
 'wyoming']

In [143]:
# Compare political dataset to demographic dataset
print(np.setdiff1d(df_d.state.unique(), df_p.state.unique()))

# Compare political dataset to educational dataset
print(np.setdiff1d(df_e.state.unique(), df_p.state.unique()))

['district of columbia' 'puerto rico']
['american samoa' 'american somoa' 'district of columbia' 'guam'
 'northern mariana islands' 'northern marianas' 'puerto rico'
 'virgin islands']


In [144]:
# Make d.c. data values that same in all three datasets
df_p['state'] = df_p.state.replace('d.c.', 'district of columbia')

### Clean the year data up

In [146]:
print(
'  political data', df_p.year.max(), df_p.year.min(), '\n',
'educational data', df_e.year.max(), df_e.year.min(), '\n',
'demographic data', df_d.year.max(), df_d.year.min())

  political data 2016 1980 
 educational data 2016 2000 
 demographic data 2017 1976


## Feature Engineering and Selection

In [None]:
# Data file location


In [None]:
# Import libraries


## Preliminary Data Model



### Model Performance Evaluation

## Improved Machine Learning Model(s)