# Computational Social Science Project #2 

*Group number:* 

*Group members:*   

*Semester:* Fall 2021


Below we fill in some of the code you might use to answer some of the questions. Here are some additional resources for when you get stuck:
* Code and documentation provided in the course notebooks  
* [Markdown cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) to help with formatting the Jupyter notebook
* Try Googling any errors you get and consult Stack Overflow, etc. Someone has probably had your question before!
* Send KQ a pull request on GitHub flagging the syntax that's tripping you up 

## 1. Introduction/Setup

#### a) Import relevant libraries
Add the other libraries you need for your code below and/or as you go. 

In [31]:
# import libraries you might need here 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# use random seed for consistent results 
np.random.seed(273)

#### b) Read in and inspect data frame 
Read in the data frame and look at some of its attributes. 

In [32]:
diabetes = pd.read_csv('Diabetes with Population Info by County 2017.csv', 
                       #CountyFips needs to be a string so leading 0 isn't dropped (this is only if you want to make choropleth map): 
                       dtype={"CountyFIPS": str}) 

In [33]:
# look at the dimensions of the diabetes data frame
print('shape: ', diabetes.shape) 

shape:  (3220, 95)


In [34]:
pd.set_option('display.max_rows', 100) # tells pandas how many rows to display when printing so results don't get truncated

# look at the data types for each column in diabetes df 
print('data types:', diabetes.dtypes)

data types: County                                                                                                                                                   object
State                                                                                                                                                    object
CountyFIPS                                                                                                                                               object
Diabetes_Number                                                                                                                                          object
Obesity_Number                                                                                                                                           object
Physical_Inactivity_Number                                                                                                                               object
sex and age_total population

Immediately, we see that some of the features that should be numeric (e.g., Diabetes_Number, Obesity_Number,  and Physical_Inactivity_Number) are not. We can check to see what the non-numeric values are in a column where we are expecting numeric information with a combination of `str.isnumeric()` and `unique()`.

In [35]:
# Return rows where the column "Diabetes_Number" is non-numeric and get the unique values of these rows
# the "~" below in front of diabetes negates the str.isnumeric() so it only takes non-numeric values
print(diabetes[~diabetes['Diabetes_Number'].str.isnumeric()]['Diabetes_Number'].unique()) 

['Suppressed']


In [36]:
# Now do the same as above, but for "Obesity_Number" :
print(diabetes[~diabetes['Obesity_Number'].str.isnumeric()]['Obesity_Number'].unique()) 

['No Data']


In [37]:
print(diabetes[~diabetes['Physical_Inactivity_Number'].str.isnumeric()]['Physical_Inactivity_Number'].unique()) 

['No Data']


The values contained in the two columns above making them objects (rather than integers) appear to be strings like "No Data" and "Suppressed." Let's drop those rows in the next section, and also recode Physical_Inactivity_Number to be an integer. 

#### c. Recode variables

Convert 'Diabetes_Number', 'Obesity_Number', and 'Physical_Inactivity_Number' to integers below so we can use them in our analysis. Also fill in the object type we want to recode 'sex and age_total population_65 years and over_sex ratio (males per 100 females)' to. 

In [38]:
# Diabetes
# keep only useful info about our target feature, i.e., where diabetes_number not = 'Suppressed'
diabetes = diabetes[diabetes['Diabetes_Number']!="Suppressed"]  # note that the inside reference to the diabetes df identifies the column, and the outer calls specific rows according to a condition 

# use the astype method on Diabetes_Number to convert it to an integer...if you are not sure, what does the astype() documentation tell you are possible arguments? 
diabetes['Diabetes_Number'] = diabetes['Diabetes_Number'].astype('int64') 

# Obesity
diabetes['Obesity_Number'] = diabetes[diabetes['Obesity_Number']!= "No Data"]['Obesity_Number'].dropna().astype('int64') 

# Physical Inactivity
diabetes['Physical_Inactivity_Number'] = diabetes[diabetes['Physical_Inactivity_Number']!= "No Data"]['Physical_Inactivity_Number'].dropna().astype('int64') 

# 65+ sex ratio had one "-" in it so let's drop that row first
diabetes = diabetes[diabetes['sex and age_total population_65 years and over_sex ratio (males per 100 females)']!= "-"]
# change to numeric (specifically, integer or float?) from string (because originally included the "-" )
diabetes['sex and age_total population_65 years and over_sex ratio (males per 100 females)'] = diabetes['sex and age_total population_65 years and over_sex ratio (males per 100 females)'].dropna().astype(float)


We should probably scale our count variables to be proportional to county population. We create the list 'rc_cols' to select all the features we want to rescale, and then use the `.div()` method to avoid typing out every single column we want to recode. 

In [40]:
diabetes

Unnamed: 0,County,State,CountyFIPS,Diabetes_Number,Obesity_Number,Physical_Inactivity_Number,sex and age_total population_65 years and over_sex ratio (males per 100 females),race_total population,race_total population_one race,race_total population_two or more races,...,sex and age_total population_21 years and over,sex and age_total population_62 years and over,sex and age_total population_65 years and over,sex and age_total population_18 years and over_1,sex and age_total population_18 years and over_male,sex and age_total population_18 years and over_female,sex and age_total population_18 years and over_sex ratio (males per 100 females),sex and age_total population_65 years and over_1,sex and age_total population_65 years and over_male,sex and age_total population_65 years and over_female
0,Autauga County,Alabama,01001,5178,13489.0,12525.0,76.5,55200,54170,1030,...,39614,9403,8050,41831,20054,21777,92.1,8050,3488,4562
1,Baldwin County,Alabama,01003,16707,48542.0,40024.0,86.6,208107,204535,3572,...,155502,49811,40665,162430,78104,84326,92.6,40665,18870,21795
2,Barbour County,Alabama,01005,3400,8009.0,5431.0,76.2,25782,25429,353,...,19394,5658,4634,20346,10923,9423,115.9,4634,2004,2630
3,Bibb County,Alabama,01007,2371,6534.0,5827.0,80.5,22527,22340,187,...,16864,4323,3661,17868,9578,8290,115.5,3661,1633,2028
4,Blount County,Alabama,01009,6282,14227.0,14365.0,79.7,57645,56710,935,...,42232,12400,10233,44177,21595,22582,95.6,10233,4540,5693
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3215,Sweetwater County,Wyoming,56037,2560,9362.0,7518.0,92.0,44117,43226,891,...,30491,6159,4721,32308,16800,15508,108.3,4721,2262,2459
3216,Teton County,Wyoming,56039,439,2051.0,2005.0,94.8,23059,22875,184,...,18169,3843,3135,18625,9619,9006,106.8,3135,1526,1609
3217,Uinta County,Wyoming,56041,1440,4925.0,3519.0,97.6,20609,19891,718,...,13939,3251,2498,14538,7280,7258,100.3,2498,1234,1264
3218,Washakie County,Wyoming,56043,637,1738.0,1708.0,89.0,8129,7791,338,...,5951,2041,1686,6187,3142,3045,103.2,1686,794,892


In [50]:
# select count variables to rc to percentages; make sure we leave out ratios and our population variable b/c these don't make sense to scale by population
rc_cols = [col for col in diabetes.columns if col not in ['County', 'State', 'CountyFIPS', 
                                                        'sex and age_total population_65 years and over_sex ratio (males per 100 females)', 'sex and age_total population_sex ratio (males per 100 females)', 'sex and age_total population_18 years and over_sex ratio (males per 100 females)',  
                                                        'race_total population']]
           
diabetes[rc_cols] = diabetes[rc_cols].apply(pd.to_numeric, errors='coerce') # recode all selected columns to numeric

# divide all columns but those listed above by total population to calculate rates
diabetes[rc_cols] = diabetes[rc_cols].div(diabetes['race_total population'], axis=0)

Let's check our work. Are all rates bounded by 0 and 1 as expected? 

In [51]:
diabetes

Unnamed: 0,County,State,CountyFIPS,Diabetes_Number,Obesity_Number,Physical_Inactivity_Number,sex and age_total population_65 years and over_sex ratio (males per 100 females),race_total population,race_total population_one race,race_total population_two or more races,race_total population_one race_white,race_total population_one race_black or african american,race_total population_one race_american indian and alaska native,race_total population_one race_american indian and alaska native_cherokee tribal grouping,race_total population_one race_american indian and alaska native_chippewa tribal grouping,race_total population_one race_american indian and alaska native_navajo tribal grouping,race_total population_one race_american indian and alaska native_sioux tribal grouping,race_total population_one race_asian,race_total population_one race_asian_asian indian,race_total population_one race_asian_chinese,race_total population_one race_asian_filipino,race_total population_one race_asian_japanese,race_total population_one race_asian_korean,race_total population_one race_asian_vietnamese,race_total population_one race_asian_other asian,race_total population_one race_native hawaiian and other pacific islander,race_total population_one race_native hawaiian and other pacific islander_native hawaiian,race_total population_one race_native hawaiian and other pacific islander_guamanian or chamorro,race_total population_one race_native hawaiian and other pacific islander_samoan,race_total population_one race_native hawaiian and other pacific islander_other pacific islander,race_total population_one race_some other race,race_total population_two or more races_white and black or african american,race_total population_two or more races_white and american indian and alaska native,race_total population_two or more races_white and asian,race_total population_two or more races_black or african american and american indian and alaska native,race alone or in combination with one or more other races_total population,race alone or in combination with one or more other races_total population_white,race alone or in combination with one or more other races_total population_black or african american,race alone or in combination with one or more other races_total population_american indian and alaska native,race alone or in combination with one or more other races_total population_asian,race alone or in combination with one or more other races_total population_native hawaiian and other pacific islander,race alone or in combination with one or more other races_total population_some other race,hispanic or latino and race_total population_hispanic or latino (of any race),hispanic or latino and race_total population_hispanic or latino (of any race)_mexican,hispanic or latino and race_total population_hispanic or latino (of any race)_puerto rican,hispanic or latino and race_total population_hispanic or latino (of any race)_cuban,hispanic or latino and race_total population_hispanic or latino (of any race)_other hispanic or latino,hispanic or latino and race_total population_not hispanic or latino,hispanic or latino and race_total population_not hispanic or latino_white alone,hispanic or latino and race_total population_not hispanic or latino_black or african american alone,hispanic or latino and race_total population_not hispanic or latino_american indian and alaska native alone,hispanic or latino and race_total population_not hispanic or latino_asian alone,hispanic or latino and race_total population_not hispanic or latino_native hawaiian and other pacific islander alone,hispanic or latino and race_total population_not hispanic or latino_some other race alone,hispanic or latino and race_total population_not hispanic or latino_two or more races,hispanic or latino and race_total population_not hispanic or latino_two or more races_two races including some other race,hispanic or latino and race_total population_not hispanic or latino_two or more races_two races excluding some other race - and three or more races,total housing units,citizen - voting age population_citizen - 18 and over population,citizen - voting age population_citizen - 18 and over population_male,citizen - voting age population_citizen - 18 and over population_female,sex and age_total population_male,sex and age_total population_female,sex and age_total population_sex ratio (males per 100 females),sex and age_total population_under 5 years,sex and age_total population_5 to 9 years,sex and age_total population_10 to 14 years,sex and age_total population_15 to 19 years,sex and age_total population_20 to 24 years,sex and age_total population_25 to 34 years,sex and age_total population_35 to 44 years,sex and age_total population_45 to 54 years,sex and age_total population_55 to 59 years,sex and age_total population_60 to 64 years,sex and age_total population_65 to 74 years,sex and age_total population_75 to 84 years,sex and age_total population_85 years and over,sex and age_total population_median age (years),sex and age_total population_under 18 years,sex and age_total population_16 years and over,sex and age_total population_18 years and over,sex and age_total population_21 years and over,sex and age_total population_62 years and over,sex and age_total population_65 years and over,sex and age_total population_18 years and over_male,sex and age_total population_18 years and over_female,sex and age_total population_18 years and over_sex ratio (males per 100 females),sex and age_total population_65 years and over_male,sex and age_total population_65 years and over_female
0,Autauga County,Alabama,01001,1.699354e-06,0.000004,4.110547e-06,76.5,55200,0.000018,3.380330e-07,0.000014,3.467299e-06,5.218179e-08,2.691136e-08,0.0,1.444024e-08,0.000000e+00,1.864104e-07,0.000000e+00,3.511605e-08,3.938248e-08,8.861059e-09,2.888049e-08,7.023209e-08,3.938248e-09,1.050200e-08,0.000000e+00,1.050200e-08,0.000000e+00,0.000000e+00,1.342286e-07,1.115837e-07,9.025152e-08,8.434415e-08,3.938248e-09,0.000018,0.000014,3.597590e-06,1.522789e-07,2.937277e-07,3.774155e-08,1.401360e-07,5.014703e-07,2.418741e-07,9.418977e-08,6.793478e-08,9.747164e-08,0.000018,0.000014,3.437763e-06,5.218179e-08,1.864104e-07,1.640937e-09,1.345568e-08,3.321256e-07,0.000000e+00,3.321256e-07,0.000008,0.000014,0.000006,0.000007,0.000009,0.000009,94.9,1.070875e-06,1.315703e-06,1.171629e-06,1.265162e-06,1.095161e-06,2.318315e-06,2.420054e-06,2.577255e-06,1.355742e-06,8.841367e-07,1.546091e-06,8.283449e-07,2.674727e-07,1.240548e-08,0.000004,0.000014,0.000014,0.000013,0.000003,2.641908e-06,0.000007,0.000007,92.1,1.144717e-06,1.497191e-06
1,Baldwin County,Alabama,01003,3.857670e-07,0.000001,9.241599e-07,86.6,208107,0.000005,8.247799e-08,0.000004,4.563536e-07,3.514320e-08,6.580691e-09,0.0,1.616310e-09,3.463521e-10,3.879144e-08,9.928761e-10,1.526258e-08,4.710389e-09,1.103709e-08,3.163350e-09,2.955538e-09,6.696141e-10,2.078113e-10,0.000000e+00,2.078113e-10,0.000000e+00,0.000000e+00,4.696535e-08,9.420778e-09,3.574354e-08,1.974207e-08,1.431589e-09,0.000005,0.000004,4.689839e-07,7.354211e-08,6.520656e-08,1.824121e-09,6.114270e-08,2.159621e-07,1.184293e-07,3.350380e-08,6.303609e-09,5.772536e-08,0.000005,0.000004,4.509274e-07,3.228002e-08,3.851436e-08,2.078113e-10,9.466959e-09,6.862391e-08,8.127730e-09,6.049617e-08,0.000003,0.000004,0.000002,0.000002,0.000002,0.000002,94.6,2.680535e-07,2.699007e-07,3.307201e-07,2.934064e-07,2.491426e-07,5.386007e-07,5.859586e-07,6.541438e-07,3.334217e-07,3.429117e-07,5.668861e-07,2.808916e-07,9.118297e-08,9.882581e-10,0.000001,0.000004,0.000004,0.000004,0.000001,9.389607e-07,0.000002,0.000002,92.6,4.357110e-07,5.032497e-07
2,Barbour County,Alabama,01005,5.115001e-06,0.000012,8.170462e-06,76.2,25782,0.000038,5.310574e-07,0.000018,1.845312e-05,1.083177e-07,3.911471e-08,0.0,1.353971e-08,0.000000e+00,1.444236e-07,1.654853e-08,0.000000e+00,7.522060e-09,0.000000e+00,7.221178e-08,1.955736e-08,2.858383e-08,1.504412e-09,1.504412e-09,0.000000e+00,0.000000e+00,0.000000e+00,1.170433e-06,6.468972e-08,1.098221e-07,8.123825e-08,1.218574e-07,0.000039,0.000019,1.879311e-05,4.528280e-07,2.331839e-07,1.504412e-09,1.203530e-06,1.663880e-06,1.361493e-06,1.338927e-07,4.513236e-09,1.639809e-07,0.000037,0.000018,1.835232e-05,9.477796e-08,1.278750e-07,1.504412e-09,1.293794e-07,5.175177e-07,1.955736e-08,4.979604e-07,0.000018,0.000030,0.000016,0.000014,0.000021,0.000018,113.3,2.091133e-06,2.181397e-06,2.522899e-06,2.157327e-06,2.494315e-06,5.528714e-06,4.669695e-06,5.193230e-06,2.492811e-06,2.483784e-06,4.242442e-06,2.094142e-06,6.348619e-07,6.002604e-08,0.000008,0.000032,0.000031,0.000029,0.000009,6.971445e-06,0.000016,0.000014,115.9,3.014842e-06,3.956604e-06
3,Bibb County,Alabama,01007,4.672237e-06,0.000013,1.148255e-05,80.5,22527,0.000044,3.684978e-07,0.000034,9.888352e-06,1.576461e-08,0.000000e+00,0.0,0.000000e+00,0.000000e+00,7.291133e-08,7.291133e-08,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,1.773519e-08,1.793225e-07,1.615873e-07,1.182346e-08,0.000000e+00,0.000044,0.000034,1.008344e-05,1.773519e-07,1.004994e-07,0.000000e+00,1.773519e-08,1.077905e-06,3.941153e-07,3.349980e-08,5.517614e-08,5.951141e-07,0.000043,0.000033,9.801647e-06,1.576461e-08,7.291133e-08,0.000000e+00,0.000000e+00,3.152922e-07,0.000000e+00,3.152922e-07,0.000018,0.000035,0.000019,0.000016,0.000024,0.000020,117.1,2.512485e-06,2.321339e-06,2.540073e-06,2.983453e-06,2.938129e-06,6.059522e-06,5.704819e-06,6.776812e-06,2.944041e-06,2.396221e-06,4.181563e-06,2.191281e-06,8.414361e-07,7.862600e-08,0.000009,0.000036,0.000035,0.000033,0.000009,7.214280e-06,0.000019,0.000016,115.5,3.217951e-06,3.996329e-06
4,Blount County,Alabama,01009,1.890491e-06,0.000004,4.322971e-06,79.7,57645,0.000017,2.813768e-07,0.000017,2.594084e-07,4.243223e-08,6.018756e-09,0.0,0.000000e+00,0.000000e+00,5.958568e-08,1.263939e-08,1.414408e-08,1.594970e-08,2.708440e-09,1.023188e-08,0.000000e+00,3.912191e-09,5.416880e-09,0.000000e+00,5.416880e-09,0.000000e+00,0.000000e+00,1.315098e-07,7.102132e-08,1.597980e-07,1.354220e-08,0.000000e+00,0.000017,0.000017,3.382541e-07,2.148696e-07,8.787383e-08,1.594970e-08,1.441492e-07,1.583234e-06,1.430959e-06,3.220034e-08,5.717818e-09,1.143564e-07,0.000016,0.000015,2.467690e-07,3.731628e-08,5.958568e-08,5.416880e-09,5.236317e-08,2.461671e-07,0.000000e+00,2.461671e-07,0.000007,0.000013,0.000006,0.000007,0.000009,0.000009,97.3,1.048768e-06,1.093006e-06,1.202246e-06,1.118586e-06,9.596906e-07,2.042164e-06,2.152608e-06,2.391553e-06,1.172153e-06,1.087288e-06,1.845350e-06,9.735337e-07,2.606121e-07,1.227826e-08,0.000004,0.000014,0.000013,0.000013,0.000004,3.079496e-06,0.000006,0.000007,95.6,1.366258e-06,1.713239e-06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3215,Sweetwater County,Wyoming,56037,1.315310e-06,0.000005,3.862695e-06,92.0,44117,0.000022,4.577894e-07,0.000021,1.818827e-07,3.812343e-07,0.000000e+00,0.0,1.238241e-07,7.706893e-09,1.433482e-07,0.000000e+00,2.517585e-08,3.236895e-08,1.541379e-09,1.798275e-08,1.130344e-08,5.497583e-08,1.048137e-07,1.489999e-08,0.000000e+00,8.991375e-08,0.000000e+00,2.892654e-07,1.197137e-07,1.094379e-07,1.510551e-07,0.000000e+00,0.000023,0.000022,3.093033e-07,5.261239e-07,3.267723e-07,1.335861e-07,3.057067e-07,3.618129e-06,2.940950e-06,1.592758e-08,7.706893e-09,6.535445e-07,0.000019,0.000018,1.726344e-07,2.425102e-07,1.433482e-07,9.967581e-08,5.651721e-09,3.473240e-07,0.000000e+00,3.473240e-07,0.000010,0.000016,0.000008,0.000008,0.000012,0.000011,107.8,1.621016e-06,1.825506e-06,1.645678e-06,1.622044e-06,1.395461e-06,3.365857e-06,3.094061e-06,2.709743e-06,1.587106e-06,1.374910e-06,1.572720e-06,6.242583e-07,2.286378e-07,1.777723e-08,0.000006,0.000017,0.000017,0.000016,0.000003,2.425616e-06,0.000009,0.000008,108.3,1.162199e-06,1.263417e-06
3216,Teton County,Wyoming,56039,8.256264e-07,0.000004,3.770799e-06,94.8,23059,0.000043,3.460484e-07,0.000039,5.153113e-07,1.448137e-07,5.642094e-08,0.0,5.642094e-09,0.000000e+00,5.303568e-07,1.128419e-08,3.140766e-07,1.880698e-08,1.034384e-07,0.000000e+00,0.000000e+00,8.275071e-08,6.582443e-08,0.000000e+00,0.000000e+00,0.000000e+00,6.582443e-08,2.585960e-06,0.000000e+00,1.824277e-07,1.147226e-07,0.000000e+00,0.000043,0.000040,5.153113e-07,3.272415e-07,6.526022e-07,7.334722e-08,2.627335e-06,6.456436e-06,5.668424e-06,1.767856e-07,0.000000e+00,6.112268e-07,0.000037,0.000035,4.607710e-07,1.448137e-07,5.303568e-07,6.582443e-08,2.256838e-08,3.046731e-07,0.000000e+00,3.046731e-07,0.000026,0.000031,0.000016,0.000015,0.000022,0.000021,106.8,2.174087e-06,2.557749e-06,2.542704e-06,1.749049e-06,2.063126e-06,7.972279e-06,7.020646e-06,5.738010e-06,2.792837e-06,2.860542e-06,3.925017e-06,1.436853e-06,5.341182e-07,7.391143e-08,0.000008,0.000036,0.000035,0.000034,0.000007,5.895988e-06,0.000018,0.000017,106.8,2.869945e-06,3.026043e-06
3217,Uinta County,Wyoming,56041,3.390382e-06,0.000012,8.285246e-06,97.6,20609,0.000047,1.690482e-06,0.000045,4.708864e-08,3.767091e-07,0.000000e+00,0.0,6.121523e-08,1.389115e-07,5.179751e-08,0.000000e+00,0.000000e+00,1.883546e-08,0.000000e+00,2.354432e-08,9.417728e-09,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,1.028887e-06,7.298739e-08,1.480938e-06,0.000000e+00,0.000000e+00,0.000049,0.000047,2.260255e-07,1.963596e-06,8.240512e-08,3.060762e-08,1.028887e-06,4.435750e-06,4.056686e-06,4.708864e-08,0.000000e+00,3.319749e-07,0.000044,0.000042,4.708864e-08,3.531648e-07,5.179751e-08,0.000000e+00,0.000000e+00,1.156026e-06,0.000000e+00,1.156026e-06,0.000021,0.000033,0.000017,0.000016,0.000025,0.000024,104.0,3.536357e-06,4.306256e-06,4.113193e-06,3.477496e-06,2.681698e-06,5.777776e-06,6.352258e-06,5.497599e-06,4.148509e-06,2.749977e-06,3.778863e-06,1.426786e-06,6.757220e-07,8.358234e-08,0.000014,0.000036,0.000034,0.000033,0.000008,5.881371e-06,0.000017,0.000017,100.3,2.905369e-06,2.976002e-06
3218,Washakie County,Wyoming,56043,9.639737e-06,0.000026,2.584721e-05,89.0,8129,0.000118,5.114962e-06,0.000110,1.513303e-08,7.869173e-07,6.355871e-07,0.0,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,6.734196e-06,1.846229e-06,2.481816e-06,6.053210e-08,0.000000e+00,0.000123,0.000115,1.937027e-06,3.359532e-06,1.361972e-07,9.079815e-08,7.384916e-06,1.744838e-05,1.601074e-05,1.513303e-08,0.000000e+00,1.422504e-06,0.000106,0.000101,1.513303e-08,4.388577e-07,0.000000e+00,0.000000e+00,1.967293e-07,4.040518e-06,3.177935e-07,3.722724e-06,0.000059,0.000093,0.000047,0.000046,0.000063,0.000060,103.6,6.567733e-06,7.324384e-06,8.489627e-06,8.777155e-06,6.174274e-06,1.184916e-05,1.489090e-05,1.545082e-05,8.792288e-06,9.185746e-06,1.463364e-05,7.990237e-06,2.890408e-06,6.582866e-07,0.000029,0.000099,0.000094,0.000090,0.000031,2.551428e-05,0.000048,0.000046,103.2,1.201562e-05,1.349866e-05


In [54]:
pd.set_option('display.max_columns', None)
# inspect recoded values
diabetes_summary = diabetes.describe().transpose() # note we use the transpose method rather than .T because this object is not a numpy array
  
# check recoding 
with pd.option_context('display.max_rows', 100, 'display.max_columns', None): 
    display(diabetes_summary.iloc[ : ,[0,1,3,7]]) # select which columns in the summary table we want to present

Unnamed: 0,count,mean,min,max
Diabetes_Number,3190.0,8.269962e-06,6.584316e-09,0.0003847338
Obesity_Number,3112.0,2.22357e-05,1.654307e-08,0.0007887042
Physical_Inactivity_Number,3112.0,1.876256e-05,1.238349e-08,0.0008258932
sex and age_total population_65 years and over_sex ratio (males per 100 females),3190.0,84.22382,34.7,220.0
race_total population,3190.0,102209.2,228.0,10098050.0
race_total population_one race,3190.0,8.967346e-05,9.512906e-08,0.004385965
race_total population_two or more races,3190.0,1.961564e-06,0.0,0.0002211825
race_total population_one race_white,3190.0,7.94675e-05,5.086619e-08,0.004347492
race_total population_one race_black or african american,3190.0,4.807644e-06,0.0,0.0004530547
race_total population_one race_american indian and alaska native,3190.0,2.68859e-06,0.0,0.0004897851


#### d. Check for duplicate columns

There are a lot of columns in this data frame. Let's see if there are any are duplicates. 

In [55]:
# I used Google to figure this out, and adapted this example for our purposes:  
# source: https://thispointer.com/how-to-find-drop-duplicate-columns-in-a-dataframe-python-pandas/ 
def getDuplicateColumns(df):
    '''
    Get a list of duplicate columns.
    It will iterate over all the columns in dataframe and find the columns whose contents are duplicate.
    :param df: Dataframe object
    :return: List of columns whose contents are duplicates.
    '''
    duplicateColumnNames = set()
    # Iterate over all the columns in dataframe
    for x in range(df.shape[1]):
        # Select column at xth index.
        col = df.iloc[:, x]
        # Iterate over all the columns in DataFrame from (x+1)th index till end
        for y in range(x + 1, df.shape[1]):
            # Select column at yth index.
            otherCol = df.iloc[:, y]
            # Check if two columns at x 7 y index are equal
            if col.equals(otherCol):
                duplicateColumnNames.add(df.columns.values[y])
    return list(duplicateColumnNames)

duplicateColumnNames = list(getDuplicateColumns(diabetes))
print('Duplicate Columns are as follows: ')
duplicateColumnNames

Duplicate Columns are as follows: 


[]

In [56]:
# now drop list of duplicate features from our df using the .drop() method
diabetes = diabetes.drop(columns=duplicateColumnNames) 

## 2. Exploratory Data Analysis

In [57]:
# insert your EDAs and interpretations in this section 

## 3. Prepare to Fit Models

### 3.1 Finalize Data Set

We've already cleaned up the data, but we can make a few more adjustments before partitioning the data and training models. Let's recode 'State' to be a categorical variable using `pd.get_dummies` and drop 'County' using `.drop()` because 'CountyFIPS' is already a unique identifier for the county. 

In [58]:
# create dummy features out of 'State' , which might be related to diabetes rates 
diabetes_clean = pd.get_dummies(______, 
                               columns = [______],  
                               drop_first = True) # only create 49 dummies by dropping first in category

# drop 'County' variable
diabetes_clean = diabetes_clean.drop(labels = ['County'],
                               axis = ______) # which axis tells python we want to drop columns rather than index rows?

# look at first 10 rows of new data frame 
diabetes_clean.______ 

NameError: name '______' is not defined

### 3.2/3.3 Partition Data and Feature Selection

Now, we will partition our data to prepare it for the training process. We will use 60% train—20% validation—20% test in this case. More data in the training set lowers bias, but then increases variance in the validation/test sets. Balancing between bias and variance with choice of these set sizes is important as we want to ensure that there is enough data to train on to get good predictions, but also want to make sure our hold-out sets are representative enough.

In [None]:
from sklearn.model_selection import train_test_split

# Set y 
y = ______

# X (everything except diabetes, our target)
X = ______

We should also preprocess our data. Using the `preprocessing` module from sklearn, let's scale our features so that they are mean-centered.

In [None]:
from sklearn import preprocessing

X = preprocessing.scale(X)

We can also get rid of the 0 variance features using the `VarianceThreshold()` method from `feature_selection`. 

In [None]:
from sklearn import feature_selection

selector = feature_selection.VarianceThreshold(0)
X = selector.fit_transform(X)

And finally, let's split our data:

In [None]:
# split the data
# train_test_split returns 4 values: X_train, X_test, y_train, y_test, so how do we create a 60-20-20 train-validate-test split? 

X_train, X_test, y_train, y_test = ______

X_train, X_validate, y_train, y_validate = ______

## 4. Train Models

In [None]:
# train your five models in this section 

## 5. Validate and Refine Models

In [None]:
# use X_validation and y_validation data sets to evaluate and refine your models

## 6. Discussion Questions

In [None]:
# insert responses for discussion Qs here