# Cleaning and Imputing Demographics Data


* [Imports and Setting Data Types](#Imports-and-Setting-Data-Types)
* [Cleaning and Imputing Data](#Cleaning-and-Imputing_Data)




## Imports and Setting Data Types

In [97]:
import numpy as np
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [98]:
# defining dtypes upon reading in data to keep FIPS as 5 digit code
dtypes_dict = {
 'FIPS': 'object',
 'State': 'object',
 'County': 'object',
 'Deaths': 'float64',
 'Years of Potential Life Lost Rate': 'float64',
 '% Fair or Poor Health': 'float64',
 'Average Number of Physically Unhealthy Days': 'float64',
 'Average Number of Mentally Unhealthy Days': 'float64',
 '% Low birthweight': 'float64',
 '% Smokers': 'float64',
 '% Adults with Obesity': 'float64',
 'Food Environment Index': 'float64',
 '% Physically Inactive': 'float64',
 '% With Access to Exercise Opportunities': 'float64',
 '% Excessive Drinking': 'float64',
 '# Alcohol-Impaired Driving Deaths': 'float64',
 '# Driving Deaths': 'float64',
 '% Driving Deaths with Alcohol Involvement': 'float64',
 '# Chlamydia Cases': 'float64',
 'Chlamydia Rate': 'float64',
 'Teen Birth Rate': 'float64',
 '# Uninsured': 'float64',
 '% Uninsured': 'float64',
 '# Primary Care Physicians': 'float64',
 'Primary Care Physicians Rate': 'float64',
 'Primary Care Physicians Ratio': 'object',
 '# Dentists': 'float64',
 'Dentist Rate': 'float64',
 'Dentist Ratio': 'object',
 '# Mental Health Providers': 'float64',
 'Mental Health Provider Rate': 'float64',
 'Mental Health Provider Ratio': 'object',
 'Preventable Hospitalization Rate': 'float64',
 '% With Annual Mammogram': 'float64',
 '% Vaccinated': 'float64',
 '# Completed High School': 'int64',
 'Population': 'int64',
 '% Completed High School': 'float64',
 '# Some College': 'int64',
 'Population.1': 'int64',
 '% Some College': 'float64',
 '# Unemployed': 'float64',
 'Labor Force': 'float64',
 '% Unemployed': 'float64',
 '% Children in Poverty': 'float64',
 'Income Ratio': 'float64',
 '# Children in Single-Parent Households': 'float64',
 '# Children in Households': 'float64',
 '% Children in Single-Parent Households': 'float64',
 '# Associations': 'int64',
 'Social Association Rate': 'float64',
 'Annual Average Violent Crimes': 'float64',
 'Violent Crime Rate': 'float64',
 '# Injury Deaths': 'float64',
 'Injury Death Rate': 'float64',
 'Average Daily PM2.5': 'float64',
 '% Severe Housing Problems': 'float64',
 'Severe Housing Cost Burden': 'float64',
 'Overcrowding': 'float64',
 'Inadequate Facilities': 'float64',
 '% Drive Alone to Work': 'float64',
 '# Workers who Drive Alone': 'int64',
 '% Long Commute - Drives Alone': 'float64',
 'Water_Violation': 'int64'}

In [99]:
df = pd.read_csv('../data/chr_clean.csv', dtype=dtypes_dict)

In [100]:
df.head()

Unnamed: 0,FIPS,State,County,Deaths,Years of Potential Life Lost Rate,% Fair or Poor Health,Average Number of Physically Unhealthy Days,Average Number of Mentally Unhealthy Days,% Low birthweight,% Smokers,% Adults with Obesity,Food Environment Index,% Physically Inactive,% With Access to Exercise Opportunities,% Excessive Drinking,# Alcohol-Impaired Driving Deaths,# Driving Deaths,% Driving Deaths with Alcohol Involvement,# Chlamydia Cases,Chlamydia Rate,Teen Birth Rate,# Uninsured,% Uninsured,# Primary Care Physicians,Primary Care Physicians Rate,Primary Care Physicians Ratio,# Dentists,Dentist Rate,Dentist Ratio,# Mental Health Providers,Mental Health Provider Rate,Mental Health Provider Ratio,Preventable Hospitalization Rate,% With Annual Mammogram,% Vaccinated,# Completed High School,Population,% Completed High School,# Some College,Population.1,% Some College,# Unemployed,Labor Force,% Unemployed,% Children in Poverty,Income Ratio,# Children in Single-Parent Households,# Children in Households,% Children in Single-Parent Households,# Associations,Social Association Rate,Annual Average Violent Crimes,Violent Crime Rate,# Injury Deaths,Injury Death Rate,Average Daily PM2.5,% Severe Housing Problems,Severe Housing Cost Burden,Overcrowding,Inadequate Facilities,% Drive Alone to Work,# Workers who Drive Alone,% Long Commute - Drives Alone,Water_Violation
0,1001,Alabama,Autauga,787.0,7830.053484,19.839179,4.501499,4.850214,9.201774,19.810766,33.0,6.7,30.6,69.130124,14.473886,19.0,57.0,33.333333,280.0,504.5,23.764041,4693.0,10.047314,26.0,46.76175,2139:1,19.0,34.00813,2940:1,16.0,28.63842,3492:1,6650.0,39.0,42.0,33076,37367,88.516606,8689,14362,60.49993,714.0,26172.0,2.728106,15.9,5.090408,3267.0,13205.0,24.740629,68,12.229996,148.5,272.28222,190.0,68.410043,10.4,13.630967,12.032598,1.116124,1.543576,85.788512,24635,38.3,0
1,1003,Alabama,Baldwin,3147.0,7680.47727,16.460675,3.647978,4.768294,8.292009,18.513318,30.0,7.8,24.7,73.713549,18.736958,53.0,179.0,29.608939,632.0,297.2,26.006904,22660.0,13.168983,153.0,70.1764,1425:1,110.0,49.27565,2029:1,220.0,98.5513,1015:1,3471.0,43.0,46.0,137219,151112,90.806157,33404,49776,67.108647,2653.0,97328.0,2.725834,13.5,4.385135,10321.0,46133.0,22.372271,229,10.503527,408.0,203.660396,777.0,72.878641,7.2,12.996651,11.987507,1.017929,0.334931,83.760106,93141,40.4,0
2,1005,Alabama,Barbour,515.0,11476.629416,29.8415,5.569267,5.590494,11.355311,25.530768,41.2,5.5,28.0,53.16677,12.466974,13.0,33.0,39.393939,191.0,755.8,37.114262,2310.0,13.504034,8.0,32.15305,3110:1,9.0,36.45791,2743:1,3.0,12.15264,8229:1,5314.0,44.0,39.0,13152,17964,73.213093,2379,6680,35.613772,324.0,8537.0,3.795244,41.0,5.981993,2707.0,5307.0,51.008103,19,7.636349,105.5,414.277861,98.0,76.988947,9.4,13.710555,12.59426,1.958651,0.652884,82.116389,8231,30.9,0
3,1007,Alabama,Bibb,476.0,12172.562382,23.853284,4.894377,5.271114,10.220994,23.084813,37.4,7.6,33.4,16.251364,15.420603,8.0,27.0,29.62963,139.0,613.2,37.786446,1764.0,10.599051,12.0,53.57143,1867:1,5.0,22.32741,4479:1,6.0,26.79289,3732:1,6690.0,33.0,40.0,12782,16168,79.057397,2483,6186,40.139024,266.0,8685.0,3.062752,25.9,5.002608,1404.0,4586.0,30.614915,19,8.482143,19.5,89.349126,119.0,105.601306,10.0,9.255242,8.263736,0.563991,0.563991,87.057671,8167,52.0,0
4,1009,Alabama,Blount,1100.0,11096.050649,21.98561,4.986622,5.357809,7.430541,22.954959,33.0,7.9,33.3,15.634486,15.833284,15.0,82.0,18.292683,176.0,303.4,31.239778,6642.0,14.115997,12.0,20.74689,4820:1,11.0,19.02258,5257:1,10.0,17.29326,5783:1,4440.0,37.0,40.0,32028,39791,80.490563,7862,14102,55.750957,676.0,25331.0,2.668667,21.0,4.427838,3003.0,13372.0,22.457374,45,7.780083,279.0,482.690611,311.0,107.591609,10.6,10.125665,7.468859,1.81247,1.159981,87.519594,21690,57.9,0


## Cleaning and Imputing Data

I need to fix 3 ratio columns, as their current format is not workable. Now, each number in the dataframe represents how many peopl there are for each single provider (be it primary care, dentist, or mental health provider). For example, the number $561$ in the `Primary Care Physicians Ratio` column would represent 561 people in the county for every 1 primary care physicians.


In [101]:
cols_to_fix = ['Primary Care Physicians Ratio', 'Dentist Ratio', 'Mental Health Provider Ratio']

for col in cols_to_fix:
    df[col] = df[col].str.replace(':1', '')


In the Commonwealth of Virginia, all municipalities incorporated as "cities" have been "independent cities", also called "free cities", since 1871. Virginia's thirty-eight independent cities are not politically part of a county, even though geographically they may be completely surrounded by one. [Source](https://www.wikiwand.com/en/Independent_city_(United_States))<br>

For statistical measures, there are several indpendent cities that get lumped with a particular county in Virginia. For that reason, I imputed all null values for those cities with values from their respective county (statistically speaking). [Source](https://en.wikipedia.org/wiki/Falls_Church,_Virginia)

In [102]:
# Falls Church City gets combined with Fairfax County
cols_with_nans_fallschurch = ['Food Environment Index', 'Income Ratio', 
                              '% Driving Deaths with Alcohol Involvement', 
                              '# Alcohol-Impaired Driving Deaths', 
                              '# Driving Deaths']

for col in cols_with_nans_fallschurch:
    df.loc[2924, col] = df.loc[2847, col]

There were 309 rows with 5 or more null values, which constitutes less than 10% of the data. The mean population in those counties is less than 2800 people (an assumption is that in smaller counties it can be harder to collect reliable data). I dropped these rows from the data set.

In [103]:
# population average for the rows I intend to drop
df[(df.isnull().sum(axis=1) >= 5)]['Population'].mean()

2767.5728155339807

In [104]:
rows_with_many_nulls = df.loc[df[(df.isnull().sum(axis=1) >= 5)].index].index.values
df.drop(rows_with_many_nulls, inplace=True)
df.reset_index(inplace=True)

In [105]:
df[df['# Driving Deaths'].isnull()]

Unnamed: 0,index,FIPS,State,County,Deaths,Years of Potential Life Lost Rate,% Fair or Poor Health,Average Number of Physically Unhealthy Days,Average Number of Mentally Unhealthy Days,% Low birthweight,% Smokers,% Adults with Obesity,Food Environment Index,% Physically Inactive,% With Access to Exercise Opportunities,% Excessive Drinking,# Alcohol-Impaired Driving Deaths,# Driving Deaths,% Driving Deaths with Alcohol Involvement,# Chlamydia Cases,Chlamydia Rate,Teen Birth Rate,# Uninsured,% Uninsured,# Primary Care Physicians,Primary Care Physicians Rate,Primary Care Physicians Ratio,# Dentists,Dentist Rate,Dentist Ratio,# Mental Health Providers,Mental Health Provider Rate,Mental Health Provider Ratio,Preventable Hospitalization Rate,% With Annual Mammogram,% Vaccinated,# Completed High School,Population,% Completed High School,# Some College,Population.1,% Some College,# Unemployed,Labor Force,% Unemployed,% Children in Poverty,Income Ratio,# Children in Single-Parent Households,# Children in Households,% Children in Single-Parent Households,# Associations,Social Association Rate,Annual Average Violent Crimes,Violent Crime Rate,# Injury Deaths,Injury Death Rate,Average Daily PM2.5,% Severe Housing Problems,Severe Housing Cost Burden,Overcrowding,Inadequate Facilities,% Drive Alone to Work,# Workers who Drive Alone,% Long Commute - Drives Alone,Water_Violation
74,89,2220,Alaska,Sitka,94.0,5733.049236,14.237609,3.755396,4.167775,4.802744,17.594305,25.7,8.3,18.1,100.0,19.274869,,,,19.0,218.7,11.384062,1119.0,15.477178,17.0,196.59998,509,12.0,141.29283,708,178.0,2095.84364,48,4090.0,33.0,39.0,5821,6146,94.712008,1704,2484,68.599034,181.0,4336.0,4.174354,9.3,3.712332,453.0,1835.0,24.686649,15,17.347057,9.0,100.631744,35.0,80.419098,,15.51481,12.450369,2.679831,2.397743,61.845494,4660,1.1,0
2614,2916,51530,Virginia,Buena Vista City,102.0,9033.823703,24.282477,5.227031,5.418573,7.065217,25.88404,21.4,3.6,24.0,97.24812,17.451487,,,,12.0,189.7,6.616257,448.0,9.805209,2.0,32.0667,3119,5.0,77.18432,1296,5.0,77.18432,1296,4980.0,45.0,52.0,3344,4123,81.105991,666,1257,52.983294,94.0,3436.0,2.735739,26.0,4.662931,577.0,1338.0,43.124066,10,16.033349,6.5,97.972718,23.0,71.624315,7.1,20.345489,19.111969,0.0,1.727447,85.095813,2818,24.5,0
2618,2920,51580,Virginia,Covington City,156.0,20646.853405,20.799693,4.626371,4.895934,13.745704,23.426513,24.2,7.4,20.9,100.0,18.325595,,,,23.0,415.8,,441.0,10.512515,3.0,54.94505,1820,3.0,54.17118,1846,16.0,288.91296,346,4687.0,51.0,50.0,3479,3984,87.324297,684,1393,49.102656,100.0,2393.0,4.178855,23.4,4.022686,302.0,1147.0,26.329555,22,40.29304,14.0,246.565692,42.0,151.597185,7.0,17.5,13.006397,3.333333,1.25,89.872915,2518,19.7,0
2633,2938,51720,Virginia,Norton City,76.0,10701.514493,23.434953,5.032985,5.220639,9.126984,24.809197,26.7,6.9,31.4,99.621021,17.274502,,,,13.0,330.3,33.692722,307.0,9.540087,5.0,126.00806,794,4.0,100.47727,995,13.0,326.55112,306,6767.0,46.0,44.0,2308,2747,84.01893,699,1086,64.364641,62.0,1687.0,3.675163,30.7,6.664658,454.0,941.0,48.246546,3,7.560484,14.5,365.193301,18.0,91.426249,8.3,20.448179,19.263456,1.62465,0.22409,82.45614,1596,10.6,0


In [106]:
    
# same for Buena Vista City - combined with Rockbridge County for stats purposes
cols_with_nans_other_3_VA_cities = ['# Alcohol-Impaired Driving Deaths', '# Driving Deaths', 
                                    '% Driving Deaths with Alcohol Involvement']

for col in cols_with_nans_other_3_VA_cities:
    df.loc[2614, col] = df.loc[2594, col]

# Covington City can be combined with Alleghany county for stats purposes
for col in cols_with_nans_other_3_VA_cities:
    df.loc[2618, col] = df.loc[2522, col]
    
# Norton City can be combined with Wise County for stats purposes
for col in cols_with_nans_other_3_VA_cities:
    df.loc[2633, col] = df.loc[2609, col]

Given that my main interest was in values for each county's Police Scorecard, it seemed that `Violent Crime Rate` may end up being important to out ability to cluster counties. For that reason, combined with the fact that I had one reliable source to impute all values I was missing, I spent a significant amount of time researching the violent crime rates (per 100k people) and imputing them. 

In [107]:
df[df['Violent Crime Rate'].isnull()]

Unnamed: 0,index,FIPS,State,County,Deaths,Years of Potential Life Lost Rate,% Fair or Poor Health,Average Number of Physically Unhealthy Days,Average Number of Mentally Unhealthy Days,% Low birthweight,% Smokers,% Adults with Obesity,Food Environment Index,% Physically Inactive,% With Access to Exercise Opportunities,% Excessive Drinking,# Alcohol-Impaired Driving Deaths,# Driving Deaths,% Driving Deaths with Alcohol Involvement,# Chlamydia Cases,Chlamydia Rate,Teen Birth Rate,# Uninsured,% Uninsured,# Primary Care Physicians,Primary Care Physicians Rate,Primary Care Physicians Ratio,# Dentists,Dentist Rate,Dentist Ratio,# Mental Health Providers,Mental Health Provider Rate,Mental Health Provider Ratio,Preventable Hospitalization Rate,% With Annual Mammogram,% Vaccinated,# Completed High School,Population,% Completed High School,# Some College,Population.1,% Some College,# Unemployed,Labor Force,% Unemployed,% Children in Poverty,Income Ratio,# Children in Single-Parent Households,# Children in Households,% Children in Single-Parent Households,# Associations,Social Association Rate,Annual Average Violent Crimes,Violent Crime Rate,# Injury Deaths,Injury Death Rate,Average Daily PM2.5,% Severe Housing Problems,Severe Housing Cost Burden,Overcrowding,Inadequate Facilities,% Drive Alone to Work,# Workers who Drive Alone,% Long Commute - Drives Alone,Water_Violation
27,28,1057,Alabama,Fayette,327.0,10520.535419,23.827758,5.115708,5.548231,10.501193,24.326008,37.6,7.1,30.0,34.162752,14.819336,4.0,18.0,22.222222,76.0,461.5,37.4613,1575.0,12.319124,17.0,103.45037,967,5.0,30.67108,3260,2.0,12.26843,8151,4667.0,38.0,47.0,9669,11706,82.598667,1946,3802,51.183588,207.0,6579.0,3.146375,23.0,5.152367,984.0,3480.0,28.275862,14,8.519443,,,84.0,101.808309,9.2,11.646298,9.724238,1.509705,1.150252,85.800797,6275,44.1,0
66,70,2050,Alaska,Bethel,266.0,15220.487811,32.261816,6.835134,5.874776,6.471183,35.008004,34.1,2.9,27.7,100.0,16.594949,6.0,9.0,66.666667,590.0,3264.0,60.419473,2971.0,17.94624,19.0,104.30391,959,26.0,141.41194,707,427.0,2322.41923,43,5053.0,4.0,7.0,7930,9801,80.910111,1605,4823,33.278043,905.0,7025.0,12.882562,31.6,5.697356,1643.0,6292.0,26.112524,8,4.391744,,,185.0,204.212292,,59.226519,7.947753,34.917127,36.022099,23.283333,6000,0.0,0
67,73,2070,Alaska,Dillingham,79.0,14346.400603,26.020902,5.916087,5.337797,5.865103,30.182684,30.9,4.8,25.5,100.0,19.143546,1.0,2.0,50.0,104.0,2108.7,41.347626,845.0,18.740297,13.0,260.0,385,6.0,122.05045,819,39.0,793.32791,126,3661.0,31.0,19.0,2499,2895,86.321244,661,1316,50.227964,157.0,1938.0,8.101135,27.1,4.477378,463.0,1523.0,30.400525,0,0.0,,,37.0,149.199565,,38.434164,8.518253,22.419929,13.879004,33.786611,1912,5.4,0
68,74,2090,Alaska,Fairbanks North Star,917.0,6820.353852,14.035618,3.963047,4.081826,5.914924,17.411357,31.0,7.9,18.2,100.0,18.649155,18.0,44.0,40.909091,918.0,920.7,24.30821,10417.0,12.328686,81.0,81.84216,1222,139.0,143.52239,697,613.0,632.94407,158,2002.0,35.0,36.0,58440,61846,94.492772,22264,30579,72.808136,2404.0,44652.0,5.383857,7.9,3.724222,4868.0,23868.0,20.395509,92,9.295652,,,415.0,83.710028,13.9,23.286166,13.654563,5.715071,6.717956,74.280392,51000,17.3,0
71,83,2170,Alaska,Matanuska-Susitna,1150.0,7002.196176,15.911607,4.118388,4.357659,5.62184,20.168045,34.6,7.7,20.2,100.0,19.052251,37.0,100.0,37.0,388.0,364.2,16.901288,14078.0,15.142193,53.0,49.25193,2030,54.0,49.85367,2006,213.0,196.64503,509,2398.0,35.0,33.0,63378,68186,92.948699,16905,29063,58.166741,3240.0,46901.0,6.908168,12.8,4.459951,4755.0,28336.0,16.780774,84,7.805966,,,464.0,87.892271,5.7,19.89221,12.281569,5.209864,4.197289,73.590669,43212,41.7,0
72,84,2180,Alaska,Nome,157.0,15393.495761,28.030815,6.111515,5.559889,5.994358,33.851021,40.3,3.7,19.7,100.0,17.645421,7.0,8.0,87.5,298.0,3003.7,57.065217,1667.0,18.373195,10.0,99.92006,1001,10.0,99.96002,1000,91.0,909.63615,110,6041.0,32.0,4.0,4668,5520,84.565217,1089,2666,40.847712,440.0,4036.0,10.901883,26.1,4.591726,940.0,3402.0,27.630805,3,2.997602,,,88.0,177.076626,,52.331606,12.773343,29.70639,25.388601,25.426058,3638,0.1,0
75,91,2240,Alaska,Southeast Fairbanks,89.0,8095.600011,16.090055,3.952492,4.247463,3.537415,22.334423,32.6,6.2,19.4,100.0,20.256297,2.0,6.0,33.333333,15.0,217.8,31.548056,1182.0,20.201675,6.0,86.73027,1153,4.0,58.02989,1723,11.0,159.58218,627,,26.0,25.0,4060,4518,89.862771,880,1751,50.256996,252.0,2810.0,8.967972,15.8,4.635207,255.0,1842.0,13.843648,3,4.336513,,,33.0,95.891207,,22.196262,8.895131,8.17757,7.943925,67.56071,2759,10.5,0
77,95,2290,Alaska,Yukon-Koyukuk,100.0,15406.858678,27.538435,6.090424,5.550066,4.868914,32.509575,30.7,2.1,25.1,100.0,18.661788,2.0,7.0,28.571429,66.0,1230.2,34.711965,985.0,21.908363,1.0,18.77229,5327,2.0,38.24092,2615,6.0,114.72275,872,1161.0,33.0,27.0,2959,3442,85.967461,633,1317,48.063781,345.0,2458.0,14.035801,30.8,5.991208,497.0,1490.0,33.355705,0,0.0,,,67.0,248.322894,,52.295918,9.170082,16.326531,38.77551,22.735849,2120,9.2,0
156,178,5135,Arkansas,Sharp,387.0,9896.321503,28.318474,5.844936,5.580724,9.836066,28.539328,40.7,6.0,39.0,45.093837,16.250798,9.0,33.0,27.272727,61.0,350.7,47.919294,1343.0,10.634254,4.0,23.03351,4342,4.0,22.93315,4361,37.0,212.13164,471,4942.0,33.0,38.0,10354,12406,83.459616,1615,3300,48.939394,285.0,5742.0,4.963427,35.9,4.236374,992.0,3621.0,27.395747,28,16.12346,,,94.0,108.960241,8.9,15.695682,11.527778,3.975326,1.576422,78.562895,5859,30.5,0
223,245,8003,Colorado,Alamosa,203.0,9501.440498,21.952996,4.515349,4.363536,10.572988,19.460692,31.0,7.7,20.8,83.127226,18.179169,7.0,16.0,43.75,95.0,574.0,17.850703,1795.0,13.87386,13.0,77.92363,1283,21.0,129.36611,773,236.0,1453.82862,69,3275.0,31.0,43.0,8355,9619,86.859341,2469,4079,60.529542,313.0,8726.0,3.586981,24.9,4.544391,632.0,3913.0,16.151291,16,9.590601,,,68.0,82.307515,4.6,19.008264,13.458582,5.041322,1.322314,78.700197,6601,13.4,0


In [108]:
# All values looked up on https://crimegrade.org/
imputing_violent_crime_rate = [
    (27, 234), 
    (66, 1974), 
    (67, 259), 
    (68, 812), 
    (71, 234), 
    (72, 466), 
    (75, 812), 
    (77, 944), 
    (156, 633), 
    (223, 627), 
    (280, 151), 
    (281, 224), 
    (353, 504), 
    (370, 537), 
    (383, 219), 
    (573, 341), 
    (603, 279), 
    (632, 399), 
    (638, 320), 
    (641, 315), 
    (645, 277), 
    (646, 309), 
    (651, 386), 
    (653, 397), 
    (667, 222), 
    (669, 273), 
    (680, 252), 
    (681, 447), 
    (685, 273), 
    (688, 271), 
    (689, 234), 
    (691, 278), 
    (695, 265), 
    (696, 322), 
    (697, 400), 
    (699, 274), 
    (703, 211), 
    (710, 472), 
    (714, 247), 
    (715, 351), 
    (718, 337), 
    (724, 174), 
    (745, 262), 
    (753, 225), 
    (771, 565), 
    (800, 293), 
    (1116, 255), 
    (1148, 251), 
    (1202, 434), 
    (1291, 228), 
    (1292, 166), 
    (1293, 201), 
    (1294, 176), 
    (1295, 424), 
    (1296, 202), 
    (1297, 190), 
    (1300, 167), 
    (1302, 438), 
    (1303, 232), 
    (1304, 238), 
    (1307, 144), 
    (1310, 153), 
    (1314, 603), 
    (1317, 165), 
    (1319, 198), 
    (1322, 878), 
    (1325, 116), 
    (1326, 182), 
    (1327, 163), 
    (1330, 234), 
    (1334, 122), 
    (1335, 162), 
    (1336, 255), 
    (1337, 209), 
    (1340, 145), 
    (1341, 216), 
    (1342, 236), 
    (1343, 133), 
    (1344, 226), 
    (1345, 433), 
    (1348, 153), 
    (1349, 228), 
    (1351, 405), 
    (1352, 384), 
    (1353, 128), 
    (1354, 188), 
    (1356, 192), 
    (1357, 148), 
    (1360, 221), 
    (1361, 168), 
    (1362, 303), 
    (1363, 197), 
    (1364, 200), 
    (1365, 484), 
    (1389, 253), 
    (1516, 167), 
    (1552, 192), 
    (1571, 150), 
    (1627, 500), 
    (1725, 371), 
    (1748, 416), 
    (1767, 423), 
    (1840, 145), 
    (1860, 167), 
    (1869, 180), 
    (1916, 234), 
    (2007, 441), 
    (2155, 195), 
    (2156, 295), 
    (2162, 1293), 
    (2172, 1604), 
    (2303, 1693), 
    (2488, 367), 
    (2494, 169), 
    (2691, 363), 
    (2705, 474), 
    (2719, 209), 
    (2729, 282), 
    (2730, 194), 
    (2734, 385), 
    (2780, 332), 
    (2828, 206)
]

In [109]:
for index, crime in imputing_violent_crime_rate:
    df.loc[index, 'Violent Crime Rate'] = crime

At this point in my cleaning process, I had a few rows that were only missing 1 item. I either dropped the row or imputed as noted below. 

In [110]:
df[df['# Driving Deaths'].isnull()]

Unnamed: 0,index,FIPS,State,County,Deaths,Years of Potential Life Lost Rate,% Fair or Poor Health,Average Number of Physically Unhealthy Days,Average Number of Mentally Unhealthy Days,% Low birthweight,% Smokers,% Adults with Obesity,Food Environment Index,% Physically Inactive,% With Access to Exercise Opportunities,% Excessive Drinking,# Alcohol-Impaired Driving Deaths,# Driving Deaths,% Driving Deaths with Alcohol Involvement,# Chlamydia Cases,Chlamydia Rate,Teen Birth Rate,# Uninsured,% Uninsured,# Primary Care Physicians,Primary Care Physicians Rate,Primary Care Physicians Ratio,# Dentists,Dentist Rate,Dentist Ratio,# Mental Health Providers,Mental Health Provider Rate,Mental Health Provider Ratio,Preventable Hospitalization Rate,% With Annual Mammogram,% Vaccinated,# Completed High School,Population,% Completed High School,# Some College,Population.1,% Some College,# Unemployed,Labor Force,% Unemployed,% Children in Poverty,Income Ratio,# Children in Single-Parent Households,# Children in Households,% Children in Single-Parent Households,# Associations,Social Association Rate,Annual Average Violent Crimes,Violent Crime Rate,# Injury Deaths,Injury Death Rate,Average Daily PM2.5,% Severe Housing Problems,Severe Housing Cost Burden,Overcrowding,Inadequate Facilities,% Drive Alone to Work,# Workers who Drive Alone,% Long Commute - Drives Alone,Water_Violation
74,89,2220,Alaska,Sitka,94.0,5733.049236,14.237609,3.755396,4.167775,4.802744,17.594305,25.7,8.3,18.1,100.0,19.274869,,,,19.0,218.7,11.384062,1119.0,15.477178,17.0,196.59998,509,12.0,141.29283,708,178.0,2095.84364,48,4090.0,33.0,39.0,5821,6146,94.712008,1704,2484,68.599034,181.0,4336.0,4.174354,9.3,3.712332,453.0,1835.0,24.686649,15,17.347057,9.0,100.631744,35.0,80.419098,,15.51481,12.450369,2.679831,2.397743,61.845494,4660,1.1,0


In [111]:
# dropping Sitka, AK - only one with nans in alcohol impaired driving deaths now
# also no reputable source to use for imputation
df.drop(index=74, inplace=True)

In [112]:
# dropping rows (4) without preventable hosp rate
df.drop(index=df[df['Preventable Hospitalization Rate'].isnull()].index, inplace=True)

In [113]:
# dropping rows without low birthweight (4)
df.drop(index=df[df['% Low birthweight'].isnull()].index, inplace=True)

In [114]:
# imputing average daily PM2.5 with mean of Alaska values
avg_daily_pm_isnull = df[df['Average Daily PM2.5'].isnull()].index
mean_value = df[df['State'] == 'Alaska']['Average Daily PM2.5'].mean()
df['Average Daily PM2.5'].fillna(mean_value, inplace=True)


The columns I drop below had too many null values to impute (for time's sake). I chose to drop those columns and move forward with my clustering model. 

In [115]:
cols_to_drop = ['Annual Average Violent Crimes', 'Mental Health Provider Ratio', 
                '# Mental Health Providers', 'Mental Health Provider Rate', 'Deaths', 
                'Years of Potential Life Lost Rate', 
                '# Primary Care Physicians', 'Primary Care Physicians Ratio', 
                'Primary Care Physicians Rate', 'Dentist Ratio', 'Dentist Rate', 
                '# Dentists', 'Chlamydia Rate', '# Chlamydia Cases', 'Food Environment Index', 
                'Teen Birth Rate']
df.drop(columns=cols_to_drop, inplace=True)

In [116]:
df.isnull().sum().sum()

0

Exporting Clean Data for use in my modeling notebook

In [117]:
df.to_csv('../data/demographic_clean.csv', index=False)

In [119]:
df.dtypes

index                                            int64
FIPS                                            object
State                                           object
County                                          object
% Fair or Poor Health                          float64
Average Number of Physically Unhealthy Days    float64
Average Number of Mentally Unhealthy Days      float64
% Low birthweight                              float64
% Smokers                                      float64
% Adults with Obesity                          float64
% Physically Inactive                          float64
% With Access to Exercise Opportunities        float64
% Excessive Drinking                           float64
# Alcohol-Impaired Driving Deaths              float64
# Driving Deaths                               float64
% Driving Deaths with Alcohol Involvement      float64
# Uninsured                                    float64
% Uninsured                                    float64
Preventabl