## County Health Rankings

This data set is intended to demonstrate how various factors affect the health of counties in the United States, as well as highlight the dramatic range in health from county to county. These factors include access to affordable housing, access to well-paying jobs, education opportunities, and many others. This data is provided as a resource to help identify and address injustices and inequities in the health of counties in the United States.

The data set includes statistics on premature death rates, physical health, mental health, low birthweight, adult smoking, and countless others. The data is primarily numeric.

Documentation can be found at : https://www.countyhealthrankings.org/sites/default/files/media/document/DataDictionary_2021.pdf


In [None]:
import pandas as pd
import requests
import io

from sklearn.impute import SimpleImputer 
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer 
from pylab import cm

import matplotlib.pyplot as plt
%matplotlib inline

#### Read in the data 

In [None]:
url = 'http://www.countyhealthrankings.org/sites/default/files/media/document/analytic_data2021.csv'
response = requests.get(url)

file_object = io.StringIO(response.content.decode('utf-8'))
df = pd.read_csv(file_object,skiprows=[0])

In [None]:
df.head()

There are USA aggregate data and state aggregate level data, let's focus only on counties

In [None]:
df = df[df['countycode']!=0]

There are 16 Maine counties that we will be taking a close look at later on 

In [None]:
maine_fipscode = df[(df['state']=='ME')].fipscode
maine_county_labels = [' Andr',' Aroo',' Cumb', ' Fran', ' Hanc',' Kenn', ' Knox', ' Linc', ' Oxfo', 
                       ' Peno', ' Pisc', ' Saga', ' Some', ' Waldo', ' Wash', ' York']

There are a lot of columns we do need. In particular the numerator, denominator, confidence interval, ... columns for forming the raw_values are included. We get rid of them 

In [None]:
# select all the columns from CHR with raw values
## these columns contain the major health-related variables
all_cols = df.columns
col_names = [i for i in all_cols if 'rawvalue' in i]
print ('Number of CHR variables: ',len(col_names))
## We include the fipscode column because we want to get only the maine counties out later on
col_names.insert(0,"fipscode") 
df_sub = df[col_names]
df_sub = df_sub.set_index('fipscode')
df_sub.head()

We get rid of the columns that do not have at least 70% of the rows with values

In [None]:
#count and find the percentage of null values and concatenat the results
missing = pd.concat([df_sub.isnull().sum(), 100*df_sub.isnull().mean()], axis=1)
missing.columns = ['count', 'percentage']
smissing = missing.sort_values(by='count', ascending=False)
print(smissing)
good_cols = smissing[smissing['percentage'] < 30].index
good_cols = good_cols.sort_values()
df_sub2 = df_sub[good_cols]
df_sub2.head()

I do a demonstration plot that should be helpful for the Assignment 3 work 

In [None]:
fig, ax = plt.subplots(figsize=(8,8))

# a boolean series gets created with true values for maine counties 
maine_counties = df_sub2.index.isin(maine_fipscode)

x_axis = df_sub2[maine_counties]['v036_rawvalue']
y_axis = df_sub2[maine_counties]['v001_rawvalue']
ax.scatter(x_axis, y_axis)

ax.set_xlabel('Poor Physical health days', fontsize=18)
ax.set_ylabel('Premature Death', fontsize=16)

for i, label in enumerate(maine_county_labels):
    plt.annotate(label, (x_axis.iloc[i], y_axis.iloc[i]))
    
plt.show()