# Compile and Anlayse Vital Signs


After a little bit of manual cleaning, I got the historic data (2000-2010) for many indicators into a common format, stored in several different excel files with several sheets each. I also saved the modern (2010 onward) indicators as CSVs, to avoid re-pulling from the APIs. In this notebook, I'll compile everything and set up some analysis. 

### Colab-Specific Steps

In [6]:
# clone the github respository, so that we have all the necessary files 
!git clone https://github.com/ejf78/cdc_vitalsigns.git

Cloning into 'cdc_vitalsigns'...
remote: Enumerating objects: 37, done.[K
remote: Counting objects: 100% (37/37), done.[K
remote: Compressing objects: 100% (32/32), done.[K
remote: Total 37 (delta 5), reused 37 (delta 5), pack-reused 0[K
Unpacking objects: 100% (37/37), done.
Checking out files: 100% (33/33), done.


In [2]:
# install geopandas in google colab 
!pip install geopandas

Collecting geopandas
  Downloading geopandas-0.10.2-py2.py3-none-any.whl (1.0 MB)
[?25l[K     |▎                               | 10 kB 15.9 MB/s eta 0:00:01[K     |▋                               | 20 kB 10.3 MB/s eta 0:00:01[K     |█                               | 30 kB 8.0 MB/s eta 0:00:01[K     |█▎                              | 40 kB 7.3 MB/s eta 0:00:01[K     |█▋                              | 51 kB 4.3 MB/s eta 0:00:01[K     |██                              | 61 kB 4.5 MB/s eta 0:00:01[K     |██▎                             | 71 kB 4.3 MB/s eta 0:00:01[K     |██▌                             | 81 kB 4.8 MB/s eta 0:00:01[K     |██▉                             | 92 kB 5.1 MB/s eta 0:00:01[K     |███▏                            | 102 kB 4.2 MB/s eta 0:00:01[K     |███▌                            | 112 kB 4.2 MB/s eta 0:00:01[K     |███▉                            | 122 kB 4.2 MB/s eta 0:00:01[K     |████▏                           | 133 kB 4.2 MB/s eta 0:0

In [8]:
# load packages
import pandas as pd
import os # for navigating directories
import requests
import geopandas as gpd
from geopandas import GeoDataFrame

In [15]:
# navigate into the directory
os.chdir("cdc_vitalsigns")

## Load and Compile Data 

In [16]:
# api info 
# read list of indicators 
api_df = pd.read_csv("VS-Indicator-APIs_EF.csv") # new version - I've labeled which API calls to make under 'pull'
api_df.set_index("ShortName", inplace=True, drop = False) # drop = False I want ShortName as a column 
api_df

Unnamed: 0_level_0,Indicator Number,Indicator,ShortName,Section,API,pull
ShortName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
tpopXX,1,Total Population,tpopXX,Census Demographics,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1
maleXX,2,Total Male Population,maleXX,Census Demographics,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1
femaleXX,3,Total Female Population,femaleXX,Census Demographics,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1
paaXX,4,Percent of Residents - Black/African-American ...,paaXX,Census Demographics,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1
pwhiteXX,5,Percent of Residents - White/Caucasian (Non-Hi...,pwhiteXX,Census Demographics,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1
...,...,...,...,...,...,...
pread8XX,211,Percentage of 8th Grade Students who Met or Ex...,pread8XX,Education and Youth,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1
palg1XX,212,Percentage of Students who Met or Exceeded PAR...,palg1XX,Education and Youth,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1
palg2XX,213,Percentage of Students who Met or Exceeded PAR...,palg2XX,Education and Youth,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1
kraXX,214,Kindergarten Readiness,kraXX,Education and Youth,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1


#### Historic Data (2000 - 2006) 

Already cleaned 

In [None]:
### historic data 
# read from CSV 
historic_indicators = pd.read_csv("precompiled_historic_indicators.csv")
# pivot longer for join 
hvs = historic_indicators.melt(id_vars = ["CSA", "indicator", "indicator_category"], 
                        var_name = "year")
# drop indicator category (will add it later so that it's uniform)
hvs = hvs.drop(["indicator_category"], axis = 1)

#### Clean data from 2010 onward

In [None]:
### modern data 
# read in data files for modern indicators (saved from previous API kills)
mvs1 = pd.read_csv("modern_vital_signs_raw_1.csv")
mvs2 = pd.read_csv("modern_vital_signs_raw_2.csv")
mvs3 = pd.read_csv("modern_vital_signs_raw_3.csv")



## reformat / melt 
# msv1 
objectid_cols = [col for col in mvs1.columns if "OBJECTID" in col] # remove any columns called OBJECTID
mvs1 = mvs1.drop(objectid_cols, axis = 1) 
# drop geometry as well; it's causing some problems
mvs1 = mvs1.drop(['Shape__Area', 'Shape__Length', "geometry"], axis = 1)
index_cols = ["CSA2010"]
mvs_df_1 = mvs1.melt(id_vars = index_cols,
                     var_name = "year-indicator",
                     value_name = "value")
# msv2 
objectid_cols = [col for col in mvs2.columns if "OBJECTID" in col]
mvs2 = mvs2.drop(objectid_cols, axis = 1)
mvs2 = mvs2.drop(['Shape__Area', 'Shape__Length', "geometry"], axis = 1)
mvs_df_2 = mvs2.melt(id_vars = index_cols,
                     var_name = "year-indicator",
                     value_name = "value")
# msv3
objectid_cols = [col for col in mvs3.columns if "OBJECTID" in col]
mvs3 = mvs3.drop(objectid_cols, axis = 1)
mvs3 = mvs3.drop(['Shape__Area', 'Shape__Length', "geometry"], axis = 1)
mvs_df_3 = mvs3.melt(id_vars = index_cols,
                     var_name = "year-indicator",
                     value_name = "value")

# rename
mvs_df_1 = mvs_df_1.rename(columns = {"CSA2010":"CSA"})
mvs_df_2 = mvs_df_2.rename(columns = {"CSA2010":"CSA"})
mvs_df_3 = mvs_df_3.rename(columns = {"CSA2010":"CSA"})
## add column for year, based on indicator/year field 
mvs_df_1["year"] = ['20' + i[-2:] for i in mvs_df_1["year-indicator"]]
mvs_df_2["year"] = ['20' + i[-2:] for i in mvs_df_2["year-indicator"]]
mvs_df_3["year"] = ['20' + i[-2:] for i in mvs_df_3["year-indicator"]]
## add column for indicator, based on indicator/year field 
mvs_df_1["indicator"] = [i[:-2] for i in mvs_df_1["year-indicator"]]
mvs_df_2["indicator"] = [i[:-2] for i in mvs_df_2["year-indicator"]]
mvs_df_3["indicator"] = [i[:-2] for i in mvs_df_3["year-indicator"]]
# drop year-indicator field 
mvs_df_1 = mvs_df_1.drop(["year-indicator"], axis = 1)
mvs_df_2 = mvs_df_2.drop(["year-indicator"], axis = 1)
mvs_df_3 = mvs_df_3.drop(["year-indicator"], axis = 1)
# ended up with duplicates because of NA values. Drop those 
mvs_df_1.dropna(subset = ["value"], inplace = True)
mvs_df_2.dropna(subset = ["value"], inplace = True)
mvs_df_3.dropna(subset = ["value"], inplace = True)
# it also seems like there are some indicators where the API failed to pull data, resulting in NAs in CSA2010 
mvs_df_1.dropna(subset = ["CSA"], inplace = True)
mvs_df_2.dropna(subset = ["CSA"], inplace = True)
mvs_df_3.dropna(subset = ["CSA"], inplace = True)



  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
# UNUSED 
##### pivot again, so that columns are used (not currently in use)
index_cols_pivotlonger = ["CSA", "indicator"]
mvs_pivot_1 = mvs_df_1.pivot(index = index_cols_pivotlonger,columns = "year", values = "value").reset_index()
mvs_pivot_2 = mvs_df_2.pivot(index = index_cols_pivotlonger,columns = "year", values = "value").reset_index()
mvs_pivot_3 = mvs_df_3.pivot(index = index_cols_pivotlonger,columns = "year", values = "value").reset_index()

### small data fix - remove some placeholder rows from mvs3 
mvs_df_3 = mvs_df_3.query("indicator != 'CSA2010'")
# export, for posterity
mvs_pivot_1.to_csv("modern_vital_signs_pivot_1.csv", index = False)
mvs_pivot_2.to_csv("modern_vital_signs_pivot_2.csv", index = False)
mvs_pivot_3.to_csv("modern_vital_signs_pivot_3.csv", index = False)

#### Compile into one dataframe

In [None]:
### concatenate into one big dataframe of vital signs 
vs = pd.concat([hvs, mvs_df_1,mvs_df_2,mvs_df_3])

In [None]:
## clean some names 
# remove asterisk from CSA names
# correct some spellings 
# unify some names that may be abbreviated
vs["CSA"] = vs.CSA.str.replace("*","", regex = False)
vs["CSA"] = vs.CSA.str.replace("Edmonson","Edmondson", regex = False)
vs["CSA"] = vs.CSA.str.replace("Falstaff","Fallstaff", regex = False) # really not sure which is right, but BNIA uses Fallstaff in modern communications
vs["CSA"] = vs.CSA.str.replace("Mt. Washington","Mount Washington", regex = False)
vs["CSA"] = vs.CSA.str.replace("Mt. Winans","Mount Winans", regex = False)

Things that need to get cleaned up: 
- anything with a * 
- Edmonson Village vs Edmondson Village
- Glen-Fallstaff vs. Glen-Falstaff
- Jonestown/Oldtown vs. Oldtown / Middle East
- Washington Village vs. Washington Village/Pigtown
- Westport/Mount Winans/Lakeland vs Westport/Mt. Winans/Lakeland
- Perkins/Middle East vs Oldtown/Middle East
- 'Medfield/Hampden/Woodberry', vs 'Medfield/Hampden/Woodberry/Remington'
-  'Mount Washington/Coldspring','Mt. Washington/Coldspring',

In [None]:
# export for posterity 
vs.to_csv("full_vital_signs.csv", index = False)

In [None]:
# create an info dataframe of indicator, years available, description, and category 
info = vs[["indicator","year"]].groupby(["indicator"])["year"].apply(set).reset_index()
# grab the info from the api DF
indicator_desc = api_df.rename(columns = {"Indicator":"indicator_description","ShortName":"indicator","Section":"category"})[["indicator_description","indicator","category"]]
info = info.merge(indicator_desc, how = "left")
info

Unnamed: 0,indicator,year,indicator_description,category
0,Fastfd,"{2018, 2013, 2011}",,
1,Hhsi,{20ze},,
2,aastud,"{2020, 2015, 2017, 2019, 2013, 2016, 2014, 201...",,
3,aastudXX,"{2002, 2009, 2007, 2000, 2003, 2005, 2006 - 20...",Percent of Students that are African American ...,Education and Youth
4,abse,"{2015, 2017, 2019, 2013, 2016, 2014, 2010, 201...",,
...,...,...,...,...
233,voted,"{2018, 2016, 2014, 2010, 2012}",,
234,walked,"{2018, 2015, 2017, 2019, 2013, 2016, 2014, 201...",,
235,wlksc,"{2017, 2011}",,
236,wstud,"{2020, 2015, 2017, 2019, 2013, 2016, 2014, 201...",,


In [None]:
# export 
info.to_csv("indicator_info.csv", index = False)

In [None]:
# how many indicators should I have? 
all_indicators = [hvs.indicator,mvs_df_1.indicator,mvs_df_2.indicator, mvs_df_3.indicator]
indicator_set = set().union(*all_indicators)
len(indicator_set) # a total of 238 unique identifiers(this seems too high...)

238

In [None]:
# how many did I end up with? 
len(set(vs.indicator))

238

In [None]:
indicator_list = list(indicator_set)
indicator_list.sort()

In [None]:
indicator_list

['Fastfd',
 'Hhsi',
 'aastud',
 'aastudXX',
 'abse',
 'abseXX',
 'abshs',
 'abshsXX',
 'absmd',
 'absmdXX',
 'affordm',
 'affordmXX',
 'affordr',
 'affordrXX',
 'age0-18_XX',
 'age18_',
 'age24_',
 'age24_XX',
 'age45-64_XX',
 'age5_',
 'age64_',
 'age64_XX',
 'age65_',
 'age65_XX',
 'arrest',
 'artevnt',
 'bahigher',
 'baltvac',
 'banks',
 'birthwt',
 'birthwtXX',
 'biz1_',
 'biz2_',
 'biz4_',
 'biz4_XX',
 'bkln',
 'business_50-99_emp',
 'busload',
 'caracc',
 'caslt',
 'clogged',
 'cloggedXX',
 'cmos',
 'community_dev_corporations',
 'community_gardens',
 'compXX',
 'compl',
 'comprop',
 'constper',
 'crehab',
 'crehabXX',
 'crime',
 'crimeXX',
 'dirtyst',
 'dirtystXX',
 'dom',
 'domXX',
 'domvio',
 'domvioXX',
 'drop',
 'dropXX',
 'eattendXX',
 'ebll',
 'ebllXX',
 'elheat',
 'empl',
 'emplXX',
 'fam',
 'familiesrelatedkids',
 'farms',
 'farmsXX',
 'fastfd',
 'female',
 'femaleXX',
 'femhhs',
 'fore',
 'foreXX',
 'gunhom',
 'hazardous_waste_sites_count',
 'hcvhouse',
 'heatgas',
 'hf

In [None]:
## TO DO: check the api call for "Hhsize" 
# modern data 
# does not seem to deliver a year, which causes some weirdness in the data frame 
# making use of previously created functions
def getGDFfromURL(url, layer=0):
    #GDF stands for GeoDataFrame; this is the innermost function called by getGDF
    tail = "/"+str(layer)+"/query?where=1%3D1&outFields=*&outSR=4326&f=json" #worked this out
    url+=tail
    print(url)
    gdf = gpd.read_file(url) #GeoPandas has a built in function to read APIs given right URL
    return gdf

def getGDF(shortname, level=0):
    #This is outermost function called by user; it calls getGDFfromURL
    url = api_df.loc[shortname, "API"]
    return getGDFfromURL(url, level)

hhsizeXX = getGDF("hhsizeXX")

https://services1.arcgis.com/mVFRs7NF4iFitgbY/arcgis/rest/services/Hhsize/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json


ImportError: the 'read_file' function requires the 'fiona' package, but it is not installed or does not import correctly.
Importing fiona resulted in: DLL load failed while importing ogrext: The specified procedure could not be found.

In [None]:
api_df

Unnamed: 0,Indicator Number,Indicator,ShortName,Section,API,pull
0,1,Total Population,tpopXX,Census Demographics,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1
1,2,Total Male Population,maleXX,Census Demographics,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1
2,3,Total Female Population,femaleXX,Census Demographics,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1
3,4,Percent of Residents - Black/African-American ...,paaXX,Census Demographics,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1
4,5,Percent of Residents - White/Caucasian (Non-Hi...,pwhiteXX,Census Demographics,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1
...,...,...,...,...,...,...
210,211,Percentage of 8th Grade Students who Met or Ex...,pread8XX,Education and Youth,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1
211,212,Percentage of Students who Met or Exceeded PAR...,palg1XX,Education and Youth,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1
212,213,Percentage of Students who Met or Exceeded PAR...,palg2XX,Education and Youth,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1
213,214,Kindergarten Readiness,kraXX,Education and Youth,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1
