<a href="https://colab.research.google.com/github/ejf78/cdc_vitalsigns/blob/master/Vital_Signs_Data_Supplement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vital Signs Data Supplement 

A notebook to grab additional data from BNIA's APIs. The plan is to grab all available indicators that I didn't previously pull, as well as grab the Baltimore City data for all indicators. 

## Set Up

In [170]:
# clone the github respository, so that we have all the necessary files 
!git clone https://github.com/ejf78/cdc_vitalsigns.git

Cloning into 'cdc_vitalsigns'...
remote: Enumerating objects: 192, done.[K
remote: Counting objects: 100% (192/192), done.[K
remote: Compressing objects: 100% (171/171), done.[K
remote: Total 192 (delta 98), reused 69 (delta 21), pack-reused 0[K
Receiving objects: 100% (192/192), 75.65 MiB | 9.39 MiB/s, done.
Resolving deltas: 100% (98/98), done.
Checking out files: 100% (43/43), done.


In [171]:
!pip install geopandas



In [172]:
# load packages
import pandas as pd
import numpy as np
import os # for navigating directories
import requests # for API pull 
import geopandas as gpd

In [173]:
# navigate into the directory
os.chdir("cdc_vitalsigns")

## Pull from APIs

In [174]:
# api info 
# read list of indicators 
api_df = pd.read_csv("archive/VS-Indicator-APIs_EF.csv") # new version - I've labeled which API calls to make under 'pull'
api_df.set_index("ShortName", inplace=True, drop = False) # drop = False I want ShortName as a column 
# add column for indicator name (used in my own data)
api_df["indicator"] = [string.replace("XX","") if type(string) == str else None for string in api_df.ShortName ]
# get full list of indicators we indend to pull
full_indicator_list = set(api_df[api_df.pull == 1].indicator)
api_df.head()

Unnamed: 0_level_0,Indicator Number,Indicator,ShortName,Section,API,pull,indicator
ShortName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
tpopXX,1,Total Population,tpopXX,Census Demographics,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1,tpop
maleXX,2,Total Male Population,maleXX,Census Demographics,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1,male
femaleXX,3,Total Female Population,femaleXX,Census Demographics,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1,female
paaXX,4,Percent of Residents - Black/African-American ...,paaXX,Census Demographics,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1,paa
pwhiteXX,5,Percent of Residents - White/Caucasian (Non-Hi...,pwhiteXX,Census Demographics,https://services1.arcgis.com/mVFRs7NF4iFitgbY/...,1,pwhite


In [175]:
# making use of previously created functions
def getGDFfromURL(url, layer=0, shortname = None):
    #GDF stands for GeoDataFrame; this is the innermost function called by getGDF
    tail = "/"+str(layer)+"/query?where=1%3D1&outFields=*&outSR=4326&f=json" #worked this out
    url+=tail
    print(url)
    # EF edits - for error handling in large batches
    try: 
      gdf = gpd.read_file(url) #GeoPandas has a built in function to read APIs given right URL
    except: 
      gdf = pd.DataFrame()
      print(f"Could not find results for {shortname}")
    return gdf

def getGDF(shortname, level=0):
    #This is outermost function called by user; it calls getGDFfromURL
    url = api_df.loc[shortname, "API"]
    return getGDFfromURL(url, level, shortname)

def getCollect(check_list, level = 0): # slight edit: I added level to this function 
    #This function collects all the target GDFs and puts into collection
    collect=[]
    for shortname in check_list:
        gdf=getGDF(shortname, level)
        collect.append(gdf)    
    return collect

#### Pulling new CSA-level values 

In [176]:
### which indicators are new to pull? 

# get list of indicators that already exist in the data 
existing_df = pd.read_csv("full_vital_signs.csv")
# identify the new ones
new_indicators = full_indicator_list - set(existing_df.indicator)
# but now we need the shortnames again 
new_indicator_shortnames = list(api_df[api_df.indicator.isin(new_indicators)].ShortName)

In [177]:
# % driving alone is broken. Looking at the API url, it's an error 400
new_indicator_shortnames[21]
api_df.loc['drvaloneXX', "API"]
getGDF('drvaloneXX', level=0)

https://services1.arcgis.com/mVFRs7NF4iFitgbY/arcgis/rest/services/Dralone/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json
Could not find results for drvaloneXX


In [178]:
### make API pull
collect_new = getCollect(new_indicator_shortnames)

https://services1.arcgis.com/mVFRs7NF4iFitgbY/arcgis/rest/services/Cashsa/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json
https://services1.arcgis.com/mVFRs7NF4iFitgbY/arcgis/rest/services/Taxlien/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json
https://services1.arcgis.com/mVFRs7NF4iFitgbY/arcgis/rest/services/Demper/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json
https://services1.arcgis.com/mVFRs7NF4iFitgbY/arcgis/rest/services/Histax/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json
https://services1.arcgis.com/mVFRs7NF4iFitgbY/arcgis/rest/services/Homtax/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json
https://services1.arcgis.com/mVFRs7NF4iFitgbY/arcgis/rest/services/Owntax/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json
https://services1.arcgis.com/mVFRs7NF4iFitgbY/arcgis/rest/services/Nomail/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json
https://services1.arcgis.com/mVFR

In [179]:
# turn the collections into a dataframe
new_indicator_df = pd.concat(collect_new)
# drop geometry and objectID
new_indicator_df = new_indicator_df.drop(['OBJECTID','OBJECTID_1','Shape__Area', 'Shape__Length', "geometry"], axis = 1)
new_indicator_df.head()

Unnamed: 0,CSA2010,cashsa11,cashsa12,cashsa13,cashsa14,cashsa15,cashsa16,cashsa17,cashsa18,cashsa19,cashsa20,taxlien15,taxlien16,taxlien17,taxlien18,taxlien19,demper11,demper12,demper13,demper14,demper15,demper16,demper17,demper18,demper19,demper20,histax12,histax13,histax14,histax15,histax16,histax17,histax18,histax19,homtax11,homtax12,homtax13,homtax14,homtax15,homtax16,...,treeplnt19,cebus11,cebus12,cebus13,cebus14,cebus15,cebus16,cebus17,cebus18,cebus19,ceemp11,ceemp12,ceemp13,ceemp14,ceemp15,ceemp16,ceemp17,Ceemp18,ceemp19,murals14,murals15,murals16,murals17,murals18,murals19,murals20,totjobs10,totjobs11,totjobs12,totjobs13,totjobs14,totjobs15,totjobs16,totjobs17,totjobs18,lights16,lights17,lights18,lights19,lights20
0,Allendale/Irvington/S. Hilton,78.22,76.086957,78.787879,76.5823,78.26087,71.038251,64.197531,57.471264,53.475936,49.565217,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,Beechfield/Ten Hills/West Hills,32.05,25.373134,29.032258,34.7458,27.777778,30.120482,25.925926,15.568862,20.261438,13.496933,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,Belair-Edison,66.67,67.391304,67.741935,69.1542,68.468468,59.745763,53.623188,50.482315,47.457627,40.15748,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,Brooklyn/Curtis Bay/Hawkins Point,73.4,72.033898,76.859504,75.4237,74.814815,73.248408,69.306931,53.846154,60.427807,56.321839,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,Canton,26.64,20.064725,15.460526,18.2836,18.360656,15.064103,14.438503,17.013889,12.759644,9.895833,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [180]:
### reformat data 

# melt (pivot longer)
new_indicator_melted = new_indicator_df.melt(id_vars = ["CSA2010"], 
                                  var_name = "year-indicator", 
                                  value_name = "value")
# drop NAs (a result of simply appending everything together)
new_indicator_melted.dropna(subset = ["value"], inplace = True)
# drop a strange value (indicator = 'City', value = 'Baltimore City')
new_indicator_melted = new_indicator_melted[new_indicator_melted.value != "Baltimore City"].copy()
# add year column 
new_indicator_melted["year"] = ['20' + i[-2:] for i in new_indicator_melted['year-indicator']]
new_indicator_melted["year_numeric"] = [int(y) for y in new_indicator_melted.year]
# add column for indicator 
new_indicator_melted["indicator"] = [i[:-2] for i in new_indicator_melted["year-indicator"]]
# drop indicator-year field 
new_indicator_melted = new_indicator_melted.drop(["year-indicator"], axis = 1)

# pivot
new_indicator_melted

Unnamed: 0,CSA2010,value,year,year_numeric,indicator
0,Allendale/Irvington/S. Hilton,78.22,2011,2011,cashsa
1,Beechfield/Ten Hills/West Hills,32.05,2011,2011,cashsa
2,Belair-Edison,66.67,2011,2011,cashsa
3,Brooklyn/Curtis Bay/Hawkins Point,73.4,2011,2011,cashsa
4,Canton,26.64,2011,2011,cashsa
...,...,...,...,...,...
612124,Southwest Baltimore,24.322058,2020,2020,lights
612125,The Waverlies,23.990713,2020,2020,lights
612126,Upton/Druid Heights,13.923806,2020,2020,lights
612127,Washington Village/Pigtown,32.164274,2020,2020,lights


In [None]:
# troubleshooting 
set(new_indicator_melted[new_indicator_melted.year == '20ty']["year-indicator"])
new_indicator_melted[(new_indicator_melted.year == '20ty')]

In [182]:
### combining existing and new DF 
# PLACEHOLDER FOR NOW 

#### Pulling Baltimore City values

In [183]:
## using the full list of indicators, run the APIs to collect baltimore data 

# get shortnames 
full_indicator_shortnames = list(api_df[api_df.indicator.isin(full_indicator_list)].ShortName)

# pull from API
collect_balt = getCollect(full_indicator_shortnames, level = 1)

https://services1.arcgis.com/mVFRs7NF4iFitgbY/arcgis/rest/services/Tpop/FeatureServer/1/query?where=1%3D1&outFields=*&outSR=4326&f=json
https://services1.arcgis.com/mVFRs7NF4iFitgbY/arcgis/rest/services/Male/FeatureServer/1/query?where=1%3D1&outFields=*&outSR=4326&f=json
https://services1.arcgis.com/mVFRs7NF4iFitgbY/arcgis/rest/services/Female/FeatureServer/1/query?where=1%3D1&outFields=*&outSR=4326&f=json
https://services1.arcgis.com/mVFRs7NF4iFitgbY/arcgis/rest/services/Paa/FeatureServer/1/query?where=1%3D1&outFields=*&outSR=4326&f=json
https://services1.arcgis.com/mVFRs7NF4iFitgbY/arcgis/rest/services/Pwhite/FeatureServer/1/query?where=1%3D1&outFields=*&outSR=4326&f=json
https://services1.arcgis.com/mVFRs7NF4iFitgbY/arcgis/rest/services/Pasi/FeatureServer/1/query?where=1%3D1&outFields=*&outSR=4326&f=json
https://services1.arcgis.com/mVFRs7NF4iFitgbY/arcgis/rest/services/P2more/FeatureServer/1/query?where=1%3D1&outFields=*&outSR=4326&f=json
https://services1.arcgis.com/mVFRs7NF4iFitg

In [188]:
# turn the collections into a dataframe
baltimore_df = pd.concat(collect_balt)
# save geometry separately, then drop geometry and objectID
balt_geo = baltimore_df[['OBJECTID','Shape__Area', 'Shape__Length', "geometry"]].copy()
baltimore_df = baltimore_df.drop(['OBJECTID','Shape__Area', 'Shape__Length', "geometry"], axis = 1)
#new_indicator_df = new_indicator_df.drop(['OBJECTID','Shape__Area', 'Shape__Length', "geometry"], axis = 1)
#new_indicator_df.head()
baltimore_df

Unnamed: 0,City_1,tpop10,tpop20,City,male10,female10,paa10,paa15,paa16,paa17,paa18,paa19,paa20,pwhite10,pwhite15,pwhite16,pwhite17,pwhite18,pwhite19,pwhite20,pasi10,pasi15,pasi16,pasi17,pasi18,pasi19,pasi20,p2more10,p2more15,p2more16,p2more17,p2more18,p2more19,p2more20,ppac10,ppac15,ppac16,ppac17,ppac18,ppac19,...,hcvhouse18,hcvhouse19,cebus11,cebus12,cebus13,cebus14,cebus15,cebus16,cebus17,cebus18,cebus19,ceemp11,ceemp12,ceemp13,ceemp14,ceemp15,ceemp16,ceemp17,ceemp18,ceemp19,murals14,murals15,murals16,murals17,murals18,murals19,murals20,totjobs10,totjobs11,totjobs12,totjobs13,totjobs14,totjobs15,totjobs16,totjobs17,totjobs18,lights15,lights16,lights17,lights18
0,Baltimore City,620961.0,585708.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
0,,,,Baltimore City,292249.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
0,,,,Baltimore City,,328712.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
0,,,,Baltimore City,,,63.81723,62.264039,62.420451,62.251289,61.922238,61.770646,57.300737,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
0,,,,Baltimore City,,,,,,,,,,28.278904,28.079987,27.682931,27.57617,27.53945,27.491166,26.855703,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,,,,Baltimore City,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,1.48641,1.53311,1.69093,1.64584,1.568537,1.404275,1.262559,1.322144,1.260949,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
0,,,,Baltimore City,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,11662.0,13151.0,14369.0,12619.0,15477.0,16090.0,15477.0,15125.0,13403.0,,,,,,,,,,,,,,,,,,,,
0,,,,Baltimore City,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,312.0,339.0,347.0,350.0,350.0,378.0,379.0,,,,,,,,,,,,,
0,,,,Baltimore City,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,320010.0,325799.0,334349.0,335497.0,344588.0,350797.0,337454.0,337057.0,323556.0,,,,


In [189]:
# examine 'city_1' vs. city
city_cols = [col for col in baltimore_df.columns if 'City' in col]
print(city_cols)
print(baltimore_df.City_1.unique())
print(baltimore_df.City.unique())

['City_1', 'City']
['Baltimore City' nan]
[nan 'Baltimore City']


In [190]:
### reformat data 

# make 'City' = 'Baltimore City' for all values 
baltimore_df.City = 'Baltimore City'
# drop City_1 column 
baltimore_df = baltimore_df.drop(["City_1"], axis = 1)
# melt (pivot longer)
balt_melted = baltimore_df.melt(id_vars = ["City"],
                                var_name = "year-indicator",
                                value_name = "value")
# drop NAs
balt_melted.dropna(subset = ["value"], inplace = True)
# add year column 
balt_melted["year"] = ['20' + i[-2:] for i in balt_melted['year-indicator']]
balt_melted['year_numeric'] = [int(y) for y in balt_melted.year]
# add col for indicator 
balt_melted["indicator"] = [i[:-2] for i in balt_melted['year-indicator']]
# drop year-indicator field 
balt_melted = balt_melted.drop(['year-indicator'], axis = 1)
# remove some strange values 
balt_melted = balt_melted[balt_melted.indicator != "CSA20"]
balt_melted

Unnamed: 0,City,value,year,year_numeric,indicator
0,Baltimore City,620961.0,2010,2010,tpop
237,Baltimore City,585708.0,2020,2020,tpop
475,Baltimore City,292249.0,2010,2010,male
713,Baltimore City,328712.0,2010,2010,female
951,Baltimore City,63.81723,2010,2010,paa
...,...,...,...,...,...
246952,Baltimore City,323556.0,2018,2018,totjobs
247190,Baltimore City,6.134416,2015,2015,lights
247427,Baltimore City,21.387817,2016,2016,lights
247664,Baltimore City,30.61706,2017,2017,lights


In [191]:
# how many indicators have data for Baltimore City? 
len(set(balt_melted.indicator))

130

## Producing full dataset

In [192]:
# grab column order 
existing_cols = existing_df.columns
existing_df.head(10)

Unnamed: 0,CSA,indicator,year,value,year_numeric
0,Allendale/Irvington/S. Hilton,female,2000,10640.0,2000.0
1,Beechfield/Ten Hills/West Hills,female,2000,7110.0,2000.0
2,Belair-Edison,female,2000,9516.0,2000.0
3,Brooklyn/Curtis Bay/Hawkins Point,female,2000,6972.0,2000.0
4,Canton,female,2000,3546.0,2000.0
5,Cedonia/Frankford,female,2000,12404.0,2000.0
6,Cherry Hill,female,2000,4485.0,2000.0
7,Chinquapin Park/Belvedere,female,2000,4551.0,2000.0
8,Claremont/Armistead,female,2000,4612.0,2000.0
9,Clifton-Berea,female,2000,6804.0,2000.0


In [193]:
# rename and reorder columns 
new_indicator_melted = new_indicator_melted.rename(columns = {"CSA2010": "CSA"})
new_indicator_melted = new_indicator_melted[existing_cols]
new_indicator_melted

Unnamed: 0,CSA,indicator,year,value,year_numeric
0,Allendale/Irvington/S. Hilton,cashsa,2011,78.22,2011
1,Beechfield/Ten Hills/West Hills,cashsa,2011,32.05,2011
2,Belair-Edison,cashsa,2011,66.67,2011
3,Brooklyn/Curtis Bay/Hawkins Point,cashsa,2011,73.4,2011
4,Canton,cashsa,2011,26.64,2011
...,...,...,...,...,...
612124,Southwest Baltimore,lights,2020,24.322058,2020
612125,The Waverlies,lights,2020,23.990713,2020
612126,Upton/Druid Heights,lights,2020,13.923806,2020
612127,Washington Village/Pigtown,lights,2020,32.164274,2020


In [194]:
# rename and reorder columns 
balt_melted = balt_melted.rename(columns = {"City": "CSA"})
balt_melted = balt_melted[existing_cols]
balt_melted

Unnamed: 0,CSA,indicator,year,value,year_numeric
0,Baltimore City,tpop,2010,620961.0,2010
237,Baltimore City,tpop,2020,585708.0,2020
475,Baltimore City,male,2010,292249.0,2010
713,Baltimore City,female,2010,328712.0,2010
951,Baltimore City,paa,2010,63.81723,2010
...,...,...,...,...,...
246952,Baltimore City,totjobs,2018,323556.0,2018
247190,Baltimore City,lights,2015,6.134416,2015
247427,Baltimore City,lights,2016,21.387817,2016
247664,Baltimore City,lights,2017,30.61706,2017


In [195]:
# create one new dataframe 
full_data_new = existing_df.append(new_indicator_melted)
full_data_new = full_data_new.append(balt_melted)

In [196]:
# correcting formatting / typos 
full_data_new["indicator"] = full_data_new.indicator.str.lower()
full_data_new = full_data_new.replace("phsip", "phisp")
full_data_new = full_data_new.replace("artevent", "artevnt")
full_data_new = full_data_new.replace("age0-18_", "age18_")

In [197]:
# check that everything adds up as expected 
print(f"There should be {existing_df.shape[0] + balt_melted.shape[0] + new_indicator_melted.shape[0]} rows.")
print(f"There are {full_data_new.shape[0]} rows")

There should be 95739 rows.
There are 95739 rows


In [198]:
# awesome!
# export 
full_data_new.to_csv("full_vital_signs.csv", index = False)

In [199]:
set(balt_melted.indicator)

{'aastud',
 'abse',
 'abshs',
 'absmd',
 'affordm',
 'affordr',
 'age18_',
 'age24_',
 'age5_',
 'age64_',
 'age65_',
 'artbus',
 'artemp',
 'artevent',
 'artevnt',
 'bahigher',
 'baltvac',
 'birthwt',
 'bkln',
 'busload',
 'caracc',
 'carpool',
 'cashsa',
 'caslt',
 'cebus',
 'ceemp',
 'clogged',
 'cmos',
 'comp',
 'compl',
 'constper',
 'crehab',
 'crime',
 'demper',
 'dirtyst',
 'dom',
 'drop',
 'eattend',
 'ebll',
 'eenrol',
 'elheat',
 'empl',
 'fastfd',
 'female',
 'femhhs',
 'fore',
 'gunhom',
 'hcvhouse',
 'heatgas',
 'hh25inc',
 'hh40inc',
 'hh60inc',
 'hh75inc',
 'hhchpov',
 'hhm75',
 'hhpov',
 'hhs',
 'hhsize',
 'histax',
 'homtax',
 'hsattend',
 'hsdipl',
 'hsenrol',
 'hstud',
 'leadtest',
 'lesshs',
 'libcard',
 'lifexp',
 'lights',
 'liquor',
 'male',
 'mattend',
 'menrol',
 'mhhi',
 'murals',
 'narc',
 'neibus',
 'nilf',
 'nohhint',
 'nomail',
 'novhcl',
 'othrcom',
 'overd',
 'ownroc',
 'owntax',
 'p2more',
 'paa',
 'pasi',
 'phisp',
 'ppac',
 'prenatal',
 'prop',
 'pub

## Create New Version of Indicator Info

In [None]:
# info = vs[["indicator","year"]].groupby(["indicator"])["year"].apply(set).reset_index()

In [200]:
## get a list of years available for each indicator 
# function for a sorted set 
def sort_set(series):
  x = set(series)
  return sorted(x)
# add list of years available (sorted)
info = full_data_new[["indicator","year"]].groupby(['indicator'])['year'].apply(sort_set).reset_index()
info

Unnamed: 0,indicator,year
0,aastud,"[2005, 2006, 2007, 2008, 2009, 2010, 2011, 201..."
1,abse,"[2005, 2006, 2007, 2008, 2009, 2010, 2011, 201..."
2,abshs,"[2005, 2006, 2007, 2008, 2009, 2010, 2011, 201..."
3,absmd,"[2005, 2006, 2007, 2008, 2009, 2010, 2011, 201..."
4,affordm,"[2000, 2006 - 2010, 2010, 2011, 2012, 2013, 20..."
...,...,...
193,waterc,[2011]
194,weather,"[2010, 2011, 2012, 2013, 2014, 2015, 2016]"
195,wlksc,"[2011, 2017]"
196,wrkout,"[2010, 2012, 2013, 2014, 2015, 2016, 2017, 2018]"


In [201]:
# add a description and category
api_df_subset = api_df.rename(columns = {"Indicator":"indicator_description","Section":"category"})[["indicator_description","indicator","category"]]
info = info.merge(api_df_subset, how = "left")
# add count of years avalable 
info["count_years_available"] = [len(yrs) for yrs in info.year]
info = info.sort_values(["category", "count_years_available"], ascending = [True, False])
info.head()

Unnamed: 0,indicator,year,indicator_description,category,count_years_available
13,artbus,"[2010, 2011, 2012, 2013, 2014, 2015, 2016, 201...",Number of Businesses that are Arts-Related per...,Arts and Culture,10
14,artemp,"[2010, 2011, 2012, 2013, 2014, 2015, 2016, 201...",Total Employment in Arts-Related Businesses,Arts and Culture,10
92,libcard,"[2011, 2012, 2013, 2014, 2015, 2016, 2017, 201...","Number of Persons with Library Cards per 1,000...",Arts and Culture,10
30,cebus,"[2011, 2012, 2013, 2014, 2015, 2016, 2017, 201...",Rate of Businesses in the Creative Economy per...,Arts and Culture,9
31,ceemp,"[2011, 2012, 2013, 2014, 2015, 2016, 2017, 201...",Number of Employees in the Creative Economy,Arts and Culture,9


In [204]:
# write data
info.to_csv("indicator_info.csv", index = False)

In [202]:
# which indicators don't have descriptions? 
info[info.indicator_description.isnull()]

Unnamed: 0,indicator,year,indicator_description,category,count_years_available
36,comp,"[2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012]",,,8
74,historical_buildings,"[2003, 2004, 2005, 2007, 2008, 2009, 2010]",,,7
149,publart,"[2015, 2016, 2017, 2018, 2019, 2020]",,,6
157,registered-voters_voted,"[2000, 2002, 2004, 2006, 2008, 2010]",,,6
158,registered_voters_18plus,"[2000, 2002, 2004, 2006, 2008, 2010]",,,6
132,pct18-25_registeredvote,"[2002, 2004, 2006, 2008, 2010]",,,5
133,pct18-25_voted,"[2002, 2004, 2006, 2008, 2010]",,,5
114,neighborhood-associations,"[2003, 2008, 2009, 2010]",,,4
185,umbrella_nonprofits,"[2003, 2008, 2009, 2010]",,,4
24,business_50-99_emp,"[2008, 2009, 2010]",,,3


In [203]:
set(full_data_new.indicator)

{'aastud',
 'abse',
 'abshs',
 'absmd',
 'affordm',
 'affordr',
 'age18_',
 'age24_',
 'age45-64_',
 'age5_',
 'age64_',
 'age65_',
 'arrest',
 'artbus',
 'artemp',
 'artevnt',
 'bahigher',
 'baltvac',
 'banks',
 'birthwt',
 'biz1_',
 'biz2_',
 'biz4_',
 'bkln',
 'business_50-99_emp',
 'busload',
 'caracc',
 'carpool',
 'cashsa',
 'caslt',
 'cebus',
 'ceemp',
 'clogged',
 'cmos',
 'community_dev_corporations',
 'community_gardens',
 'comp',
 'compl',
 'comprop',
 'constper',
 'crehab',
 'crime',
 'demper',
 'dirtyst',
 'dom',
 'domvio',
 'drop',
 'eattend',
 'ebll',
 'eenrol',
 'elheat',
 'empl',
 'fam',
 'familiesrelatedkids',
 'farms',
 'fastfd',
 'female',
 'femhhs',
 'fore',
 'gunhom',
 'hazardous_waste_sites_count',
 'hcvhouse',
 'heatgas',
 'hfai',
 'hh25inc',
 'hh40inc',
 'hh60inc',
 'hh75inc',
 'hhchpov',
 'hhm75',
 'hhpov',
 'hhs',
 'hhsize',
 'histax',
 'historical_buildings',
 'homtax',
 'hs_degree_only_25plus',
 'hsaalg',
 'hsabio',
 'hsaeng',
 'hsagov',
 'hsattend',
 'hsdi

## Odds and Ends (troubleshooting)

In [None]:
full_data_new[full_data_new.indicator == "OBJECTID"]

Unnamed: 0,CSA,indicator,year,value,year_numeric
65593,Allendale/Irvington/S. Hilton,OBJECTID,20_1,1.0,201.0
65594,Beechfield/Ten Hills/West Hills,OBJECTID,20_1,2.0,201.0
65595,Belair-Edison,OBJECTID,20_1,3.0,201.0
65596,Brooklyn/Curtis Bay/Hawkins Point,OBJECTID,20_1,4.0,201.0
65597,Canton,OBJECTID,20_1,5.0,201.0
...,...,...,...,...,...
67184,Southwest Baltimore,OBJECTID,20_1,51.0,201.0
67185,The Waverlies,OBJECTID,20_1,52.0,201.0
67186,Upton/Druid Heights,OBJECTID,20_1,53.0,201.0
67187,Washington Village/Pigtown,OBJECTID,20_1,54.0,201.0


In [None]:
# where is that coming from? 
print([col for col in new_indicator_df.columns if 'OBJ' in col or 'obj' in col])
print([col for col in baltimore_df.columns if 'OBJ' in col or 'obj' in col])
print([col for col in existing_df.columns if 'OBJ' in col or 'obj' in col])

['OBJECTID_1']
[]
[]


In [None]:
# possible typo?
full_data_new[full_data_new.indicator == 'phsip']

Unnamed: 0,CSA,indicator,year,value,year_numeric
29239,Allendale/Irvington/S. Hilton,phsip,2017,2.0627,2017.0
29240,Beechfield/Ten Hills/West Hills,phsip,2017,1.7848,2017.0
29241,Belair-Edison,phsip,2017,1.2428,2017.0
29242,Brooklyn/Curtis Bay/Hawkins Point,phsip,2017,14.907,2017.0
29243,Canton,phsip,2017,3.0867,2017.0
29244,Cedonia/Frankford,phsip,2017,3.3246,2017.0
29245,Cherry Hill,phsip,2017,5.8131,2017.0
29246,Chinquapin Park/Belvedere,phsip,2017,5.9408,2017.0
29247,Claremont/Armistead,phsip,2017,15.2484,2017.0
29248,Clifton-Berea,phsip,2017,2.035,2017.0


In [None]:
# do we have 2017 data for phisp? 
set(full_data_new[full_data_new.indicator == 'phisp'].year)
# we do... 

{'2000', '2010', '2015', '2016', '2017', '2018', '2019', '2020'}

In [None]:
full_data_new[(full_data_new.indicator == 'phisp') & (full_data_new.year == '2017')]
# but it's only for Baltimore City. Safe to assume that phsip was a typo

Unnamed: 0,CSA,indicator,year,value,year_numeric
9962,Baltimore City,phisp,2017,4.957922,2017.0
