In [1]:
import pandas as pd
import pickle
import numpy as np
from json import loads
import requests
import re

This notebook includes the work to clean the food access data and merge it with the cleaned education dataset prior to analysis. 

In [2]:
food_df = pd.read_csv('food_access_2019.csv')

The food access atlas has a huge number of features to work from. As a start, let's see if any are missing too many values for them to be useful. I'll pull a list of any column that has more than half of the data points missing

In [3]:
food_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72531 entries, 0 to 72530
Columns: 147 entries, CensusTract to TractSNAP
dtypes: float64(126), int64(19), object(2)
memory usage: 81.3+ MB


In [4]:
drop_cols = list(food_df.isnull().sum()[food_df.isnull().sum()> (len(food_df)//2)].index)

Looking through the list of columns and comparing to the data description, it looks like all of these columns are population counts and percentages for various groups (e.g. lapop10 is the population count of individuals in the tract who live more than 10 miles from a supermarket). For this analysis, I'm nto focused on populations, just whether the tract as a whole is low access. And since there are too many missing values to impute anything, I will just drop all of the populatiuon and share columns.  Note that not all pop/share columns showed up in this list, but since the majority are unavailable, we'll just drop all of them. 

In [5]:
drop_cols

['lapop10',
 'lapop10share',
 'lalowi10',
 'lalowi10share',
 'lakids10',
 'lakids10share',
 'laseniors10',
 'laseniors10share',
 'lawhite10',
 'lawhite10share',
 'lablack10',
 'lablack10share',
 'laasian10',
 'laasian10share',
 'lanhopi10',
 'lanhopi10share',
 'laaian10',
 'laaian10share',
 'laomultir10',
 'laomultir10share',
 'lahisp10',
 'lahisp10share',
 'lahunv10',
 'lahunv10share',
 'lasnap10',
 'lasnap10share',
 'lapop20',
 'lapop20share',
 'lalowi20',
 'lalowi20share',
 'lakids20',
 'lakids20share',
 'laseniors20',
 'laseniors20share',
 'lawhite20',
 'lawhite20share',
 'lablack20',
 'lablack20share',
 'laasian20',
 'laasian20share',
 'lanhopi20',
 'lanhopi20share',
 'laaian20',
 'laaian20share',
 'laomultir20',
 'laomultir20share',
 'lahisp20',
 'lahisp20share',
 'lahunv20',
 'lahunv20share',
 'lasnap20',
 'lasnap20share']

In [6]:
# The columns in question are all next to eachother, dropping the section that contains these columns

food_df.drop(columns=food_df.iloc[:,31:-12].columns, inplace=True)

food_df

Unnamed: 0,CensusTract,State,County,Urban,Pop2010,OHU2010,GroupQuartersFlag,NUMGQTRS,PCTGQTRS,LILATracts_1And10,...,TractSeniors,TractWhite,TractBlack,TractAsian,TractNHOPI,TractAIAN,TractOMultir,TractHispanic,TractHUNV,TractSNAP
0,1001020100,Alabama,Autauga County,1,1912,693,0,0.0,0.00,0,...,221.0,1622.0,217.0,14.0,0.0,14.0,45.0,44.0,6.0,102.0
1,1001020200,Alabama,Autauga County,1,2170,743,0,181.0,8.34,1,...,214.0,888.0,1217.0,5.0,0.0,5.0,55.0,75.0,89.0,156.0
2,1001020300,Alabama,Autauga County,1,3373,1256,0,0.0,0.00,0,...,439.0,2576.0,647.0,17.0,5.0,11.0,117.0,87.0,99.0,172.0
3,1001020400,Alabama,Autauga County,1,4386,1722,0,0.0,0.00,0,...,904.0,4086.0,193.0,18.0,4.0,11.0,74.0,85.0,21.0,98.0
4,1001020500,Alabama,Autauga County,1,10766,4082,0,181.0,1.68,0,...,1126.0,8666.0,1437.0,296.0,9.0,48.0,310.0,355.0,230.0,339.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72526,56043000200,Wyoming,Washakie County,0,3326,1317,0,57.0,1.71,0,...,593.0,3106.0,6.0,15.0,0.0,27.0,172.0,309.0,61.0,64.0
72527,56043000301,Wyoming,Washakie County,1,2665,1154,0,10.0,0.38,0,...,399.0,2377.0,5.0,23.0,0.0,40.0,220.0,446.0,88.0,41.0
72528,56043000302,Wyoming,Washakie County,1,2542,1021,0,73.0,2.87,0,...,516.0,2312.0,11.0,10.0,1.0,26.0,182.0,407.0,23.0,64.0
72529,56045951100,Wyoming,Weston County,0,3314,1322,0,252.0,7.60,0,...,499.0,3179.0,15.0,10.0,1.0,47.0,62.0,91.0,47.0,34.0


Looking at the remaining issue areas, it looks like MadianFamilyIncome and LAPOP/LALOWI fields are the biggest concerns. Median income isn't relevant for my analysis and the LAPOP/LALOWI are the same as teh previous dropped columns where it's a population count/share of the population that are X number of miles from the closest supermarket. We can drop these columns as well, which leaves us with a much more manageable level of null values.

In [7]:
# Set display output
pd.options.display.max_rows = None

# Look at null values
food_df.isnull().sum()[food_df.isnull().sum() > 0]

NUMGQTRS                 25
PCTGQTRS                 25
PovertyRate               3
MedianFamilyIncome      748
LAPOP1_10             29957
LAPOP05_10            14540
LAPOP1_20             35914
LALOWI1_10            29957
LALOWI05_10           14540
LALOWI1_20            35914
TractLOWI                 4
TractKids                 4
TractSeniors              4
TractWhite                4
TractBlack                4
TractAsian                4
TractNHOPI                4
TractAIAN                 4
TractOMultir              4
TractHispanic             4
TractHUNV                 4
TractSNAP                 4
dtype: int64

In [8]:
# Reset display parameters
pd.options.display.max_rows = 20

# Drop columns
food_df.drop(columns=['MedianFamilyIncome', 'LAPOP1_10', 'LAPOP05_10', 'LAPOP1_20', 'LALOWI1_10', 'LALOWI05_10', 'LALOWI1_20'], inplace=True)

In [9]:
food_df.head(3)

Unnamed: 0,CensusTract,State,County,Urban,Pop2010,OHU2010,GroupQuartersFlag,NUMGQTRS,PCTGQTRS,LILATracts_1And10,...,TractSeniors,TractWhite,TractBlack,TractAsian,TractNHOPI,TractAIAN,TractOMultir,TractHispanic,TractHUNV,TractSNAP
0,1001020100,Alabama,Autauga County,1,1912,693,0,0.0,0.0,0,...,221.0,1622.0,217.0,14.0,0.0,14.0,45.0,44.0,6.0,102.0
1,1001020200,Alabama,Autauga County,1,2170,743,0,181.0,8.34,1,...,214.0,888.0,1217.0,5.0,0.0,5.0,55.0,75.0,89.0,156.0
2,1001020300,Alabama,Autauga County,1,3373,1256,0,0.0,0.0,0,...,439.0,2576.0,647.0,17.0,5.0,11.0,117.0,87.0,99.0,172.0


We've now managed to get our features down from 147 to 36, a much more manageable number. There are a few more columns that we can drop since they will be unecessary:

OHU2010 (total number of houses), NUMQTRS and PCTGQTRS(number/share of population residing in group quarters)

In [10]:
food_df.columns

Index(['CensusTract', 'State', 'County', 'Urban', 'Pop2010', 'OHU2010',
       'GroupQuartersFlag', 'NUMGQTRS', 'PCTGQTRS', 'LILATracts_1And10',
       'LILATracts_halfAnd10', 'LILATracts_1And20', 'LILATracts_Vehicle',
       'HUNVFlag', 'LowIncomeTracts', 'PovertyRate', 'LA1and10', 'LAhalfand10',
       'LA1and20', 'LATracts_half', 'LATracts1', 'LATracts10', 'LATracts20',
       'LATractsVehicle_20', 'TractLOWI', 'TractKids', 'TractSeniors',
       'TractWhite', 'TractBlack', 'TractAsian', 'TractNHOPI', 'TractAIAN',
       'TractOMultir', 'TractHispanic', 'TractHUNV', 'TractSNAP'],
      dtype='object')

In [11]:
food_df.drop(columns=['OHU2010', 'NUMGQTRS', 'PCTGQTRS'], inplace=True)

Let's do one more check of the null values and see if we can resolve those. 

In [12]:
food_df.isnull().sum()[food_df.isnull().sum() > 0]

PovertyRate      3
TractLOWI        4
TractKids        4
TractSeniors     4
TractWhite       4
TractBlack       4
TractAsian       4
TractNHOPI       4
TractAIAN        4
TractOMultir     4
TractHispanic    4
TractHUNV        4
TractSNAP        4
dtype: int64

Confirming that the null values for all of these come from the same 4 census tracts, 3 in South Dakota and 1 in Alaska. Before we deal with these null values, let's match this to our edu dataframe and see if there are any schools in that area that we'll need to worry about. 

In [13]:
pd.options.display.max_rows = 35

food_df[food_df['TractLOWI'].isnull()].T

Unnamed: 0,1293,59643,59644,59645
CensusTract,2158000100,46102940500,46102940800,46102940900
State,Alaska,South Dakota,South Dakota,South Dakota
County,Kusilvak Census Area,Oglala Lakota County,Oglala Lakota County,Oglala Lakota County
Urban,0,0,0,0
Pop2010,7459,4419,4745,4422
GroupQuartersFlag,0,0,0,0
LILATracts_1And10,0,0,0,0
LILATracts_halfAnd10,0,0,0,0
LILATracts_1And20,0,0,0,0
LILATracts_Vehicle,1,0,0,0


In [14]:
# Load edu df dataset
edu_df = pickle.load(open('edu_df.pkl', 'rb'))

Census tract codes are made up of 3 components: a 2 digit state code, a 3 digit county code, and a 6 digit tract. The edu dataframe 'tract' column only includes the 6 digit tract id, but the 'county_code' column contains the other digits required. 

Taking a look at the different ids, we can see that the length for tract ids run from 3 to 6, the length for county cocde is 4 to 5 and the length of CensuTract runs from 10 to 11. That looks like all the codes have leading 0's removed so we'll need to add those back on to the 'tract' id to ensure a length of at least 10.

source: https://transition.fcc.gov/form477/Geo/more_about_census_tracts.pdf

In [15]:
min_len = np.inf
max_len = 0
for i in edu_df['tract'].unique():
    item = str(int(i))
    if len(item) < min_len:
        min_len = len(item)
    if len(item) > max_len:
        max_len = len(item)

print(min_len, max_len)

min_len = np.inf
max_len = 0
for i in edu_df['county_code'].unique():
    item = str(int(i))
    if len(item) < min_len:
        min_len = len(item)
    if len(item) > max_len:
        max_len = len(item)

print(min_len, max_len)

print(len(str(food_df.CensusTract.unique().min())),
len(str(food_df.CensusTract.unique().max())))

for i in food_df['CensusTract'].unique():
    if len(str(i)) == 11:
        if str(i).split()[0] == 0:
            print('True')

3 6
4 5
10 11


In [16]:
# Note that tracts can have sub-categories marked as a decimal point, which is why the tract id is listed as a float. But a quick check verifies that there are no subcategories included

for i in edu_df['tract']:
    if i > int(i):
        print('True')

In [17]:
def census_tract (row):
    tract = str(int(row['tract']))
    code = str(row['county_code'])
    while len(tract) < 6:
        tract = '0' + tract
    
    census = code + tract
    return census

edu_df['CensusTract'] = edu_df.apply(census_tract, axis=1)
    

In [18]:
edu_df.sample(10)

Unnamed: 0,school_name,ncessch,cohort_cat,cohort_num,street,city,state,zip,county_code,state_code,tract,school_level,charter,lunch_program,urban_locale_cat,grad_rate,CensusTract
43188,Pomperaug Regional High School,90353700749,disability,50,234 Judd Rd.,Southbury,CT,6488,9009,9,348122.0,3,0,1.0,Rural,90,9009348122
13658,Chawanakee Academy Charter,60011607170,api,1,45077 Rd. 200,ONeals,CA,93645,6039,6,108.0,4,1,0.0,Rural,69,6039000108
81588,Sharon High,251062001705,black,24,181 Pond Street,Sharon,MA,2067,25021,25,414300.0,3,0,1.0,Suburban,80,25021414300
61525,Eastern High School,180300000395,white,107,1100 N Eastern School RD E-3,Pekin,IN,47165,18175,18,967700.0,3,0,1.0,Rural,87,18175967700
116979,YOUNG WOMENS LEADERSHIP SCHOOL - ASTORIA,360010205948,disability,7,23-15 NEWTOWN AVE,LONG ISLAND CITY,NY,11105,36081,36,6900.0,3,0,2.0,Urban,50,36081006900
158223,Nashville Big Picture High School,470318002136,disability,4,160 Rural AVE,Nashville,TN,37209,47037,47,18101.0,3,0,2.0,Urban,64,47037018101
133410,Walnut Ridge High School,390438000728,total,184,4841 E Livingston Ave,Columbus,OH,43227,39049,39,9312.0,3,0,2.0,Urban,72,39049009312
139798,CARNEY HS,400669000262,econ_disadvantaged,5,204 South Carney Street,Carney,OK,74832,40081,40,961200.0,3,0,1.0,Rural,68,40081961200
30492,Citrus Valley High,63207012254,disability,34,800 W. Pioneer Ave.,Redlands,CA,92374,6071,6,8001.0,3,0,2.0,Urban,74,6071008001
113105,RUIDOSO HIGH,350231000529,homeless,12,125 WARRIOR DR,RUIDOSO,NM,88345,35027,35,960600.0,3,0,1.0,Rural,50,35027960600


In [19]:
edu_df['CensusTract'] = edu_df['CensusTract'].astype('int64')

In [20]:
df = edu_df.merge(food_df, on= 'CensusTract', how='left')

After joining the datasets, we can see that there are 918 rows (associated with 189 schools) that didn't match with the food access dataset. Digging more into this, it appears that the census tracts are actually 2020 tracts not 2010, even though the file states that they are 2010. Since the school data we're using is 2019 it makes to match up to the 2020 data for those that didn't match properly. To do this, we'll need to use the Census Geocoder to look up each address and find the current census tract ID. 

source: https://geocoding.geo.census.gov/geocoder/

In [21]:
df.isnull().sum()[df.isnull().sum() > 0]

State                   918
County                  918
Urban                   918
Pop2010                 918
GroupQuartersFlag       918
LILATracts_1And10       918
LILATracts_halfAnd10    918
LILATracts_1And20       918
LILATracts_Vehicle      918
HUNVFlag                918
LowIncomeTracts         918
PovertyRate             926
LA1and10                918
LAhalfand10             918
LA1and20                918
LATracts_half           918
LATracts1               918
LATracts10              918
LATracts20              918
LATractsVehicle_20      918
TractLOWI               987
TractKids               987
TractSeniors            987
TractWhite              987
TractBlack              987
TractAsian              987
TractNHOPI              987
TractAIAN               987
TractOMultir            987
TractHispanic           987
TractHUNV               987
TractSNAP               987
dtype: int64

In [22]:
df[df['Pop2010'].isnull()]['ncessch'].nunique()

189

In [23]:
def census_lookup(school):
    '''Input is a row from edu_df. Function parses the address and looks up the census tract ID from census.gov'''

    # Parse address into correct format
    old_tract = school['CensusTract']
    street = re.sub(pattern=r'["#"]', repl='', string=school['street'])
    geoStreet = '+'.join(street.split())
    geoCity = '+'.join(school['city'].split())
    geoState = school['state']
    geoZip= school['zip']

    # Make Request
    url = f'https://geocoding.geo.census.gov/geocoder/geographies/address?street={geoStreet}&city={geoCity}&state={geoState}&zip={geoZip}&benchmark=Public_AR_Current&vintage=Current_Current&layers=11&format=json'
    response = requests.get(url)

    # Load results from request
    results = loads(response.text)

    # Check if search returned results or not
    if len(results['result']['addressMatches']) > 0:
        # Get census tract ID
        new_tract = results['result']['addressMatches'][0]['geographies']['Census Tracts'][0]['GEOID']
    else:
        new_tract = 'No Match'
    
    return new_tract

In [24]:
# List of school IDs for schools that didn't have a match
missing_schools = list(df[df['Pop2010'].isnull()]['ncessch'].unique())
temp_dict = {}

for idx, id in enumerate(missing_schools):
    # Loop through each school and look up it's 2020 census tract ID if available
    school = edu_df[edu_df['ncessch'] == id].iloc[0]
    old_tract = edu_df[edu_df['ncessch'] == id].iloc[0]['CensusTract']
    new_tract = census_lookup(school)

    temp_dict[id] = [old_tract, new_tract]

    # Print to verify code is running properly
    print(f'Index {idx} complete')

Index 0 complete
Index 1 complete
Index 2 complete
Index 3 complete
Index 4 complete
Index 5 complete
Index 6 complete
Index 7 complete
Index 8 complete
Index 9 complete
Index 10 complete
Index 11 complete
Index 12 complete
Index 13 complete
Index 14 complete
Index 15 complete
Index 16 complete
Index 17 complete
Index 18 complete
Index 19 complete
Index 20 complete
Index 21 complete
Index 22 complete
Index 23 complete
Index 24 complete
Index 25 complete
Index 26 complete
Index 27 complete
Index 28 complete
Index 29 complete
Index 30 complete
Index 31 complete
Index 32 complete
Index 33 complete
Index 34 complete
Index 35 complete
Index 36 complete
Index 37 complete
Index 38 complete
Index 39 complete
Index 40 complete
Index 41 complete
Index 42 complete
Index 43 complete
Index 44 complete
Index 45 complete
Index 46 complete
Index 47 complete
Index 48 complete
Index 49 complete
Index 50 complete
Index 51 complete
Index 52 complete
Index 53 complete
Index 54 complete
Index 55 complete
In

In [25]:
# Check how many values did not have a match
count = 0
for item in temp_dict.keys():
    if temp_dict[item][1] == 'No Match':
        count += 1

print(f'Out of {len(missing_schools)} schools, {count} were not found in the lookup')

Out of 189 schools, 159 were not found in the lookup


It looks like all but 30 schools we weren't able to find a 2020 census match. A spot check looks like most schools have the same state code (72) which represents Puerto Rico. A quick loop confirms that all but 1 school where we could not find a match werre in Puerto Rico. The one school that was not Peurot Rico is a school in eatonville that I couldn't find through manual searching either. We'll give up on these schools

In [26]:
missing_vals = []
count = 0
for item in temp_dict.keys():
    if temp_dict[item][1] == 'No Match':
        missing_vals.append(item)

for item in missing_vals:
    if str(item)[:2] == '72':
        count+= 1

print(count)

158


In [27]:
edu_df[edu_df['ncessch'] == 361992001819
       ]

Unnamed: 0,school_name,ncessch,cohort_cat,cohort_num,street,city,state,zip,county_code,state_code,tract,school_level,charter,lunch_program,urban_locale_cat,grad_rate,CensusTract
121682,MORRISVILLE MIDDLE SCHOOL HIGH SCHOOL,361992001819,total,54,5061 FEARON RD,MORRISVILLE,NY,13408,36053,36,940600.0,3,0,1.0,Rural,84,36053940600
121683,MORRISVILLE MIDDLE SCHOOL HIGH SCHOOL,361992001819,disability,12,5061 FEARON RD,MORRISVILLE,NY,13408,36053,36,940600.0,3,0,1.0,Rural,50,36053940600
121684,MORRISVILLE MIDDLE SCHOOL HIGH SCHOOL,361992001819,econ_disadvantaged,34,5061 FEARON RD,MORRISVILLE,NY,13408,36053,36,940600.0,3,0,1.0,Rural,84,36053940600
121685,MORRISVILLE MIDDLE SCHOOL HIGH SCHOOL,361992001819,api,3,5061 FEARON RD,MORRISVILLE,NY,13408,36053,36,940600.0,3,0,1.0,Rural,57,36053940600
121686,MORRISVILLE MIDDLE SCHOOL HIGH SCHOOL,361992001819,black,2,5061 FEARON RD,MORRISVILLE,NY,13408,36053,36,940600.0,3,0,1.0,Rural,63,36053940600
121687,MORRISVILLE MIDDLE SCHOOL HIGH SCHOOL,361992001819,two_or_more,1,5061 FEARON RD,MORRISVILLE,NY,13408,36053,36,940600.0,3,0,1.0,Rural,51,36053940600
121688,MORRISVILLE MIDDLE SCHOOL HIGH SCHOOL,361992001819,white,48,5061 FEARON RD,MORRISVILLE,NY,13408,36053,36,940600.0,3,0,1.0,Rural,84,36053940600


In [28]:
tract_changes = pd.DataFrame.from_dict(temp_dict, columns=['old_tract', 'new_tract'], orient='index')

len(tract_changes)

tract_changes

Unnamed: 0,old_tract,new_tract
40052003063,4019470400,04019005200
40089803393,4013980600,04013422647
60211511108,6071980100,06071980100
80195000010,8001988700,08001988700
80336006527,8031980100,08031980100
...,...,...
720003002064,72059740400,No Match
720003002066,72123952800,No Match
720003002069,72031050600,No Match
720003002073,72031051102,No Match


In [29]:
# Update all tracts

def update_tract(x):
    if x in list(tract_changes['old_tract']):
        tract = tract_changes[tract_changes['old_tract'] == x].new_tract.values[0]
        return tract
    else:
        return x
    

edu_df['CensusTract2'] = edu_df['CensusTract'].apply(update_tract)

edu_df[edu_df['CensusTract'] == 72031051102]



Unnamed: 0,school_name,ncessch,cohort_cat,cohort_num,street,city,state,zip,county_code,state_code,tract,school_level,charter,lunch_program,urban_locale_cat,grad_rate,CensusTract,CensusTract2
193636,ANGEL P. MILLAN ROHENA,720003002073,total,142,CAR 857 KM 0.3 BO CANOVANILLAS,CAROLINA,PR,986,72031,72,51102.0,3,0,1.0,Suburban,77,72031051102,No Match
193637,ANGEL P. MILLAN ROHENA,720003002073,disability,32,CAR 857 KM 0.3 BO CANOVANILLAS,CAROLINA,PR,986,72031,72,51102.0,3,0,1.0,Suburban,54,72031051102,No Match
193638,ANGEL P. MILLAN ROHENA,720003002073,econ_disadvantaged,120,CAR 857 KM 0.3 BO CANOVANILLAS,CAROLINA,PR,986,72031,72,51102.0,3,0,1.0,Suburban,77,72031051102,No Match
193639,ANGEL P. MILLAN ROHENA,720003002073,homeless,6,CAR 857 KM 0.3 BO CANOVANILLAS,CAROLINA,PR,986,72031,72,51102.0,3,0,1.0,Suburban,50,72031051102,No Match
193641,ANGEL P. MILLAN ROHENA,720003002073,hispanic,141,CAR 857 KM 0.3 BO CANOVANILLAS,CAROLINA,PR,986,72031,72,51102.0,3,0,1.0,Suburban,77,72031051102,No Match


In [30]:
# Dropping the items with no match

edu_df = edu_df[edu_df['CensusTract2'] != 'No Match']

edu_df.drop(columns=['CensusTract'], inplace=True)

edu_df.rename(columns={'CensusTract2': 'CensusTract'}, inplace=True)

In [31]:
edu_df['CensusTract'] = edu_df['CensusTract'].astype('int64')

In [32]:
df = edu_df.merge(food_df, on= 'CensusTract', how='left')

This has now gotten our missing matches down to 134 rows and 23 schools

In [33]:
print(df.isnull().sum()[df.isnull().sum()> 0])

print(df[df['State'].isnull()]['ncessch'].nunique())

State                   134
County                  134
Urban                   134
Pop2010                 134
GroupQuartersFlag       134
LILATracts_1And10       134
LILATracts_halfAnd10    134
LILATracts_1And20       134
LILATracts_Vehicle      134
HUNVFlag                134
LowIncomeTracts         134
PovertyRate             142
LA1and10                134
LAhalfand10             134
LA1and20                134
LATracts_half           134
LATracts1               134
LATracts10              134
LATracts20              134
LATractsVehicle_20      134
TractLOWI               203
TractKids               203
TractSeniors            203
TractWhite              203
TractBlack              203
TractAsian              203
TractNHOPI              203
TractAIAN               203
TractOMultir            203
TractHispanic           203
TractHUNV               203
TractSNAP               203
dtype: int64
23


Again, many of the schools with missing issues are from Puerto Rico. It is probably easiest just to drop those schools entirely. Of the remaining 7, without a better understaning of the census rules there's not much more I can do. Since these schools are relatively spread out across the country it shouldn't cause any serious gaps in our data (i.e. one state not being represented or a coast being underrepresented) we'll just drop these schools as well. 

In [34]:
df[df['State'].isnull()]['ncessch'].unique()

array([ 40089803393,  60211511108,  80195000010,  80336006527,
       330329800717, 360010005297, 450390101614, 720003000033,
       720003000097, 720003000296, 720003000445, 720003000494,
       720003000698, 720003000732, 720003000764, 720003001045,
       720003001133, 720003001282, 720003001286, 720003001349,
       720003001849, 720003001855, 720003002007], dtype=int64)

In [35]:
for i in list(df[df['State'].isnull()]['ncessch'].unique())[:7]:
    print(df[df['ncessch'] == i]['school_name'].unique(),df[df['ncessch'] == i]['street'].unique(), df[df['ncessch'] == i]['city'].unique(), df[df['ncessch'] == i]['state'].unique(), df[df['ncessch'] == i]['CensusTract'].unique())

['BASIS Mesa'] ['5010 S EASTMARK PKWY'] ['MESA'] ['AZ'] [4013422647]
['Public Safety Academy'] ['1482 E. Enterprise Dr.'] ['San Bernardino'] ['CA'] [6071980100]
['ADAMS CITY HIGH SCHOOL'] ['7200 QUEBEC PARKWAY'] ['COMMERCE CITY'] ['CO'] [8001988700]
['HIGH TECH EARLY COLLEGE'] ['11200 E 45TH AVE'] ['DENVER'] ['CO'] [8031980100]
['The Founders Academy Charter School (H)'] ['5 Perimeter Rd'] ['Manchester'] ['NH'] [33011980101]
['QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK COLLEGE'] ['94-50 159TH ST'] ['JAMAICA'] ['NY'] [36081024600]
['Midlands Middle College'] ['1260 Lexington Drive'] ['West Columbia'] ['SC'] [45063980100]


In [36]:
df = df[df['State'].isnull() == False]

df.isnull().sum()[df.isnull().sum() > 0]

PovertyRate       8
TractLOWI        69
TractKids        69
TractSeniors     69
TractWhite       69
TractBlack       69
TractAsian       69
TractNHOPI       69
TractAIAN        69
TractOMultir     69
TractHispanic    69
TractHUNV        69
TractSNAP        69
dtype: int64

In [37]:
#Looking up the 

A quick look at the remaining null values shows us that there are only 3 unique census tracts that are missing data. Using census.gov I located the actual numbers for each of these wherever possible, when I couldn't, I used the median for the state. This is not going to be a particularly accurate number but since it's only 3 schools it shouldn't cause much of an issue for the model

In [38]:
print(df[df['TractLOWI'].isnull()].CensusTract.unique())
print(df[df.PovertyRate.isnull()].CensusTract.unique())

df[df['CensusTract'] == 46102940900]


[ 2158000100 46102940500 46102940900]
[46102940500 46102940900]


Unnamed: 0,school_name,ncessch,cohort_cat,cohort_num,street,city,state,zip,county_code,state_code,...,TractSeniors,TractWhite,TractBlack,TractAsian,TractNHOPI,TractAIAN,TractOMultir,TractHispanic,TractHUNV,TractSNAP
145292,Little Wound School,590017300115,total,103,1 Main Street,Kyle,SD,57752,46102,46,...,,,,,,,,,,
145293,Little Wound School,590017300115,disability,18,1 Main Street,Kyle,SD,57752,46102,46,...,,,,,,,,,,
145294,Little Wound School,590017300115,econ_disadvantaged,103,1 Main Street,Kyle,SD,57752,46102,46,...,,,,,,,,,,
145295,Little Wound School,590017300115,am_indian/ak_native,103,1 Main Street,Kyle,SD,57752,46102,46,...,,,,,,,,,,


In [39]:
demo_dicts = {'PovertyRate' : {46102940500: 53.75, 46102940900: 58.33},
'TractLOWI' : {2158000100: df.loc[df['state'] == 'AK', 'TractLOWI'].median(),  46102940500: df.loc[df['state'] == 'SD', 'TractLOWI'].median(), 46102940900: df.loc[df['state'] == 'SD', 'TractLOWI'].median()},
'TractKids': {2158000100: 3411,  46102940500: 1191, 46102940900: 1858},
'TractSeniors': {2158000100: 490,  46102940500: 271, 46102940900: 399},
'TractWhite' : {2158000100: 173,  46102940500: 178, 46102940900: 341},
'TractBlack' : {2158000100: 16,  46102940500: 11, 46102940900: 0},
'TractAsian': {2158000100: 23,  46102940500: 0, 46102940900: 0},
'TractNHOPI' : {2158000100: 0,  46102940500: 0, 46102940900: 0},
'TractAIAN' : {2158000100: 7946,  46102940500: 5233, 46102940900: 4299},
'TractOMultir' : {2158000100: 192,  46102940500: 0, 46102940900: 182},
'TractHispanic' : {2158000100: 86,  46102940500: 239, 46102940900: 283},
'TractHUNV' : {2158000100: df.loc[df['state'] == 'AK', 'TractHUNV'].median(),  46102940500: df.loc[df['state'] == 'SD', 'TractHUNV'].median(), 46102940900: df.loc[df['state'] == 'SD', 'TractHUNV'].median()},
'TractSNAP': {2158000100: df.loc[df['state'] == 'AK', 'TractSNAP'].median(),  46102940500: df.loc[df['state'] == 'SD', 'TractSNAP'].median(), 46102940900: df.loc[df['state'] == 'SD', 'TractSNAP'].median()}}

In [40]:
df[df['CensusTract'] == 46102940900]

#[ 2158000100 46102940500 46102940900]

Unnamed: 0,school_name,ncessch,cohort_cat,cohort_num,street,city,state,zip,county_code,state_code,...,TractSeniors,TractWhite,TractBlack,TractAsian,TractNHOPI,TractAIAN,TractOMultir,TractHispanic,TractHUNV,TractSNAP
145292,Little Wound School,590017300115,total,103,1 Main Street,Kyle,SD,57752,46102,46,...,,,,,,,,,,
145293,Little Wound School,590017300115,disability,18,1 Main Street,Kyle,SD,57752,46102,46,...,,,,,,,,,,
145294,Little Wound School,590017300115,econ_disadvantaged,103,1 Main Street,Kyle,SD,57752,46102,46,...,,,,,,,,,,
145295,Little Wound School,590017300115,am_indian/ak_native,103,1 Main Street,Kyle,SD,57752,46102,46,...,,,,,,,,,,


In [41]:
for col in list(demo_dicts.keys()):
    for tract in list(demo_dicts[col].keys()):
        df.loc[df['CensusTract'] == tract, col] = demo_dicts[col][tract]
        
df[df['CensusTract'] == 46102940900]

Unnamed: 0,school_name,ncessch,cohort_cat,cohort_num,street,city,state,zip,county_code,state_code,...,TractSeniors,TractWhite,TractBlack,TractAsian,TractNHOPI,TractAIAN,TractOMultir,TractHispanic,TractHUNV,TractSNAP
145292,Little Wound School,590017300115,total,103,1 Main Street,Kyle,SD,57752,46102,46,...,399.0,341.0,0.0,0.0,0.0,4299.0,182.0,283.0,47.0,91.0
145293,Little Wound School,590017300115,disability,18,1 Main Street,Kyle,SD,57752,46102,46,...,399.0,341.0,0.0,0.0,0.0,4299.0,182.0,283.0,47.0,91.0
145294,Little Wound School,590017300115,econ_disadvantaged,103,1 Main Street,Kyle,SD,57752,46102,46,...,399.0,341.0,0.0,0.0,0.0,4299.0,182.0,283.0,47.0,91.0
145295,Little Wound School,590017300115,am_indian/ak_native,103,1 Main Street,Kyle,SD,57752,46102,46,...,399.0,341.0,0.0,0.0,0.0,4299.0,182.0,283.0,47.0,91.0


In [42]:
df.isnull().sum()[df.isnull().sum() > 0]

Series([], dtype: int64)

Finally, we have joined all of our data together and removed all missing values. There is a little more cleaning that we can do before we start our analysis. 

1) The GroupQuartersFlag indicates that more than 67% of the population lives in 'group quarters' which are communcal living situations like military barracks, employee housing, etc. Since these situations are less common and introduce additonal factors that may skew our results (for instance military barracks will provide food for those living there, making the proximity of a grocery store irrelevant) we'll drop any rows where this is true.

2) Several such as the address of the school and the school name were only needed for matching and joining purposes. We can drop these columns to make our analysis simpler.

3) We can reorganize the columns to read a little easier

4) We can set the correct types for each of the features

In [43]:
# Remove group quarters observations
df = df[df.GroupQuartersFlag == 0]

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 144951 entries, 0 to 145323
Data columns (total 49 columns):
 #   Column                Non-Null Count   Dtype   
---  ------                --------------   -----   
 0   school_name           144951 non-null  object  
 1   ncessch               144951 non-null  int64   
 2   cohort_cat            144951 non-null  object  
 3   cohort_num            144951 non-null  int64   
 4   street                144951 non-null  object  
 5   city                  144951 non-null  object  
 6   state                 144951 non-null  object  
 7   zip                   144951 non-null  object  
 8   county_code           144951 non-null  int64   
 9   state_code            144951 non-null  int64   
 10  tract                 144951 non-null  float64 
 11  school_level          144951 non-null  int64   
 12  charter               144951 non-null  int64   
 13  lunch_program         144951 non-null  float64 
 14  urban_locale_cat      144951 non-nul

The USDA has created 8 categories to measure food access. The categories are all identifed as low access and are further subdivided by distance (measured separately for urban and rural), low income status, and vehicle accessibility. Given that they are the SMEs for this, I will use their categories for my work instead of attempting to create my own. So, I will drop all of the columns other than those flags.

In [45]:
# Drop unecessary columns
df = df[['ncessch', 'CensusTract', 'cohort_num','cohort_cat', 'school_level', 'charter', 'lunch_program', 'LILATracts_1And10', 'LILATracts_halfAnd10', 'LILATracts_1And20', 'LILATracts_Vehicle',
         'LA1and10', 'LAhalfand10', 'LA1and20', 'LATractsVehicle_20','grad_rate']]

In [46]:
# Reorganize columns

df.columns

Index(['ncessch', 'CensusTract', 'cohort_num', 'cohort_cat', 'school_level',
       'charter', 'lunch_program', 'LILATracts_1And10', 'LILATracts_halfAnd10',
       'LILATracts_1And20', 'LILATracts_Vehicle', 'LA1and10', 'LAhalfand10',
       'LA1and20', 'LATractsVehicle_20', 'grad_rate'],
      dtype='object')

In [47]:
# Update columns to appropriate type

# Everything but the target is a category, so we'll change everything to category and then just update the target
df = df.astype('category')

df[['grad_rate', 'cohort_num']] = df[['grad_rate', 'cohort_num']].astype('int64')

df.dtypes


ncessch                 category
CensusTract             category
cohort_num                 int64
cohort_cat              category
school_level            category
charter                 category
lunch_program           category
LILATracts_1And10       category
LILATracts_halfAnd10    category
LILATracts_1And20       category
LILATracts_Vehicle      category
LA1and10                category
LAhalfand10             category
LA1and20                category
LATractsVehicle_20      category
grad_rate                  int64
dtype: object

In [48]:
df.to_pickle(open('grad_and_access_df.pkl', 'wb'))