# Processing & Pruning Our Data

## Import our dependencies

In [33]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import os

## Load our datasets

In [2]:
atlas_fifteen_path = '../data/raw/Raw_Atlas_2015/Raw_Atlas_Data_2015.xlsx'
fifteen_xl = pd.ExcelFile(atlas_fifteen_path)
atlas_ten_path = '../data/raw/Raw_Atlas_2010/Raw_Atlas_Data_2010.xlsx'
ten_xl = pd.ExcelFile(atlas_ten_path)

atlas_nineteen = pd.read_csv('../data/raw/Raw_Atlas_2019/Raw_Atlas_Data_2019.csv')
atlas_fifteen = fifteen_xl.parse('Food Access Research Atlas')
atlas_ten = ten_xl.parse('Food Access Research Atlas')

## Processing 2019 data for our present-day classification model

The 100+ columns in our table are often redundant and useless for training our model.

Let's keep the fundamentals, along with some general data buffer zones for identifying relative isolation + outliers.

In [5]:
keep_cols = [
    'Urban', 'Pop2010', 'PovertyRate', 'MedianFamilyIncome',
    'TractLOWI', 'TractKids', 'TractSeniors', 'TractHUNV', 'TractSNAP'
]

buffer_cols = [
    'lapop1share', 'lalowi1share', 'lakids1share', 'laseniors1share',
    'lahunv1share', 'lasnap1share'
]

In [27]:
processed_atlas_nineteen = atlas_nineteen[keep_cols + buffer_cols].copy()

### Additional features for our 2019 model

#### Simple ratios:

In [35]:
processed_atlas_nineteen['LOWIRatio'] = processed_atlas_nineteen['TractLOWI'] / processed_atlas_nineteen['Pop2010'] # Low income percentage
processed_atlas_nineteen['SNAPRatio'] = processed_atlas_nineteen['TractSNAP'] / processed_atlas_nineteen['Pop2010'] # Percentage of residents receiving SNAP
processed_atlas_nineteen['HUNVRatio'] = processed_atlas_nineteen['TractHUNV'] / processed_atlas_nineteen['Pop2010'] # Percentage of residents without a vehicle (important for food deserts!)
processed_atlas_nineteen['FoodInsecurityIndex'] = (
    processed_atlas_nineteen['LOWIRatio'] +
    processed_atlas_nineteen['SNAPRatio'] +
    processed_atlas_nineteen['HUNVRatio']
)

#### More complex features:

In [36]:
processed_atlas_nineteen['SNAPDisparity'] = processed_atlas_nineteen['SNAPRatio'] - processed_atlas_nineteen['lasnap1share']

Our calculated SNAP disparity is how much higher/lower the tract's SNAP is relative to its surrounding area.

A high positive disparity would indicate isolated food insecurity, while a negative disparity would show a priveleged pocket inside of a worse-off area.

In [41]:
processed_atlas_nineteen['LOWIWeighted'] = processed_atlas_nineteen['TractLOWI'] * processed_atlas_nineteen['PovertyRate']

Our weighted low income emphasizes tracts with both a high number of low-income people and a high poverty rate.

This helps create more granular priorization for our model. 

### Scaling our 2019 data before saving to be processed later

In [42]:
scaler = MinMaxScaler()
scaled_processed_atlas_nineteen = scaler.fit_transform(processed_atlas_nineteen)

I chose to use sklearn's MinMaxScaler() module over the StandardScaler() module for this project.

Given our present-day forecasting model will be using a ReLU for the hidden layer activation function, MinMax scaling shines in this aspect.

Our K-Means clustering would work fine with either choice. Given that the feature values in this dataset (proportions, demographic shares, poverty rates) are naturally bounded and positive, MinMaxScaler could also be seen as a natural fit.

### Save our processed data to be ran through our autoencoder/K-means model

In [47]:
final_atlas_nineteen = pd.DataFrame(scaled_processed_atlas_nineteen, columns=processed_atlas_nineteen.columns)
final_atlas_nineteen.head()

Unnamed: 0,Urban,Pop2010,PovertyRate,MedianFamilyIncome,TractLOWI,TractKids,TractSeniors,TractHUNV,TractSNAP,lapop1share,lalowi1share,lakids1share,laseniors1share,lahunv1share,lasnap1share,LOWIRatio,SNAPRatio,HUNVRatio,FoodInsecurityIndex,SNAPDisparity,LOWIWeighted
0,1.0,0.051027,0.113,0.318183,0.03622,0.042803,0.012796,0.00099,0.046897,0.9919,0.2411,0.289978,0.1144,0.0079,0.1322,0.007051,0.02284,0.000292,0.008725,0.849223,0.006762
1,1.0,0.057916,0.179,0.187881,0.063843,0.051161,0.012391,0.014689,0.071724,0.5811,0.2783,0.205837,0.0583,0.09,0.1295,0.010951,0.030778,0.003815,0.014296,0.852045,0.018882
2,1.0,0.090038,0.15,0.242867,0.103964,0.075475,0.025418,0.016339,0.07908,0.46,0.1418,0.135903,0.0596,0.0,0.0587,0.011472,0.021832,0.00273,0.013853,0.921084,0.025766
3,1.0,0.117086,0.028,0.275182,0.073396,0.08569,0.052342,0.003466,0.045057,0.3109,0.0783,0.086894,0.0539,0.0046,0.0176,0.006229,0.009566,0.000445,0.007032,0.961,0.003395
4,1.0,0.287442,0.152,0.379128,0.178475,0.266948,0.065196,0.03796,0.155862,0.2455,0.0545,0.073128,0.0336,0.0135,0.0204,0.00617,0.013481,0.001987,0.007736,0.958351,0.044822


In [48]:
final_atlas_nineteen.to_csv("../data/processed/processed_atlas_nineteen.csv")