# DTSA 5509 Supervised Learning Final Project
## Decision Tree Classification Tasks Applied to Food Access Research Atlas
### Project Topic
Food Deserts are areas where there is low access to supermarkets and other sources of healthy foods. They are often reported as occuring in communities that are already underserved, low income, and in Black communities. Access to food that is affordable and healthy is critical to support the health and wellbeing of a community, and areas that are struggling with poverty and inequality often also struggle with basic access to healthy food.

I took data summarizing socio-economic factors, and showed the correlation between these and low access to food through Machine Learning.  Using this socio-economic data, I explored a variety of Decision Tree classification models to attempt to classify whether a community has low access to food, using measures such as the racial composition and poverty levels as features to the models.

**For more information**

Annie E. Casey Foundation: Food Deserts in the United States: https://www.aecf.org/blog/exploring-americas-food-deserts

Wikipedia: Food Desert: https://en.wikipedia.org/wiki/Food_desert

### Imports

In [48]:
import pandas as pd
import altair as alt
import dtale
from sklearn.svm import SVC
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import OneHotEncoder
from imblearn.under_sampling import RandomUnderSampler
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

### Data
The data used in this report is the Food Access Research Atlas, published by the Economic Research Service, Department of Agriculture, via data.gov.

This dataset has 72,531 records, one for each U.S. Census tract.  The dataset includes 147 features for each tract, describing the tract, the population of the tract, demographic breakdowns of each tract (population Black, Seniors, Children, etc), and numerous measures on the food access for each tract.

For my task, I wanted to focus on how the socio-economic factors contribututed to whether a tract was likely to be classified as low access to food.  I limited the features used to the following:

| Field              | Description                                                                                                 |
|:-------------------|:------------------------------------------------------------------------------------------------------------|
| LA1and20           | Flag for low access tract at 1 mile for urban areas or 20 miles for rural areas.  |
| Urban              | Flag for urban tract                                                                                        |
| LowIncomeTracts    | Flag for low income tract                                                                                   |
| PovertyRate        | Share of the tract population living with income at or below the Federal poverty thresholds for family size |
| MedianFamilyIncome | Tract median family income                                                                                  |
| TractLOWI          | Total count of low-income population in tract                                                               |
| TractKids          | Total count of children age 0-17 in tract                                                                   |
| TractSeniors       | Total count of seniors age 65+ in tract                                                                     |
| TractWhite         | Total count of White population in tract                                                                    |
| TractBlack         | Total count of Black or African American population in tract                                                |
| TractAsian         | Total count of Asian population in tract                                                                    |
| TractNHOPI         | Total count of Native Hawaiian and Other Pacific Islander population in tract                               |
| TractAIAN          | Total count of American Indian and Alaska Native population in tract                                        |
| TractOMultir       | Total count of Other/Multiple race population in tract                                                      |
| TractHispanic      | Total count of Hispanic or Latino population in tract                                                       |
| TractHUNV          | Total count of housing units without a vehicle in tract                                                     |
| TractSNAP          | Total count of housing units receiving SNAP benefits in tract                                               |

Field `LA1and20` is the label I'm attempting to predict.  The remaining fields are the features.


Data: https://catalog.data.gov/dataset/food-access-research-atlas

About the Atlas: https://www.ers.usda.gov/data-products/food-access-research-atlas/about-the-atlas.aspx

**Data Citations**

Economic Research Service, Department of Agriculture (2021, February 24). Data.Gov. Food Access Research Atlas. Retrieved April 23, 2023, from https://catalog.data.gov/dataset/food-access-research-atlas

In [2]:
# Create variable table and copy to clipboard for above
used_columns = ['LA1and20','Pop2010','Urban','PovertyRate','MedianFamilyIncome','LowIncomeTracts','TractBlack','TractLOWI','TractKids','TractSeniors','TractWhite','TractAsian','TractNHOPI','TractAIAN','TractOMultir','TractHispanic','TractHUNV','TractSNAP']
var_df = pd.read_excel('../data/FoodAccessResearchAtlasData2019.xlsx', sheet_name='Variable Lookup')
var_df = var_df[var_df['Field'].isin(used_columns)][['Field','Description']]
pd.io.clipboards.to_clipboard(var_df.to_markdown(index=False), excel=False)

### Data Cleaning

As outlined above, I've limited the features used to the most relevant features for the classification task I wanted to perform.  Of the 147 features, 18 are relevant for the task of classifying the low food access status based on the demographic makeup of an area.  Many of the other features in the original file describe in detail the low food access for specific populations, so have been left out since they would be auto-correlative with the low access label I'm attempting to predict.

In addition, while reviewing the number of NA values present in the data, I found that `MedianFamilyIncome` had a high count of NAs. I have dropped this feature from the dataset for this reason.

The remaining features have a low number of NA values.  For these, I have dropped the rows (tracts) that contained NA values.

In the original dataset, the demographic data is reported as total counts of residents in that demographic. Because I want to compare tracts, and each tract may have a different total population, it makes most sense to convert these to percentages.  Below, I calculate the percentage of residents in each demographic using the demographic count and the total population count `Pop2010`.  After I've done these calculations, I drop the original count columns as they are no longer needed.

I also investigated whether the labels in the dataset were balanced.  I've found that they are not, and I rebalance these later while building the model.

#### Read Data

In [15]:
# Read Data
df = pd.read_excel('../data/FoodAccessResearchAtlasData2019.xlsx', sheet_name='Food Access Research Atlas')
df = df[used_columns]

#### Check for NAs, Vizualize, and Clean

In [49]:
# Check for NAs and vizualize
na_counts = pd.DataFrame(df.isna().sum()).reset_index()
na_counts.columns = ['Field', 'NA Count']
na_counts = na_counts.sort_values('NA Count', ascending=False)
base = alt.Chart(na_counts).encode(
    x=alt.X('Field:N', sort='-y'),
    y='NA Count',
    text = 'NA Count'
).properties(width=800)
base.mark_bar() + base.mark_text(align='center', dy=-10)

In [5]:
# Drop high NA column Median Family Income
df = df.drop('MedianFamilyIncome', axis=1)
# Drop records with NAs in values
df = df.dropna()
# Check for no NAs
df.isna().sum()

LA1and20           0
Pop2010            0
Urban              0
PovertyRate        0
LowIncomeTracts    0
TractBlack         0
TractLOWI          0
TractKids          0
TractSeniors       0
TractWhite         0
TractAsian         0
TractNHOPI         0
TractAIAN          0
TractOMultir       0
TractHispanic      0
TractHUNV          0
TractSNAP          0
dtype: int64

#### Calculate Percentage Share for Each Demographic

In [51]:
# Calculate percentage share for each demographic population
df['BlackPopShare'] = df['TractBlack'] / df['Pop2010']
df['LOWIPopShare'] = df['TractLOWI'] / df['Pop2010']
df['KidsPopShare'] = df['TractKids'] / df['Pop2010']
df['SeniorsPopShare'] = df['TractSeniors'] / df['Pop2010']
df['WhitePopShare'] = df['TractWhite'] / df['Pop2010']
df['AsianPopShare'] = df['TractAsian'] / df['Pop2010']
df['NHOPIPopShare'] = df['TractNHOPI'] / df['Pop2010']
df['AIANPopShare'] = df['TractAIAN'] / df['Pop2010']
df['OMultirPopShare'] = df['TractOMultir'] / df['Pop2010']
df['HispanicPopShare'] = df['TractHispanic'] / df['Pop2010']
df['HUNVPopShare'] = df['TractHUNV'] / df['Pop2010']
df['SNAPPopShare'] = df['TractSNAP'] / df['Pop2010']
df = df.drop(['TractBlack','TractLOWI','TractKids','TractSeniors','TractWhite','TractAsian','TractNHOPI','TractAIAN','TractOMultir','TractHispanic','TractHUNV','TractSNAP'], axis=1)

#### Check for Label Imbalance

In [46]:
# Check for imbalaced labels
val_counts = pd.DataFrame(df['LA1and20'].value_counts()).reset_index()
base = alt.Chart(val_counts).encode(
    x=alt.X('LA1and20:N', sort='-y'),
    y='count',
    text = 'count'
).properties(width=100)
base.mark_bar(size=30) + base.mark_text(align='center', dy=-10)

### Exploratory Data Analysis

### Models

In [50]:
#TODO: Rebalance

data = df.loc[:, df.columns.isin(['PovertyRate','MedianFamilyIncome','LowIncomeTracts','BlackPopShare','LOWIPopShare','KidsPopShare','SeniorsPopShare','WhitePopShare','AsianPopShare','NHOPIPopShare','AIANPopShare','OMultirPopShare','HispanicPopShare','HUNVPopShare','SNAPPopShare'])]
data = pd.concat([data, pd.get_dummies(df['Urban'])], axis=1)
label = df['LA1and20']
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size=0.2, stratify=label)

In [None]:
data

In [None]:
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
evallist = [(dtrain, 'train'), (dtest, 'eval')]

In [None]:
param = {'max_depth': 7, 'eta': 0.3, 'objective': 'binary:logistic', 'scale_pos_weight':sum(label) / sum(1-label)}
param['nthread'] = 4
param['eval_metric'] = 'auc'
num_round = 100
bst = xgb.train(param, dtrain, num_round, evallist, early_stopping_rounds=5)

In [None]:
xgb.plot_importance(bst)

In [None]:
format = 'svg' #You should try the 'svg'
image = xgb.to_graphviz(bst, num_trees=0)
image.graph_attr = {'dpi':'400'}
image.render('tree', format = format)

In [None]:
#['PovertyRate','BlackPopShare','LOWIPopShare','KidsPopShare','SeniorsPopShare','WhitePopShare','AsianPopShare','NHOPIPopShare','AIANPopShare','OMultirPopShare','HispanicPopShare','HUNVPopShare','SNAPPopShare']
data = []
dtest = xgb.DMatrix(data)
ypred = bst.predict(dtest)