# [Modeling] Spatial Contexts for Los Angeles

In [2]:
import pandas as pd
import geopandas as gpd
import numpy as np
import statsmodels
import seaborn as sns
import matplotlib.pyplot as plt
import folium
from folium.plugins import MarkerCluster
from sklearn.model_selection import train_test_split

%matplotlib inline
plt.style.use('fivethirtyeight')

## Predictive Modeling

Here we will use of CalEPA data for Los Angeles to draw inference and predictions using various models, including unsupervised and supervised models.

In [38]:
ces_la = pd.read_csv("../data/CalEPA/ces_losangeles.csv", index_col=0).dropna()
ces_la["CES 4.0 Percentile Range"] = ces_la['CES 4.0 Percentile Range'].replace("1-5% (lowest scores)", "00-05%").replace("5-10%", "05-10%").replace("95-100% (highest scores)", "95-100%")

In [39]:
ces_la

Unnamed: 0,Census Tract,Total Population,California County,ZIP,Approximate Location,Longitude,Latitude,CES 4.0 Score,CES 4.0 Percentile Range,Ozone,...,CES 4.0 Percentile,Children < 10 years (%),Pop 10-64 years (%),Elderly > 64 years (%),Hispanic (%),White (%),African American (%),Native American (%),Asian American (%),Other/Multiple (%)
0,6037204920,2751,Los Angeles,90023,Los Angeles,-118.197497,34.017500,82.39,95-100%,0.048,...,99.97,13.34,72.59,14.07,97.27,1.71,0.84,0.00,0.00,0.18
2,6037543202,5124,Los Angeles,90220,Compton,-118.230032,33.879862,79.29,95-100%,0.042,...,99.91,18.60,72.48,8.92,78.14,1.09,15.67,0.00,4.84,0.25
3,6037203300,2000,Los Angeles,90033,Los Angeles,-118.207788,34.058872,77.35,95-100%,0.049,...,99.87,7.70,84.50,7.80,75.55,2.85,10.45,0.00,6.95,4.20
4,6037291220,3640,Los Angeles,90247,Los Angeles,-118.286709,33.877139,77.25,95-100%,0.041,...,99.86,12.77,73.16,14.07,69.34,3.98,8.43,0.00,16.32,1.92
5,6037433501,1949,Los Angeles,91733,South El Monte,-118.065122,34.057255,76.91,95-100%,0.055,...,99.85,10.98,75.42,13.60,93.89,0.72,0.00,0.00,5.39,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2055,6037262604,5523,Los Angeles,90272,Los Angeles,-118.548578,34.051108,3.85,00-05%,0.051,...,1.31,14.01,58.34,27.65,3.31,86.24,0.00,0.00,5.63,4.82
2056,6037670702,5649,Los Angeles,90275,Rancho Palos Verdes,-118.328443,33.753600,3.53,00-05%,0.039,...,1.02,7.97,65.59,26.45,8.43,68.53,1.12,0.42,15.29,6.21
2057,6037620904,2897,Los Angeles,90266,Manhattan Beach,-118.410369,33.880731,3.08,00-05%,0.043,...,0.72,12.60,70.80,16.60,4.42,85.71,0.00,0.00,2.35,7.53
2058,6037262802,3424,Los Angeles,90272,Los Angeles,-118.502456,34.045865,2.23,00-05%,0.050,...,0.34,11.42,61.89,26.69,10.72,80.96,0.32,0.00,2.98,5.02


## 1. Unsupervised Modeling

Without telling our models about the CES 4.0 Scores nor corresponding Percentile groups, can they learn about them? 

Can we use specific features to predict a score, population characteristic, or pollution indicator?

### Principal Component Analysis

Why PCA? We are:

1. Visually identifying clusters of similar observations in high dimensions.
2. We have reason to believe that the data are inherently low rank: there are many attributes, but only a few  attributes mostly determine the rest through a linear association.

Are there features we should be aware of that guide our understanding of the CES 4.0 Scores? Whether linear or non-linear?


In [30]:
from sklearn.decomposition import PCA

In [42]:
df_pca = ces_la.iloc[:, 7:].drop(columns=['CES 4.0 Score', 'CES 4.0 Percentile Range', 'CES 4.0 Percentile'])
df_pca

Unnamed: 0,Ozone,Ozone Pctl,PM2.5,PM2.5 Pctl,Diesel PM,Diesel PM Pctl,Drinking Water,Drinking Water Pctl,Lead,Lead Pctl,...,Pop. Char. Pctl,Children < 10 years (%),Pop 10-64 years (%),Elderly > 64 years (%),Hispanic (%),White (%),African American (%),Native American (%),Asian American (%),Other/Multiple (%)
0,0.048,53.73,12.251640,89.21,0.781,96.55,787.94,92.53,92.56,98.40,...,95.79,13.34,72.59,14.07,97.27,1.71,0.84,0.00,0.00,0.18
2,0.042,26.70,12.216660,88.64,0.376,83.17,459.20,55.16,91.35,97.71,...,96.90,18.60,72.48,8.92,78.14,1.09,15.67,0.00,4.84,0.25
3,0.049,59.69,12.576875,91.57,1.053,98.41,798.87,93.57,74.68,84.31,...,94.49,7.70,84.50,7.80,75.55,2.85,10.45,0.00,6.95,4.20
4,0.041,24.88,12.066061,82.35,0.637,94.08,805.57,94.04,66.07,74.22,...,86.01,12.77,73.16,14.07,69.34,3.98,8.43,0.00,16.32,1.92
5,0.055,71.66,12.004168,78.44,0.551,91.56,821.08,95.28,91.11,97.54,...,95.54,10.98,75.42,13.60,93.89,0.72,0.00,0.00,5.39,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2055,0.051,64.73,11.249036,57.42,0.068,25.09,439.39,51.48,40.31,37.61,...,0.23,14.01,58.34,27.65,3.31,86.24,0.00,0.00,5.63,4.82
2056,0.039,17.65,11.404007,59.30,0.019,6.76,277.20,18.36,41.13,38.71,...,0.69,7.97,65.59,26.45,8.43,68.53,1.12,0.42,15.29,6.21
2057,0.043,29.89,11.879107,72.43,0.073,26.98,237.39,9.33,39.36,36.19,...,0.64,12.60,70.80,16.60,4.42,85.71,0.00,0.00,2.35,7.53
2058,0.050,60.93,11.339568,58.37,0.104,38.54,540.97,63.99,48.87,49.94,...,0.05,11.42,61.89,26.69,10.72,80.96,0.32,0.00,2.98,5.02


In [43]:
pca = PCA(n_components=2)

In [44]:
pca.fit(df_pca)

PCA(n_components=2)

How much information do we lose? Let's use scree plots.

### Clustering

We will use affinity propogation as we are working with many clusters of unequal size.

In [None]:
from sklearn.cluster import AffinityPropagation

What can we learn?

## 2. Supervised Modeling

### Classification

### Regression