# Week 09 (?!) LA Neighborhood Transit: Spatial Statistics Overview

**By:** Andrew Williams and Ben Brassette

**Decription:** Purpose of this notebook is to use the tools from the Week 08 Spatial Stats lesson in order to provide a better analysis of our chosen neighborhoods for this project. 

**Neighborhoods:**
* Downtown (Central LA)
* Pico-Union(Central LA)
* Panaroma City (San Fernando Valley)
* North Hollywood (San Fernando Valley)
* Mid-City (Central LA, Car Dominant)

# Library 

In [None]:
# to read and wrangle data
import pandas as pd

# to create spatial data
import geopandas as gpd

# for basemaps
import contextily as ctx

# For spatial statistics
import esda
from esda.moran import Moran, Moran_Local

import splot
from splot.esda import moran_scatterplot, plot_moran, lisa_cluster,plot_moran_simulation

import libpysal as lps

# Graphics
import matplotlib.pyplot as plt
import plotly.express as px

# Trimming Data

## Data Check

I'm going to downloand my dataset that features mode of transportatotion to work. I'm using a dataset that also has neighborhoods, income, and racial breakdownn in case I need to explore other variables (time permitting). I will do a typical check of the data to make sure it's ready for some exploration. 

In [None]:
gdf= gpd.read_file('m2w_income_race_new.geojson')

In [None]:
type(gdf)

In [None]:
gdf.shape

In [None]:
gdf. head(4)

In [None]:
gdf.tail(4)

I'll need to rename my columns

In [None]:
gdf.columns.to_list()

In [None]:
gdf.columns=['Geoid',
 'Name',
 'Neighborhood',
 'Median Inc',
 'Total Work',
 'Car Total',
 'Drove alone',
 'Carpooled',
 'Public transportation',
 'Bus',
 'Subway',
 'Long-distance rail',
 'Light rail',
 'Worked from home',
 '%Car Total',
 '%Drove alone',
 '%Carpooled',
 '%Public transportation',
 '%Bus',
 '%Subway',
 '%Long-distance rail',
 '%Light rail',
 '%Worked from home',
 'Total Pop',
 'White',
 'Black',
 'Native',
 'Asian',
 'Native H',
 'Hispanic or Latino',
 '%White',
 '%Black',
 '%Native',
 '%Asian',
 '%Hawaiian',
 '%Hispanic or Latino',
 'geometry']

In [None]:
gdf.head(3)

All is right with the world and the dataset is good to go!

# Normalizing: Our Data per 1000 people

Following the example from class, I'm normalizing a couple variables to see the rate per 1000 people

In [None]:
gdf['car_per_1000'] = gdf['Car Total']/gdf['Total Work']*1000
gdf['transit_per_1000'] = gdf['Public transportation']/gdf['Total Work']*1000
gdf['bus_per_1000'] = gdf['Bus']/gdf['Total Work']*1000
gdf['subway_per_1000'] = gdf['Subway']/gdf['Total Work']*1000
gdf['disrail_per_1000'] = gdf['Long-distance rail']/gdf['Total Work']*1000
gdf['lightrail_per_1000'] = gdf['Worked from home']/gdf['Total Work']*1000

Well we use all of these, no. Likely just car and transporation. But it's nice to have options.

In [None]:
gdf.sample(3)

Also note, I should really stops adding space to my variables. 

In [None]:
gdf.sort_values(by="transit_per_1000").tail(10)

So I'm following Yoh's notebook until I get my feet settled with this data, but did not realize 5 tracts have no data. They are not in our slected neighborhoods, but would like to explore why these show up with no values. I know one of the tracts consists of the beach on the Westside. I imagine other are similar in nature.

I know people are using points for their data with their polygons but I'm going to continue to use my polygon tracts. HOWEVER, it would be interesting to to map bus stops or transit stops in each tract. I initially tried bus stops but was having troubles uploading my data to jupyter. May try again later. I could use rail stops, but given the dismal rail ridership I've seen, I'm not sure if that will be terribly helpful. Will forge on for better or worse now

In [None]:
fig,ax = plt.subplots(figsize=(20,18))
gdf.sort_values(by='transit_per_1000',ascending=False)[:30].plot(ax=ax,
                                                                 color='blue',
                                                                 edgecolor='white',
                                                                 alpha=0.5,legend=True)


# title
ax.set_title('Top 30 Tracts of Transit Ridership per 1000 people')

# no axis
ax.axis('off')

# add a basemap
ctx.add_basemap(ax, crs=gdf.crs.to_string())

I changed the contexuality input that we were using from class as that wasn't showing anything. I think there may be an issue with projecting my data to a CRS, but am not entirely sure.

Top 30 tracts are in the Central LA. It looks like mostly Westlake with some scattering around the edges. It's noticable that these areas are presenting themselves as clusters, with one mega-cluster in Westlake.

In [None]:
fig,ax = plt.subplots(figsize=(20,20))

gdf.plot(ax=ax,
        column='transit_per_1000',
        legend=True,
        alpha=0.8,
        cmap='cividis',
        scheme='quantiles')

ax.axis('off')
ax.set_title('Transit Ridership Per 1000 People',fontsize=22) #font size! Well hot dog. Going to be using this for the next week
ctx.add_basemap(ax, crs=gdf.crs.to_string())

Okay, really want to get my bus stop data to work now, I think that would be helpful. Still having trouble with the data itself.

But the story: High transit use in Central LA and South LA and moderate usage on the Westside and the San Fernando Valley. This presents a new persective my adding normalizing the data per 1000 people, which in effect is a different way to present percentages. Still interesting to see. I'm curious what the lag data will show. 

# Global Spaitial Autocorrelation or Something Like That

So I'm using K to count the number of nearest neighbors. When we eventually get down to some of the mapping and charts, I think seeing clusters of transit will provide some insights, but am stil worried about using just the one combined dataset I have-- as it feels "flat."

In [None]:
# calculate spatial weight
wq =  lps.weights.KNN.from_dataframe(gdf,k=8)

# Row-standardization
wq.transform = 'r'

Woo! Something happened. 

Doing stuff with spatial lag. Kind of exciting to see what happens with this. 

Moved down to just the one variable to make sure I get this right. But creating a new variable

In [None]:
gdf['transit_per_1000_lag'] = lps.weights.lag_spatial(wq, gdf['transit_per_1000'])

In [None]:
gdf.sample(10)[['Total Work','Neighborhood','Public transportation','transit_per_1000','transit_per_1000_lag']]

Oh! This is what I was expecting and I'm also surprised. There will be a couple layers to unpack here in a bit. Excited to move on.

## DONUT and DONUT HOLE TIME (down with diamonds!)

Going to try and identify some donuts and donut holes. 

In [None]:
gdf['transit_lag_diff'] = gdf['transit_per_1000'] - gdf['transit_per_1000_lag']

In [None]:
gdf.sort_values(by='transit_lag_diff')

Well that query wasn't too helpful and I'm definately not going to check out the whole dataset. Though myabe it's time to check out what this means for our selected neighborhoods.

* Downtown (Central LA)
* Pico-Union(Central LA)
* Panaroma City (San Fernando Valley)
* North Hollywood (San Fernando Valley)
* Mid-City (Central LA, Car Dominant)

In [None]:
gdf.query("Neighborhood== 'Downtown'").sort_values(by='transit_lag_diff')

Obvisouly these are all in one neighborhood, but I will be interested to see how they spatailly related to each other. There is a pretty significant range in transit lag differnces.

In [None]:
gdf.query("Neighborhood== 'Pico-Union'").sort_values(by='transit_lag_diff')

Range is not quite as large as Downtown, but still significant.

In [None]:
gdf.query("Neighborhood== 'Panorama City'").sort_values(by='transit_lag_diff')

Panorama City range is actually similar to Pico-Union, which I do find suprising. Excited to plot these soon and see their spatial relation>

In [None]:
gdf.query("Neighborhood== 'North Hollywood'").sort_values(by='transit_lag_diff')

Less transit lag differences that are positive, but can be expected given this neighborhood is in the San Fernando Valley.

In [None]:
gdf.query("Neighborhood== 'Mid-City'").sort_values(by='transit_lag_diff')

Again, seems similar to Pico-Union and Panorama City. Maybe these patterns are indicative of neighborhoods in general, or at least neighborhoods with marginally more transit ridership.

In [None]:
gdf_donut = gdf.sort_values(by='transit_lag_diff').head(5)
gdf_donut

In [None]:
# hashtag-donut holes for the win
gdf_donuthole = gdf.sort_values(by='transit_lag_diff').tail(28)
gdf_donuthole

So the last 28 tracts have NaN values. I wonder if that's becasue trasnit ridership in these areas is so small. I thought they would show up as negative values, so I'm a little confused in what's happening here. 

In [None]:
# create the 1x2 subplots
fig, ax = plt.subplots(1, 2, figsize=(20, 15))

# two subplots produces ax[0] (left) and ax[1] (right)

# regular count map on the left
gdf.plot(ax=ax[0], # this assigns the map to the left subplot
         column='transit_per_1000', 
         scheme='quantiles',
         k=5, 
         edgecolor='white', 
         linewidth=0, 
         alpha=0.8, 
         legend=True,)


ax[0].axis("off")
ax[0].set_title("Transit per 1000",fontsize=22)

# spatial lag map on the right
gdf.plot(ax=ax[1],
         column='transit_lag_diff',
         scheme='quantiles',
         k=5, 
         edgecolor='white',
         linewidth=0, 
         alpha=0.8,
         legend=True,)


ax[1].axis("off")
ax[1].set_title('Transit Spatial Lag, Per 1000 People',fontsize=22)

plt.show()

So, I defintaly spend too much time exploring different color options. It's JUST SO FUN. 

These images are pretty startling. Definately not as uniform as I thought it would be. Transit ridership is more differetiated in the spatial lag map. The spatial lag brings reduces transit ridership significantly. All of which should not be surprising given LA's inherent driving nature.

Mapping the Neighborhoods
* Downtown (Central LA)
* Pico-Union(Central LA)
* Panoroma City (San Fernando Valley)
* North Hollywood (San Fernando Valley)
* Mid-City (Central LA, Car Dominant)

Downtown

In [None]:
# create the 1x2 subplots
fig, ax = plt.subplots(1, 2, figsize=(20, 15))

# two subplots produces ax[0] (left) and ax[1] (right)

# regular count map on the left
gdf.query("Neighborhood== 'Downtown'").plot(ax=ax[0], # this assigns the map to the left subplot
         column='transit_per_1000', 
         scheme='quantiles',
         k=5, 
         edgecolor='white', 
         linewidth=0, 
         alpha=0.8, 
         legend=True,)


ax[0].axis("off")
ax[0].set_title("Transit per 1000",fontsize=22)

# spatial lag map on the right
gdf.query("Neighborhood== 'Downtown'").plot(ax=ax[1],
         column='transit_lag_diff',
         scheme='quantiles',
         k=5, 
         edgecolor='white',
         linewidth=0, 
         alpha=0.8,
         legend=True,)


ax[1].axis("off")
ax[1].set_title('Transit Spatial Lag, Per 1000 People',fontsize=22)

plt.show()

* Numbers significantly reduced
* Other tracts not on this neighborhood are liekly influencing this the boundaries of each neighborhood. 
* Auto travel of neighboring tracts are likley influencing these numbers
* Only 1 interval is positive
* Ultimately, suprising to see how trasnit lag numbers on this "high transit" area, though since its' LA, maybe not so surprising

Pico-Union

In [None]:
# create the 1x2 subplots
fig, ax = plt.subplots(1, 2, figsize=(20, 15))

# two subplots produces ax[0] (left) and ax[1] (right)

# regular count map on the left
gdf.query("Neighborhood== 'Pico-Union'").plot(ax=ax[0], # this assigns the map to the left subplot
         column='transit_per_1000', 
         scheme='quantiles',
         k=5, 
         edgecolor='white', 
         linewidth=0, 
         alpha=0.8, 
         legend=True,)


ax[0].axis("off")
ax[0].set_title("Transit per 1000",fontsize=22)

# spatial lag map on the right
gdf.query("Neighborhood== 'Pico-Union'").plot(ax=ax[1],
         column='transit_lag_diff',
         scheme='quantiles',
         k=5, 
         edgecolor='white',
         linewidth=0, 
         alpha=0.8,
         legend=True,)


ax[1].axis("off")
ax[1].set_title('Transit Spatial Lag, Per 1000 People',fontsize=22)

plt.show()

So be warned, I'm imaging all of these maps are likley going to say the same thing, more or less. These other neighborhoods will have significantly less transit compared to Downtown. 
* Numbers significantly reduced, more so than Downtown. 
* It would be helpful to see what neighborhoods surround Pico-Union. 
* Auto travel of neighboring tracts are likley influencing these numbers
* Only 1 interval is positive

Panorama City

In [None]:
# create the 1x2 subplots
fig, ax = plt.subplots(1, 2, figsize=(20, 15))

# two subplots produces ax[0] (left) and ax[1] (right)

# regular count map on the left
gdf.query("Neighborhood== 'Panorama City'").plot(ax=ax[0], # this assigns the map to the left subplot
         column='transit_per_1000', 
         scheme='quantiles',
         k=5, 
         edgecolor='white', 
         linewidth=0, 
         alpha=0.8, 
         legend=True,)


ax[0].axis("off")
ax[0].set_title("Transit per 1000",fontsize=22)

# spatial lag map on the right
gdf.query("Neighborhood== 'Panorama City'").plot(ax=ax[1],
         column='transit_lag_diff',
         scheme='quantiles',
         k=5, 
         edgecolor='white',
         linewidth=0, 
         alpha=0.8,
         legend=True,)


ax[1].axis("off")
ax[1].set_title('Transit Spatial Lag, Per 1000 People',fontsize=22)

plt.show()

* Numbers significantly reduced, more tracts are in the "positive", I should look at their total populations. This is surprising given this neighorhood's location in the San Fernando Valley
* It would be helpful to see what neighborhoods surround Panorama City. 
* Auto travel of neighboring tracts are likley influencing these numbers
* Only 1 interval is positive

North Hollywod

In [None]:
# create the 1x2 subplots
fig, ax = plt.subplots(1, 2, figsize=(20, 15))

# two subplots produces ax[0] (left) and ax[1] (right)

# regular count map on the left
gdf.query("Neighborhood== 'North Hollywood'").plot(ax=ax[0], # this assigns the map to the left subplot
         column='transit_per_1000', 
         scheme='quantiles',
         k=5, 
         edgecolor='white', 
         linewidth=0, 
         alpha=0.8, 
         legend=True,)


ax[0].axis("off")
ax[0].set_title("Transit per 1000",fontsize=22)

# spatial lag map on the right
gdf.query("Neighborhood== 'North Hollywood'").plot(ax=ax[1],
         column='transit_lag_diff',
         scheme='quantiles',
         k=5, 
         edgecolor='white',
         linewidth=0, 
         alpha=0.8,
         legend=True,)


ax[1].axis("off")
ax[1].set_title('Transit Spatial Lag, Per 1000 People',fontsize=22)

plt.show()

*So these maps are practically identical, with some slight changes
* Again, more tracts in the positive
* Has three interval ranges in the positive. That is unique from what I've seen here. There are some nuances to this, but something I was not expecting.
* Interval ranges are also much more condensed compared to the others

Mid-City

In [None]:
# create the 1x2 subplots
fig, ax = plt.subplots(1, 2, figsize=(20, 15))

# two subplots produces ax[0] (left) and ax[1] (right)

# regular count map on the left
gdf.query("Neighborhood== 'Mid-City'").plot(ax=ax[0], # this assigns the map to the left subplot
         column='transit_per_1000', 
         scheme='quantiles',
         k=5, 
         edgecolor='white', 
         linewidth=0, 
         alpha=0.8, 
         legend=True,)


ax[0].axis("off")
ax[0].set_title("Transit per 1000",fontsize=22)

# spatial lag map on the right
gdf.query("Neighborhood== 'Mid-City'").plot(ax=ax[1],
         column='transit_lag_diff',
         scheme='quantiles',
         k=5, 
         edgecolor='white',
         linewidth=0, 
         alpha=0.8,
         legend=True,)


ax[1].axis("off")
ax[1].set_title('Transit Spatial Lag, Per 1000 People',fontsize=22)

plt.show()

* Again, almost identical.
* Only one interval in the positive, but ultiple tracts in this interval.
* Numbers are compressed. Downtown appears to have the biggest range of trasnit numbers.

# Moran

This part I'm again not condifent in. Will follow Yoh's notebook for guidance and see what we can find.

Restarting some steps removing NaN values from the dataset. 

In [None]:
gdf2=gdf

In [None]:
gdf2.sample(5)

In [None]:
gdf2=gdf.drop([1003, 1001, 998, 997, 995])

In [None]:
gdf2.sort_values(by='transit_per_1000').head()

In [None]:
gdf2.sort_values(by='transit_per_1000').tail()

In [None]:
gdf2.reset_index()

In [None]:
gdf2=gdf2.reset_index()

In [None]:
gdf2.sample()

In [None]:
gdf3=gdf2

So we dropped the NaN neighborhoods. I believe that was the crux of our problem, but I'm not entirely sure. 

In [None]:
gdf3[gdf3['transit_per_1000']>1]

In [None]:
gdf3_trimmed= gdf3[gdf3['transit_per_1000']>1]

In [None]:
gdf3_trimmed.sample()

In [None]:
y = gdf2.transit_per_1000
moran = Moran(y, wq)
moran.I

In [None]:
fig, ax = moran_scatterplot(moran, aspect_equal=True)
plt.show()

# Work Division--need to come back to

Both discussed the process and communicated analysis of data,

**Andrew:** Prepped the data, ran some intial tests to practice skills, and relayed info to Ben. Gave feedback. 

**Ben:** Using the same data, also ran tests to explore data and practice skills. Provided analysis. Gave feedback. 