## Module 4 Class activities
This notebook is a starting point for the exercises and activities that we'll do in class.

Before you attempt any of these activities, make sure to watch the video lectures for this module.

### Spatial joins
In the Data Wrangling lecture, we used the [dataset on California traffic collisions](https://tims.berkeley.edu/help/SWITRS.php). Let's revisit that dataset, but make use of the spatial information this time.

Below is the code that we used in lecture to load in the data. It's just one month from Ventura County; if you want more, you'll need to register.

An aside: Note that the paths are a little different because the data files are under `lectures/data`, not `classes/data`. The `..` directory means "up one level."

The `os` module has some useful functions for directory and file operations. 

In [None]:
import os

# see what directory we are in
os.getcwd()

In [None]:
# list the current directory contents
os.listdir()

In [None]:
# list the parent directory contents
os.listdir('..')

In [None]:
# load in the data
import pandas as pd
collisionDf = pd.read_csv('../Lectures/data/Collisions.csv')

<div class="alert alert-block alert-info">

<strong>Exercise:</strong> What columns provide the spatial coordinates? What problems might there be with each one?
</div>

*Hint*: Look at the [codebook](https://tims.berkeley.edu/help/SWITRS.php) to see the column definitions. You have two choices - there are minor differences.
    
*Hint*: Use `head()` to look at the first rows of the relevant columns . What problems are there with each of them?

In [None]:
# your code here

collisionDf[['LATITUDE','LONGITUDE', 'POINT_X', 'POINT_Y']].head()

You'll notice that there is some missing data. There is a helpful function, `fillna()` in pandas that will fill in missing values from another columns. Take a look at the documentation.

In [None]:
collisionDf.fillna?

Note that the `value` argument can be a scalar (e.g. you can replace all NaNs with 0), or another column (e.g. you can replace all NaNs in the `LONGITUDE` column with values from `POINT_X`.) [See the example here](https://stackoverflow.com/questions/30357276/how-to-pass-another-entire-column-as-argument-to-pandas-fillna).

Also note that there is an `inplace` keyword argument, which we've seen before with the `set_index()` function. It works the same way.

<div class="alert alert-block alert-info">

<strong>Exercise:</strong> Fill in the missing values in the latitude and longitude columns.

In [None]:
# your code here
collisionDf.fillna({'LONGITUDE': collisionDf.POINT_X}, inplace=True)
collisionDf.fillna({'LATITUDE': collisionDf.POINT_Y}, inplace=True)
collisionDf[['LATITUDE','LONGITUDE', 'POINT_X', 'POINT_Y']].head()

<div class="alert alert-block alert-info">

<strong>Exercise:</strong> Convert your dataframe to a GeoDataFrame. Call it <strong>collisionGdf</strong>. 
    
Do a quick-and-dirty plot of the points to satisfy yourself that it worked.

*Hint*: The geopandas `points_from_xy()` function will be helpful.



In [None]:
# your code here
collisionGdf = 9999

import geopandas as gpd
collisionGdf = gpd.GeoDataFrame(collisionDf, geometry=gpd.points_from_xy(collisionDf.LONGITUDE, collisionDf.LATITUDE, 
                                          crs='EPSG:4326'))

collisionGdf.plot()

What do we join the collision data to?

Let's do two separate analyses:
* Look at the transportation justice aspects of road safety, through joining the collision data to the CalEnviroScreen data
* Look at school safety, through joining the collision data to school locations

## Collisions and neighborhood characteristics

Let's start with the EnviroScreen. We already used this data set, so let's load it into `geopandas`.

In [None]:
enviroscreen = gpd.read_file('../Lectures/data/CalEnviroScreen/CES4 Final Shapefile.shp')

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Drop all the rows from <strong>enviroscreen</strong> except for those in Ventura County.
</div>

*Hint*: The `df=df[...]` syntax is the easiest way to do this. It will just keep the rows where the condition inside the `[ ]` is `True`.

For example, this will only keep the census tracts with population greater than 5000.

`enviroscreen = enviroscreen[enviroscreen.TotPop19>5000]`

In [None]:
# This returns a boolean Series
enviroscreen.TotPop19>5000

In [None]:
# Then we pass that series to only return values from the DataFrame where the condition evaluated to True
# Note that rows with index 4581, 4583, etc. have been filtered out
enviroscreen[enviroscreen.TotPop19>5000]

In [None]:
# your code here

enviroscreen = enviroscreen[enviroscreen.County=='Ventura']

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Add the number of collisions to each census tract in the EnviroScreen data frame.
</div>

*Hints*:
- Think about projections!
- I suggest a multistep process
  - What census tract is the collision in? Do a spatial join to add the tract (which is in `enviroscreen`) to the collisions dataframe.
  - How many collisions are there in each tract? Use `groupby`! Create a new dataframe with the tract-level counts.
  - Then you can join these counts back to `enviroscreen` using the `Tract` column
  
  
If you get an error in the final join, `Other Series must have a name`, you can add a name to a pandas `Series` as follows (remember that a Series is like a one-column DataFrame):

    `your_series_name.name = 'n_collisions'`

In [None]:
# your code here

# get the census tract of each collision through joining
collisionGdf2 = gpd.sjoin(collisionGdf.to_crs('EPSG:3310'), enviroscreen)

# check there are the same number of rows
print(len(collisionGdf), len(collisionGdf2))

# looks like we lost a few, which we should investigate later on. 
# Maybe there is missing lat/lon? Or spatial imprecision?

# get the tract-level counts
tractcounts = collisionGdf2.groupby('Tract').size()

# we need to give it a name
tractcounts.name = 'n_collisions'

# join back to enviroscreen
enviroscreen = enviroscreen.set_index('Tract').join(tractcounts)

# replace the missing values with zeros (no collisions)
enviroscreen.fillna({'n_collisions': 0}, inplace=True)

# we should do our standard checks here for number of rows, describing the column, a quick-and-dirty plot

In [None]:
enviroscreen.head()

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Plot the relationship between traffic collisions and the Enviroscreen score, and/or some of the demographic indicators.
</div>

*Hints*:
- The `CIscoreP` gives the percentile of each census tract. The higher the score, the more the pollution burden and/or vulnerability as measured via demographic characteristics. Disadvantaged communities are defined as those with a percentile of 75 or greater.
- Try boxplots, scatterplots, or the `seaborn.regplot` (a scatter plot with the line of best fit)
- You can also map the results

In [None]:
# your code here

# create a binary variable for Disadvantaged Community (in the top quartile)
enviroscreen['disadv'] = enviroscreen.CIscoreP>=75
enviroscreen.disadv.mean()

# box plot
#enviroscreen.boxplot('n_collisions', by='disadv')

# scatter plot
# need to drop the negative values first
enviroscreen[enviroscreen.CIscoreP>0].plot(y='n_collisions', x='CIscoreP', kind='scatter')

# with regression line
import seaborn as sns
sns.regplot(x='CIscoreP', y='n_collisions', data=enviroscreen[enviroscreen.CIscoreP>0])

## Schools
Now let's do a join to the schools dataset. It's in your GitHub repository, `data/California_Schools_2019-20/`.

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Load the schools data into a geodataframe called <strong>schools</strong>. Drop all the schools that are not in Ventura County. (You can use the <strong>CountyName</strong> column.)
</div>

In [None]:
# your code here
schools = gpd.read_file('../Classes/data/California_Schools_2019-20/SchoolSites1920.shp')
schools = schools[schools.CountyName=='Ventura']

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong>In my version of the data, it looks like there is an errant school in the far north of California, that only purports to be in Ventura County. Identify and drop it.
</div>

*Hint*: There are several ways to approach this. My approach would be to:

* Create a new column with the `y` coordinate: `schools['y'] = schools.geometry.y`
* Sort by this column to find the row with the highest value of `y`
* Drop that row (e.g. `schools = schools[schools.y<some_value_of_y]`


In [None]:
# your code here

# diagnose the problem
schools.plot()

# add a new column
schools['y'] = schools.geometry.y

# look for highest value
print(schools.sort_values(by='y').y.tail())

# drop the outlier
schools = schools[schools.y<4.5e6]

# check it worked
schools.plot()

Now, how do we join the schools to the collision data? Both are point geometries

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Think conceptually about different options to do the join. It can help to do some sketches.</div>

There are several ways to do this, but let's look at the number of collisions within a 1km radius of each school. Then, we can follow a five-step process:
* Make sure we are working in a suitable projection
* Create a 1km buffer around each school
* Do a spatial join between collisions and (buffered) schools, attaching school ids to the collision geodataframe
* Group by the school id to get the counts
* Join back to the school data


*NOTE*: Buffering a geometry isn't usually the most efficient way to get this count, because creating new geometries takes time and memory. Instead, we could get the distances between each school and each collision, and count the number with a distance (like we did in the video lecture). That's a little more complicated, and for a small dataset the speed penalty is going to be minimal. But for large datasets, try and avoid creating buffer geometries.

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Find the relevant State Plane coordinate reference system for Ventura County (choose the one in meters, not feet). Convert both <strong>schools</strong> and <strong>collisionGdf</strong> to that crs.</div>

In [None]:
# your code here

# I Googled Ventura State Plane meters and found it was California Zone 5
# Then I used spatialreference.org to get the EPSG code: 3497

collisionGdf.to_crs('EPSG:3497', inplace=True)
schools.to_crs('EPSG:3497', inplace=True)


<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Convert the school geometry into a 1km buffer.

*Hint*: The `buffer()` function will work here. It will create a new geometry, which you can use to overwrite the old one. You can buffer lines and polygons as well as points.

For example: `gdf.geometry = gdf.geometry.buffer(100)`. 

In [None]:
# your code here

schools.geometry = schools.geometry.buffer(1000)

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Add a column to the schools data with the number of collisions within 1km.

*Hint*: You should now be able to follow steps 3-5 using the same procedure as with the EnviroScreen data. 

In [None]:
# your code here

# The Scode looks promising as an id column. Let's check that it is unique
print(schools.SCode.is_unique)

# spatial join
tmpgdf = gpd.sjoin(schools, collisionGdf, predicate='intersects')
tmpgdf.head() # check it worked

# group by
collision_counts = tmpgdf.groupby('SCode').size()
collision_counts.name = 'n_collisions'
collision_counts.head() # check it worked

# join back
schools = schools.set_index('SCode').join(collision_counts)
schools.fillna({'n_collisions':0}, inplace=True)
schools.n_collisions.describe()

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Map the number of collisions near each school.

*Hints*: 
* There are several ways to do this. You could do proportional markers (you might need to scale the `n_collisions` column). Add a basemap too!
* Note that you can't do proportional circles for the schools polygon geometry - just for points. You'll need to convert the geometry back to the centroids (`gdf.geometry = gdf.geometry.centroid`). (Or you could have saved a copy of the old geodataframe and joined the collision counts back to that.)

In [None]:
# your code here

import matplotlib.pyplot as plt
import contextily as ctx

schools.geometry = schools.geometry.centroid
fig, ax = plt.subplots(figsize=(10,10))

# my first try
# looks like we need to make the markers bigger
#schools.plot(markersize='n_collisions', ax=ax)

schools['n_collisions_scaled'] = schools.n_collisions*10
# plot all the schools first, so we don't ignore those ones with zero collisions
schools.to_crs('EPSG:3857').plot(ax=ax, markersize=0.5, color='k')
# plot the proportional markers
schools.to_crs('EPSG:3857').plot(markersize='n_collisions_scaled', color='b', ax=ax)

ctx.add_basemap(ax=ax, alpha=0.5, zoom=13)
ax.set_title('Number of collisions near each school')
ax.set_xticks([])
ax.set_yticks([])

## Joins gone wrong

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Fix the errors in each cell below. (Not all of them generate Python exceptions, but the join might not be what you expect.)
</div>

In [None]:
# first, reload a clean copy of the data
collisionDf = pd.read_csv('../Lectures/data/Collisions.csv')
enviroscreen = gpd.read_file('../Lectures/data/CalEnviroScreen/CES4 Final Shapefile.shp')

In [None]:
# spatial join between collisions and EnviroScreen
joined = gpd.sjoin(collisionDf, enviroscreen, predicate='intersects')

In [None]:
# get number of collisions in each census tract
collisionGdf = gpd.GeoDataFrame(collisionDf, 
                                geometry=gpd.points_from_xy(collisionDf.LONGITUDE.fillna(collisionDf.POINT_X), 
                                                            collisionDf.LATITUDE.fillna(collisionDf.POINT_Y), 
                                          crs='EPSG:4326'))

joined = gpd.sjoin(collisionGdf, enviroscreen, predicate='contains')

<div class="alert alert-block alert-info">
<h3>What you should have learned</h3>
<ul>
  <li>Gain more practice with spatial joins</li>
  <li>Understand how to buffer geometries.</li>
  <li>Get practice with troubleshooting spatial joins.</li>
</ul>
</div>