# 3. Spatial Joins and Calculations
With all of the data, spatial data, and mapping we've learned so far we're ready to add more data to our arsenal! We're going to learn new methods for linking and transforming our data.

- [3.1 Introduction ](#section1)
- [3.2  Permit and Parcel Data](#section2)
  - transforming polygons to points
- [3.3 Mapping Permit data with Census Tracts](#section3)
- [3.4 Spatial Join](#section4)
- [3.5 Analyzing the results of our Spatial Joins](#section5)
    - Dissolving
    - Clipping

**INSTRUCTOR NOTES**:
- Datasets used:
    - "../notebook_data/outdata/Permit_HousingApp_Parcel_Merge_Oakland.geojson"
    - "../notebook_data/outdata/Permit_ActivityReport_Parcel_Merge_Oakland.geojson" 
    - "../notebook_data/outdata/tracts_acs_gdf_ac.json"
    - "../notebook_data/notebook_data/census/Places/cb_2018_06_place_500k.zip"
 
 
- Expected time to complete:
    - Lecture + Questions: 1 hour
    - Homework: 45 minutes

### Set-Up

Let's import the packages we need before we get started.

In [None]:
import numpy as np
import pandas as pd
import geopandas as gpd

import matplotlib # base python plotting library
%matplotlib inline  
import matplotlib.pyplot as plt # more plotting stuff 

# We are getting futurewarning errors about the syntax of CRS definitions, ie "init=epsg:4269" vs "epsg:4269"
# so suppress as these are minor
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

<a id="section0"></a>
## 3.0 Warmup Exercise

Looking at the table of contents shown above, what types of questions do you think you will be able to answer by learning the introduced data and new techniques?

In [None]:
# Write your thoughts here

Awesome! We'll also be reviewing attribute joins here, but the goal is to be able to **join data in a different way - by location - and begin to learn how to spatially summarize data**.

<a id="section1"></a>
## 3.1 Introduction

Geospatial data *is* special. Because all geospatial data is referenced to the Earth, geospatially data can be dynamically joined by location even if they share no common attributes. This is called a **spatial join**. This is in contrast to an `attribute join` where the join attribute must already exist in the data. For example, we were able to merge permit and parcel data because both contain an APN column. 

In this lesson we will explore `spatial joins` with our permit and census tract data. 

We will also take a look at a few useful and common spatial transformations. First, we will transform the polygon parcel geomtries to point geometries. Second, we will use the Geopandas **dissolve** operation to aggregate geometries.

<a id="section2"></a>
## 3.2 Permit and Parcel data

Let's create an initial map of our joined permit and parcel point data, and check how it looks in combination with our ACS and tracts data.

First let's import our `permit_parcel_gdf` and then map it. Remember from `notebook s2_3` that this is the dataset where you joined together the Oakland parcels and permit activity report.

In [None]:
permit_parcel_gdf = gpd.read_file("../notebook_data/outdata/Permit_ActivityReport_Parcel_Merge_Oakland.geojson")

In [None]:
# Check the shape of the gdf
permit_parcel_gdf.shape

In [None]:
# Check columns available
permit_parcel_gdf.columns

In [None]:
# Double check and make sure no APN values are empty
permit_parcel_gdf['APN'].isnull().value_counts()

Let's try mapping the permit data by `units_permit` which is the sum of building permits issued for very-low, low, moderate, and above moderate income units.

In [None]:
fig, ax = plt.subplots(figsize = (12,8)) 
permit_parcel_gdf.plot(column='units_permit',legend=True, cmap="winter",ax=ax)
plt.tight_layout()
plt.show()

Let's zoom in to get a better look at the data. We can do this by setting the x and y limits of the plot.

In [None]:
fig, ax = plt.subplots(figsize = (10,8)) 
permit_parcel_gdf.plot(column='units_permit',cmap="winter",ax=ax)

# Set x and y limits to zoom in
# We can get the values from the coordinates on the x and y axis of the previous map.
ax.set_xlim([-122.275,-122.250])
ax.set_ylim([37.82,37.80])

plt.tight_layout()
plt.show()

The map above shows the permit locations as the parcel polygons. Since we are unlikley to zoom in that closely on the permit locations, let's `transform` the parcel polygons to points rather than tiny (relative to the area) polygon geometries that are difficult to visualize and take up a lot of memory and disk space, and thus are slower to work with. Point geometry will work at any scale for these data. 

To do this we create a point dataset similar to what we did for some of our tracts and ACS data.

Let's start by reminding ourselves of how to find the centroid of a polygon shape. We can do this by finding x and y coordinates of the polygon centroid separately...
> A `centroid` is the center of a bounding box around a polygon feature. Given an irregularly shaped polygon, e.g. Florida, the centroid may not be within the polygon. If you want a point that is sure to be within the polygon you can use the `gdf.representative_point()` method.

In [None]:
# Get X coordinate of the parcel centroids as a list, showing first 5 values
permit_parcel_gdf.centroid.x[0:5]

In [None]:
# Get Y coordinate of the parcel centroids as a list, showing first 5 values
permit_parcel_gdf.centroid.y[0:5]

Or, you can find both with one line of code.

In [None]:
# Get the centroids of the permit parcel polygon data
permit_parcel_gdf.centroid

We can use the centroid to create a new point data set called `permit_parcel_point_gdf`.
To do this we use the geopandas function `GeoDataFrame`, which returns a new geodataframe by combining a pandas dataframe and a geometry column. 
- the pandas dataframe is the the permit_parcel_gdf with the geometry column dropped
- the geometry column is the centroid of the permit_parcel_gdf polygons

In [None]:
permit_parcel_point_gdf = gpd.GeoDataFrame(permit_parcel_gdf.drop('geometry',axis=1), 
                            geometry=permit_parcel_gdf.centroid)

In [None]:
permit_parcel_point_gdf.head()

Let's map those points now to see how they compare to the polygon data.

In [None]:
fig, ax = plt.subplots(figsize = (10,8)) 
permit_parcel_gdf.plot(color="white", edgecolor="black",ax=ax)
permit_parcel_point_gdf.plot(column='units_permit',cmap="winter",ax=ax)

# Set x and y limits to zoom in
# We can get the values from the coordinates on the x and y axis of the previous map.
ax.set_xlim([-122.275,-122.250])
ax.set_ylim([37.82,37.80])

plt.tight_layout()
plt.show()

<a id="section3"></a>
## 3.3 Mapping Permit data with Census Tracts

The first method of spatial analysis is mapping. We overlay our data in a map to visualize and explore locations and relationships. 

In this section we will consider the permit data in the context of our ACS 2018 5-year data, aggregated to census tracts. So let's load that data for Alameda County and create an overlay map.

In [None]:
tracts_acs_gdf = gpd.read_file("../notebook_data/outdata/tracts_acs_gdf_ac.json")

Now let's map the permit points overlayed on top of the census tracts, visualizing the points by the sum of he sum of units that were permitted for very-low, low, moderate, and above moderate income (`units_permit`)

In [None]:
# Map counts of permits within tracts
fig, ax = plt.subplots(figsize = (18,8)) 
tracts_acs_gdf.plot(ax=ax,color='lightgrey',edgecolor='white')  # Add the census tracts
permit_parcel_point_gdf.plot(column='units_permit', legend=True, cmap="winter",ax=ax)  # Add the permit points

# Set x and y limits to Zoom map into Oakland
ax.set_xlim([-122.35,-122.1])
ax.set_ylim([37.7,37.87])

plt.show()

In the above map the number of `unit_permits` is not obvious because so many of the permit applications have zero  units that have yet to be issued a building permit.

We can make the data convey the variable of interest more clearly by setting the color of the locations with no approved building permits to grey.  Take a look at how this is done.

In [None]:
my_cmap = matplotlib.cm.get_cmap('winter')
my_cmap.set_under('grey') # Set color to be used for low out-of-range values.

# Map counts of permits within tracts
fig, ax = plt.subplots(figsize = (18,8)) 

# Add the census tracts
tracts_acs_gdf.plot(ax=ax,color='lightgrey',edgecolor='white')  

# Add the permit data
permit_parcel_point_gdf.sort_values(by="units_permit").plot(ax=ax, 
                                                            column='units_permit', 
                                                            legend=True, 
                                                            cmap="winter",
                                                            alpha=0.75,
                                                            markersize=50,
                                                            vmin=1)  # Set a value for the low vis value

# Set x and y limits to Zoom map into Oakland
ax.set_xlim([-122.35,-122.1])
ax.set_ylim([37.7,37.87])

plt.show()

 
<div style="display:inline-block;vertical-align:top;">
    <img src="http://www.pngall.com/wp-content/uploads/2016/03/Light-Bulb-Free-PNG-Image.png" width="30" align=left > 
</div>  
<div style="display:inline-block;">
 
 

#### Question
</div>

Why do you think we added the `sort_values(by="units_permit")` to the plot command above?

In [None]:
# Your thoughts here

In the map above we can better see areas where a lot of units have received building permits. But the map is still too busy. There are too many overlapping permit points and the cummulative number of permitted units is not obvious.

A common workflow is to aggregate point data to an area unit and then create a choropleth map. For example, if we had a geodataframe of city neighborhoods we could sum permitted units by neighborhood.

We don't have a neighborhood geodataframe but we do have census tract polygons to which we can aggregate the points and create a choropleth map. Moreover, by aggregating a permit variable by census tract we can also link that variable to the wealth of demographic, social and economic data!

Let's get started.

## 3.5 Recap
In this lesson we introduces some important ways to spatially join and transform geospatial data.

In the process we learned how to use the following:
- Spatial Join	-`gpd.sjoin()`
- Dissolve `gdf.dissolve()`
- Clip. `gpd.clip`


<img align="left" width=500 src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/7b/Quite_the_happy_dog.jpg/640px-Quite_the_happy_dog.jpg"></img>


---

## 3.6 Homework

#### Exercise 1

Let's pull in the other permit data table we have: `Permit_HousingApp_Parcel_Merge_Oakland.geojson`
Try the following:
1. Import the geojson as `housingapp_parcel_gdf`. Check your geodataframe’s columns.
2. Convert to a point dataset
3. Create a map that colors the points by the values in the column `approved` (number of approved units) 
4. Overlay these points on the Alameda County tracts data
5. Zoom to Oakland


In [None]:
# Your code here

*Click here for answers*

<!--- cut and paste below---
 
# SOLUTION
# 1. Check your geodataframe’s columns
housingapp_parcel_gdf = gpd.read_file("../notebook_data/outdata/Permit_HousingApp_Parcel_Merge_Oakland.geojson")
housingapp_parcel_gdf.columns

housingapp_parcel_gdf[['APN2','proposed','approved']]

# SOLUTION
housingapp_parcel_point_gdf =  gpd.GeoDataFrame(housingapp_parcel_gdf.drop('geometry',axis=1), 
                            geometry=housingapp_parcel_gdf.centroid)


# SOLUTION

my_cmap = matplotlib.cm.get_cmap('summer')
my_cmap.set_under('grey')

# Map counts of permits within tracts
fig, ax = plt.subplots(figsize = (18,8)) 

tracts_acs_gdf.plot(ax=ax,color='lightgrey', edgecolor='white')
housingapp_parcel_point_gdf.plot(ax=ax,
                                 column='approved', 
                                 legend=True, 
                                 cmap=my_cmap,
                                 vmin=1
                                 )

# Set x and y limits
ax.set_xlim([-122.35,-122.1])
ax.set_ylim([37.7,37.87])

plt.show()

--- cut and paste above code --->

#### Exercise 2

In the code cell below.

1. Spatially join the `housingapp_parcel_point_gdf` data to the census ACS data (`tracts_acs_gdf`). Make sure the CRSs match first! Save the output to a geodataframe called `tracts_and_apps_gdf`.
  
  
2. Use the `dissolve` operation on the output of the spatial join (`tracts_and_apps_gdf`) and sum the number of **proposed** permits by census tract. Name the output geodataframe `tract_proposed_counts_gdf`.


3. Create a map of the `count` of proposed units per census tract in  `tract_proposed_counts_gdf`.


4. Save your work
  1. your sjoin output
  1. your dissolve output
  1. your map

In [None]:
## YOUR CODE HERE
 
# Drop rows with no geometry - don't change this next line - it is needed bc not all rows have geometry
# housingapp_parcel_point_gdf=housingapp_parcel_point_gdf[~housingapp_parcel_point_gdf.geometry.isna()]

# Transform CRSs so they match

# sjoin housing apps to tracts

# Dissolve and sum proprosed units

# Map it
 
# Save your work

*Click here for answers*

<!--- Cut and paste below ---

# SOLUTION

# Drop rows with no geometry
housingapp_parcel_point_gdf = housingapp_parcel_point_gdf[~housingapp_parcel_point_gdf.geometry.isna()]
housingapp_parcel_point_gdf.head()

# Transform CRSs so they match
housingapp_parcel_point_gdf = housingapp_parcel_point_gdf.to_crs(tracts_acs_gdf.crs)

# sjoin housing apps to tracts
tracts_and_apps_gdf = gpd.sjoin(tracts_acs_gdf, housingapp_parcel_point_gdf)

# Dissolve and sum proprosed units
tract_proposed_counts_gdf=tracts_and_apps_gdf[['GEOID','geometry','proposed']].dissolve(by='GEOID', aggfunc="sum", as_index=False)
#tract_proposed_counts_gdf

# Map it
fig, ax = plt.subplots(figsize = (14,8)) 

# Display the output of our spatial join
tract_proposed_counts_gdf.plot(ax=ax,column='proposed', 
                             scheme="quantiles", 
                             cmap="YlGnBu",
                             edgecolor="grey",
                             legend=True, )

plt.show()

--- cut and paste above --->

## Congrats you're done with part 3!
</br>

---
<div style="display:inline-block;vertical-align:middle;">
<a href="https://dataforhousing.org/" target="_blank"><img src ="https://media-exp1.licdn.com/dms/image/C560BAQELkt35AxeIeA/company-logo_200_200/0?e=1597881600&v=beta&t=irZ1tYCA9A2biVzCguvCXzsfzanSYDFuF22IUFNY5Sg" width="75" align="left">
</a>
</div>

<div style="display:inline-block;vertical-align:middle;">
    <div style="font-size:larger">&nbsp;Data Science for Housing Workshop, University of California Berkeley</div>
    <div>&nbsp;Tim Thomas, Patty Frontiera, Emmanuel Lopez, Ethan Ebinger, Hikari Murayama, Karen Chapple, Claudia von Vacano<div>
    <div>&copy; UC Regents, 2019-2020</div>
</div>