# 3. Spatial Joins and Calculations
With all of the data, spatial data, and mapping we've learned so far we're ready to add more data to our arsenal! We're going to learn new methods for linking and transforming our data.

- [3.1 Introduction ](#section1)
- [3.2  Permit and Parcel Data](#section2)
  - transforming polygons to points
- [3.3 Mapping Permit data with Census Tracts](#section3)
- [3.4 Spatial Join](#section4)
- [3.5 Analyzing the results of our Spatial Joins](#section5)
    - Dissolving
    - Clipping

**INSTRUCTOR NOTES**:
- Datasets used:
    - "../notebook_data/outdata/Permit_HousingApp_Parcel_Merge_Oakland.geojson"
    - "../notebook_data/outdata/Permit_ActivityReport_Parcel_Merge_Oakland.geojson" 
    - "../notebook_data/outdata/tracts_acs_gdf_ac.json"
    - "../notebook_data/notebook_data/census/Places/cb_2018_06_place_500k.zip"
 
 
- Expected time to complete:
    - Lecture + Questions: 1 hour
    - Homework: 45 minutes

### Set-Up

Let's import the packages we need before we get started.

In [None]:
import numpy as np
import pandas as pd
import geopandas as gpd

import matplotlib # base python plotting library
%matplotlib inline  
import matplotlib.pyplot as plt # more plotting stuff 

# We are getting futurewarning errors about the syntax of CRS definitions, ie "init=epsg:4269" vs "epsg:4269"
# so suppress as these are minor
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

<a id="section0"></a>
## 3.0 Warmup Exercise

Looking at the table of contents shown above, what types of questions do you think you will be able to answer by learning the introduced data and new techniques?

In [None]:
# Write your thoughts here

Awesome! We'll also be reviewing attribute joins here, but the goal is to be able to **join data in a different way - by location - and begin to learn how to spatially summarize data**.

<a id="section1"></a>
## 3.1 Introduction

Geospatial data *is* special. Because all geospatial data is referenced to the Earth, geospatially data can be dynamically joined by location even if they share no common attributes. This is called a **spatial join**. This is in contrast to an `attribute join` where the join attribute must already exist in the data. For example, we were able to merge permit and parcel data because both contain an APN column. 

In this lesson we will explore `spatial joins` with our permit and census tract data. 

We will also take a look at a few useful and common spatial transformations. First, we will transform the polygon parcel geomtries to point geometries. Second, we will use the Geopandas **dissolve** operation to aggregate geometries.

<a id="section2"></a>
## 3.2 Permit and Parcel data

Let's create an initial map of our joined permit and parcel point data, and check how it looks in combination with our ACS and tracts data.

First let's import our `permit_parcel_gdf` and then map it. Remember from `notebook s2_3` that this is the dataset where you joined together the Oakland parcels and permit activity report.

In [None]:
permit_parcel_gdf = gpd.read_file("../notebook_data/outdata/Permit_ActivityReport_Parcel_Merge_Oakland.geojson")

In [None]:
# Check the shape of the gdf
permit_parcel_gdf.shape

In [None]:
# Check columns available
permit_parcel_gdf.columns

In [None]:
# Double check and make sure no APN values are empty
permit_parcel_gdf['APN'].isnull().value_counts()

Let's try mapping the permit data by `units_permit` which is the sum of building permits issued for very-low, low, moderate, and above moderate income units.

In [None]:
fig, ax = plt.subplots(figsize = (12,8)) 
permit_parcel_gdf.plot(column='units_permit',legend=True, cmap="winter",ax=ax)
plt.tight_layout()
plt.show()

Let's zoom in to get a better look at the data. We can do this by setting the x and y limits of the plot.

In [None]:
fig, ax = plt.subplots(figsize = (10,8)) 
permit_parcel_gdf.plot(column='units_permit',cmap="winter",ax=ax)

# Set x and y limits to zoom in
# We can get the values from the coordinates on the x and y axis of the previous map.
ax.set_xlim([-122.275,-122.250])
ax.set_ylim([37.82,37.80])

plt.tight_layout()
plt.show()

The map above shows the permit locations as the parcel polygons. Since we are unlikley to zoom in that closely on the permit locations, let's `transform` the parcel polygons to points rather than tiny (relative to the area) polygon geometries that are difficult to visualize and take up a lot of memory and disk space, and thus are slower to work with. Point geometry will work at any scale for these data. 

To do this we create a point dataset similar to what we did for some of our tracts and ACS data.

Let's start by reminding ourselves of how to find the centroid of a polygon shape. We can do this by finding x and y coordinates of the polygon centroid separately...
> A `centroid` is the center of a bounding box around a polygon feature. Given an irregularly shaped polygon, e.g. Florida, the centroid may not be within the polygon. If you want a point that is sure to be within the polygon you can use the `gdf.representative_point()` method.

In [None]:
# Get X coordinate of the parcel centroids as a list, showing first 5 values
permit_parcel_gdf.centroid.x[0:5]

In [None]:
# Get Y coordinate of the parcel centroids as a list, showing first 5 values
permit_parcel_gdf.centroid.y[0:5]

Or, you can find both with one line of code.

In [None]:
# Get the centroids of the permit parcel polygon data
permit_parcel_gdf.centroid

We can use the centroid to create a new point data set called `permit_parcel_point_gdf`.
To do this we use the geopandas function `GeoDataFrame`, which returns a new geodataframe by combining a pandas dataframe and a geometry column. 
- the pandas dataframe is the the permit_parcel_gdf with the geometry column dropped
- the geometry column is the centroid of the permit_parcel_gdf polygons

In [None]:
permit_parcel_point_gdf = gpd.GeoDataFrame(permit_parcel_gdf.drop('geometry',axis=1), 
                            geometry=permit_parcel_gdf.centroid)

In [None]:
permit_parcel_point_gdf.head()

Let's map those points now to see how they compare to the polygon data.

In [None]:
fig, ax = plt.subplots(figsize = (10,8)) 
permit_parcel_gdf.plot(color="white", edgecolor="black",ax=ax)
permit_parcel_point_gdf.plot(column='units_permit',cmap="winter",ax=ax)

# Set x and y limits to zoom in
# We can get the values from the coordinates on the x and y axis of the previous map.
ax.set_xlim([-122.275,-122.250])
ax.set_ylim([37.82,37.80])

plt.tight_layout()
plt.show()

<a id="section3"></a>
## 3.3 Mapping Permit data with Census Tracts

The first method of spatial analysis is mapping. We overlay our data in a map to visualize and explore locations and relationships. 

In this section we will consider the permit data in the context of our ACS 2018 5-year data, aggregated to census tracts. So let's load that data for Alameda County and create an overlay map.

In [None]:
tracts_acs_gdf = gpd.read_file("../notebook_data/outdata/tracts_acs_gdf_ac.json")

Now let's map the permit points overlayed on top of the census tracts, visualizing the points by the sum of he sum of units that were permitted for very-low, low, moderate, and above moderate income (`units_permit`)

In [None]:
# Map counts of permits within tracts
fig, ax = plt.subplots(figsize = (18,8)) 
tracts_acs_gdf.plot(ax=ax,color='lightgrey',edgecolor='white')  # Add the census tracts
permit_parcel_point_gdf.plot(column='units_permit', legend=True, cmap="winter",ax=ax)  # Add the permit points

# Set x and y limits to Zoom map into Oakland
ax.set_xlim([-122.35,-122.1])
ax.set_ylim([37.7,37.87])

plt.show()

In the above map the number of `unit_permits` is not obvious because so many of the permit applications have zero  units that have yet to be issued a building permit.

We can make the data convey the variable of interest more clearly by setting the color of the locations with no approved building permits to grey.  Take a look at how this is done.

In [None]:
my_cmap = matplotlib.cm.get_cmap('winter')
my_cmap.set_under('grey') # Set color to be used for low out-of-range values.

# Map counts of permits within tracts
fig, ax = plt.subplots(figsize = (18,8)) 

# Add the census tracts
tracts_acs_gdf.plot(ax=ax,color='lightgrey',edgecolor='white')  

# Add the permit data
permit_parcel_point_gdf.sort_values(by="units_permit").plot(ax=ax, 
                                                            column='units_permit', 
                                                            legend=True, 
                                                            cmap="winter",
                                                            alpha=0.75,
                                                            markersize=50,
                                                            vmin=1)  # Set a value for the low vis value

# Set x and y limits to Zoom map into Oakland
ax.set_xlim([-122.35,-122.1])
ax.set_ylim([37.7,37.87])

plt.show()

 
<div style="display:inline-block;vertical-align:top;">
    <img src="http://www.pngall.com/wp-content/uploads/2016/03/Light-Bulb-Free-PNG-Image.png" width="30" align=left > 
</div>  
<div style="display:inline-block;">
 
 

#### Question
</div>

Why do you think we added the `sort_values(by="units_permit")` to the plot command above?

In [None]:
# Your thoughts here

In the map above we can better see areas where a lot of units have received building permits. But the map is still too busy. There are too many overlapping permit points and the cummulative number of permitted units is not obvious.

A common workflow is to aggregate point data to an area unit and then create a choropleth map. For example, if we had a geodataframe of city neighborhoods we could sum permitted units by neighborhood.

We don't have a neighborhood geodataframe but we do have census tract polygons to which we can aggregate the points and create a choropleth map. Moreover, by aggregating a permit variable by census tract we can also link that variable to the wealth of demographic, social and economic data!

Let's get started.

<a id="section4"></a>
## 3.4 Spatial join

Previously, we joined permits and parcels based on a shared attribute - APN - which is included in both datasets. This is called an `attribute join`.

- *What other datasets did we join by attribute?*

But what if our datasets do not have a shared attribute?

If two geodataframes have the same CRS we can do a **spatial join** to join attributes by shared location. 

We do this with the Geopandas [**sjoin**](https://geopandas.org/reference/geopandas.sjoin.html) function.

Before we begin make sure your two input geodataframes have the same CRS.

In [None]:
permit_parcel_point_gdf.crs == tracts_acs_gdf.crs

If they do not have the same CRS, transform one gdf to match the other.

In [None]:
print(permit_parcel_point_gdf.crs)
print(tracts_acs_gdf.crs)

In [None]:
# Transform ther permit points to match the CRS of the Census data (4269)
permit_parcel_point_gdf.to_crs(tracts_acs_gdf.crs, inplace=True)
permit_parcel_point_gdf.crs == tracts_acs_gdf.crs

In [None]:
# And for good measure...
print(permit_parcel_point_gdf.crs)
print(tracts_acs_gdf.crs)

Next, read the help documentation for `sjoin` and let's take a few minutes to discuss it.

In [None]:
# Take a look at the function documentation
help(gpd.sjoin)

Now, let's spatially join the permit data to the census data so that we can sum the number of permitted units in each census tract.  

In [None]:
# Spatially join permit data to census tract data
tracts_and_permits_gdf = gpd.sjoin(tracts_acs_gdf, permit_parcel_point_gdf)

Now let's take a look at the output of the spatial join.

In [None]:
tracts_and_permits_gdf.head(2)

By default `sjoin` is an inner join. It keeps the data from both geodataframes only where the locations spatially intersect. For our join, this means we maintain only those Alameda County census tracts that contain Oakland permit points.

By default `sjoin` maintains the geometry of first geodataframe input to the operation, here census tracts. 

The output of the our `sjoin` operation is the geodataframe `tracts_and_permits_gdf` which has 
- a row for each permit application that is located within a census tract (all of which are)
- the **polygon geometry** of the census tract in which the permit is located
- all of the atribute data columns (non-geometry columns) from both input geodataframes.


To confirm this, let's map the ouput geodataframe.

In [None]:
tracts_and_permits_gdf.plot()

We are missing some census tracts because not all census tracts contain permit activity **AND** we did an `inner join`
- Where was the inner join specified?

Take a look at the input and ouput the geometry types:

In [None]:
print("Permits input geometry:", permit_parcel_point_gdf.geometry.type.unique())
print("Tracts input geometry:" , tracts_acs_gdf.geometry.type.unique())
print("Tracts and permits join output geometry:", tracts_and_permits_gdf.geometry.type.unique())

We joined point data to polygon/multipolygon data and our join output has polygon geometry. Why, because of the order of the inputs to `sjoin` - the polygon geodataframe (census tracts) was listed first.

Now, check out the shape of the input geodataframes and output geodataframes.

In [None]:
print("Permits:", permit_parcel_point_gdf.shape)
print("Tracts:" , tracts_acs_gdf.shape)
print("Tracts and permits:", tracts_and_permits_gdf.shape)

Our output geodataframe has the same number of rows as the permits geodataframe. This is because all permits fall within an Oakland census tract and thus are included in the join output.

However, the output geodataframe  has duplicate census tract data because there are tracts that contain more than one permit application. 

For example, let's look at the rows for one census tract:

In [None]:
tracts_and_permits_gdf[tracts_and_permits_gdf['GEOID'] == '06001400600'][['GEOID','APN','units_permit']]

Let's look just at the number of permitted units for this census tract.

So, our sjoin output is not map ready. We first need to aggregate the number of permitted units by census tract, grouping the data by `GEOID` which is a unique identifier.

Ok, let's sum `units_permit` in each census tract.  We can do this using a pandas `groupby` operation.

In [None]:
tract_permit_counts_df = tracts_and_permits_gdf[['GEOID','units_permit']].groupby('GEOID', as_index=False).sum()
print("Rows and columns:", tract_permit_counts_df.shape)

# take a look at the data
tract_permit_counts_df.head(7)

Now we can verify that the sum of permitted units for census tract `06001400600` (at row index 5) is seven.

The above `groupby` and sum operations give us the counts that we are looking for:
- We have identified the 105 census tracts that contain permit application locations.
- We have the number of `units_permit` within those census tracts. 

But the output of `groupby` is a dataframe not a geodataframe.

If we want to output a spatial geodataframe we can to do one of two things:
1. join the `groupby` output to the tracts_acs_gdf by the attribute `GEOID`
or
2. use the geodataframe [**dissolve**](https://geopandas.org/aggregation_with_dissolve.html) method, which you can think of as a spatial `groupby`. 

Since we already know how to do an attribute join, let's do the `dissolve`.

In [None]:
tract_permit_counts_gdf=tracts_and_permits_gdf[['GEOID','geometry','units_permit']].dissolve(by='GEOID', aggfunc="sum", as_index=False)
print("Rows and columns: ", tract_permit_counts_gdf.shape)

# take a look
tract_permit_counts_gdf.head(7)

Let's break that down.

- The `dissolve` operation requires a geometry column and a grouping column, which above is GEOID. Any geometries within the **same group** will be dissolved if they have the same geometry or nested geometries. 
 
- The `aggfunc`, or aggregation function, of the dissolve operation will be applied to all numeric columns in the input geodataframe (unless the function is `count` in which case it will count rows.)  

Check out the Geopandas documentation on [dissolve](https://geopandas.org/aggregation_with_dissolve.html?highlight=dissolve) for more information.

Above we selected three columns from the input geodataframe to create a subset as input to the dissolve operation. Can you think of why we did that?

### Mapping our Spatial Join Output

Because our `sjoin` plus `dissolve` operation outputs a geodataframe, we can map the count of `units_permit` by census tract.

In [None]:
fig, ax = plt.subplots(figsize = (14,8)) 

# Display the output of our spatial join
tract_permit_counts_gdf.plot(ax=ax,column='units_permit', 
                             scheme="quantiles", 
                             cmap="YlGnBu",
                             edgecolor="grey",
                             legend=True, 
                             legend_kwds={'title':'Permited units'})

plt.show()


<div style="display:inline-block;vertical-align:top;">
    <img src="https://image.flaticon.com/icons/svg/87/87705.svg" width="30" align=left > 
</div>  
<div style="display:inline-block;">

#### Questions
</div>

- How does the above map convey the distribution of `units_permit` in Oakland differently compared to the point of `units_permit` we created previously? 
- What does the above map tell you about the spatial distribution of permits in Oakland?
- Does the output geodataframe include all census tracts in Oakland?

- Why does the output include census tracts with zero permitted units?

- What addition(s) could improve the above map?

<img align="left" width=500 src="https://upload.wikimedia.org/wikipedia/commons/f/ff/Cat_on_laptop_-_Just_Browsing.jpg"></img>


#### Exercise

1. Use `dissolve` with the `tracts_and_permits_gdf` geodataframe to **count** the number of permit applications by census tract.
  - Hint, group by the column that is unique to permit apps (APN). This will also be the name of the output column


2. Use `dissolve` with the `tracts_and_permits_gdf` geodataframe to calculate the **mean** median household income (`med_hhinc`) in census tracts with more than one permitted unit by census tract.


3. Make choropleth maps of the two variables `applications by tract` and `median household income`.

In [None]:
# Your code here

# 1. Count of Permit Applications by census tract
app_count_gdf=...
#app_count_gdf.head()

In [None]:
# 2. Mean median household income in tracts with more than one permit application by census tract
mean_hhinc_gdf=...
#mean_hhinc_gdf.head()

In [None]:
# 3. Maps (uncomment and complete)
#app_count_gdf.plot(...)
#mean_hhinc_gdf.plot(...)

*Click here for answers*

<!---
# SOLUTION 1
# Count of Permit Applications by census tract
app_count_gdf=tracts_and_permits_gdf[['GEOID','geometry','APN']].dissolve(by='GEOID', aggfunc="count", as_index=False)
app_count_gdf.head()

# SOLUTION 2
# Mean median household income in tracts with more than one permit application by census tract
mean_hhinc_gdf=tracts_and_permits_gdf[['GEOID','geometry','med_hhinc']].dissolve(by='GEOID', aggfunc="mean", as_index=False)
mean_hhinc_gdf()

## SOULTION 3
app_count_gdf.plot(column='APN', legend=True, legend_kwds={'label': 'Count of aps'});
mean_hhinc_gdf.plot(column='med_hhinc', legend=True, legend_kwds={'label': 'Avg med hhinc'});
--->


### `sjoin` recap

Thus far we have spatially joined the permit data to the census tract data. The result is polygon geodataframe that includes:
- a row for each permit application that spatially intersects with our census tract gdf (all permits do),
- with the geometry of the census tracts in which they reside

This means :
- a sjoin can output duplicate geometries.
- we can `dissolve` duplicate geometries and
- summarize the permit data by the census geometries

### Order matters!

By default, `sjoin` output will have the geometry of the first input geodataframe.

Thus, if we change the order of our sjoin we output point geometries instead of polygon geometries.

For example, let's see what we get if we spatially join tracts to permits.

In [None]:
# Spatially join census tract data to permit data
permits_and_tracts_gdf = gpd.sjoin(permit_parcel_point_gdf,tracts_acs_gdf)

The shape of our output geodataframe is the same.

In [None]:
print("Permits:", permit_parcel_point_gdf.shape)
print("Tracts:" , tracts_acs_gdf.shape)
print("Tracts and permits:", tracts_and_permits_gdf.shape)
print("Permits and tracts:", permits_and_tracts_gdf.shape)

But, when we plot it we now see points.

In [None]:
permits_and_tracts_gdf.plot()

If we look at a few rows of the geodataframe we see that it has columns from both input geodataframes

In [None]:
permits_and_tracts_gdf.head()


With this spatial join we have augemented our permit data with census ACS data. 
- We have a row for each permit activity report and we have point geometry to map the permit locations. 
- And we have the ACS data associated with each permit location. 

We could use this output as the input to further analysis if we were interested in exploring the relationship between the permit data and one or more ACS variables.

As we did in the prior `sjoin` example, we could also use this data to aggregrate the permit data by census tract but we would not have a direct way to map the aggregated data since our output has point and not census tract geometry. So if the goal is to summarize permit data by census tract then an `sjoin` that ouputs census tract geometries would be a better way to go about it.


### Dissolve revisted

We just saw how `dissolve` is used to remove duplicate geometries. It is also commonly used to merge adjacent geometries. Let's check this out.

First let's take a look at the Alameda County census tracts.

In [None]:
tracts_acs_gdf.plot()

If we take a look at the data we see that all rows have the same value for `COUNTYFP` or the county FIPS code, which is `001`.

In [None]:
tracts_acs_gdf.head(2)

Let's dissolve the Alameda County census tracts by COUNTYFP.

In [None]:
tracts_acs_gdf_dissolved = tracts_acs_gdf.dissolve(by='COUNTYFP')
tracts_acs_gdf_dissolved

In [None]:
tracts_acs_gdf_dissolved.plot()

Because the rows for each census tract share the same county FIPS code and the census tract geometries are all nested within the county boundary, all interior polygons were dissolved leaving one boundary polygon.  Pretty cool, huh!  The `dissolve` operation is commonly used for these kind of geometric aggregation tasks because a lot of time your geographic data has more detail than you need.

## Improving our Spatial Join output Maps

In some of the maps we made above we have add to address the issue that the census tract data are for all of Alameda County while the Permit data is for the City of Oakland.  We have worked with this by **zooming** into Oakland in our maps. However, the data for locations outside of Oakland are still displayed.

Another way to address this is by reading in a boundary file for the city of Oakland and then mapping our data on top of that.

### City of Oakland data

To do this we will load the boundary file for all census places in California.

In [None]:
places_gdf =  gpd.read_file("zip://../notebook_data/census/Places/cb_2018_06_place_500k.zip")
places_gdf.head(3)

Subset the data to Oakland...

In [None]:
oakland_gdf = places_gdf.loc[places_gdf['NAME']=='Oakland'].copy().reset_index(drop=True) #subset


And plot the data

In [None]:
oakland_gdf.plot();

Now we can recreate our map of tracts with permits and display these on top of the city boundary. This will remove any gaps in the city where we do not have census tracts that contain permit locations.

In [None]:
fig, ax = plt.subplots(figsize = (14,8)) 

# add city boundary
oakland_gdf.plot(ax=ax, color="grey", alpha=0.6) 

# Display the output of our spatial join
tract_permit_counts_gdf.plot(ax=ax,column='units_permit', 
                             scheme="quantiles", 
                             cmap="YlGnBu",
                             edgecolor="grey",
                             legend=True, )

ax.set_title("Count of Permitted Units in Oakland by Census Tract")
ax.set_axis_off() 
plt.show()

Now that we have are permit data aggregated to census tract, let's see how we can explore the relationship between the ACS data and the permit data.

For example, let's see if there is any spatial relationship between the count of permitted units and the percent homeowners (`p_owners`) in the census tract.

First, let's create a point dataset of our census tracts.

In [None]:
tracts_acs_gdf_point = gpd.GeoDataFrame(tracts_acs_gdf.loc[:,tracts_acs_gdf.columns!='geometry'], 
                            geometry=tracts_acs_gdf.centroid)

Now map the census tract points on top of our tract polygons symbolized by our variables of interest.

In [None]:
fig, ax = plt.subplots(figsize = (14,8)) 

# add city boundary
oakland_gdf.plot(ax=ax, color="grey", alpha=0.6) 

# Display the output of our spatial join
tract_permit_counts_gdf.plot(ax=ax,column='units_permit', 
                             scheme="quantiles", 
                             cmap="YlGnBu",
                             edgecolor="grey",
                             legend=True, )

# Display percent home owners
tracts_acs_gdf_point.plot(ax=ax,column='p_owners', 
                             cmap="hot",
                             edgecolor="grey",
                             markersize=60,
                             legend=True, )

ax.set_title("Count of Permitted Units in Oakland by Census Tract")
ax.set_axis_off() 
plt.show()

Well that's not as good as it could be!

The census tract points are for the entire county but our tract polygons, output from `sjoin`, are only in Oakland.

Let's **clip** the census tract points to the boundary of Oakland.

### Clipping GeoDataFrames

Clipping involves cutting out the features (or rows) in one geospatial dataset that spatially intersect the features of a polygon geospatial dataset. It is often called a cookie cutter operation. This is useful if we limit the information to a certain region. For example, if we want the census tracts for the city of Oakland we can clip the census tracts for the state to the boundary of that city.

First, take a look at the Geopandas `clip` function documentation.
- Clip requires both datasets to be in the same CRS. 

In [None]:
# Uncomment to read
#help(gpd.clip)

Clip the census tract points to the boundary of Oakland.

In [None]:
tracts_acs_gdf_point_clipped = gpd.clip(tracts_acs_gdf_point, oakland_gdf).reset_index(drop=True)

Now, let's try that map again.

In [None]:
fig, ax = plt.subplots(figsize = (14,8)) 

# add city boundary
oakland_gdf.plot(ax=ax, color="grey", alpha=0.6) 

# Display the output of our spatial join
tract_permit_counts_gdf.plot(ax=ax,column='units_permit', 
                             scheme="quantiles", 
                             cmap="Blues",
                             edgecolor="grey",
                             legend=True, 
                             legend_kwds={'title':'Permitted Units'})

# Display percent home owners
tracts_acs_gdf_point_clipped.plot(ax=ax,column='p_owners', 
                             cmap="Reds",
                             edgecolor="grey",
                             markersize=60,
                             legend=True, 
                             legend_kwds={'label': 'Proportion of Home Owners'})

ax.set_title("Count of Permitted Units in Oakland by Census Tract")
ax.set_axis_off() 
plt.show()

Now that's better! This map seems to indicate that a larger number of permitted units can be found in areas with lower rates of home ownership.

> `Clip` is a very common geometric data transformation. Check out the optional `Spatial Interpolation notebook` if you want to learn more.

### Any Questions?

### Save your work!
Save the files we created so we can reuse in subsequent notebooks.

In [None]:
# Permit data joined to census tract ACS data
tracts_and_permits_gdf.to_file("../outdata/tracts_and_permits_gdf.json", driver="GeoJSON")

In [None]:
# Tract ACS data joined to Permit date
permits_and_tracts_gdf.to_file("../outdata/permits_and_tracts_gdf.json",driver="GeoJSON")

In [None]:
# City of Oakland boundary file
oakland_gdf.to_file("../outdata/oakland_gdf.json", driver="GeoJSON")

## 3.5 Recap
In this lesson we introduces some important ways to spatially join and transform geospatial data.

In the process we learned how to use the following:
- Spatial Join	-`gpd.sjoin()`
- Dissolve `gdf.dissolve()`
- Clip. `gpd.clip`


<img align="left" width=500 src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/7b/Quite_the_happy_dog.jpg/640px-Quite_the_happy_dog.jpg"></img>


---

## 3.6 Homework

#### Exercise 1

Let's pull in the other permit data table we have: `Permit_HousingApp_Parcel_Merge_Oakland.geojson`
Try the following:
1. Import the geojson as `housingapp_parcel_gdf`. Check your geodataframe’s columns.
2. Convert to a point dataset
3. Create a map that colors the points by the values in the column `approved` (number of approved units) 
4. Overlay these points on the Alameda County tracts data
5. Zoom to Oakland


In [None]:
# Your code here

*Click here for answers*

<!--- cut and paste below---
 
# SOLUTION
# 1. Check your geodataframe’s columns
housingapp_parcel_gdf = gpd.read_file("../notebook_data/outdata/Permit_HousingApp_Parcel_Merge_Oakland.geojson")
housingapp_parcel_gdf.columns

housingapp_parcel_gdf[['APN2','proposed','approved']]

# SOLUTION
housingapp_parcel_point_gdf =  gpd.GeoDataFrame(housingapp_parcel_gdf.drop('geometry',axis=1), 
                            geometry=housingapp_parcel_gdf.centroid)


# SOLUTION

my_cmap = matplotlib.cm.get_cmap('summer')
my_cmap.set_under('grey')

# Map counts of permits within tracts
fig, ax = plt.subplots(figsize = (18,8)) 

tracts_acs_gdf.plot(ax=ax,color='lightgrey', edgecolor='white')
housingapp_parcel_point_gdf.plot(ax=ax,
                                 column='approved', 
                                 legend=True, 
                                 cmap=my_cmap,
                                 vmin=1
                                 )

# Set x and y limits
ax.set_xlim([-122.35,-122.1])
ax.set_ylim([37.7,37.87])

plt.show()

--- cut and paste above code --->

#### Exercise 2

In the code cell below.

1. Spatially join the `housingapp_parcel_point_gdf` data to the census ACS data (`tracts_acs_gdf`). Make sure the CRSs match first! Save the output to a geodataframe called `tracts_and_apps_gdf`.
  
  
2. Use the `dissolve` operation on the output of the spatial join (`tracts_and_apps_gdf`) and sum the number of **proposed** permits by census tract. Name the output geodataframe `tract_proposed_counts_gdf`.


3. Create a map of the `count` of proposed units per census tract in  `tract_proposed_counts_gdf`.


4. Save your work
  1. your sjoin output
  1. your dissolve output
  1. your map

In [None]:
## YOUR CODE HERE
 
# Drop rows with no geometry - don't change this next line - it is needed bc not all rows have geometry
# housingapp_parcel_point_gdf=housingapp_parcel_point_gdf[~housingapp_parcel_point_gdf.geometry.isna()]

# Transform CRSs so they match

# sjoin housing apps to tracts

# Dissolve and sum proprosed units

# Map it
 
# Save your work

*Click here for answers*

<!--- Cut and paste below ---

# SOLUTION

# Drop rows with no geometry
housingapp_parcel_point_gdf = housingapp_parcel_point_gdf[~housingapp_parcel_point_gdf.geometry.isna()]
housingapp_parcel_point_gdf.head()

# Transform CRSs so they match
housingapp_parcel_point_gdf = housingapp_parcel_point_gdf.to_crs(tracts_acs_gdf.crs)

# sjoin housing apps to tracts
tracts_and_apps_gdf = gpd.sjoin(tracts_acs_gdf, housingapp_parcel_point_gdf)

# Dissolve and sum proprosed units
tract_proposed_counts_gdf=tracts_and_apps_gdf[['GEOID','geometry','proposed']].dissolve(by='GEOID', aggfunc="sum", as_index=False)
#tract_proposed_counts_gdf

# Map it
fig, ax = plt.subplots(figsize = (14,8)) 

# Display the output of our spatial join
tract_proposed_counts_gdf.plot(ax=ax,column='proposed', 
                             scheme="quantiles", 
                             cmap="YlGnBu",
                             edgecolor="grey",
                             legend=True, )

plt.show()

--- cut and paste above --->

## Congrats you're done with part 3!
</br>

---
<div style="display:inline-block;vertical-align:middle;">
<a href="https://dataforhousing.org/" target="_blank"><img src ="https://media-exp1.licdn.com/dms/image/C560BAQELkt35AxeIeA/company-logo_200_200/0?e=1597881600&v=beta&t=irZ1tYCA9A2biVzCguvCXzsfzanSYDFuF22IUFNY5Sg" width="75" align="left">
</a>
</div>

<div style="display:inline-block;vertical-align:middle;">
    <div style="font-size:larger">&nbsp;Data Science for Housing Workshop, University of California Berkeley</div>
    <div>&nbsp;Tim Thomas, Patty Frontiera, Emmanuel Lopez, Ethan Ebinger, Hikari Murayama, Karen Chapple, Claudia von Vacano<div>
    <div>&copy; UC Regents, 2019-2020</div>
</div>