<a id="section4"></a>
## 3.4 Spatial join

Previously, we joined permits and parcels based on a shared attribute - APN - which is included in both datasets. This is called an `attribute join`.

- *What other datasets did we join by attribute?*

But what if our datasets do not have a shared attribute?

If two geodataframes have the same CRS we can do a **spatial join** to join attributes by shared location. 

We do this with the Geopandas [**sjoin**](https://geopandas.org/reference/geopandas.sjoin.html) function.

Before we begin make sure your two input geodataframes have the same CRS.

In [None]:
permit_parcel_point_gdf.crs == tracts_acs_gdf.crs

If they do not have the same CRS, transform one gdf to match the other.

In [None]:
print(permit_parcel_point_gdf.crs)
print(tracts_acs_gdf.crs)

In [None]:
# Transform ther permit points to match the CRS of the Census data (4269)
permit_parcel_point_gdf.to_crs(tracts_acs_gdf.crs, inplace=True)
permit_parcel_point_gdf.crs == tracts_acs_gdf.crs

In [None]:
# And for good measure...
print(permit_parcel_point_gdf.crs)
print(tracts_acs_gdf.crs)

Next, read the help documentation for `sjoin` and let's take a few minutes to discuss it.

In [None]:
# Take a look at the function documentation
help(gpd.sjoin)

Now, let's spatially join the permit data to the census data so that we can sum the number of permitted units in each census tract.  

In [None]:
# Spatially join permit data to census tract data
tracts_and_permits_gdf = gpd.sjoin(tracts_acs_gdf, permit_parcel_point_gdf)

Now let's take a look at the output of the spatial join.

In [None]:
tracts_and_permits_gdf.head(2)

By default `sjoin` is an inner join. It keeps the data from both geodataframes only where the locations spatially intersect. For our join, this means we maintain only those Alameda County census tracts that contain Oakland permit points.

By default `sjoin` maintains the geometry of first geodataframe input to the operation, here census tracts. 

The output of the our `sjoin` operation is the geodataframe `tracts_and_permits_gdf` which has 
- a row for each permit application that is located within a census tract (all of which are)
- the **polygon geometry** of the census tract in which the permit is located
- all of the atribute data columns (non-geometry columns) from both input geodataframes.


To confirm this, let's map the ouput geodataframe.

In [None]:
tracts_and_permits_gdf.plot()

We are missing some census tracts because not all census tracts contain permit activity **AND** we did an `inner join`
- Where was the inner join specified?

Take a look at the input and ouput the geometry types:

In [None]:
print("Permits input geometry:", permit_parcel_point_gdf.geometry.type.unique())
print("Tracts input geometry:" , tracts_acs_gdf.geometry.type.unique())
print("Tracts and permits join output geometry:", tracts_and_permits_gdf.geometry.type.unique())

We joined point data to polygon/multipolygon data and our join output has polygon geometry. Why, because of the order of the inputs to `sjoin` - the polygon geodataframe (census tracts) was listed first.

Now, check out the shape of the input geodataframes and output geodataframes.

In [None]:
print("Permits:", permit_parcel_point_gdf.shape)
print("Tracts:" , tracts_acs_gdf.shape)
print("Tracts and permits:", tracts_and_permits_gdf.shape)

Our output geodataframe has the same number of rows as the permits geodataframe. This is because all permits fall within an Oakland census tract and thus are included in the join output.

However, the output geodataframe  has duplicate census tract data because there are tracts that contain more than one permit application. 

For example, let's look at the rows for one census tract:

In [None]:
tracts_and_permits_gdf[tracts_and_permits_gdf['GEOID'] == '06001400600'][['GEOID','APN','units_permit']]

Let's look just at the number of permitted units for this census tract.

So, our sjoin output is not map ready. We first need to aggregate the number of permitted units by census tract, grouping the data by `GEOID` which is a unique identifier.

Ok, let's sum `units_permit` in each census tract.  We can do this using a pandas `groupby` operation.

In [None]:
tract_permit_counts_df = tracts_and_permits_gdf[['GEOID','units_permit']].groupby('GEOID', as_index=False).sum()
print("Rows and columns:", tract_permit_counts_df.shape)

# take a look at the data
tract_permit_counts_df.head(7)

Now we can verify that the sum of permitted units for census tract `06001400600` (at row index 5) is seven.

The above `groupby` and sum operations give us the counts that we are looking for:
- We have identified the 105 census tracts that contain permit application locations.
- We have the number of `units_permit` within those census tracts. 

But the output of `groupby` is a dataframe not a geodataframe.

If we want to output a spatial geodataframe we can to do one of two things:
1. join the `groupby` output to the tracts_acs_gdf by the attribute `GEOID`
or
2. use the geodataframe [**dissolve**](https://geopandas.org/aggregation_with_dissolve.html) method, which you can think of as a spatial `groupby`. 

Since we already know how to do an attribute join, let's do the `dissolve`.

In [None]:
tract_permit_counts_gdf=tracts_and_permits_gdf[['GEOID','geometry','units_permit']].dissolve(by='GEOID', aggfunc="sum", as_index=False)
print("Rows and columns: ", tract_permit_counts_gdf.shape)

# take a look
tract_permit_counts_gdf.head(7)

Let's break that down.

- The `dissolve` operation requires a geometry column and a grouping column, which above is GEOID. Any geometries within the **same group** will be dissolved if they have the same geometry or nested geometries. 
 
- The `aggfunc`, or aggregation function, of the dissolve operation will be applied to all numeric columns in the input geodataframe (unless the function is `count` in which case it will count rows.)  

Check out the Geopandas documentation on [dissolve](https://geopandas.org/aggregation_with_dissolve.html?highlight=dissolve) for more information.

Above we selected three columns from the input geodataframe to create a subset as input to the dissolve operation. Can you think of why we did that?

### Mapping our Spatial Join Output

Because our `sjoin` plus `dissolve` operation outputs a geodataframe, we can map the count of `units_permit` by census tract.

In [None]:
fig, ax = plt.subplots(figsize = (14,8)) 

# Display the output of our spatial join
tract_permit_counts_gdf.plot(ax=ax,column='units_permit', 
                             scheme="quantiles", 
                             cmap="YlGnBu",
                             edgecolor="grey",
                             legend=True, 
                             legend_kwds={'title':'Permited units'})

plt.show()


<div style="display:inline-block;vertical-align:top;">
    <img src="https://image.flaticon.com/icons/svg/87/87705.svg" width="30" align=left > 
</div>  
<div style="display:inline-block;">

#### Questions
</div>

- How does the above map convey the distribution of `units_permit` in Oakland differently compared to the point of `units_permit` we created previously? 
- What does the above map tell you about the spatial distribution of permits in Oakland?
- Does the output geodataframe include all census tracts in Oakland?

- Why does the output include census tracts with zero permitted units?

- What addition(s) could improve the above map?

<img align="left" width=500 src="https://upload.wikimedia.org/wikipedia/commons/f/ff/Cat_on_laptop_-_Just_Browsing.jpg"></img>


#### Exercise

1. Use `dissolve` with the `tracts_and_permits_gdf` geodataframe to **count** the number of permit applications by census tract.
  - Hint, group by the column that is unique to permit apps (APN). This will also be the name of the output column


2. Use `dissolve` with the `tracts_and_permits_gdf` geodataframe to calculate the **mean** median household income (`med_hhinc`) in census tracts with more than one permitted unit by census tract.


3. Make choropleth maps of the two variables `applications by tract` and `median household income`.

In [None]:
# Your code here

# 1. Count of Permit Applications by census tract
app_count_gdf=...
#app_count_gdf.head()

In [None]:
# 2. Mean median household income in tracts with more than one permit application by census tract
mean_hhinc_gdf=...
#mean_hhinc_gdf.head()

In [None]:
# 3. Maps (uncomment and complete)
#app_count_gdf.plot(...)
#mean_hhinc_gdf.plot(...)

*Click here for answers*

<!---
# SOLUTION 1
# Count of Permit Applications by census tract
app_count_gdf=tracts_and_permits_gdf[['GEOID','geometry','APN']].dissolve(by='GEOID', aggfunc="count", as_index=False)
app_count_gdf.head()

# SOLUTION 2
# Mean median household income in tracts with more than one permit application by census tract
mean_hhinc_gdf=tracts_and_permits_gdf[['GEOID','geometry','med_hhinc']].dissolve(by='GEOID', aggfunc="mean", as_index=False)
mean_hhinc_gdf()

## SOULTION 3
app_count_gdf.plot(column='APN', legend=True, legend_kwds={'label': 'Count of aps'});
mean_hhinc_gdf.plot(column='med_hhinc', legend=True, legend_kwds={'label': 'Avg med hhinc'});
--->
