# Optional: Spatial Clipping and Areal Interpolation
In this optional notebook we will cover clipping and areal interpolation.

- [6.1 Introduction ](#section1)
- [6.2 Load Census Data](#section2)
- [6.3  Clipping](#section3)
- [6.4  Areal Interpolation](#section4)

**Instructor Notes**:
- Datasets used:
    - notebook_data/census/Tracts/cb_2018_06_tract_500k.zip
    - notebook_data/census/ACS5yr/census_variables_CA_2018.csv
    - notebook_data/census/Places/cb_2018_06_place_500k.zip


- Expected time to complete:
    - 1 hour

<a id="section1"></a>
## 6.1 Introduction

**Clipping** and **areal interpolation** are both useful and important skills for combining different datasets. 

`Clipping` involves cutting out the features (or rows) in one geospatial dataset that spatially intersect the features of a polygon geospatial dataset.  This is useful if we limit the information to a certain region.  For example, if we want the census tracts for the city of Oakland we can clip the census tracts for the state to the boundary of that city.

<img src = "./img/oak_tracts_clip.png">

Clipping will cut input geometries that cross the boundary of the clip geometry. For example, if a census tract crosses Oakland boundary it will be clipped to the boundary, cookie-cutter style. However, clipping does not alter the input attribute data. For example, if a census tract is clipped in half, it will still maintain the same value for all attributes, e.g., total population. 

  
`Areal Interpolation`, on the other hand, uses **area weighting** to reaggregate data from one geometry to another. Using the example above, if a census tract in Oakland were clipped in half, then areal interpolation would assign half the total population to it.

Great, now that you have a sense of these two methods, let's cover what our goal is in this notebook. We will **clip our census tract data to a city boundary and then interpolate the ACS5 values for those clipped tracts**.

### Set-Up
Let's import the packages we need before we get started.

In [1]:
import math
import numpy as np
import pandas as pd
 
import geopandas as gpd

# Ignore warning about missing/empty geometries
import warnings
warnings.filterwarnings('ignore', 'GeoSeries.notna', UserWarning)

import matplotlib # base python plotting library
%matplotlib inline  
import matplotlib.pyplot as plt # more plotting stuff 

We'll also be using the following from `tobler`, which is a library for areal interpolation and dasymetric mapping. You can find out more here: https://pysal.org/tobler/index.html

In [2]:
# For area weighted interpolation
from tobler import area_weighted
from tobler.area_weighted import area_interpolate, area_tables

ImportError: cannot import name 'area_tables' from 'tobler.area_weighted' (/opt/anaconda3/lib/python3.8/site-packages/tobler/area_weighted/__init__.py)

<a id="section2"></a>
## 6.2 Load the Data


First, Let's read in the census geographic data and census ACS5 data for the state.

In [None]:
# Our data
tract_data = "zip://../notebook_data/census/Tracts/cb_2018_06_tract_500k.zip"
acs5_data = "../notebook_data/census/ACS5yr/census_variables_CA_2018.csv"
places_data = "zip://../notebook_data/census/Places/cb_2018_06_place_500k.zip"


In [None]:
# Read in the census tracts for all of California 
# setting Census Tract Identifier GEOID to a string so as not to lose leading zeros
tracts_gdf = gpd.read_file(tract_data, dtype={"GEOID":str})

In [None]:
tracts_gdf.head(2)

In [None]:
# Read in our ACS5 data for CA
acs5_df = pd.read_csv(acs5_data, dtype={"FIPS_11_digit": str})

In [None]:
# Take a look at the rows
acs5_df.head(2)

### Join the ACS data to the census tracts

In [None]:
acs5_tracts_gdf = tracts_gdf.merge(acs5_df, how="left", left_on="GEOID", right_on="FIPS_11_digit")

In [None]:
acs5_tracts_gdf.head(2)

Plot the output census tract data, creating a choropleth map of one ACS5 variable.

In [None]:
acs5_tracts_gdf.plot(column="p_white", legend=True)

We now have our 2018 ACS 5 year data for all of California. Our goal is to subset these so that we only have this data for our city of interest. 


### Census Places

Census places identify, in general, populated communities. This includes incorporated cities, towns and villages (legal entities) and Census Designated Places (populated areas that lack separate government, but are useful for statistical purposes). So census places are statistical areas that may not completely align with our administrative/legal city boundaries. But they are a useful proxy.

Read in the census place data.

In [None]:
places_gdf = gpd.read_file(places_data)
places_gdf.head()

Subset to select our city of interest which is Oakland, CA for this example.

In [None]:
city_name = 'Oakland'
oakland_gdf = places_gdf[places_gdf['NAME']==city_name].reset_index(drop=True)
oakland_gdf.plot()

Take a look at the geodataframe. It has only one row which makes sense for this data.

In [None]:
oakland_gdf.head()

<a id="section3"></a>
## 6.3 Clipping

Clipping allows us to clip one geometry by another. For example, we can clip the `acs5_tracts_gdf` geodataframe to a city boundary.

First, take a look at the function documentation or check the Geopandas web page.

In [None]:
# Uncomment to view
#gpd.clip?

**Clip Order matters**: clip the first geometry by the second geometry!

In [None]:
# Clip CA Census tracts to the boundary of our city
oakland_clip_gdf = gpd.clip(acs5_tracts_gdf, oakland_gdf).reset_index(drop=True)

In [None]:
# Take a look at output
oakland_clip_gdf.head()

Interesting! Note that our clip output geometry as shown above includes LINESTRINGS and POLYGONS.

Let's take a close look at different geometry types in the `clip` output.

In [None]:
print(oakland_clip_gdf.geometry.type.unique())

The `unique` method gives us the unique types of geometries in the dataframe.

The `value_counts` method will give the count of each unique type.

In [None]:
oakland_clip_gdf.geometry.type.value_counts()

We can print those different geometry types with different colors to see what is going on in the clip output.

In [None]:
fig, ax = plt.subplots(figsize = (12,12)) 

tracts_gdf.plot(ax=ax, color="white", edgecolor="black", linewidth=0.6)
oakland_clip_gdf[oakland_clip_gdf['geometry'].type== 'Polygon'].plot(ax=ax,color='green', alpha=0.5)
oakland_clip_gdf[oakland_clip_gdf['geometry'].type== 'MultiLineString'].plot(ax=ax,color='red', linewidth=4)
oakland_clip_gdf[oakland_clip_gdf['geometry'].type== 'LineString'].plot(ax=ax,color='black', linewidth=4)

# Set x and y limits to Zoom map in on our city of interest
#Use the output from the total_bounds attribute to zoom to the city of interest
ax.set_xlim([oakland_clip_gdf.total_bounds[0]-0.01, oakland_clip_gdf.total_bounds[2]+0.01])
ax.set_ylim([oakland_clip_gdf.total_bounds[1]-0.01, oakland_clip_gdf.total_bounds[3]+0.01])

plt.show()

So you can see that the `clip` operation returned LineStrings and MultiLineStrings along with the polygons of the census tracts. This happens at the intersection of tract and place polygons.

Since census tracts are polygons, we will keep the polygon and multipolygon data returned by the clip. Here, we only have type Polygon.

In [None]:
oakland_clip_gdf = oakland_clip_gdf[oakland_clip_gdf.geometry.type.isin(['Polygon'])].reset_index(drop=True)

and repeat that plot...

In [None]:
fig, ax = plt.subplots(figsize = (12,12)) 

#tracts_gdf.plot(ax=ax, color="white", edgecolor="grey")

oakland_clip_gdf[oakland_clip_gdf['geometry'].type== 'Polygon'].plot(ax=ax,color='green',alpha=0.5)
oakland_clip_gdf[oakland_clip_gdf['geometry'].type== 'MultiLineString'].plot(ax=ax,color='red', linewidth=4)
oakland_clip_gdf[oakland_clip_gdf['geometry'].type== 'LineString'].plot(ax=ax,color='orange', linewidth=4)
tracts_gdf.plot(ax=ax,facecolor='none',edgecolor="black",linewidth=0.5)

# Set x and y limits to Zoom map in on our city of interest
#Use the output from the total_bounds attribute to zoom to the city of interest
ax.set_xlim([oakland_clip_gdf.total_bounds[0]-0.01, oakland_clip_gdf.total_bounds[2]+0.01])
ax.set_ylim([oakland_clip_gdf.total_bounds[1]-0.01, oakland_clip_gdf.total_bounds[3]+0.01])

plt.show()

Now we only see the polygon census tracts in the map. We get warnings for trying to add LINE types because they no longer exist in the geodataframe.

We can also see from the above plot that the census tracts nest within the city boundary. A tract is either in the city or not, there are no partial overlaps.

Now, for good measure,  let's plot the clip input and output geodataframes.

In [None]:
# plot 3 maps in one row
fig = plt.figure(figsize=(15,8))
# map 1
ax1 = plt.subplot(131)
ax1.set_aspect('equal')
ax1.set_title("CA Census Tracts (input data)")
acs5_tracts_gdf.plot(ax=ax1)
# map 2
ax2 = plt.subplot(132)
ax2.set_aspect('equal')
ax2.set_title("Oakland City Boundary (clip data)")
oakland_gdf.plot(ax=ax2, edgecolor="black")
# map 3
ax3 = plt.subplot(133)
ax3.set_aspect('equal')
ax3.set_title("Tracts clipped to Oakland (output data)")
oakland_clip_gdf.plot(ax=ax3, edgecolor="black")

# remove grid lines & labels
ax1.set_axis_off()
ax2.set_axis_off()
ax3.set_axis_off()

# display plot
plt.show()

### Clipping and attribute data

So, we have clipped the state wide census tract data to the Oakland city boundary. 

Now let's see what happened to the census tract ACS attribute data.

In [None]:
oakland_clip_gdf.head()

Nothing happened to the attribute data - the columns and column values do not change. Only the geometry will change with a `clip` operation. We maintain the column values from the input data for all rows in the clip output. 

We can now make a choropleth man of total population (`c_race`) or of median household income (`med_hhinc`) within the city of Oakland.

In [None]:
oakland_clip_gdf.plot(column='c_race', legend=True, legend_kwds={'label':"total population"});
oakland_clip_gdf.plot(column='med_hhinc', legend=True, legend_kwds={'label':"median household income"});

We can make histograms of the data values for the city.

In [None]:
oakland_clip_gdf['med_hhinc'].hist()

And sum values for the city.

In [None]:
## total population (c_race) within the city
oakland_clip_gdf.c_race.sum()

According to [Oakland's wikipedia page](https://en.wikipedia.org/wiki/Oakland,_California), the city's population in 2019 was 433,031.

So we are close, differing by 11,989. It's not the same because, although Wikipedia population value is also from census data, it is for a different year (2019). Moreover, the census continually revises its sample estimates to improve them.

# BUT BUT BUT

## Important caution about clipping

Clipping is a geometric operation. As we just noted, it only changes the geometry column. Clipping does not reapportion values where census tracts straddle the clip geometry (eg city boundaries). 

In cases where the tracts are nested completely within a city the clip method is sufficient. It may also be sufficient if the tracts are almost completely nested, depending on what your data analysis.

When census tracts or other geographies only partially overlay the geometry of interest you need to use a different method to reaggreage the data.  One popular method is **areal interpolation**.



<a id="section4"></a>
## 6.4 Areal Interpolation

`Areal interpolation` uses area weighting to reapportion data values aggreggated by one geometry to another geometry. For example, if only half a census tract is within the target area only have the total population would be aggregated to the new geometry.

There are two types of numeric variables that can be interpolated using this approach:

- `intensive`: averages, medians, percents, ratios
  - When intensive variables are reggregated, the weighted values are `averaged`.
  

- `extensive`: counts
  - When intensive variables are reaggregated, the weighted counts are `summed`.

### Areal Interpolation time!

We can use the Tobler `areal_interpolate` function to reaggregate the census tract ACS5 data to the census tracts clipped to the boundary of our city of interest.

First read the documention!

In [None]:
# Uncomment to read
# area_interpolate?

Make sure both geodataframes have the same CRS.
- If not, you will need to transform one CRS to match the other.

In [None]:
acs5_tracts_gdf.crs == oakland_clip_gdf.crs

In [None]:
# Areal interpolate tracts data to the places
oakland_ai_gdf = area_interpolate(acs5_tracts_gdf, 
                                   oakland_clip_gdf, 
                                   intensive_variables = ['med_rent','med_hhinc'],
                                   extensive_variables = ['c_race','c_white'],
                                   allocate_total=False
                                  )

In [None]:
oakland_ai_gdf.head()

*How many rows do we expect in the output?*

In [None]:
oakland_clip_gdf.shape

We are ready to run the function.  Note, this is an interesting case of areal interpolation in that most of the target geometries are the same as the source geometries. It is only on the borders that they may differ where the tracts are not nested within the city border.

Take a look at output of the areal_interpolate function.

In [None]:
oakland_ai_gdf.shape

In [None]:
oakland_ai_gdf.head(3)

In [None]:
oakland_ai_gdf.tail(3)

### Understanding the output

(1) Do the number of tracts in the input data match the output?
- If yes great.
- If not, what is the relationship?

(2) Does the output geodataframe have data for both the intensive and extensive variables?


<img align="left" width=500 src="https://upload.wikimedia.org/wikipedia/commons/f/ff/Cat_on_laptop_-_Just_Browsing.jpg"></img> 

If we have twice as many rows in the output that indicates a set of features for the intensive variables and one for the extensive. We can use subsetting if we want to isolate those.

 - We also do not have the census identifier in the output (`GEOID`)

We do have the output for both intensive and extensive variables.

### Check the results with a few plots.

In [None]:
# Plot an intensive variable - this plots only the rows that have values for med_rent
oakland_ai_gdf.plot(column='med_rent', legend=True)

In [None]:
# Plot an extensive variable - this plots only the rows that have values for c_race
oakland_ai_gdf.plot(column='c_race', legend=True)

Compare the total pop (sum of c_race) from clip and areal interpolation operations.

In [None]:
oakland_ai_gdf.c_race.sum()

In [None]:
oakland_clip_gdf.c_race.sum()

Those summary values from the clip and areal interpolation the same. This is expected because the census tracts are nested within the city.

## Part 2 - When Areal Interpolation is needed

The boundary of Oakland is defined by a fairly simple polygon containing one hole (the City of Piedmont). All of the census tracts in Oakland are completely contained within the city boundary so areal interpolation isn't necessary.

Let's take a look at the city of `San Jose, CA` which presents a more complex case.

To push the limit of the complexities you will encounter, let's try clipping and interpolating data for the city of `San Jose, CA` whose boundary has a multi-part complex polygon.


First, grab the boundary of San Jose from the census places data.

In [None]:
sj_gdf = places_gdf[places_gdf['NAME']=='San Jose'].reset_index(drop=True)
sj_gdf.plot()

Take a look at the geodataframe - only one row!

In [None]:
sj_gdf.head()

Clip the census tracts

Order matters: clip the first geometry by the second geometry!

In [None]:
sj_clip_gdf = gpd.clip(acs5_tracts_gdf,sj_gdf).reset_index(drop=True)

Now, take a look at the output.

In [None]:
sj_clip_gdf.head()

Let's check to see what geometry types are in the clip output.

In [None]:
sj_clip_gdf.geometry.type.unique()


And plot them as we did above for Oakland.

In [None]:
fig, ax = plt.subplots(figsize = (15,15)) 

sj_clip_gdf[sj_clip_gdf['geometry'].type== 'GeometryCollection'].plot(ax=ax,color='purple', alpha=0.5)
sj_clip_gdf[sj_clip_gdf['geometry'].type== 'Polygon'].plot(ax=ax,color='green', alpha=0.5)
sj_clip_gdf[sj_clip_gdf['geometry'].type== 'MultiPolygon'].plot(ax=ax,color='yellow', alpha=0.5)
sj_clip_gdf[sj_clip_gdf['geometry'].type== 'MultiLineString'].plot(ax=ax,color='red', linewidth=4)
sj_clip_gdf[sj_clip_gdf['geometry'].type== 'LineString'].plot(ax=ax,color='black', linewidth=4)
sj_clip_gdf[sj_clip_gdf['geometry'].type== 'Point'].plot(ax=ax,color='black')
tracts_gdf.plot(ax=ax,facecolor='none',edgecolor="black",linewidth=0.6)

# Set x and y limits to Zoom map into city
ax.set_xlim([sj_clip_gdf.total_bounds[0]-0.01, sj_clip_gdf.total_bounds[2]+0.01])
ax.set_ylim([sj_clip_gdf.total_bounds[1]-0.01, sj_clip_gdf.total_bounds[3]+0.01])

# Then show plot
plt.show()

We can see from the map above that we want to keep the polygon features which are in the geometry types `GeometryCollection` and `Polygon`. The Line and point features are an artifact of the clip operation and exist at the intersection of multiple polygon boundaries.

In [None]:
# drop the non-polygon data! Keep Polygon and Geom Collectoin
sj_clip_gdf = sj_clip_gdf[sj_clip_gdf.geometry.type.isin(['Polygon','GeometryCollection'])].reset_index(drop=True)


Let's plot that again to make sure it looks good.

In [None]:
fig, ax = plt.subplots(figsize = (15,15)) 

sj_clip_gdf[sj_clip_gdf['geometry'].type== 'GeometryCollection'].plot(ax=ax,color='purple', alpha=0.5)
sj_clip_gdf[sj_clip_gdf['geometry'].type== 'Polygon'].plot(ax=ax,color='green', alpha=0.5)
sj_clip_gdf[sj_clip_gdf['geometry'].type== 'MultiPolygon'].plot(ax=ax,color='green', alpha=0.5)

sj_clip_gdf[sj_clip_gdf['geometry'].type== 'MultiLineString'].plot(ax=ax,color='red', linewidth=4)
sj_clip_gdf[sj_clip_gdf['geometry'].type== 'LineString'].plot(ax=ax,color='orange', linewidth=4)
sj_clip_gdf[sj_clip_gdf['geometry'].type== 'Point'].plot(ax=ax,color='black')
tracts_gdf.plot(ax=ax,facecolor='none',edgecolor="black",linewidth=0.6)

# Set x and y limits to Zoom map into city
ax.set_xlim([sj_clip_gdf.total_bounds[0]-0.01, sj_clip_gdf.total_bounds[2]+0.01])
ax.set_ylim([sj_clip_gdf.total_bounds[1]-0.01, sj_clip_gdf.total_bounds[3]+0.01])

# Then show plot
plt.show()

We can see above that there are a number of tracts that are not completely within the city boundary (see the bottom right, for example.) This indicates that clipping will not be appropriate for San Jose.

To check that, let's use this clipped data to sum the total population in san jose.

In [None]:
sj_clip_gdf['c_race'].sum()

How close does this total match that reported in the [Wikipedia page for San Jose](https://en.wikipedia.org/wiki/San_Jose,_California) in 2019 - 1,021,795? It overestimates by over 150,000, likely not acceptable for any data analysis!

The difference is due primarily to the clip operation including the total population from tracts only partially within San Jose.

Let's see what we get if we use `areal interpolation` instead.

In [None]:
# Area interpolate the ACS5 Geodatafame and the San Jose Boundary (acs5_tracts_gdf, san_jose_gdf)

In [None]:
#sj_tracts_only = sj_clip_gdf[['GEOID','geometry']].reset_index(drop=True)
sj_ai_gdf = area_interpolate(acs5_tracts_gdf, 
                              sj_clip_gdf, 
                              intensive_variables = ['med_hhinc'],
                              extensive_variables = ['c_race'],
                              allocate_total=False
                            )

In [None]:
print(sj_ai_gdf.shape)  # how many tracts?
print(sj_clip_gdf.shape)

In [None]:
sj_ai_gdf.head()

In [None]:
sj_ai_gdf.plot(column='c_race', legend=True)

How close does the area interpolated population total match that reported in [Wikipedia for San Jose](https://en.wikipedia.org/wiki/San_Jose,_California)?  (1,021,795)

In [None]:
print(sj_ai_gdf['c_race'].sum())


That's much better. Our total population of San Jose from derived from areal interpolation is much closer the 2019 value reported for San Jose, differing by 38,521.  That still a bit of a difference that would need to be examined.

Why the difference? Well we are comparing 2018 and 2019 data so that is one issue.

Another is that areal interpolation is not a perfect approach for reallocaton data from one unit to another. There is not perfect approach!

### A word of Caution about Areal Interpolation

The main shortcoming of areal interpolation is that area weighting assumes that the variable of interest, say population, is uniformly distributed throughout the source areas (here census tracts). If this were true then an area weighted reallocation would be a consistently reliable approach. However, we know that this is not the case with most area data that we wish to reallocate. The analyst must decide if the errors are tolerable for the application at hand. This is usually true for exploratory data analysis and in the absence of more reliable approaches.  Alternative approaches to areal interpolation include dysametric and model-based approaches, both of which are much more challenging to implement and to explain. For this reason simple area interpolation is often used for this task.