# Exercise 7: Deeper dive into `geopandas`

Skills:

* Attributes to get from `geometry`
* Buffering
* Multiple `geometry` columns
* Clipping
* Spatial joins - more discussion on `how` and `predicate` arguments
* Overlay - more discussion on `how` arguments

References:
* spatial join vs overlay [explanation](https://groups.google.com/g/geopandas/c/H_qzH2T5cCE)
* `geopandas` tutorials by package creator: https://github.com/jorisvandenbossche/geopandas-tutorial
* `geopandas` concepts, explanations, but datasets use Hebrew: https://geobgu.xyz/py/geopandas2.html
* Advanced spatial modeling concepts: https://geographicdata.science/book/notebooks/01_geo_thinking.html
* [PyGIS](https://pygis.io/docs/a_intro.html) Geospatial Tutorials (focus on ch 1-3)
* 7 crucial geoprocessing [tools](https://gisgeography.com/geoprocessing-tools/)

In [None]:
import geopandas as gpd
import intake
import pandas as pd

catalog = intake.open_catalog(
    "../_shared_utils/shared_utils/shared_data_catalog.yml")

In [None]:
districts = catalog.caltrans_districts.read()
stops = catalog.ca_transit_stops.read()

## Attributes to get from `geometry`

A lot of information is stored in the `geometry` column. 
Take a look at each of the attributes for each of the datasets.

For each gdf, look at:
* its coordinate reference system (`gdf.crs`) 
* which column is its geometry (`gdf.geometry.name`)
* for a point, get the x, y of the first row (`gdf.geometry.x`, `gdf.geometry.y`)
* for a line, get the length of the first row (`gdf.geometry.length`)
* for a polygon, get the area of the first row (`gdf.geometry.area`)
* note: the length and area must be in units that can be interpreted, such as meters, feet, miles, etc and not decimal degrees.

In [None]:
districts.crs

In [None]:
districts[districts.DISTRICT==7].geometry.iloc[0]

In [None]:
districts = districts.to_crs("EPSG:2229") #unit in feet
one_geom = districts[districts.DISTRICT==7].geometry.iloc[0]
print(type(one_geom))

In [None]:
one_geom.area

## Buffering

Typically, you want to draw some radius around a geometry. This is buffering. It is most often used for points and lines, but occasionally, you'll use it for polygons too. The result of a buffer is always a polygon. 

Examples of questions you're asking:
* how many destinations can I reach within 5 miles from my location? (my location is a point, and a 5 mile buffer should be drawn)
* how many bus stops are on this road? Well, it's highly unlikely you'll have a bus stop (point) fall exactly on the road (line) in your dataset. Instead, you can draw a small buffer (maybe 20 meters) around the road and see how many bus stops fall within it.

Draw a buffer of 50 meters around the stop and set the `geometry` column to be the buffered geometry.

gdfs can hold multiple geometry columns, and geoparquets and GeoJSONs can save files with multiple geometry columns.

In [None]:
stops = stops.to_crs("EPSG:3310") # unit in meters
stops = stops.assign(
    geometry_buffered = stops.geometry.buffer(50)
)

stops[["agency", "stop_id", "stop_name", 
       "geometry", "geometry_buffered"]].head(2)

## Multiple `geometry` columns

By default, the `geometry` column is used. But, if you have another `geometry` column you'd like to use, you can set it.

In [None]:
stops2 = stops[stops.agency.str.contains("Big Blue")
].set_geometry("geometry_buffered")

print(f"stops geometry: {stops.geometry.name}")
print(f"stops2 geometry: {stops2.geometry.name}")

In [None]:
stops2.head(10).explore(tiles="CartoDB Positron")

In [None]:
# you can reset the geometry column on-the-fly just for mapping
stops2.head(10).set_geometry("geometry").explore(
    tiles='CartoDB Positron')

## Dissolve

Dissolve is a way to aggregate in the geospatial world. It's a way to combine multiple rows, and their geometries, into 1 row. You can also calculate statistics in the dissolve, such as `count`, `sum`, etc. 

In [None]:
# There are 12 Caltrans districts
# Use dissolve to combine the 12 district polygons into 
# 1 large CA polygon 
districts.plot()

In [None]:
ca = districts.dissolve()
ca.plot()

In [None]:
ca

Look at the gdf returned. Why does it say `DISTRICT==1`, yet the entire CA boundary is shown?

[Docs explanation](https://geopandas.org/en/stable/docs/user_guide/aggregation_with_dissolve.html). By default, the first value is kept.

Instead, let's count how many districts there are and sum up the area.

In [None]:
districts2 = districts.assign(
    state = "CA"
) 
districts2.head(2)

In [None]:
districts2[["state", "Shape__Area", "geometry"]].head(2)

In [None]:
ca2 = (districts2[["state", "DISTRICT", "Shape__Area", "geometry"]]
 .dissolve(
    by=["state"], 
     aggfunc={
        "DISTRICT": "count",
        "Shape__Area": "sum"})
 .reset_index()
)

ca2

In [None]:
ca2.plot()

## Clipping

Clipping is a technique to narrow down one gdf by the boundaries of another gdf. The other gdf is called the mask. 

[Docs](https://geopandas.org/en/stable/gallery/plot_clip.html) here and [here](https://geopandas.org/en/stable/docs/reference/api/geopandas.clip.html).

Examples:
* which transit stops fall within District 7?
* find Amtrak routes within CA (cut away the lines that fall outside of CA, but keep the lines that fall within CA)

| current gdf | current geometry type | mask gdf    | 
|-------------|-----------------------|-------------|
| stops       | point                 | district    | 
| routes      | line                  | state       |


In [None]:
amtrak_stops = (stops[
    (stops.route_type.isin(['0', '1', '2'])) & 
    (stops.agency=="Amtrak")]    
    [["agency", "stop_id", "stop_name", "geometry"]]
    .reset_index(drop=True)
)

In [None]:
amtrak_stops.to_crs("EPSG:4326").clip(
    ca.to_crs("EPSG:4326")
).plot()

## Spatial Join

You are asking questions about your current gdf, but require information from another gdf. A spatial join allows you to attach columns from another gdf to the current gdf.

Spatial joins do not change the **values** in the `geometry` column. The `how=` and `predicate=` arguments determine **which rows** are kept and **which geometry** column is kept. It does not change the contents of the geometry the way `gpd.overlay()` does.

* Which county does this stop belong in?
* Which bus routes run in District 7?
* Which state does this district belong in? 

| current gdf | current geometry type | another gdf | concept            |
|-------------|-----------------------|-------------|--------------------|
| stops       | point                 | county      | point-in-polygon   |
| highways    | line                  | district    | line-in-polygon    |
| districts   | polygon               | state       | polygon-in-polygon |

In [None]:
# There are 789 stops!
amtrak_stops.shape

In [None]:
amtrak_stops.plot()

It looks like Amtrak stops across the US are shown.

To find which ones are located in CA, we are asking a point-in-polygon question. Which Amtrak stop (point) falls in CA (polygon)?

A spatial join can tell us this.

### Explore the various `how=` and `predicate=` arguments
* Read [docs](https://geopandas.org/en/stable/docs/user_guide/set_operations.html)
* [predicate and how explanation](https://geopandas.org/en/stable/docs/user_guide/mergingdata.html)


#### predicate = intersects / within 
* `predicate` specifies which spatial set operation you're using. Is it a point **within** a polygon? Is it a **point** that **intersects** a polygon? 
* `intersect` is the most common predicate. 
* For lines, `intersect` and `within` can give different results. 
* A line can intersect with the polygon even if it does not fall completely within a polygon.
* For the other predicates, `contains`, `within`, `touches`, `crosses`, `overlaps`, there is a lot more nuance about how much of the interior and exterior interact. This would matter when comparing lines and polygons or polygons and polygons.

#### how = inner / left / right
* `inner`: inner join keeps only the rows of the left gdf that meets the predicate requirements (ex: point does intersect with the polygon. points that do not intersect are dropped in the resulting gdf).
* `left`:
* `right`: 

In [None]:
# spatial join can tell you which stop falls into which CA or not
# make sure the CRS is the same for both gdfs
s1 = gpd.sjoin(
    amtrak_stops.to_crs("EPSG:2229"),
    ca2.to_crs("EPSG:2229"),
    how = "inner",
    predicate = "intersects"
)

s1.shape

In [None]:
s1.plot()

In [None]:
# spatial join can tell you which stop falls into which CA or not
# if we do left join, then we keep all the points even if they
# do not intersect with CA
s2 = gpd.sjoin(
    amtrak_stops.to_crs("EPSG:2229"),
    ca2.to_crs("EPSG:2229"),
    how = "left",
    predicate = "intersects"
)

s2.shape

In [None]:
# Columns that do not intersect with CA are not populated for 
# `state` and `DISTRICT` and hold missing values
# but columns that do intersect with CA hold non-missing values
s2.tail()

In [None]:
s2.plot()

In [None]:
# A right join is used if you want to keep only the inner join rows 
# (stops that intersect with CA)
# but use the geometry from the right

s3 = gpd.sjoin(
    amtrak_stops.to_crs("EPSG:2229"),
    ca2.to_crs("EPSG:2229"),
    how = "right",
    predicate = "intersects"
)

s3.shape

In [None]:
s3.plot()

## Overlay

**Recall**

Spatial joins do not change the **values** in the `geometry` column. The `how=` and `predicate=` arguments determine **which rows** are kept and **which geometry** column is kept. It does not change the contents of the geometry the way `gpd.overlay()` does.

Overlays **change the values** in the `geometry` column.

Polygons are the easiest to demonstrate these concepts, but typically, `overlay` can be used with lines or polygons.

#### how = intersection / symmetric difference / difference / identity
* `intersection` is the most common.
* `difference` might be of interest. 
* `symmetric_difference` is rare. This removes the middle intersection in the Venn Diagram.
* `identity` is rare.
* Look carefully at what columns are kept. If there are columns that aren't necessary, remove those columns.

In [None]:
d7 = districts[districts.DISTRICT==7]

intersection_overlay = gpd.overlay(
    ca2,
    d7,
    how = "intersection", 
    keep_geom_type=True
)

display(intersection_overlay)
intersection_overlay.plot()

In [None]:
intersection_overlay2 = gpd.overlay(
    ca2,
    d7[["DISTRICT", "geometry"]],
    how = "intersection",
    keep_geom_type=True
)

display(intersection_overlay2)
intersection_overlay2.plot()

In [None]:
print(f"area before overlay: {ca.geometry.iloc[0].area}")
print(f"area after overlay: {intersection_overlay2.geometry.iloc[0].area}")

In [None]:
difference_overlay = gpd.overlay(
    ca2,
    d7[["DISTRICT", "geometry"]],
    how = "difference",
    keep_geom_type=True
)

display(difference_overlay)
difference_overlay.plot()

In [None]:
# From D7 polygon, remove the part that is CA..
# this is why it's plotting basically nothing
difference_overlay2 = gpd.overlay(
    d7[["DISTRICT", "geometry"]],
    ca2,
    how = "difference",
    keep_geom_type=True
)

display(difference_overlay2)
difference_overlay2.plot()