# Spatial Queries

In spatial analysis, our goal is not just to make nice maps, but to actually run analyses that leverage the explicitly spatial nature of our data. The process of doing this is known as **spatial analysis**.

To construct spatial analyses, we string together series of spatial operations in such a way that the end result answers our question of interest. There are many such spatial operations. These are known as **spatial queries**.

These queries can be divided into:

- **Measurement queries**
    - What is feature A's **length**?
    - What is feature A's **area**?
    - What is feature A's **perimeter**?
    - What is feature A's **distance** from feature B?
- **Relationship queries**
    - Is feature A **within** feature B?
    - Does feature A **intersect** with feature B?
    - Does feature A **cross** feature B?
    
Spatial queries are not limited to the examples we've shown here.

We'll work through examples of each of those types of queries. Then, we'll see an example of a very common spatial analysis that is a conceptual amalgam of those two types: **proximity analysis**.

<!--
- Expected time to complete
    - Lecture + Questions: 45 minutes
    - Exercises: 20 minutes
-->

In [None]:
import pandas as pd
import geopandas as gpd

import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline  

## Load and Prep Some Data

Let's read in our Census tracts data again:

In [None]:
census_tracts = gpd.read_file("zip://../data/census/Tracts/cb_2013_06_tract_500k.zip")
census_tracts.plot()

In [None]:
census_tracts.head()

Then, we'll grab just the Alameda Country tracts.

In [None]:
census_tracts_ac = census_tracts.loc[census_tracts['COUNTYFP'] == '001'].reset_index(drop=True)
census_tracts_ac.plot()

## Measurement Queries

We'll start off with some simple measurement queries.

For example, here's how we can get the areas of each of our Census tracts:

In [None]:
census_tracts_ac.area

Okay! We got...numbers?

What do those numbers mean? What are our units? And if we're not sure, how might be find out?

Let's take a look at our CRS:

In [None]:
census_tracts_ac.crs

Ah-ha! We're working in an unprojected CRS, with units of decimal degrees.

**When doing spatial analysis, we will almost always want to work in a projected CRS that has natural distance units, such as meters!**

Time to project!

As previously, we'll use UTM Zone 10N with a NAD83 data. This is a good choice for our region of interest.

In [None]:
census_tracts_ac_utm10 = census_tracts_ac.to_crs("epsg:26910")

In [None]:
census_tracts_ac_utm10.crs

Now, let's try our area calculation again.

In [None]:
census_tracts_ac_utm10.area

That looks much more reasonable! What are our units, now?

You may have noticed that our Census tracts already have an area column in them.

Let's do a confidence check on our results.

In [None]:
# Calculate the area for the 0th feature
census_tracts_ac_utm10.area[0]

In [None]:
# Get the area for the 0th feature according to its 'ALAND' attribute
census_tracts['ALAND'][0]

In [None]:
# Check equivalence of the calculated areas and the 'ALAND' column
census_tracts_ac_utm10['ALAND'].values == census_tracts_ac_utm10.area

What explains this disagreement? Are the calculated areas incorrect?

We can also sum the area for Alameda county by adding `.sum()` to the end of our area calculation:

In [None]:
census_tracts_ac_utm10.area.sum()

We can actually look up how large Alameda County is to check our work. The county is 739 miles<sup>2</sup>, which is around 1,914,001,213 meters<sup>2</sup>. I'd say we're pretty close!

As it turns out, we can similarly use another attribute to get the features' lengths.

**NOTE**: In this case, given we're dealing with polygons, this is equivalent to getting the features' perimeters.

In [None]:
census_tracts_ac_utm10.length

## Relationship Queries

[Spatial relationship queries](https://en.wikipedia.org/wiki/Spatial_relation) consider how two geometries or sets of geometries relate to one another in space. 

<img src="https://upload.wikimedia.org/wikipedia/commons/5/55/TopologicSpatialRelarions2.png" height="300px"></img>

Here is a list of the most commonly used GeoPandas methods to test spatial relationships:

- [within](http://geopandas.org/reference.html?highlight=distance#geopandas.GeoSeries.within)
- [contains](http://geopandas.org/reference.html?highlight=distance#geopandas.GeoSeries.contains) (the inverse of `within`)
- [intersects](http://geopandas.org/reference.html?highlight=distance#geopandas.GeoSeries.intersects)

There several other GeoPandas spatial relationship predicates, but they are more complex to properly employ. For example, the following two operations only work with geometries that are completely aligned.

- [touches](http://geopandas.org/reference.html?highlight=distance#geopandas.GeoSeries.touches)
- [equals](http://geopandas.org/reference.html?highlight=distance#geopandas.GeoSeries.equals)

All of these methods takes the form:

    Geoseries.<predicate>(geometry)
    
For example:

    Geoseries.contains(geometry)

Let's load a new dataset to demonstrate these queries.

This is a dataset containing all the protected areas (parks and the like) in California.

In [None]:
parks = gpd.read_file('../data/protected_areas/CPAD_2020a_Units.shp')

Does this need to be reprojected too?

In [None]:
parks.crs

Yes it does!

Let's reproject it.

In [None]:
parks_utm10 = parks.to_crs("epsg:26910")

One common use for spatial queries is for spatial subsetting of data.

In our case, let's use `intersects` to find all of the parks that have land in Alameda County.

But before we do that, let's take another look at our geometries.

In [None]:
census_tracts_ac_utm10.geometry.type.unique()

In [None]:
census_tracts_ac_utm10.plot()

Because we nave Census tracts, each of these rows is either a Polygon or a MultiPolygon. For our relationship query, we can actually simplify our geometry to be one polygon by using `unary_union`:

In [None]:
census_tracts_ac_utm10.geometry.unary_union

In [None]:
print(census_tracts_ac_utm10.geometry.unary_union)

Now, we can go ahead and conduct our operation `intersects`:

In [None]:
parks_in_ac = parks_utm10.intersects(census_tracts_ac_utm10.geometry.unary_union)

If we scroll the resulting GeoDataFrame to the right, we'll see that the `COUNTY` column of our resulting subset gives us a good confidence check on our results.

In [None]:
parks_in_ac

In [None]:
parks_utm10[parks_in_ac].head()

So does this overlay plot!

In [None]:
# Plot Census tracts
ax = census_tracts_ac_utm10.plot(color='gray', figsize=(12, 16))
# Plot parks
parks_utm10[parks_in_ac].plot(ax=ax,
                              column='ACRES',
                              cmap='summer',
                              legend=True,
                              edgecolor='black',
                              linewidth=0.4, 
                              alpha=0.8,
                              legend_kwds={'label': "acres", 'orientation': "horizontal"})
ax.set_title('Protected areas in Alameda County, colored by area', size=18)

---

### Challenge 1: Spatial Relationship Query

Let's use a spatial relationship query to create a new dataset containing Berkeley schools!

Run the next two cells to load datasets containing Berkeley's city boundary and Alameda County's
schools and to reproject them to EPSG: 26910.

Then in the following cell, write your own code to:

1. Subset the schools for only those `within` Berkeley.
2. Plot the Berkeley boundary and then the schools as an overlay map.

---

In [None]:
# Load the Berkeley boundary
berkeley = gpd.read_file("../data/berkeley/BerkeleyCityLimits.shp")
# Transform to EPSG:26910
berkeley_utm10 = berkeley.to_crs("epsg:26910")
# Look at GeoDataFrame
berkeley_utm10.head()

In [None]:
# Load the Alameda County schools CSV
schools_df = pd.read_csv('../data/alco_schools.csv')
# Convert it to a GeoDataFrame
schools_gdf = gpd.GeoDataFrame(schools_df, 
                               geometry=gpd.points_from_xy(schools_df.X, schools_df.Y))
# Define its unprojected (EPSG:4326) CRS
schools_gdf.crs = "epsg:4326"
# Transform to EPSG:26910
schools_gdf_utm10 = schools_gdf.to_crs("epsg:26910")
# Look at GeoDataFrame
schools_df.head()

In [None]:
# YOUR CODE HERE


## Proximity Analysis

Now that we've seen the basic idea of spatial measurement and relationship queries, let's take a look at a common analysis that combines those concepts: **promximity analysis**.

Proximity analysis seeks to identify all features in a focal feature set that are within some maximum distance of features in a reference feature set.

A common workflow for this analysis is:

1. Buffer (i.e. add a margin around) the reference dataset, out to the maximum distance.
2. Run a spatial relationship query to find all focal features that intersect (or are within) the buffer.

Let's read in our bike boulevard data again. We'll find out which of our Berkeley schools are within a block's distance (200 m) of the boulevards.

In [None]:
bike_blvds = gpd.read_file('../data/transportation/BerkeleyBikeBlvds.geojson')
bike_blvds.plot()

Of course, we need to reproject the boulevards to our projected CRS.

In [None]:
bike_blvds_utm10 = bike_blvds.to_crs("epsg:26910")

Now we can create our 200 meter bike boulevard buffers.

In [None]:
bike_blvds_utm10.crs

In [None]:
bike_blvds_buf = bike_blvds_utm10.buffer(distance=200)

In [None]:
bike_blvds_buf.head()

Now, let's overlay everything.

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
# Plot Berkeley city boundary
berkeley_utm10.plot(color='lightgrey', ax=ax)
# Plot buffer
bike_blvds_buf.plot(color='pink', ax=ax, alpha=0.5)
# Plot bicycle boulevards
bike_blvds_utm10.plot(ax=ax)
# Plot Berkeley schools
berkeley_schools.plot(color='purple',ax=ax)

Great! Looks like we're all ready to run our intersection to complete the proximity analysis.

**NOTE**: In order to subset with our buffers, we need to call the `unary_union` attribute of the buffer object. This gives us a single unified polygon, rather than a series of multipolygons representing buffers around each of the points in our multilines.

In [None]:
schools_near_blvds = berkeley_schools.within(bike_blvds_buf.unary_union)
blvd_schools = berkeley_schools[schools_near_blvds]

Now let's overlay again, to see if the schools we subsetted make sense.

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
# Plot Berkeley city boundary
berkeley_utm10.plot(color='lightgrey', ax=ax)
# Plot buffer
bike_blvds_buf.plot(color='pink', ax=ax, alpha=0.5)
# Plot bicycle boulevards
bike_blvds_utm10.plot(ax=ax)
# Plot Berkeley schools
berkeley_schools.plot(color='purple',ax=ax)
# Plot schools within buffer 
blvd_schools.plot(color='yellow', markersize=50, ax=ax)

If we want to find the shortest distance from one school to the bike boulevards, we can use the `distance` function.

In [None]:
berkeley_schools.distance(bike_blvds_utm10.unary_union)

---

### Challenge 2: Proximity Analysis

Now it's your turn to try out a proximity analysis!

Run the next cell to load BART-system data, reproject it to EPSG: 26910, and subset it to Berkeley.

Then in the following cell, write your own code to find all schools within walking distance (1 km) of a BART station.

As a reminder, let's break this into steps:

1. Buffer your Berkeley BART stations to 1 km (**HINT**: remember your units!).
2. Use the schools' `within` attribute to check whether or not they're within the buffers (**HINT**: don't forget the `unary_union`!).
3. Subset the Berkeley schools using the object returned by your spatial relationship query.
4. As always, plot your results for a good visual check!

---

In [None]:
# Load the BART stations from CSV
bart_stations = pd.read_csv('../data/transportation/bart.csv')
# Convert to a GeoDataFrame
bart_stations_gdf = gpd.GeoDataFrame(bart_stations, 
                                     geometry=gpd.points_from_xy(bart_stations.lon, bart_stations.lat))
# Define its unprojected (EPSG:4326) CRS
bart_stations_gdf.crs = "epsg:4326"
# Transform to UTM Zone 10 N (EPSG:26910)
bart_stations_gdf_utm10 = bart_stations_gdf.to_crs("epsg:26910")
# Subset to Berkeley
berkeley_bart = bart_stations_gdf_utm10[bart_stations_gdf_utm10.within(berkeley_utm10.unary_union)]

In [None]:
# YOUR CODE HERE
