# Attribute and Spatial Joins

Now that we understand the logic of spatial relationship queries, let's take a look at another fundamental spatial operation that relies on them.

This operation, called a **spatial join**, is the process by which we can leverage the spatial relationships between distinct datasets to merge their information into a new, synthetic dataset.

This operation can be thought as the spatial equivalent of an **attribute join**, in which multiple tabular datasets can be merged by aligning matching values in a common column that they both contain. If you've done data wrangling in Python with pandas, you've probably performed an attribute join at some point!

Thus, we'll start by developing an understanding of this operation first!

<!-- 
- Expected time to complete
    - Lecture + Questions: 45 minutes
    - Exercises: 20 minutes
-->

In [None]:
import pandas as pd
import geopandas as gpd

import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline  

## Data Input and Preparation

Let's read in a table of data from the US Census' 5-year American Community Survey (ACS5).

In [None]:
# Read in the ACS5 data for CA into a pandas DataFrame.
# Note: We force the FIPS_11_digit to be read in as a string to preserve any leading zeroes.
acs5_df = pd.read_csv("../data/census/ACS5yr/census_variables_CA.csv", dtype={'FIPS_11_digit': str})
acs5_df.head()

**Brief summary of the data**:

Below is a table of the variables in this table. They were combined from 
different ACS 5 year tables.

A few things to note:
- Variables that start with `c_` are counts.
- Variables that start with `med_` are medians.
- Variables that end in `_moe` are margin of error estimates.
- Variables that start with `_p` are proportions calcuated from the counts divided by the table denominator (the total count for whom that variable was assessed).


| Variable        | Description                                     |
|-----------------|-------------------------------------------------|
|`c_race`         |Total population                                 
|`c_white`        |Total white non-Latinx
| `c_black`       | Total black and African American non-Latinx
| `c_asian`       | Total Asian non-Latinx
| `c_latinx`      | Total Latinx
| `state_fips`    | State level FIPS code
| `county_fips`   | County level FIPS code
| `tract_fips`    |Tracts level FIPS code
| `med_rent`      |Median rent
| `med_hhinc`     |Median household income
| `c_tenants`     |Total tenants
| `c_owners`      |Total owners
| `c_renters`     |Total renters
| `c_movers`      |Total number of people who moved
| `c_stay`        |Total number of people who stayed
| `c_movelocal`   |Number of people who moved locally
| `c_movecounty`  |Number of people who moved counties
| `c_movestate`   | Number of people who moved states
| `c_moveabroad`  |Number of people who moved abroad
| `c_commute`     |Total number of commuters
| `c_car`         | Number of commuters who use a car
| `c_carpool`     | Number of commuters who carpool
| `c_transit`     |Number of commuters who use public transit
| `c_bike`        |Number of commuters who bike
| `c_walk`        |Number of commuters who bike
| `year`          | ACS data year
| `FIPS_11_digit` | 11-digit FIPS code


We're going to drop all of our `moe` columns by identifying all of those that end with `_moe`. We can do that in two steps, first by using `filter` to identify columns that contain the string `_moe`.

In [None]:
moe_cols = acs5_df.filter(like='_moe', axis=1).columns
moe_cols

In [None]:
acs5_df.drop(moe_cols, axis=1, inplace=True)

And lastly, let's grab only the rows for year 2018 and county FIPS code 1 (Alameda County).

In [None]:
acs5_df_ac = acs5_df[(acs5_df['year'] == 2018) & (acs5_df['county_fips'] == 1)]

Now, let's read in the Census tracts again!

In [None]:
tracts_gdf = gpd.read_file("zip://../data/census/Tracts/cb_2013_06_tract_500k.zip")

In [None]:
tracts_gdf.head()

In [None]:
# Pull out the tracts within Alameda county
tracts_gdf_ac = tracts_gdf[tracts_gdf['COUNTYFP'] == '001']
tracts_gdf_ac.plot()
plt.show()

## Attribute Joins

We just mapped the Census tracts. But what makes a map powerful is when you map the data associated with the locations within the map. So, let's take a stab at performing an attribute join between the GeoDataFrame of tracts, and the DataFrame of ACS data.

Why do we need to do this? Let's reflect on the current DataFrames we're working with

- `tracts_gdf_ac` contains polygon data in a GeoDataFrame. However, as we saw in the `head` of that dataset, there are no attributes of interest!
- `acs5_df_ac` contains 2018 ACS data from a CSV file (`census_variables_CA.csv`), imported and read in as a pandas DataFrame. However, they have no geometries!

In order to map the ACS data we need to associate it with the tracts. Let's do that now, by joining the columns from `acs5_df_ac` to the columns of `tracts_gdf_ac` using a common column as the key for matching rows. This process is called an **attribute join**. There are several ways we can go about performing this join.

<img src="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2017/03/join-types-merge-names.jpg">

The image above gives us a nice conceptual summary of the types of joins we could run.

Before we begin, let's reflect on a couple points:

1. In general, why might we choose one type of join over another?
2. In our case, do we want an inner, left, right, or outer (AKA 'full') join? 

> **NOTE**: You can read more about merging in `geopandas` [here](http://geopandas.org/mergingdata.html#attribute-joins).

Now, let's perform the join! Let's take a look at the common column in both our DataFrames.

In [None]:
tracts_gdf_ac['GEOID'].head()

In [None]:
acs5_df_ac['FIPS_11_digit'].head()

Note that they are **not named the same thing**. 
        
That's okay! We just need to know that they contain the same information.

Also note that they are **not in the same order**. 
        
That's not only okay... that's the point! If they were in the same order already then we could just join them side by side, without having Python find and line up the matching rows from each!

Let's do a `left` join to keep all of the Census tracts in Alameda County and only the ACS data for those tracts.

> **NOTE**: To figure out how to do this we could always take a peek at the documentation by calling
`?tracts_gdf_ac.merge`, or `help(tracts_gdf_ac)`.

In [None]:
# Left join keeps all tracts and the ACS data for those tracts
tracts_acs_gdf_ac = tracts_gdf_ac.merge(acs5_df_ac,
                                        left_on='GEOID',
                                        right_on='FIPS_11_digit',
                                        how='left')
tracts_acs_gdf_ac.head(2)

Let's check that we have all the variables we have in our dataset now.

In [None]:
list(tracts_acs_gdf_ac.columns)

Confidence check: in this case, how many rows and columns should we have?

In [None]:
print(f"Rows and columns in the Alameda County Census tract GeoDataFrame: {tracts_gdf_ac.shape}")
print(f"Row and columns in the ACS5 2018 data: {acs5_df_ac.shape}")
print(f"Rows and columns in the Alameda County Census tract GeoDataFrame joined to the ACS data: {tracts_acs_gdf_ac.shape}")

Let's save out our merged data so we can use it in the final notebook.

In [None]:
tracts_acs_gdf_ac.to_file('../data/outdata/tracts_acs_gdf_ac.json', driver='GeoJSON')

---

### Challenge 1: Choropleth Map

We can now make choropleth maps using our attribute-joined GeoDataFrame. Go ahead and pick one variable to color the map, then map it. You can go back to lesson 5 if you need a refresher on how to make this!

---

In [None]:
# YOUR CODE HERE


## Spatial Joins

Great! We've wrapped our heads around the concept of an attribute join. Now, let's extend that concept to its spatially explicit equivalent: the **spatial join**!

To start, we'll read in some other data: The Alameda County schools data. Then, we'll work with that data and our `tracts_acs_gdf_ac` data together.

In [None]:
# Import schools data
schools_df = pd.read_csv('../data/alco_schools.csv')
# Convert to GeoDataFrame
schools_gdf = gpd.GeoDataFrame(schools_df, 
                               geometry=gpd.points_from_xy(schools_df.X, schools_df.Y))
# Convert CRS
schools_gdf.crs = "epsg:4326"

Let's check if we have to transform the schools to match the `tracts_acs_gdf_ac` CRS.

In [None]:
print(f'schools_gdf CRS: {schools_gdf.crs}')
print(f'tracts_acs_gdf_ac CRS: {tracts_acs_gdf_ac.crs}')

Yes, we do! Let's do that.

Note that, we didn't even necessarily have to check whether they different. The syntax below will work in all cases, and allows us not to have to type out the EPSG code ourselves!

In [None]:
schools_gdf = schools_gdf.to_crs(tracts_acs_gdf_ac.crs)

print(f'schools_gdf CRS: {schools_gdf.crs}')
print(f'tracts_acs_gdf_ac CRS: {tracts_acs_gdf_ac.crs}')

We're ready to combine the datasets in an analysis. In this case, we want to get data from the Census tract within which each school is located.

How can we do that? The two datasets don't share a common column to use for a join.

In [None]:
tracts_acs_gdf_ac.columns

In [None]:
schools_gdf.columns

However, they do have a shared relationship by way of space! 

So, we'll use a spatial relationship query to figure out the Census tract that each school is in, then associate the tract's data with that school (as additional data in the school's row). This is a **spatial join**!

### Census Tract Data Associated with Each School

In this case, let's say we're interested in the relationship between the median household income in a Census tract (`tracts_acs_gdf_ac['med_hhinc']`) and a school's Academic Performance Index (`schools_gdf['API']`).

To start, let's take a look at the distributions of our two variables of interest.

In [None]:
tracts_acs_gdf_ac.hist('med_hhinc')

In [None]:
schools_gdf.hist('API')

Oh, right! There are schools with no reported APIs (i.e. API == 0)! Let's drop those. We'll do this in the interest of pedagogy for this workshop, but it's also worth keeping in mind: what do we lose by dropping those schools? How might that impact the results of our analysis?

In [None]:
schools_gdf_api = schools_gdf[schools_gdf['API'] > 0]

In [None]:
schools_gdf_api.hist('API')

Now, maybe we think there ought to be some correlation between the two variables? As a first pass at this possibility, let's overlay the two datasets, coloring each one by its variable of interest. This should give us a sense of whether or not similar values co-occur.

In [None]:
ax = tracts_acs_gdf_ac.plot(column='med_hhinc',
                            cmap='cividis',
                            figsize=(18, 18),
                            legend=True,
                            legend_kwds={'label': "median household income ($)",
                                         'orientation': "horizontal"})
schools_gdf_api.plot(column='API',
                     cmap='cividis',
                     edgecolor='black',
                     alpha=1,
                     ax=ax,
                     legend=True,
                     legend_kwds={'label': "API",
                                  'orientation': "horizontal"})
plt.show()

### Spatially Joining the Schools and Census Tracts

Though it's hard to say for sure, it certainly looks possible. It would be ideal to scatter the variables! But in order to do that, we need to know the median household income in each school's tract, which means we definitely need our **spatial join**!

We'll first take a look at the documentation for the spatial join function, `gpd.sjoin`.

In [None]:
help(gpd.sjoin)

Looks like the key arguments to consider are:

- The two GeoDataFrames (**`left_df`** and **`right_df`**);
- The type of join to run (**`how`**), which can take the values `left`, `right`, or `inner`;
- The spatial relationship query to use (**`op`**).

A couple things to note:

- By default, `sjoin` is an inner join. It keeps the data from both GeoDataFrames only where the locations spatially intersect.
- By default, `sjoin` maintains the geometry of first geodataframe input to the operation. 

So, before we move on, let's think about how we'll conduct this analysis.

1. Which GeoDataFrame are we joining onto which (i.e. which one is getting the other one's data added to it)?
2. What happened to 'outer' as a join type?
3. Thus, in our operation, which GeoDataFrame should be the `left_df`, which should be the `right_df`, and `how` do we want our join to run?

Alright! Let's run our join!

In [None]:
schools_jointracts = gpd.sjoin(left_df=schools_gdf_api,
                               right_df=tracts_acs_gdf_ac,
                               how='left')

In [None]:
schools_jointracts.head()

---

### Challenge 2: Confidence Checks

As always, we want to perform a confidence check on our intermediate result before we rush ahead.

One way to do that is to introspect the structure of the result object a bit.

1. What type of object should that have given us?
2. What should the dimensions of that object be, and why?
3. If we wanted a visual check of our results (i.e. a plot or map), what could we do?

---

In [None]:
# YOUR CODE HERE


Confirmed! The output of the `sjoin` operation is a GeoDataFrame (`schools_jointracts`) with:
- A row for each school that is located inside a census tract (all of them are).
- The **point geometry** of that school.
- All of the attribute data columns (non-geometry columns) from both input GeoDataFrames.

Let's also take a look at an overlay map of the schools on the tracts. If we color the schools categorically by their tracts IDs, then we should see that all schools within a given tract polygon are the same color.

We're only going to plot a few of the schools, because we don't have enough colors on the color wheel for each unique tract:

In [None]:
ax = tracts_acs_gdf_ac.plot(color='white',
                            edgecolor='black',
                            figsize=(18, 18))
schools_jointracts.iloc[:16].plot(column='GEOID', ax=ax, legend=True)

### Assessing the Relationship between Median Household Income and API

Fantastic! That looks right!

Now, we can create that scatter plot we were thinking about!

In [None]:
fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(schools_jointracts.med_hhinc, schools_jointracts.API)
ax.set_xlabel('Median Household Income (Dollars)')
ax.set_ylabel('Academic Performance Index')

Wow! Just as we suspected based on our overlay map,
there's a pretty obvious, strong, and positive correlation
between median household income in a school's tract
and the school's API.

## Aggregation

We just saw that a spatial join in one way to leverage the spatial relationship between two datasets in order to create a new dataset,

An **aggregation** is another way we can generate new data from this relationship. In this case, for each feature in one dataset we find all the features in another dataset that satisfy our chosen spatial relationship query with it (e.g. within, intersects), then aggregate them using some summary function (e.g. count, mean).

### Calculating Aggregated School Counts

Let's take this for a spin with our data. We'll count all the schools within each Census tract.

Note that we've already done the first step of spatially joining the data from the aggregating features
(the tracts) onto the data to be aggregated (our schools).

The next step is to group our GeoDataFrame by Census tract, and then summarize our data by group. We do this using the DataFrame method `groupy`.

To get the correct count, lets rejoin our schools on our tracts, this time keeping all schools (not just those with positive APIs, as before).

In [None]:
schools_jointracts = gpd.sjoin(schools_gdf, tracts_acs_gdf_ac, how='left')

Let's perform the `groupby` operation.

When aggregating by count, we'll get the counts for every column, which will be the same. So, we'll just select the `GEOID` and `Site` columns at the end.

In [None]:
schools_countsbytract = schools_jointracts.groupby('GEOID', as_index=False).count()[['GEOID','Site']]
print(f"Counts, rows and columns: {schools_countsbytract.shape}")
print(f"Tracts, rows and columns: {tracts_acs_gdf_ac.shape}")

# Take a look at the data
schools_countsbytract.head()

### Obtaining Tract Polygons with School Counts

The above `groupby` and `count` operations give us the counts we wanted.

- We have 263 (of 361) Census tracts that contain at least one school.
- We have the number of schools within each of those tracts.

But, the output of `groupby` is a plain DataFrame, and not a GeoDataFrame.

If we want a GeoDataFrame, then we have two options:

1. We could join the `groupby` output to `tracts_acs_gdf_ac` by the attribute `GEOID`.
2. We could start over, using the GeoDataFrame `dissolve` method, which we can think of as a spatial `groupby`.

Since we already know how to do an attribute join, we'll do the `dissolve`!

First, let's run a new spatial join.

In [None]:
tracts_joinschools = gpd.sjoin(left_df=schools_gdf,
                               right_df=tracts_acs_gdf_ac,
                               how='right')

In [None]:
tracts_joinschools.head()

Now, let's run the dissolve!

In [None]:
tracts_schoolcounts = tracts_joinschools[['GEOID', 'Site', 'geometry']].dissolve(by='GEOID', aggfunc='count')
print(f"Counts, rows and columns: {tracts_schoolcounts.shape}")

tracts_schoolcounts.head()

Nice! Let's break that down.

- The `dissolve` operation requires a geometry column and a grouping column (in our case, `'GEOID'`). Any geometries within the **same group** will be dissolved if they have the same geometry or nested geometries. 
 
- The `aggfunc`, or aggregation function, of the dissolve operation will be applied to all numeric columns in the input geodataframe (unless the function is `count` in which case it will count rows).  

Check out the Geopandas documentation on [dissolve](https://geopandas.org/aggregation_with_dissolve.html?highlight=dissolve) for more information.

Now, let' reflect:

1. Above, we selected three columns from the input GeoDataFrame to create a subset as input to the dissolve operation. Why?
2. Why did we run a new spatial join? What would have happened if we had used the `schools_jointracts` object instead?
3. What explains the dimensions of the new object (361, 2)?

### Mapping the Spatial Join Output

Because our `sjoin` plus `dissolve` pipeline outputs a GeoDataFrame, we can now easily map the school count by Census tract!

In [None]:
fig, ax = plt.subplots(figsize = (14, 8)) 

# Display the output of our spatial join
tracts_schoolcounts.plot(ax=ax,
                         column='Site', 
                         scheme="user_defined",
                         classification_kwds={'bins': list(range(9))},
                         cmap="PuRd_r",
                         edgecolor="grey",
                         legend=True, 
                         legend_kwds={'title': 'Number of schools'})
schools_gdf.plot(ax=ax,
                 color='cyan',
                 markersize=2)

---

### Challenge 3: Aggregation

What is the mean API of each Census tract?

As we mentioned, the spatial aggregation workflow that we just put together above could have been used not to generate a new count variable, but also to generate any other new variable the results from calling an aggregation function on an attribute column.

In this case, we want to calculate and map the mean API of the schools in each Census tract.

Copy and paste code from above where useful, then tweak and/or add to that code. Do the following:

1. Join the schools onto the tracts (**HINT**: make sure to decide whether or not you want to include schools with API = 0!).
2. Dissolves that joined object by the tract IDs, giving you a new GeoDataFrame with each tract's mean API (**HINT**: because this is now a different calculation, different problems may arise and need handling!).
3. Plot the tracts, colored by API scores (**HINT**: overlay the schools points again, visualizing them in a way that will help you visually check your results!).

---

In [None]:
# YOUR CODE HERE
