# 5. Proximity Calculations
Phew! We're going to continue building off what we've learned!

- [5.1 Introduction  ](#section1)
- [5.2 Proximity Analysis ](#section2)
    - Load and Prepare the Permit data
    - Create Buffer Polygons around Permit Points
    - Load and Prepare the BART stations Data
    - Spatial Join
    - Count the number of BART stations within walking distance
- [5.3 Exploratory Analysis ](#section3)
   - Map overlays
   - Interactive mapping
- [5.4 Recap](#section4)
- [5.5 Homework](#section5)

**INSTRUCTOR NOTES**:
- Datasets used:
    - "../notebook_data/outdata/tracts_and_permits_gdf.json"
    - "../notebook_data/outdata/permits_and_tracts_gdf.json"
    

- Expected time to complete:
    - Lecture + Questions: 45 minutes
    - Homework: 45 minutes

<a id="section1"></a>
## 5.1 Introduction

In `s3_3` we explored how to enhance a data set with spatial joins with the Geopandas `sjoin` operation.

Specifically, we joined the permit application data to census tract ACS data by spatial location so that we could summarize the permit data by census tract and the ACS data for tracts that contain approved permits.
  
- We did the sjoin twice, outputting:
  - `tracts_and_permits_gdf`, a polygon geodataframe  
  - `permits_and_tracts_gdf`, a point geodataframe
  
We also output a geodataframe of the City of Oakland boundary
  - `oakland_gdf`
  
In this notebook we build on that effort to further enhance our permit data.

- First, we introduce buffers as a way to identify permit application locations within walking distance to BART. 
- Then, we will create some maps from this data.


### Set-Up
Let's import the packages we need before we get started.

In [None]:
import math
import numpy as np
import pandas as pd
import collections
import requests 
from urllib.request import urlopen, Request

import json # for working with JSON data
import geojson # ditto for GeoJSON data - an extension of JSON with support for geographic data
import geopandas as gpd
import mapclassify # to classify data values

import matplotlib # base python plotting library
%matplotlib inline  
import matplotlib.pyplot as plt # more plotting stuff

import folium # popular python web mapping tool for creating Leaflet maps
import folium.plugins
from folium.plugins import MeasureControl

In [None]:
# We are getting futurewarning errors about the syntax of CRS definitions, ie "init=epsg:4269" vs "epsg:4269"
# so suppress as these are minor
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

<a id="section2"></a>
## 5.2 Proximity Analysis Buffers

Proximity analysis is a key part of spatial analysis. It considers what is nearby, in accordance with [Tobler's first law of geography](https://en.wikipedia.org/wiki/Tobler%27s_first_law_of_geography) which we paraphrase as "*Everything is related but nearby things are more related*."

In practice, distance-based buffer polygons around geospatial features are often used to examine proximal relationships. For example, one may want to consider how many parks are within walking distance of schools in order to identify underserved schools. This could be implemented using the following "recipe":

1. define walking distance, eg 500 meters or 1/3 mile
2. create buffer polygons around park features with a radious of `walking distance`
3. use a spatial join to associate schools with parks
4. count the number of parks with the school buffers.

Buffers can take on different shapes according to your original geometries ("input"). Putting a buffer around these can result in what you see in the second row of the image. The third row would result if we decided to find the entire regions the buffers would cover.

<img src = "https://pro.arcgis.com/en/pro-app/tool-reference/analysis/GUID-267CF0D1-DB92-456F-A8FE-F819981F5467-web.png" height="500" width="500">


 
In this section we will use buffers to enhance our permit data as we ask *"how many BART stations are within walking distance of each permit location?"*


<div style="display:inline-block;vertical-align:top;">
    <img src="http://www.pngall.com/wp-content/uploads/2016/03/Light-Bulb-Free-PNG-Image.png" width="30" align=left > 
</div>  
<div style="display:inline-block;">

#### Questions
</div>

- What is the geometry of our permit data? 
- What will the buffers look like?
- What do we need to do to the geodataframes before we can spatial join them?


In [None]:
# Write your thoughts here

### Step 1. Prepare the Data

#### Load the Permit data
First up, we will read in the permit data from a previous lesson. If you recall, a spatial join enhanced the permit data with census tract information and other ACS data. 

In [None]:
permits_gdf = gpd.read_file("../notebook_data/outdata/permits_and_tracts_gdf.json", driver="GeoJSON")
permits_gdf.plot()

If we look at the permit data we will see that it is enhanced with ACS data for the tract within which it resides.

In [None]:
permits_gdf.head()

In [None]:
permits_gdf.shape

#### Load the BART Station Data

If we look inside our notebook_data transportation folder we see a `bart_stations.csv` file. Let's check it out.

In [None]:
!ls ../notebook_data/transportation

Since this is a CSV file and not a shapefile or another geographic file format, we will read it in with pandas to a dataframe.

In [None]:
# Read in bart stations
# Read in CSV file
df = pd.read_csv("../notebook_data/transportation/bart_stations.csv")
df.head()

Oops! that didnt work well. Let's specify the delimiter / column separator character

In [None]:
# Read in bart stations
# Read in CSV file
df = pd.read_csv("../notebook_data/transportation/bart_stations.csv", sep=";")
df.head()

It's a common workflow to get point data in a CSV file. 

Then we convert it to a geodataframe by identifying the columns that contain the point geometry.

In [None]:
#Convert the DataFrame to a GeoDataFrame. 
bart_gdf = gpd.GeoDataFrame( df, geometry=gpd.points_from_xy(df.lon, df.lat)) 

# and take a look
bart_gdf.plot();

Humm.... what's odd about that?

In [None]:
#Convert the DataFrame to a GeoDataFrame. 
bart_gdf = gpd.GeoDataFrame( df, geometry=gpd.points_from_xy(df.lat, df.lon)) 

# and take a look
bart_gdf.plot()

In [None]:
# Take a look 
bart_gdf.head()

Since a CSV file doesn't have a CRS we need to define it.

In [None]:
# Check it out
print("Here is our CRS after reading in the CSV file: ", bart_gdf.crs)

# Define the CRS
bart_gdf.crs = 'epsg:4326'

# Check it out
print("Here is our CRS now: ", bart_gdf.crs)


### Step 2. Define Walking distance

Our goal is to add to the `permit_gdf` geodataframe a column with the number of BART stations within walking distance.

The first step in doing this is to define walking distance. We can read the planning literature for ideas but let's assume for this exercise that it is 500 meters, which is about 1/3 mile.

In [None]:
walking_distance_meters = 500;  # setting walking distance initially to 500 meters

### Step 3. Prepare data for Buffer Analysis

In order to create buffer polygons around our permit locations we need to transform our permit geodataframe to a 2D CRS where the units set are meters.  If you recall from our first Geopandas lesson there are a number of these CRSs for California.

Let's use the `California Albers CRS, NAD83 (epsg:3310`) since that can be used for any city in CA.


Now check the CRS of the permits geodataframe.

In [None]:
permits_gdf.crs

Create a new permits geodataframe that has the CRS 3310.

In [None]:
permits_3310 = permits_gdf.to_crs('epsg:3310')

Now that we've transformed, or reprojected, the permit data, let's plot it. Notice the different coordinate values.

In [None]:
permits_3310.plot(figsize=(8,10)) # note the different coordinate values - no longer lat/lon!
plt.show()

Ok, now create a new version of the bart data with the CAL Albers CRS (3310)

In [None]:
# transform the crS
bart_3310 = ...

In [None]:
# plot it


In [None]:
# Take a look at the geodataframe


### Step 4.  Create Buffer Polygons

With that we can go on to actually making our buffers around the BART Stations that have the `walking distance` as the radius. We do this with the Geopandas geodataframe `.buffer()` method.

In [None]:
# Make sure
bart_3310 = bart_gdf.to_crs('epsg:3310')

bart_buf = bart_3310.buffer(distance=walking_distance_meters)

Now let's map the output.

In [None]:
fig, ax = plt.subplots(figsize=(20,20))
bart_buf.plot(ax=ax, color="pink", edgecolor="green")
bart_3310.plot(ax=ax, color='black')
plt.show()

Let's add the permit points to the map

In [None]:
# Map it
fig, ax = plt.subplots(figsize = (20,20)) 

# Display the buffer output
bart_buf.plot(ax=ax,color="pink", edgecolor="green")

# Overlay the permit points
permits_3310.plot(ax=ax, color="blue", alpha=0.5, markersize=5)

plt.show()

You can see from the map above that each BART station now has a buffer polygon.

You also get a sense that some but not all permit locations are near Bart stations.

Let's take a quick look at the data in the buffer output `bart_buf`.

In [None]:
bart_buf.head()

What type of data is that?

In [None]:
type(bart_buf)

The output of the `buffer` operation is a Geopandas `GeoSeries` NOT a geodataframe. Before we can proceed we need to create a geodataframe.

We can create a geodataframe by combining a few key columns from the permit data with the buffer geometry.

In [None]:
# Create a permit_buf geodataframe
bart_buf_gdf = gpd.GeoDataFrame(data=bart_3310[['station_name']],
                                  geometry=bart_buf)

In [None]:
# Take a look at our permit buffer geodataframe
bart_buf_gdf.head()

Nice work! We now have a geodataframe with all of our bart 500m buffer polygons and our permit geodataframe.

<div style="display:inline-block;vertical-align:top;">
    <img src="http://www.pngall.com/wp-content/uploads/2016/03/Light-Bulb-Free-PNG-Image.png" width="30" align=left > 
</div>  
<div style="display:inline-block;">

#### Question
</div>
How would you go from a buffer polygon geodataframe back to a point geodataframe?

In [None]:
# Write your thoughts

Next we want to join them so that we can identify the permits that are within walking distance of BART.

In order to be able to identify those permits after our `spatial join` we first want to create a new variable called `bart_count`. Since each row is for one Bart station, we're going to set our variable to 1 for every entry.  

> This type of variable is often called a `dichotomous variable`, `binary variable` or `dummy_variable`.

In [None]:
bart_buf_gdf['bart_count'] = 1
bart_buf_gdf.head(5)

### Step 5. Spatially join the Permit buffers and BART data

Great, now that we have our data in the right CRS with our new BART count variable, we're going to go ahead and identify the permit point locations within waking distance of a BART station.

To do that, we're going to do a **spatial join** using the geopandas **sjoin** function. 

In [None]:
help(gpd.sjoin)

Before proceeding, consider this:

- What geodataframe should be listed first in the spatial join as the `left_df`? Why does this matter?
- Do we want to do a default inner join or a left join?
- What will the output geometry type be? What do we want it to be?



<img align="left" width=500 src="https://upload.wikimedia.org/wikipedia/commons/f/ff/Cat_on_laptop_-_Just_Browsing.jpg"></img>


OK, spatial join time!

In [None]:
# Join the bart data to the permit data to identify permit locations near bart
permits_near_bart_gdf = gpd.sjoin(permits_3310,bart_buf_gdf)
permit_near_bart_gdf.head()

<img align="left" width=500 src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8f/Sad_Lucy.jpg/640px-Sad_Lucy.jpg"></img>

Our permit geodataframe has a little artifact left over from a previous spatial join - the `index_right` column. This needs to be dropped.

In [None]:
#list(permits_3310.columns)
permits_3310.drop(columns=['index_right'], inplace=True)

Now try that spatial join again!

In [None]:
permits_near_bart_gdf = gpd.sjoin(permits_3310, bart_buf_gdf)
permits_near_bart_gdf.head()

Before diving too deep into our results. We can start by double checking the shape of our input and output geodataframes to make sure they seem reasonable.

In [None]:
print("Number of permit buffers",len(permits_3310))
print("Number of BART stations:",len(bart_buf_gdf))
print("Number of Permits in BART bufs",len(permits_near_bart_gdf))

#print("Number of instances BART stations are within Permit Buffers:",len(permit_buf_bart_gdf))


### Step 6. Count the number of Bart Stations within walking Distance of Permit Locations

Now that we have done our spatial join, we can sum the count of BART stations within walking distance of permit locations. We will dissolve duplicate geometries that share the same `jurisdiction_id`, assuming this to be a unique ID for the permit applications.

In [None]:
permit_bart_counts_gdf =permits_near_bart_gdf[['jurisdiction_id','geometry','bart_count']].dissolve(by='jurisdiction_id', aggfunc="sum", as_index=False)
permit_bart_counts_gdf

We can combine this output with our sjoin input permit geodataframe (`permits_3310`) to enhance the permit information.

In [None]:
permits_gdf_enhanced = permits_3310.merge(permit_bart_counts_gdf[['jurisdiction_id','bart_count']], on="jurisdiction_id", how="left")

In [None]:
# Take a look
permits_gdf_enhanced.sort_values(by="bart_count", ascending=False).head()

In [None]:
permits_gdf_enhanced.shape

Now let's use `value_counts` to check the distribution of values in the `bart_count` column.

In [None]:
permits_gdf_enhanced.bart_count.value_counts(dropna=False)

You can see above that after the merge the `bart_count` column is NaN (not a number/null) for permit locations that were not within walking distance to BART.

We can use the `fillna()` method to set those values to zero.

In [None]:
permits_gdf_enhanced.bart_count.fillna(0, inplace=True)

# And check the counts again
permits_gdf_enhanced.bart_count.value_counts(dropna=False)

Phew! now let's map our output. The ultimate sanity check.

In [None]:
# Map it
fig, ax = plt.subplots(figsize = (10,10)) 

# Display the buffer output in PINK
bart_buf_gdf.plot(ax=ax, edgecolor="black",color="pink", alpha=0.5)

# Overlay the bart points in RED
permits_gdf_enhanced.sort_values(by="bart_count").plot(ax=ax, column='bart_count', categorical=True, legend=True)

# Set x and y limits to zoom into Oakland
ax.set_xlim([-203000,-185000])
ax.set_ylim([-31000,-14000])

ax.set_title('Oakland Permit Application locations by Number of BART Stations within Walking Distance')
plt.show()

<a id="section3"></a>
## 5.3. Exploratory Analysis

Once you have enhanced your spatial data the next step is to explore relationships and build and test hypothesis about the data.

For example, let's explore whether or not these locations are in census tracts with a high number of approved units.

First, let's read in the tract polygon with permit data file that we created in a previous lesson.

In [None]:
# Read in census tract ACS data with joined permit data
tracts_with_permits = gpd.read_file("../notebook_data/outdata/tracts_and_permits_gdf.json", drive="GeoJSON")

As we did in a previous lesson, let's sum the number of approved permit units by census tract

In [None]:
tract_permit_counts_gdf=tracts_with_permits[['GEOID','geometry','units_permit']].dissolve(by='GEOID', aggfunc="sum", as_index=False)
tract_permit_counts_gdf

And let's plot it to see what we have

In [None]:
fig, ax = plt.subplots(figsize = (24,12)) 

# Display the output of our spatial join
tract_permit_counts_gdf.plot(ax=ax,
                             column='units_permit', 
                             scheme="quantiles", 
                             cmap="YlGnBu",
                             edgecolor="grey",
                             legend=True,
                             legend_kwds={'title': "Permitted units by Tract"}
                            )



plt.show()

Now we can create a map that allows us to explore the relationship between BART stations and permitted units.

> Take a close look at how we add the permits data. What's new here?

In [None]:
fig, ax = plt.subplots(figsize = (24,12)) 

# Display the output of our spatial join
tract_permit_counts_gdf.plot(ax=ax,
                             column='units_permit', 
                             scheme="quantiles", 
                             cmap="YlGnBu",
                             edgecolor="grey",
                             legend=True,
                             legend_kwds={'title': "Permitted units by Tract"}
                             )

# Add permit locs within walking distance to bart
permits_gdf_enhanced.to_crs(tract_permit_counts_gdf.crs).sort_values(by="bart_count").plot(ax=ax, 
                                                            column='bart_count', 
                                                            edgecolor='grey', 
                                                            legend=True, 
                                                            cmap='Reds',
                                                            markersize=25,
                                                            legend_kwds={'label': "Count of BART Stations w/in Walking Distance"}
                                                            )


ax.set_title("Oakland Permit Application Data")
plt.show()

<div style="display:inline-block;vertical-align:top;">
    <img src="https://image.flaticon.com/icons/svg/87/87705.svg" width="30" align=left > 
</div>  
<div style="display:inline-block;">

#### Question
</div>

Does there appear to be a relationship between the number of nearby BART stations and the number of permitted units?


### Interactive Map Review

We just did a lot of complex spatial joins, dissolves and aggregations. Let's create an interactive map to check our work and do a sanity check.

We will add the BART stations, buffers, and the permit points with the count of bart stations within walking distance (500 meters).

Finally we will add a `folium.MeasureControl` to check the size of the virtual buffers and the Bart counts for the permit locations.


In [None]:
# Define the basemap
buf_map = folium.Map(location=[37.809142, -122.268228],   # lat, lon around which to center the map
                 tiles='CartoDB Positron',
                 width=900,                        # the width & height of the output map
                 height=600,                       # in pixels
                 zoom_start=15)  

# Add BART Stations buffers
folium.GeoJson(bart_buf.to_crs('epsg:4286')).add_to(buf_map)
   


# Add Bart stations as Markers (default with GeoJson when data are points)
folium.GeoJson(bart_gdf,
              tooltip=folium.GeoJsonTooltip(fields=['station_name' ], 
                   aliases=['station_namae'],
                   labels=True,
                   localize=True
               ),
              ).add_to(buf_map)

# Add permit locations
permits_gdf_enhanced.to_crs('epsg:4326').apply(lambda row: folium.Circle(location=[row['geometry'].y,row['geometry'].x],
                                  tooltip= row['bart_count'],
                                  radius=20,
                                  color='purple',
                                  fill=True,
                                  fill_color='purple'
                                 ).add_to(buf_map),
                             axis=1)

buf_map.add_child(MeasureControl())

buf_map # wait for it...

<a id="section4"></a>
## 5.4 Recap
In this notebook we answered the question "How many BART stations are within walking distance from a permit?" and "What is the relationship between the number of approved permits and walkable BART stations for a tract?" We learned how to create buffer and overlay points over a choropleth map. We also revisited how to create an interactive map

Below you'll find a list of key functionalities we learned and practiced:
- Create a buffer of specified size
    - `.buffer()`
- Spatial joins
    - `.sjoin()`
- CRS transformations
    - `.to_crs()`
- Creating an interactive map with a measurement widget on a Folium map
    - `folium.MeasureControl`

---
<a id="section5"></a>
## 5.5 Homework

####  Exercise

Do another buffer analysis, this time use any of the following data that you find in the folders:

>`notebook_data/transportation`
> - `sfmta_stations.zip` - SF MTA station locations
> - `regional_bike_facilities.zip` - Off-street shared use path, bike lanes, and on-street bike routes
> - `baywheels_stations.zip` - Baywheel station locations

> `notebook_data/other`
> - `ca_grocery_stores_2019_wgs84.zip` - Grocery store locations

You'll need to execute the following steps:
1. Load the data and check the columns, geometry type and CRS
2. Check and update the crs if needed
3. Spatially join your dataset with the buffer polygons of the permits data
4. Dissolve and aggregate the values of interest
5. Join the data back to the permits dataset
6. Replace null values with zero
7. Map the results
6. Create an interactive map with your new data as a layer and check your results


In [None]:
# Your code here

*Click here for answers*

<!---
    # SOLUTION
    # Load the data and check the columns, geometry type and CRS
    baywheels_stations_gdf = gpd.read_file("zip://../notebook_data/transportation/baywheels_stations.zip")
    # Check and update the crs if needed
    baywheels_3310 = baywheels_stations_gdf.to_crs('epsg:3310')
    baywheels_3310['bike_count_dv'] = 1
    baywheels_3310.head()

    # SOLUTION

    # Spatially join your dataset with the buffer polygons of the permits data
    permit_buf_bike_gdf = gpd.sjoin(permit_buf_gdf, baywheels_3310)
    permit_buf_bike_gdf.head()

    # Dissolve and aggregate the values of interest
    permit_bike_counts_gdf=permit_buf_bike_gdf[['jurisdiction_id','geometry','bike_count_dv']].dissolve(by='jurisdiction_id', aggfunc="sum", as_index=False)
    permit_bike_counts_gdf.head()
    permit_bike_counts_gdf.shape

    # Join the data back to the permits dataset
    permits_gdf_enhanced = permits_gdf_enhanced.merge(permit_bike_counts_gdf[['jurisdiction_id','bike_count_dv']], on="jurisdiction_id", how="left")
    permits_gdf_enhanced.head()

    # SOLUTION

    # Replace null values with zero
    permits_gdf_enhanced['bike_count_dv'].fillna(0, inplace=True)
    permits_gdf_enhanced.head()

    # Map the results
    # Plot
    fig, ax = plt.subplots(figsize = (24,12)) 

    #Add permit locs within walking distance to bart
    permits_gdf_enhanced.sort_values(by="bike_count_dv").plot(ax=ax, 
                                                                column="bike_count_dv", 
                                                                edgecolor='grey', 
                                                                legend=True, 
                                                                cmap='Greens',
                                                                markersize=25)

    ax.set_title('Oakland Permit locations by Number of Bike Stations within Walking Distance')
    plt.show()

    # SOLUTION

    # Create an interactive map with your new data as a layer and check your results
    # Define the basemap
    buf_map = folium.Map(location=[37.809142, -122.268228],   # lat, lon around which to center the map
                     tiles='CartoDB Positron',
                     width=900,                        # the width & height of the output map
                     height=600,                       # in pixels
                     zoom_start=15)  

    # Add Bike Stations as Circle Markers - you can set radius
    for i in baywheels_stations_gdf.index:
        folium.Circle(
            location=[baywheels_stations_gdf['geometry'].y[i], baywheels_stations_gdf['geometry'].x[i]],
            radius= 500,
            popup= baywheels_stations_gdf['name'][i],
            color='green',
            fill=True,
            fill_color='green'
    ).add_to(buf_map)

    # Add Bike stations as Markers (default with GeoJson when data are points)
    folium.GeoJson(baywheels_stations_gdf,
                  tooltip=folium.GeoJsonTooltip(fields=['name' ], 
                       aliases=['Location'],
                       labels=True,
                       localize=True
                   ),
                  ).add_to(buf_map)

    # Add permit locations
    permits_gdf_enhanced.apply(lambda row: folium.Circle(location=[row['geometry'].y,row['geometry'].x],
                                      tooltip= row['bike_count_dv'],
                                      radius=5,
                                     ).add_to(buf_map),
                                 axis=1)

    buf_map.add_child(MeasureControl())

    buf_map # wait for it...
--->


## Congrats you're done with part 5!



</br>

---
<div style="display:inline-block;vertical-align:middle;">
<a href="https://dataforhousing.org/" target="_blank"><img src ="https://media-exp1.licdn.com/dms/image/C560BAQELkt35AxeIeA/company-logo_200_200/0?e=1597881600&v=beta&t=irZ1tYCA9A2biVzCguvCXzsfzanSYDFuF22IUFNY5Sg" width="75" align="left">
</a>
</div>

<div style="display:inline-block;vertical-align:middle;">
    <div style="font-size:larger">&nbsp;Data Science for Housing Workshop, University of California Berkeley</div>
    <div>&nbsp;Tim Thomas, Patty Frontiera, Emmanuel Lopez, Ethan Ebinger, Hikari Murayama, Karen Chapple, Claudia von Vacano<div>
    <div>&copy; UC Regents, 2019-2020</div>
</div>