# Distances and nearest neighbors

## Lecture objectives

1. Introduce distance and nearest neighbor calculations

A common task is getting the distances between a geometry and a set of other geometries, or the nearest neighbor. For example, you might want to get the closest school or grocery store to a particular census tract.

First, let's load in the same dataset we used in the previous lectures.

In [None]:
import geopandas as gpd
import pandas as pd
import numpy as np
import requests
import pygris

pantryDf = pd.read_csv('../data/Food_Resources_in_California.csv')
pantryDf = pantryDf[pantryDf.County=='Los Angeles']

# convert to a GeoDataFrame
pantrygdf = gpd.GeoDataFrame(
    pantryDf, geometry=gpd.points_from_xy(pantryDf.Longitude, pantryDf.Latitude, 
                                          crs='EPSG:4326'))

# get the census data for the City of LA
# B19019_001E is median household income
r = requests.get('https://api.census.gov/data/2019/acs/acs5?get=B19019_001E&for=tract:*&in=state:06%20county:037')
censusdata = r.json()
incomeDf = pd.DataFrame(censusdata[1:], columns=censusdata[0])
incomeDf.rename(columns={'B19019_001E':'median_HH_income'}, inplace=True)
incomeDf.median_HH_income = incomeDf.median_HH_income.astype(int)
incomeDf.loc[incomeDf.median_HH_income<0, 'median_HH_income'] = np.nan

# Add the tract boundaries. For this, we'll use the pygris package
tracts = pygris.tracts(state='06',county='037', year=2019)
tracts.set_index(['STATEFP','COUNTYFP','TRACTCE'], inplace=True)
tracts.index.names=['state','county','tract']
# and join the tract boundaries to the census data
incomeDf = tracts[['geometry']].join(incomeDf.set_index(['state','county','tract'])).reset_index()

For distances, the projection is important here so let's convert to State Plane. 

In [None]:
pantrygdf.to_crs('EPSG:3497', inplace=True)
incomeDf.to_crs('EPSG:3497', inplace=True)

The nearest neighbor can be found with `sjoin_nearest`. The optional argument, `distance_col`, will add a column with the distances.

In [None]:
incomeDf.sjoin_nearest(pantrygdf, distance_col='dist_to_pantry')

Note that we only have the result for the closest part of the census tract. If we want the centroid, we can create a new GeoDataFrame and convert its polygons to centroids.

In [None]:
import matplotlib.pyplot as plt

incomeDf_centroids = incomeDf.copy()
incomeDf_centroids.geometry = incomeDf.geometry.centroid

# map to show the centroids 
fig, ax=plt.subplots(figsize=(5,5))
incomeDf_centroids.plot(markersize=1, ax=ax)
incomeDf.plot(ax=ax, lw=4, alpha=0.5)

And let's do the nearest neighbor with these centroids.

In [None]:
incomeDf_centroids.sjoin_nearest(pantrygdf, distance_col='dist_to_pantry')

Notice that the distances are a little larger than before.

In [None]:
incomeDf.sjoin_nearest(pantrygdf, 
        distance_col='dist_to_pantry').dist_to_pantry.mean()

In [None]:
incomeDf_centroids.sjoin_nearest(pantrygdf, 
        distance_col='dist_to_pantry').dist_to_pantry.mean()

What if you don't just care about the closest one, but want to get the distances from a census tract to a larger number of pantries, or even all of them? For example, some accessibility measures look at the distance to the 2nd or 3rd closest destination (e.g. a grocery store), in order to capture the number of choices that people have.

To start with, let's look at the distances to a single tract. Note that `sort_values` will sort the results, so it's easiest to see the smallest and largest distances.

In [None]:
# as an example, take the first census tract, and get its geometry
tractgeom = incomeDf.iloc[0].geometry

# get the distances from this tract to all the food pantries
distances = pantrygdf.distance(tractgeom)
distances.sort_values(inplace=True)
distances

So how do we know which one is the 3rd closest? We can use `iloc` to get the 3rd row. 

In [None]:
distances.iloc[2]

If we want to calculate the distance to the 3rd closest pantry for each census tract, we can put this in a function.

The argument of the function will be the geometry of the tract. It will return the distance.

Once we have that function, we can use our old friend `apply` to apply it to every tract in the city of LA.

In [None]:
def get_3rd_closest_dist(geom):
    # get distance from every pantry to a single census tract (geom)
    distances = pantrygdf.distance(geom)
    third_closest = distances.sort_values().iloc[2]
    return third_closest

incomeDf['dist_third_closest'] = incomeDf.geometry.apply(get_3rd_closest_dist)

In [None]:
incomeDf

Finally, let's plot using the `seaborn.regplot()` function that we saw before.

In [None]:
import seaborn as sns
ax = sns.regplot(x="median_HH_income", y="dist_third_closest", data=incomeDf)

<div class="alert alert-block alert-info">
<h3>Key Takeaways</h3>
<ul>
  <li>Nearest neighbors, and distances are simple to calculate in geopandas.</li>
  <li>Watch your projection!</li>
</ul>
</div>