# Point Patterns

> [`IPYNB`](../content/part2/06_points.ipynb)

> **NOTE**: some of this material has been ported and adapted from "Lab 9" in [Arribas-Bel (2016)](http://darribas.org/gds15/labs/Lab_09.html).

This notebook covers a brief introduction on how to visualize and analyze point patterns. To demonstrate this, we will use a dataset of all the AirBnb listings in the city of Austin (check the Data section for more information about the dataset).

Before anything, let us load up the libraries we will use:

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import geopandas as gpd
import seaborn as sns
import matplotlib.pyplot as plt
import mplleaflet as mpll

## Data preparation

Let us first set the paths to the datasets we will be using:

In [None]:
# Adjust this to point to the right file in your computer
listings_link = 'data/listings.csv.gz'

The core dataset we will use is `listings.csv`, which contains a lot of information about each individual location listed at AirBnb within Austin:

In [None]:
lst = pd.read_csv(listings_link)
lst.info()

It turns out that one record displays a very odd location and, for the sake of the illustration, we will remove it:

In [None]:
odd = lst.loc[lst.longitude>-80, ['longitude', 'latitude']]
odd

In [None]:
lst = lst.drop(odd.index)

## Point Visualization

The most straighforward way to get a first glimpse of the distribution of the data is to plot their latitude and longitude:

In [None]:
sns.jointplot

In [None]:
sns.jointplot(x="longitude", y="latitude", data=lst);

Now this does not neccesarily tell us much about the dataset or the distribution of locations within Austin. There are two main challenges in interpreting the plot: one, there is lack of context, which means the points are not identifiable over space (unless you are so familiar with lon/lat pairs that they have a clear meaning to you); and two, in the center of the plot, there are so many points that it is hard to tell any pattern other than a big blurb of blue.

Let us first focus on the first problem, geographical context. The quickest and easiest way to provide context to this set of points is to overlay a general map. If we had an image with the map or a set of several data sources that we could aggregate to create a map, we could build it from scratch. But in the XXI Century, the easiest is to overlay our point dataset on top of a web map. In this case, we will use [Leaflet](http://leafletjs.com/), and we will convert our underlying `matplotlib` points with `mplleaflet`. The full dataset (+5k observations) is a bit too much for leaflet to plot it directly on screen, so we will obtain a random sample of 100 points:

In [None]:
# NOTE: `mpll.display` turned off to be able to compile the website,
#       comment out the last line of this cell for rendering Leaflet map.
rids = np.arange(lst.shape[0])
np.random.shuffle(rids)
f, ax = plt.subplots(1, figsize=(6, 6))
lst.iloc[rids[:100], :].plot(kind='scatter', x='longitude', y='latitude', \
                      s=30, linewidth=0, ax=ax);
mpll.display(fig=f,)

This map allows us to get a much better sense of where the points are and what type of location they might be in. For example, now we can see that the big blue blurb has to do with the urbanized core of Austin.

### `bokeh` alternative

Leaflet is not the only technology to display data on maps, although it is probably the default option in many cases. When the data is larger than "acceptable", we need to resort to more technically sophisticated alternatives. One option is provided by `bokeh` and its `datashaded` submodule (see [here](https://anaconda.org/jbednar/nyc_taxi/notebook) for a very nice introduction to the library, from where this example has been adapted).

Before we delve into `bokeh`, let us reproject our original data (lon/lat coordinates) into Web Mercator, as `bokeh` will expect them. To do that, we turn the coordinates into a `GeoSeries`:

In [None]:
from shapely.geometry import Point
xys_wb = gpd.GeoSeries(lst[['longitude', 'latitude']].apply(Point, axis=1), \
                      crs="+init=epsg:4326")
xys_wb = xys_wb.to_crs(epsg=3857)
x_wb = xys_wb.apply(lambda i: i.x)
y_wb = xys_wb.apply(lambda i: i.y)

Now we are ready to setup the plot in `bokeh`:

In [None]:
from bokeh.plotting import figure, output_notebook, show
from bokeh.tile_providers import get_provider,Vendors

output_notebook()

minx, miny, maxx, maxy = xys_wb.total_bounds
y_range = miny, maxy
x_range = minx, maxx

def base_plot(tools='pan,wheel_zoom,reset',plot_width=600, plot_height=400, **plot_args):
    p = figure(tools=tools, plot_width=plot_width, plot_height=plot_height,
        x_range=x_range, y_range=y_range, outline_line_color=None,
        min_border=0, min_border_left=0, min_border_right=0,
        min_border_top=0, min_border_bottom=0, **plot_args)
    
    p.axis.visible = False
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None
    return p
    
options = dict(line_color=None, fill_color='#800080', size=4)

And good to go for mapping!

In [None]:
# NOTE: `show` turned off to be able to compile the website,
#       comment out the last line of this cell for rendering.
p = base_plot()
#p.add_tile(STAMEN_TERRAIN)
p.circle(x=x_wb, y=y_wb, **options)
show(p)

## Kernel Density Estimation

A common alternative when the number of points grows is to replace plotting every single point by estimating the continuous observed probability distribution. In this case, we will not be visualizing the points themselves, but an abstracted surface that models the probability of point density over space. The most commonly used method to do this is the so called kernel density estimate (KDE). The idea behind KDEs is to count the number of points in a continious way. Instead of using discrete counting, where you include a point in the count if it is inside a certain boundary and ignore it otherwise, KDEs use functions (kernels) that include points but give different weights to each one depending of how far of the location where we are counting the point is.

Creating a KDE is very straightfoward in Python. In its simplest form, we can run the following single line of code:

In [None]:
sns.kdeplot(lst['longitude'], lst['latitude'], shade=True, cmap='viridis');

Now, if we want to include additional layers of data to provide context, we can do so in the same way we would layer up different elements in `matplotlib`. Let us load first the Zip codes in Austin, for example:

In [None]:
zc = gpd.read_file('data/Zipcodes.geojson')
zc.plot()

And, to overlay both layers:

In [None]:
f, ax = plt.subplots(1, figsize=(9, 9))

zc.plot(color='white', linewidth=0.1, ax=ax)

sns.kdeplot(lst['longitude'], lst['latitude'], \
            shade=True, cmap='Purples', \
            ax=ax);

ax.set_axis_off()
plt.axis('equal')
plt.show()

<!--
## `bokeh` alternative

pts.head()

from sklearn.neighbors import KernelDensity
from sklearn.grid_search import GridSearchCV

# Setup kernel
kde = KernelDensity(metric='euclidean',
                    kernel='gaussian', algorithm='ball_tree')
# Bandwidth selection
gs = GridSearchCV(kde, \
                {'bandwidth': np.linspace(0.1, 1.0, 30)}, \
                cv=3)
%time cv = gs.fit(pts[['x', 'y']].values)
bw = cv.best_params_['bandwidth']
kde.bandwidth = bw
# Fit the KDE
kde.fit(pts[['x', 'y']].values)

# Build a mesh
minX, minY = pts[['x', 'y']].values.min(axis=0)
maxX, maxY = pts[['x', 'y']].values.max(axis=0)
bbox = [minX, minY, maxX, maxY]
mn = 100
mx = np.linspace(minX, maxX, mn)
my = np.linspace(minY, maxY, mn)
mxx, myy = np.meshgrid(mx, my)
mxxyy = np.hstack((mxx.reshape(-1, 1), myy.reshape(-1, 1)))
# Fit to the KDE
d = kde.score_samples(mxxyy).reshape(mn, mn)

print pts.min()['x'], pts.min()['y']

print pts.max()['x'], pts.max()['y']

mxxyy.max(axis=0)

from bokeh.plotting import figure, show

p = base_plot()
p.add_tile(STAMEN_TERRAIN)

p.image(image=[d], x=minX, y=minY, dw=maxX-minX, dh=maxY-minY, \
        alpha=0.001, palette="Blues9")

show(p)
-->