# Advanced Spatial Analysis with [PySAL](https://pysal.org)

Examples used here are based on the [excellent tutorials](http://darribas.org/materials.html) by Dani Arribas-Bel.

### Let's start with the PySal *viz* component

In [None]:
%matplotlib inline

import pandas as pd
import geopandas as gpd
import seaborn as sns
import contextily as ctx
import matplotlib.pyplot as plt
import numpy as np
from shapely.geometry import Polygon
import libpysal 
from libpysal import weights
from pysal.explore import esda
from pysal.viz import mapclassify
#from pysal.viz.splot.mapping import vba_choropleth

In [None]:
plt.rcParams['figure.figsize'] = [10, 10] # change standard figure size

Let's load a [geopackage](http://www.geopackage.org) with districts in [Belo Horizonte](https://en.wikipedia.org/wiki/Belo_Horizonte), reproject it to [EPSG:3857](https://epsg.io/3857), and plot it:

In [None]:
db = gpd.read_file('https://github.com/darribas/gds_ufmg19/raw/master/data/bh.gpkg').to_crs(epsg=3857)
db.plot();

In [None]:
db.head()

### Seaborn [color palettes](https://seaborn.pydata.org/tutorial/color_palettes.html)

Sequential

In [None]:
sns.palplot(sns.color_palette('viridis', 7))

Divergent

In [None]:
sns.palplot(sns.color_palette('coolwarm', 7))

Categorial

In [None]:
sns.palplot(sns.color_palette('Set2', 7))

### Assigning colors to values: Classification 

In the raster lecture, we already used a non-linear mapping from data values to colors. There are more elaborate ways to do this, though, particularly if we want to classify the data so that we can reduce the number of color values on the map to preserve readability. We'll go through a few ways to do this (giving you some cartography skills about [Choropleth Maps](https://onlinelibrary-wiley-com.zorac.aub.aau.dk/doi/abs/10.1002/9781118786352.wbieg0951) along the way), starting with

**Equal intervals**

In [None]:
classi = mapclassify.EqualInterval(db['Average Monthly Wage'], k=7)
classi

In [None]:
classi.bins

We'll make ourselves a little function that will nicely show where the boundaries between the classes are when we want to compare the different classification methods:

In [None]:
def plotClassification(classi):
    # Set up the figure
    f, ax = plt.subplots(1, figsize=(9, 6))
    # Plot the kernel density estimation (KDE)
    sns.kdeplot(db['Average Monthly Wage'], shade=True)
    # Add a blue tick for every value at the bottom of the plot (rugs)
    sns.rugplot(db['Average Monthly Wage'], alpha=0.5)
    # Loop over each break point and plot a vertical red line
    for cut in classi.bins:
        plt.axvline(cut, color='red', linewidth=0.75)
    # Title
    ax.set_title(classi.name)
    # Display image
    plt.show()

In [None]:
plotClassification(classi)

**Quantiles**

The equal intervals method splits up the data into, *ahem*, equal intervals. Quantiles instead arranges the class boundaries so that we have the same number of data points in each class:

In [None]:
classi = mapclassify.Quantiles(db['Average Monthly Wage'], k=7)
classi

In [None]:
plotClassification(classi)

**[Fisher-Jenks](https://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization)** is a data clustering method designed to determine the "best" arrangement of values, such that natural breaks are identified and used.

In [None]:
classi = mapclassify.FisherJenks(db['Average Monthly Wage'], k=7)
classi

In [None]:
plotClassification(classi)

**Boxplot** should sound familiar (if not, go back to the exploratory data analysis notebook). This method will use the same breaks use for the boxplot to classify the data.

In [None]:
classi = mapclassify.BoxPlot(db['Average Monthly Wage'])
classi

How does the classification relate to the box plot?

In [None]:
# Set up the figure
f, axs = plt.subplots(2, figsize=(10, 10), gridspec_kw = {'height_ratios':[3, 1]})
# Plot the kernel density estimation (KDE)
sns.kdeplot(db['Average Monthly Wage'], shade=True, ax=axs[0])
# Add a blue tick for every value at the bottom of the plot (rugs)
sns.rugplot(db['Average Monthly Wage'], alpha=0.5, ax=axs[0])
# Loop over each break point and plot a vertical red line
for cut in classi.bins:
    axs[0].axvline(cut, color='red', linewidth=0.75)
# Box-Plot
axs[1].boxplot(db['Average Monthly Wage'], vert=False)
# Set X axis manually
axs[1].set_xlim(axs[0].get_xlim())
# Title
axs[0].set_title(classi.name)
# Display image
plt.show()

Let's make some

## Choropleth maps

Use the classification methods to assign colors to features on the map.

🏋 Which of the following maps is the best? Which one is correct? Let's discuss during the Q&A!

In [None]:
for classification in ['equal_interval', 'quantiles', 'fisher_jenks']:
    f, ax = plt.subplots(1, figsize=(14, 14))
    db.plot(column='Average Monthly Wage', scheme=classification, ax=ax, legend=True)
    ctx.add_basemap(ax, url=ctx.providers.Stamen.Toner)
    ax.set_axis_off()
    plt.axis('equal')
    plt.title(classification)
    plt.show()

# [Spatial Weights](http://darribas.org/gds_scipy16/ipynb_md/03_spatial_weights.html)

Spatial weights are central components of many areas of spatial analysis. In general terms, for a spatial data set composed of *n* locations (points, areal units, network edges, etc.), the spatial weights matrix expresses the potential for interaction between observations at each pair *i,j* of locations. There is a rich variety of ways to specify the structure of these weights, and PySAL supports the creation, manipulation and analysis of spatial weights matrices across three different general types:

- Contiguity Based Weights
- Distance Based Weights
- Kernel Weights

Let's take a look at a lattice example with fake data:

In [None]:
# do a regular 3x3 lattice and draw it here
w = weights.lat2W(3, 3, rook=True)
# Get points in a grid
l = np.arange(3)
xs, ys = np.meshgrid(l, l)
# Set up store
polys = []
# Generate polygons
for x, y in zip(xs.flatten(), ys.flatten()):
    poly = Polygon([(x, y), (x+1, y), (x+1, y+1), (x, y+1)])
    polys.append(poly)
# Convert to GeoSeries
polys = gpd.GeoSeries(polys)
gdf = gpd.GeoDataFrame({'geometry': polys, 
                        'id': ['P-%s'%str(i).zfill(2) for i in range(len(polys))]})
w.remap_ids(gdf['id'].values)
# Annotate & Visualise
ax = polys.plot(edgecolor='k', facecolor='w')
[plt.text(x, y, t, 
          verticalalignment='center',
          horizontalalignment='center') for x, y, t in zip(
         [p.centroid.x for p in polys],
         [p.centroid.y for p in polys],
         [i for i in gdf['id']])]
ax.set_axis_off()

Now we can generate a contiguity matrix:

In [None]:
pd.DataFrame(w.full()[0], 
             index=gdf['id'],
             columns=gdf['id'],
            ).astype(int)

### Real-world data: Belo Horizonte

In [None]:
w_queen = weights.Queen.from_dataframe(db)
pd.DataFrame(w_queen.full()[0], 
             index=db['CD_GEOCMU'],
             columns=db['CD_GEOCMU'],
            ).astype(int)

☝️ Rook's case vs. Queen's case: 

- In Rook's case, only sees shared **edges** lead to a connection
- In Queen's case, also shared **vertices** (points) lead to a connection

# Spatial Autocorrelation

For more methods to analyse spatial autocorrelation, check out https://geographicdata.science/book/notebooks/06_spatial_autocorrelation.html

We'll just do a quick analysis using **[Moran's I](https://en.wikipedia.org/wiki/Moran%27s_I)**

We want to analyse the spatial autocorrelation of the industry diversity in Belo Horizonte. For comparison, let's create fake data with random industry diversity data:

In [None]:
np.random.seed(1234)
db['Random Industry Diversity'] = db['Industry Diversity'].sample(frac=1).values
db

In [None]:
moran = esda.Moran(db['Industry Diversity'], w_queen)
moran.I

In [None]:
moran.p_sim

In [None]:
moran_shuffled = esda.Moran(db['Random Industry Diversity'], w_queen)
moran_shuffled.I

In [None]:
moran_shuffled.p_sim

# 💪 Exercise

Write some code to:

1. Download the countries dataset from https://www.naturalearthdata.com/downloads/10m-cultural-vectors/10m-admin-0-countries/
2. Make a choropleth map of the population density (column `POP_EST` divided by country area)
3. Calculate Moran's I to understand whether there is spatial autocorrelation of the gross domestic product per capita (`GDP_MD` / `POP_EST`).