# Intro to Geo-Spatial Analysis


## Introduction

In this chapter, you'll learn the basics of geospatial analysis with code.

You should be aware when following this chapter that installing geographic analysis packages isn't always the easiest and things can and do go wrong! (Some geospatial analysis courses recommend running everything in a Docker container.)

There are two types of spatial data geographic information systems (GIS): vector and raster. Vector spatial data are made up of vertices and paths. These are represented computationally by points (zero-dimensional objects, if you like), lines (1D objects), and polygons (2D objects, aka an area). Vector data are analogous to vector image formats, like .eps and .pdf, they are smooth and well-defined regardless of the level of zoom you look at them (on a map). Raster data are composed of pixels, also known as grid cells. Each cell has a different value and the raster map is overlaid on a map. Raster data are better for continuous variables such as temperature or rainfall, while vector data are better suited to political boundaries and features like roads. In this book, we'll only cover vector geospatial data analysis (if you need to work with rasters, check out [rasterio](https://rasterio.readthedocs.io/en/latest/)).

Common file formats for vector data include Shapefile (.shp...), GeoJSON/JSON, KML and GeoPackage.

### Coordinate Reference Systems

A Coordinate Reference System (CRS) associates numerical coordinates with a position on the surface of the Earth. Latitude and longitude are an example (and have units of degrees) but there are many CRSs. They are important because they define how your data will look on a map. Think about it like this: in the case of the usual charts you plot, you usually take it as given that you are working in a space that is 2D with an X- and Y-basis vector. But the most basic object in mapping is a *sphere*, which is fundamentally different to a 2D plane. This means you have to choose whether to show part of a globe or all of a 2D representation of a globe (a *projection*), which inevitably introduces distortions.

The type main classes of CRS are geographic or projection. A geographic CRS uses a 3D model of the Earth and 'drapes' the data over it.

A projected CRS has the geographic CRS plus a map projection that has co-ordinates on a 2D plane. It is well known that there is no distortion-free projection from a globe to a plane; you cannot preserve areas, shapes, distances, and angles simultaneously. Different map projections introduce different distortions, as lovingly shown in this [xkcd](https://xkcd.com/977/) cartoon.

One example of a map projection is the Mercator projection, which is a *conformal mapping*, i.e. a mapping that locally preserves angles, but not necessarily lengths. In fact, Mercator distorts areas, especially the further away an area is from the equator. Some projections are better for some purposes than others.

![XKCD Bad Map Projection: South America](https://imgs.xkcd.com/comics/bad_map_projection_south_america.png)

XKCD: Bad Map Projection: South America

Some analysis tools expect geospatial data to be in a projected CRS-this is true of the main package we'll use, **geopandas**. This is usually not a problem for economic data; it's rare that the curvature of the Earth becomes a factor (though distances might in some rare situations). Most spatial libraries expect that all of the data that are being combined be in the same CRS.

CRSs are usually referenced by an [EPSG code](https://epsg.io/). Two widely used CRSs are WGS84 (aka EPSG: 4326) which provides a good representaiton of most places on Earth and NAD83 (aka EPSG: 4269), which provides a good representation of the USA.

Why are we banging on about this? Because maps and geometry objects come in different CRS and it's worth being aware of that so that you can ensure they are in the same format, and that you have the right CRS for your purposes.

### Imports and packages

We'll be using [**geopandas**](https://geopandas.org/index.html), the go-to package for vector spatial analysis in Python. The easiest way to install this package is using `conda install geopandas`; if you want to install it via pip then look at the [install instructions](https://geopandas.org/install.html). 

Let's import some of the packages we'll be using:

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import os
import numpy as np
import geopandas as gpd

In [None]:
# TODO hide cell
# Set max rows displayed for readability
pd.set_option('display.max_rows', 6)
# Plot settings
plot_style = {'xtick.labelsize': 20,
                  'ytick.labelsize': 20,
                  'font.size': 22,
                  'figure.autolayout': True,
                  'figure.figsize': (10, 5.5),
                  'axes.titlesize': 22,
                  'axes.labelsize': 20,
                  'lines.linewidth': 4,
                  'lines.markersize': 6,
                  'legend.fontsize': 16,
                  'mathtext.fontset': 'stix',
                  'font.family': 'STIXGeneral',
                  'legend.frameon': False}
plt.style.use(plot_style)

# TODO: for this chapter, turn off spines etc by default?

## Geopandas dataframes

Quite literally, **GeoPandas** is a combination of geo and pandas so the good news is that everything you know about using **pandas** dataframes can be re-used here for geospatial data. The geo part adds functionality for geo-spatial data.

### Input and Output

So, first, we need some geo-spatial data to analyse. There are several different file formats for geo-spatial data, such as Shapefile (.shp), GeoJSON/JSON, KML, and GeoPackage.

We'll use a Shapefile of the countries of the world from [Natural Earth](https://www.naturalearthdata.com/downloads/50m-cultural-vectors/50m-admin-0-countries-2/). It comes as a zip file; unzip it and one of the files ends in .shp, which is the one we load with **geopandas**.

Let's load the data and look at the first few rows:

In [None]:
df = gpd.read_file(os.path.join('data', 'geo', 'world', 'ne_50m_admin_0_countries.shp'))
df.head(3)

There's a lot of info here, but a lot of it is different labelling. The dataset has one country per row.

To save the data frame, use `df.to_file("output_name.shp")` for shapefiles. For other output formats use, for example, `df.to_file("output_name.geojson", driver="GeoJSON")`.

### Basics

Let's see what we get when we call the humble plot function on the data we already read in!

In [None]:
df.plot(color='green');

I think it's glorious just how easy this is to use. 

Because **geopandas** builds on **pandas**, we can do all of the usual pandas-like operations such as filtering based on values. Here's how to filter to an individual country (one with a recognisable shape!):

In [None]:
df[df['ADMIN'] == 'Italy'].plot();

Note that it is the last column in the dataframe, the `geometry` column, that makes this sorcery possible.

### Working with Coordinate Reference Systems

We can check what the CRS of this entire **geopandas** dataframe is:

In [None]:
df.crs

We can switch between CRS using the `to_crs` function. Let's see the world map we plotted earlier using the WGS84 projection and also using some other projectsion, including the dreaded Meractor projection. Mercator looks completely ridiculous if we don't drop Antarctica first, so let's get shot of it for all the projections. (You can find a list of projections [here](https://proj.org/operations/projections/index.html).)

In [None]:
exclude_admins = ['Antarctica', 'French Southern and Antarctic Lands']
proj_names = ['WGS 84', 'Mercator', 'Winkel Tripel', 'Quartic Authalic']
crs_names = ['EPSG:4326', 'EPSG:3395', '+proj=wintri', '+proj=qua_aut']

world_no_antrtc = df[~df['ADMIN'].isin(exclude_admins)]

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 15))
for i, ax in enumerate(axes.flat):
    world_no_antrtc.to_crs(crs_names[i]).plot(ax=ax)
    ax.set_title(proj_names[i])
    ax.set_frame_on(False)
    ax.set_xticks([])
    ax.set_yticks([])
plt.tight_layout();

## Manipulating Space

### Basics

Let's look more closely at the objects that encode the shapes of the countries on our world map:

In [None]:
df.loc[df['ADMIN'] == 'Italy', 'geometry']

The object is a multipolygon. Remember that we have points (0D), lines (1D), and polygons (aka areas, 2D) that we can embed in a projection. A line is at least 2 vertices that are connected; a polygon is the area contained within 3 or more vertices. Multipolygons are the union of two or more non-contiguous polygons: in this case, the Italian mainland, Sicily, and Sardinia.

The `plot` function works just as happily if our basic objects are points rather than polygons though. In the below example, we'll grab the centroid (the spatial midpoint) of each country as a point and plot them:

In [None]:
df['centroid'] = df.centroid

fig, ax = plt.subplots()
ax.set_frame_on(False)
ax.set_xticks([])
ax.set_yticks([])
df.boundary.plot(ax=ax, lw=0.5, color='k')
df['centroid'].plot(ax=ax, color='red')
plt.show()

Let's explore those basic building blocks a bit more. A point at position (1, 2) is defined as follows (**shapely** is used by **geopandas**):

In [None]:
from shapely.geometry import Point
point = Point(1, 2)
point

A point doesn't have much other than a single position in 2D space. But lines have length, and polygons have area.

There are different kinds of lines but the simplest is the `LineString` which can be constructed from a sequence of points.

In [None]:
from shapely.geometry import LineString

line = LineString([Point(0.0, 1.0), Point(2.0, 2.0),
            Point(2.5, 5.0), Point(4, 5),
            Point(4, 0)])
print(f'Length of line is {line.length:.2f}')
line

We already saw Polygons in the shape of Italy. But here's a much simpler one:

In [None]:
from shapely.geometry import Polygon

poly = Polygon([(0, 0), (1, 1), (1, 0)])
print(f'The area of the poly is {poly.area}')
poly

### Spatial Operations

We've seen the basic builidng blocks of geometries: points, lines, and polygons. We've also seen how to retireve some basic properties of these such as length, area, and centroid. In this section, we'll see some slightly more advanced spatial operations that you can perform with **geopandas**.

#### Point-in-polygon

This does pretty much what you'd expect! It's useful in practice because we might want to know if a particular place falls within one area or another. As a test, let's see if the centroid for Germany actually falls within the Germany polygon.

In [None]:
df_row = df.loc[df['ADMIN']=='Germany', :]
df_row['geometry'].iloc[0].contains(df_row['centroid'].iloc[0])

But be careful with this! Countries can have complex multi-polygon geometries for which a centroid is not guaranteed to be within any of the polygons. France is a great example as French Guiana is so far away that it pulls the centroid just out of the mainland:

In [None]:
fig, ax = plt.subplots()
df_row = df.loc[df['ADMIN']=='France', :]
df_row['geometry'].plot(ax=ax)
df_row['centroid'].plot(ax=ax, color='red');

#### Buffers

Buffers are just an area drawn around a particular geometry, given a specific radious. They have a great practical use in computing catchment areas and so on. To create a buffer around the geometry you're currently using, use the `df.buffer(number)` command to return a column with the buffered version of your geometry in. Be aware that the units of the CRS you're using will determine the units of the buffer.

### Spatial set operations

More complex spatial manipulation can be achieved through spatial set operations. The figure below shows some examples for polygons (but the same principles apply for lines and polygons too):

<img src="https://geopandas.org/_images/overlay_operations.png" alt="Spatial operations"></a>

Different set operations with two polygons. Source: QGIS documentation.

In addition to these, there are 'crosses' (for two lines) and 'touches'. You can find information about all of the set operations that are available [here](https://geopandas.org/set_operations.html).

To demonstrate one of these, let's see if we can pick out a few regions that are *intersected* by a river. We'll try out the river Trent, which passes through England, and we'll see which *Local Authority Districts* (LADs) it passes through. First, we load the UK data, which is from the [ONS Open Geography Portal](https://geoportal.statistics.gov.uk/datasets/local-authority-districts-may-2020-boundaries-uk-buc):

In [None]:
dfuk = gpd.read_file(os.path.join('data', 'geo', 'uk_lad', 'Local_Authority_Districts__May_2020__UK_BUC.shp'))
dfuk.plot(color='dimgrey');

Next we bring in data on major rivers and pass it through the same projection as our map of England is using:

In [None]:
river = gpd.read_file(os.path.join('data', 'geo', 'rivers', 'rivers.shp'))
river = river.to_crs(dfuk.crs)
river.plot(color='mediumblue');

We can now combine these for a view on how these rivers goes through the UK. We will use `buffer` to modify how easy it is to visualise the river by asking for everything within 1.5 km from the river.

In [None]:
fig, ax = plt.subplots(figsize=(15, 15))
ax.set_frame_on(False)
ax.set_xticks([])
ax.set_yticks([])
dfuk.plot(color='grey', ax=ax)
river.buffer(1.5e3).plot(ax=ax, color='lightblue');

Let's turn our focus to one river: the Trent. We will ask which of the regions are intersected by that river and plot only those parts. We'll also add some annotations for the names of the regions.

In [None]:
# Restrict river to just one
river_name = 'Trent'
river = river[river['name'] == river_name]

# Find a subset of the regions that is intersected by the river by creating new True/False column
dfuk['river'] = dfuk['geometry'].apply(lambda x: river['geometry'].buffer(1e3).intersects(x))

# The rest of the code is just related to plotting:

# Create a cut of dfuk for convenience
df_th = dfuk[dfuk['river']].copy()

# Get a representative point for each region to annotate
df_th['coords'] = df_th['geometry'].representative_point().apply(lambda x: x.coords[:][0])

# Plotting bits
fig, ax = plt.subplots(figsize=(10, 10))
ax.set_frame_on(False)
ax.set_xticks([])
ax.set_yticks([])
df_th.plot(color='grey', ax=ax, edgecolor='0.6')
river.plot(color='lightblue', ax=ax)

# Add text annotation to the largest polygons
for idx, row in df_th.iterrows():
    if(row['geometry'].area>np.quantile(df_th.area, q=0.4)):
        ax.annotate(text=row['LAD20NM'], xy=row['coords'],
                    horizontalalignment='center', weight='bold',
                    fontsize=10, color='black')
ax.set_title(f'Local Authority Districts that are intersected by the {river_name}', loc='left', fontsize=20)
plt.show()

Note that the way we set this up to use `river_name` as an input variable means we could oh so easily wrap everything up in a function that did this for other rivers. Oh go on then, let's see how that works *because* it shows how scalable operations are once you do them in *code*:



In [None]:
def river_intersect_plot(river_name):
    """
    Given the name of a river, shows a plot of the LADs that it intersects.
    """
    river = gpd.read_file(os.path.join('data', 'geo', 'rivers', 'rivers.shp'))
    river = river[river['name'] == river_name]
    river = river.to_crs(dfuk.crs)
    dfuk['river'] = dfuk['geometry'].apply(lambda x: river['geometry'].buffer(2e3).intersects(x))
    df_th = dfuk[dfuk['river']].copy()
    df_th['coords'] = df_th['geometry'].representative_point().apply(lambda x: x.coords[:][0])
    fig, ax = plt.subplots(figsize=(10, 10))
    ax.set_frame_on(False)
    ax.set_xticks([])
    ax.set_yticks([])
    df_th.plot(color='grey', ax=ax, edgecolor='0.6')
    river.plot(color='lightblue', ax=ax)
    for idx, row in df_th.iterrows():
        if(row['geometry'].area>np.quantile(df_th.area, q=0.6)):
            ax.annotate(text=row['LAD20NM'], xy=row['coords'],
                        horizontalalignment='center', weight='bold',
                        fontsize=10, color='black')
    ax.set_title(f'Local Authority Districts that are intersected by the {river_name}', loc='left', fontsize=20)
    plt.show()

With our function defined, we can do the whole thing for a completely different river:

In [None]:
river_intersect_plot('Thames')

### Computing distances

Let's now find out how far it is between two regions of interest in our cut of regions around the Trent. We'll pick the East Riding of Yorkshire and Stafford, and then plot them on the map. In some ways, it's probably *less* fussy to compute the distances between all pairs in a dataframe because of the need to use `.iloc` to isolate the individual values within a row in the below snippets, but the below at least shows you what's going on in gory detail.

In [None]:
# Get the rows we're interested in out of the dataframe:
name_a = 'East Riding of Yorkshire'
name_b = 'Stafford'
# This selects the *all rows* that match these conditions, which is why we have to use .iloc[0] thereafter
# to make sure we're only passing a single row.
place_a = df_th.loc[df_th['LAD20NM'] == 'East Riding of Yorkshire', :]
place_b = df_th.loc[df_th['LAD20NM'] == 'Stafford', :]
# Compute the distance using representative points
dist_a_b = (place_a['geometry'].representative_point().iloc[0]
            .distance(place_b['geometry'].representative_point().iloc[0]))

# Plot the map
fig, ax = plt.subplots(figsize=(10, 10))
ax.set_frame_on(False)
ax.set_xticks([])
ax.set_yticks([])
df_th.plot(color='grey', ax=ax, edgecolor='0.6')
river.plot(color='lightblue', ax=ax)
place_a.plot(ax=ax, color='red')
place_b.plot(ax=ax, color='green')
for i, place in enumerate([place_a, place_b]):
    ax.annotate(text=place['LAD20NM'].iloc[0], xy=place['coords'].iloc[0],
                horizontalalignment='center', weight='bold',
                fontsize=15, color='black')
# Uncomment below to add a connecting line
## Create a line between two rep points in a and b
# connector = LineString([place_a['geometry'].representative_point().iloc[0],
#                        place_b['geometry'].representative_point().iloc[0]])
## Convert it to a geopandas dataframe for easy plotting
# gpd.GeoDataFrame([connector],columns=['line'], geometry='line').plot(ax=ax, linestyle='-.', color='black')
ax.set_title(f'Distance between {name_a} and {name_b} is {dist_a_b/1e3:1.0f} km.', loc='left')
plt.show()

### Nearest Neighbour

## Aggregation and Dissolve


## Plotting

Compositions are possible