# Introduction to Geo-spatial Analysis


## Introduction

In this chapter, you'll learn the basics of geospatial analysis with code.

You should be aware when following this chapter that installing geographic analysis packages isn't always the easiest and things can and do go wrong! (Some geospatial analysis courses recommend running everything in a Docker container.)

There are two types of spatial data geographic information systems (GIS): vector and raster. Vector spatial data are made up of vertices and paths. These are represented computationally by points (zero-dimensional objects, if you like), lines (1D objects), and polygons (2D objects, aka an area). Vector data are analogous to vector image formats, like .eps and .pdf, they are smooth and well-defined regardless of the level of zoom you look at them (on a map). Raster data are composed of pixels, also known as grid cells. Each cell has a different value and the raster map is overlaid on a map. Raster data are better for continuous variables such as temperature or rainfall, while vector data are better suited to political boundaries and features like roads. In this book, we'll only cover vector geospatial data analysis (if you need to work with rasters, check out [rasterio](https://rasterio.readthedocs.io/en/latest/)).

Common file formats for vector data include Shapefile (.shp...), GeoJSON/JSON, KML and GeoPackage.

### Coordinate Reference Systems

A Coordinate Reference System (CRS) associates numerical coordinates with a position on the surface of the Earth. Latitude and longitude are an example (and have units of degrees) but there are many CRSs. They are important because they define how your data will look on a map. Think about it like this: in the case of the usual charts you plot, you usually take it as given that you are working in a space that is 2D with an X- and Y-basis vector. But the most basic object in mapping is a *sphere*, which is fundamentally different to a 2D plane. This means you have to choose whether to show part of a globe or all of a 2D representation of a globe (a *projection*), which inevitably introduces distortions.

The type main classes of CRS are geographic or projection. A geographic CRS uses a 3D model of the Earth and 'drapes' the data over it.

A projected CRS has the geographic CRS plus a map projection that has co-ordinates on a 2D plane. It is well known that there is no distortion-free projection from a globe to a plane; you cannot preserve areas, shapes, distances, and angles simultaneously. Different map projections introduce different distortions, as lovingly shown in this [xkcd](https://xkcd.com/977/) cartoon.

One example of a map projection is the Mercator projection, which is a *conformal mapping*, i.e. a mapping that locally preserves angles, but not necessarily lengths. In fact, Mercator distorts areas, especially the further away an area is from the equator. Some projections are better for some purposes than others.

![XKCD Bad Map Projection: South America](https://imgs.xkcd.com/comics/bad_map_projection_south_america.png)

XKCD: Bad Map Projection: South America

Some analysis tools expect geospatial data to be in a projected CRS-this is true of the main package we'll use, **geopandas**. This is usually not a problem for economic data; it's rare that the curvature of the Earth becomes a factor (though distances might in some rare situations). Most spatial libraries expect that all of the data that are being combined be in the same CRS.

CRSs are usually referenced by an [EPSG code](https://epsg.io/). Two widely used CRSs are WGS84 (aka EPSG: 4326) which provides a good representaiton of most places on Earth and NAD83 (aka EPSG: 4269), which provides a good representation of the USA.

Why are we banging on about this? Because maps and geometry objects come in different CRS and it's worth being aware of that so that you can ensure they are in the same format, and that you have the right CRS for your purposes.

### Imports and packages

We'll be using [**geopandas**](https://geopandas.org/index.html), the go-to package for vector spatial analysis in Python. The easiest way to install this package is using `conda install geopandas`; if you want to install it via pip then look at the [install instructions](https://geopandas.org/install.html). 

Let's import some of the packages we'll be using:

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import os

In [None]:
# TODO hide cell
# Set max rows displayed for readability
pd.set_option('display.max_rows', 6)
# Plot settings
plot_style = {'xtick.labelsize': 20,
                  'ytick.labelsize': 20,
                  'font.size': 22,
                  'figure.autolayout': True,
                  'figure.figsize': (10, 5.5),
                  'axes.titlesize': 22,
                  'axes.labelsize': 20,
                  'lines.linewidth': 4,
                  'lines.markersize': 6,
                  'legend.fontsize': 16,
                  'mathtext.fontset': 'stix',
                  'font.family': 'STIXGeneral',
                  'legend.frameon': False}
plt.style.use(plot_style)

# TODO: for this chapter, turn off spines etc by default?

## Geopandas dataframes

Quite literally, **GeoPandas** is a combination of geo and pandas so the good news is that everything you know about using **pandas** dataframes can be re-used here for geospatial data. The geo part adds functionality for geo-spatial data.

### Basics

So, first, we need some geo-spatial data to analyse. There are several different file formats for geo-spatial data, such as Shapefile (.shp), GeoJSON/JSON, KML, and GeoPackage.

We'll use a Shapefile of the countries of the world from [Natural Earth](https://www.naturalearthdata.com/downloads/50m-cultural-vectors/50m-admin-0-countries-2/). It comes as a zip file; unzip it and one of the files ends in .shp, which is the one we load with **geopandas**.

Let's load the data and look at the first few rows:

In [None]:
import geopandas as gpd
df = gpd.read_file('/Users/aet/Downloads/ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp')
df.head(3)

There's a lot of info here, but a lot of it is different labelling. The dataset has one country per row.

Let's see what we get when we call the humble plot function!

In [None]:
df.plot(color='green');

I think it's glorious just how easy this is to use. 

Because **geopandas** builds on **pandas**, we can do all of the usual pandas-like operations such as filtering based on values. Here's how to filter to an individual country (one with a recognisable shape!):

In [None]:
df[df['ADMIN'] == 'Italy'].plot();

By the way, you can pass an `Axes` object to the `plot()` function and do all of the usual manipulations that would do with **Matplotlib**. For example:

In [None]:
fig, ax = plt.subplots()
df[df['ADMIN'] == 'United Kingdom'].plot(ax=ax, color='green', edgecolor='k')
fig.set_facecolor("lightslategray")
ax.set_frame_on(False)
ax.set_xticks([])
ax.set_yticks([])
plt.show()

The secret behind this `.plot()` sorcery is the last column of the dataframe, `geometry`, as this encodes the shape of individual countries in the dataset. Let's look at it for Italy:

In [None]:
df.loc[df['ADMIN'] == 'Italy', 'geometry']

The object is a multipolygon. Remember that we have points (0D), lines (1D), and polygons (aka areas, 2D) that we can embed in a projection. A line is at least 2 vertices that are connected; a polygon is the area contained within 3 or more vertices. Multipolygons are the union of two or more non-contiguous polygons: in this case, the Italian mainland, Sicily, and Sardinia.

The `plot` function works just as happily if our basic objects are points rather than polygons though. In the below example, we'll grab the centroid (the spatial midpoint) of each country as a point and plot them:

In [None]:
df['centroid'] = df.centroid
df.set_geometry('centroid').plot();

Let's explore those basic building blocks a bit more. A point at position (1, 2) is defined as follows (**shapely** is used by **geopandas**):

In [None]:
from shapely.geometry import Point
point = Point(1, 2)
point

There are different kinds of lines but the simplest is the `LineString` which can be constructed from a sequence of points.

In [None]:
from shapely.geometry import LineString

line = LineString([Point(0.0, 1.0), Point(2.0, 2.0),
            Point(2.5, 5.0), Point(4, 5),
            Point(4, 0)])
print(f'Length of line is {line.length:.2f}')
line

We already saw Polygons in the shape of Italy. But here's a much simpler one:

In [None]:
from shapely.geometry import Polygon

poly = Polygon([(0, 0), (1, 1), (1, 0)])
poly

## Spatial set operations

Use Binary logical operations from https://raw.githack.com/uo-ec607/lectures/master/09-spatial/09-spatial.html#BONUS_1:_US_Census_data_with_tidycensus_and_tigris as an example, ie where does river go.

## Aggregation and Dissolve


## Calculating distances

Use https://raw.githack.com/uo-ec607/lectures/master/09-spatial/09-spatial.html#BONUS_1:_US_Census_data_with_tidycensus_and_tigris

### Working with Co-ordinate Reference Systems


We can check what the CRS of this entire **geopandas** dataframe is:

In [None]:
df.crs

We can switch between CRS using the `to_crs` function. Let's see the world map we plotted earlier using the WGS84 projection using the dreaded Meractor projection. This looks completely ridiculous if we don't drop Antarctica first, so let's do that!

In [None]:
df_mercator = df[~df['ADMIN'].isin(['Antarctica', 'French Southern and Antarctic Lands'])].to_crs("EPSG:3395")
df_mercator.plot();

## Plotting

Compositions are possible

In [None]:
fig, ax = plt.subplots()
df.boundary.plot(ax=ax, lw=0.9)
df['centroid'].plot(ax=ax, color='green', edgecolor='k', alpha=0.6)
plt.show()

### Choropleth Maps

See https://geopandas.org/mapping.html

Heatmaps: https://nbviewer.jupyter.org/gist/perrygeo/c426355e40037c452434 

df['centroid']