# Session 2-4: Introduction to Geopandas 🛰️🌍

![ntl](./assets/ntl.jpg)

Data science for Climate & Health is a ***broad*** topic. But generally, to figure out how we might create a more sustainabile future, we need two types of data: human and environment. These types of data are inharently geospatial because they **map** human and environment phenomena on planet earth. 

[<span class="codeb">Geographic Information Systems</span>](https://en.wikipedia.org/wiki/Geographic_information_system) (GIS) allow for visualizing, manipulating, and analyzing human and environmental geographic data. But GIS platforms have limited utility because (1) it can be difficult to reproduce work flows with a GIS and (2) processing large volumes of data inefficent with a GIS graphical user interface (GUI). Further, GIS platforms tend to be a black box that do not allow you to fully understand how your data is being processed. 

Thankfully, open-source data science evangelists have developed a suite of geospatial data science packages – such as [<span class="codeb">GeoPandas</span>](https://geopandas.org) and [<span class="codeb">Rasterio</span>](https://rasterio.readthedocs.io) – in Python that build upon [Numpy](https://numpy.org), [<span class="codeb">Pandas</span>](https://pandas.pydata.org), and other commonly used Python packages. As such, many of the data structures and functions are similar for packages like <span class="code">Geopandas</span> as they are in Pandas. 

In this session, we will overview how GeoSpatial data can be analysized in Python. 
 
<p style="height:1pt"> </p>

<div class="boxhead2">
    Session Topics
</div>

<div class="boxtext2">
<ul class="a">
    <li> 📌 Introduction to <span class="codeb">GeoPandas</span> </li>
    <ul class="b">
        <li> Anatomy of a Geometry </li>
        <li> Importing Shape Files </li>
        <li> Concatenating and Merging Data </li>
        <li> Coordinate reference systems and projections </li>
    </ul>

</ul>
</div>

<hr style="border-top: 0.2px solid gray; margin-top: 12pt; margin-bottom: 0pt"></hr>

### Instructions
We will work through this notebook together. To run a cell, click on the cell and press "Shift" + "Enter" or click the "Run" button in the toolbar at the top. 

<p style="color:#408000; font-weight: bold"> 🐍 &nbsp; &nbsp; This symbol designates an important note about Python structure, syntax, or another quirk.  </p>

<p style="color:#008C96; font-weight: bold"> ▶️ &nbsp; &nbsp; This symbol designates a cell with code to be run.  </p>

<p style="color:#008C96; font-weight: bold"> ✏️ &nbsp; &nbsp; This symbol designates a partially coded cell with an example.  </p>

<hr style="border-top: 1px solid gray; margin-top: 24px; margin-bottom: 1px"></hr>

## Introduction to GeoPandas

<img src="./assets/geopandas.png">

GeoPandas is an open-source Python library that ascribes geographic information to Pandas Series and Pandas DataFrame objects. In other words, GeoPandas enables a Pandas Series/DataFrame to have a spatial dimension, akin to a .shp file in a GIS platform. Importantly, Geopandas can perform geometric operations. To do this, GeoPandas objects use **[Shapely](https://pypi.org/project/shapely/)** geometry objects. 

### GeoSpatial Data 
<hr style="border-top: 0.2px solid gray; margin-top: 12px; margin-bottom: 1px"></hr>

GeoSpatial data is either `raster` (e.g. a grid) or `vector` (e.g. 2-d cartesian points, lines, or polygons). We will come back to raster data later in this lession. 

<img src="./assets/raster-vector.png" alt="rastervector" width="500"/>

Because GeoPandas ascribes spatial information to tabular data, GeoPandas objects are `vector` spatial data. Each row in a GeoPandas DataFrame, must have spatial information that is either a point, line, or polygon that corresponds to the geographic location(s), or area, to which the data should be mapped.  

To add spatial information to a Pandas DataFrame, the `geometry` column implements a `shapely` object that contains infromation about the cartesian location of the location of that data. Let's look at an example. 

<div class="run">
    ▶️ <b> Run the cell below. </b>
</div>

In [None]:
import pandas as pd 
# Create a Pandas DataFrame from a list
df = pd.DataFrame({'location' : ['p1','p2','p3','p4','p5','p6','p7'],
                   'data1' : [1,2,3,4,5,6,7],
                   'data2' : [10,22,55,67,70,1,87]})
df

The DataFrame has three columns - location, data1, and data2 - but it does not have any geographic information from which it can be mapped. To do this, we need to first great x and y coordinates (usually latitude and longitude).

<div class="run">
    ▶️ <b> Run the cell below. </b>
</div>

In [None]:
df['x'] = [0,2,3,4,20,4,10]
df['y'] = [1,0,5,2,6,4,11]
df

Now we will turn our `x` and `y` columns into a `Shapely POINT` and implement a `GeoPandas DataFrame`

<div class="run">
    ▶️ <b> Run the cell below. </b>
</div>

In [None]:
from shapely.geometry import Point
import geopandas as gpd

df['geometry'] = [Point(xy) for xy in zip(df.x, df.y)] 
gdf = gpd.GeoDataFrame(df)
gdf

In [None]:
# check the type of object for the first geometry
type(gdf['geometry'][0])

#### Now, let's map our data!
Notice that the points are plotted on the x and y coordinates we provided, but the color of the points corresponds to data2 feild. If you change the column to data2, then the colors will change to represent data2. 

<div class="run">
    ▶️ <b> Run the cell below. </b>
</div>

In [None]:
ax = gdf.plot(column = 'data1', legend = True)

##### Let's do the same for our GHS data!

Similarly, our Ghana DHS 2014 dataset (.csv) has two columns `'lat'` and `'lon'` which record the locations of the surveyed households. But to convert it from a *Pandas Dataframe* to *Geopandas Dataframe*, we need to create a geometry variable. 

In [None]:
# First let's load the Ghana survey dataset
df = pd.read_csv('./data/Ghana-2014-DHS-Household-Filtered.csv')

# Identify records where both longitude and latitude are 0
index_to_drop = df[(df['long'] == 0) & (df['lat'] == 0)].index

# Drop these records using the drop function
df = df.drop(index_to_drop)
df.head(1)

In [None]:
# Create geometry (vector point) from the lon and lat coordinates
geometry = [Point(xy) for xy in zip(df['long'], df['lat'])]

# Convert pandas dataframe to geopandas dataframe
geo_df = gpd.GeoDataFrame(df, geometry=geometry)

# Set the CRS to WGS 84 (EPSG:4326)
# This step is important as later you may want to reproject the vector files
geo_df.set_crs("EPSG:4326", inplace=True)

geo_df.head(1)

In [None]:
geo_df[['lat', 'long', 'geometry']]

See! Our table has been geospatialized
<br>You can use the [`gdf.plot()`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.plot.html) function to visualize vector files in Python. Geopandas provides a high-level interface to the Matplotlib for making maps - for more detailed information, please check the User Guide Page from Geopandas [Mapping and plotting tools](https://geopandas.org/en/stable/docs/user_guide/mapping.html)

In [None]:
# Import matplotlib package 
import matplotlib.pyplot as plt

# Initialize empty figure
fig, ax = plt.subplots(1, 1, figsize=(8, 6))

# Plot the locations of the household survey
geo_df.plot(color='grey', markersize=5, ax=ax)

plt.title('Geographic Distribution of 2014 DHS Survey Responses in Ghana')
plt.xlabel('Longitude')
plt.ylabel('Latitude')

plt.show()

You can also visualize different variables on the survey data.

In [None]:
# PLOT THE RURAL/URBAN DISTRIBUTION ON THE SURVEY LOCATIONS

# Initialize the plot
fig, ax = plt.subplots(1, 1, figsize=(8, 6))

# Plot the locations of the household survey with the rural/urban column defining the colors
geo_df.plot(column='rural_urban', markersize=8, legend=True,
            legend_kwds={'loc': 'upper left', 'bbox_to_anchor': (1, 1)}, ax=ax)

# Set plot title and labels
plt.title('Locations (Rural vs Urban) of 2014 DHS Survey Responses in Ghana')
plt.xlabel('Longitude')
plt.ylabel('Latitude')

# Adjust the layout to make room for the legend
plt.tight_layout()

# Show the plot
plt.show()

### Importing Shape files
<hr style="border-top: 0.2px solid gray; margin-top: 12px; margin-bottom: 1px"></hr>

GeoPandas can easily load vector data from `.shp` files, as well as `.csv`, `.json`, and other common geospatial vector file formats. For the following parts of the tutorial, we will use two datasets to further examine the usage of Geopandas and how to process vector datasets in Python.

Let's start with a `.shp` that contains <ins>the boundaries of every country on the planet.<ins>

In [None]:
# file & path
import os

# Base directory where all notebooks/codes/data are stored directly 
base_dir = './'

# Filename relative to the base directory
filename = 'data/ne_10m_admin_0_countries/ne_10m_admin_0_countries.shp' 

# Joining paths to get the full path to the file
fn = os.path.join(base_dir, filename)

In [None]:
gdf = gpd.read_file(fn)
gdf.head()

Like a `Pandas DataFrame`, this `GeoPandas DataFrame` looks quite similar. You can see from that it contains a bunch of different columns for each country, but it also contains a `geometry` column that `GeoPandas` reads in as a `shapely` object. 

But functionally, `gdf` has many of the same methods and attributes as a `Pandas DataFrame`. Let's take a look at some of the similarities and differences. 

<div class="run">
    ▶️ <b> Run the cells below. </b>
</div>

In [None]:
print(type(gdf))
gdf.columns

In [None]:
# What is the size of our gdf?
gdf.shape

<div class="example">
    ✏️ <b> Try it. </b> 
   Try ordering the countries by population estimate.
</div>

In [None]:
# Let's try to sort the column by the total population estimate
gdf.sort_values(by = 'POP_EST', ascending = False)

<div class="example">
    ✏️ <b> Try it. </b> 
   Try showing only the country name and population for the top ten most populated countries.
</div>

In [None]:
# Sort the table ascendingly by population count and print the top 10 countries
gdf.sort_values(by = 'POP_EST', ascending = False)[['NAME', 'POP_EST']].head(10)

✏️ Try making a [choropleth map](https://en.wikipedia.org/wiki/Choropleth_map) of per capita GDP estimate by country.

In [None]:
import numpy as np

# Create a new variable/column and calculate the GDP per capita 
gdf['GDP_percapita'] = gdf['GDP_MD_EST'] / gdf['POP_EST'] * 10**6

# print the firt 5 elements and check
gdf['GDP_percapita'].head()

In [None]:
# you can also just use ax = gdf.plot() for simple quick visualization 
# but to have more subtle control, we can use matplotlib function to adjust the plot
ax = gdf.plot(column = 'GDP_percapita', legend = True)

In [None]:
# Initialize the plot
fig, ax = plt.subplots(1, 1, figsize=(8, 6))

# Plot the locations of the household survey with the rural/urban column defining the colors
gdf.plot(column='GDP_percapita', legend=True, legend_kwds={'shrink': 0.4}, ax=ax, vmin = 300, vmax = 100000)

# Set plot title and labels
plt.title('GDP per capita estimate by country')
plt.xlabel('Longitude')
plt.ylabel('Latitude')

# Adjust the layout to make room for the legend
plt.tight_layout()

# Show the plot
plt.show()

# ✏️ On your own

In [None]:
# Make a choropleth map of Population Rank for ONLY Africa 
# Hint - you will need to subset the data for CONTINENT

### Concatenating and Merging Data
<hr style="border-top: 0.2px solid gray; margin-top: 12px; margin-bottom: 1px"></hr>

Often you will have tabular data, like a `.csv` file, that does not contain Geographic information. But it may have catagorical staptial information. In this case you may have to [concatenate](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) or [merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html), two datasets to join the `.csv` file to spatial boundaries contained in a `.shp` file. Both concatenate and merge are quite useful. But with tabular data, you will most likely need to join two (or more) datasets based on a common set of values. As such, `pd.merge()` is likely your friend.

For example, you may have a dataset with a columns called `country` but it does not have a geographic boundaries for the countries. You will then have to merge that dataset with a second dataset that has the geographic boundaries. This is often the case with `.csv` file that you want to attach to a `.shp` file to make a map.

To merge two datasets, both datasets must have a column with the same name and (hopefully) the same values on which to join the the datasets. 

<ins>Let's look anothr `.csv` file, which contains multi-year DHS survey data aggregated at country level.<ins>

<div class="run">
    ▶️ <b> Run the cell below. </b>
</div>

In [None]:
fn = os.path.join(base_dir, 'data/DhsPrevalenceWCovar.csv')
df = pd.read_csv(fn)
df.head()

**Cool!** Our dataset has a column - `country` - that has catagorical geographic information that we can likely use to concatenate our geometry data from our world countries `.shp` file. But we first need to look at what columns we will need from world countries `.shp` file, which are the name and the geometry.

<div class="run">
    ▶️ <b> Run the cell below. </b>
</div>

In [None]:
gdf.columns

Let's make a new `GeoDataFrame` just with the columns we need and then let's merge.
<div class="run">
    ▶️ <b> Run the cell below. </b>
</div>

In [None]:
# Create a subset from the original GeodataFrame with only name and the geometry
gdf_sub = gdf[['NAME','geometry']].copy()

# check the subset
gdf_sub.head()

Before we can merge the two datasets, we have to make sure they both have the same column name. In this case, we need to rename `NAME` to country. Take a moment to read about [pd.merge()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)
<div class="run">
    ▶️ <b> Run the cells below. </b>
</div>

In [None]:
# Rename the column
# inplace = True will overwrite the dataframe's columns name
gdf = gdf_sub.rename(columns = {'NAME' : 'country'}, inplace = True) 
gdf_sub.head()

In [None]:
# Now merge them
df_m = pd.merge(df, gdf_sub, on = 'country', how = 'inner')

In [None]:
# Check the columns of our DHS df_m
df_m.columns

In [None]:
df_m.head(1)

When you use `pd.merge()` it changes the type of the `GeoDataFrame` object back to a normal `Pandas DataFrame`. The geoemetry column still contains the `Shaple POLYGON` objects, but the it will not have the same attributes or methods as a `GeoPandas GeoData Frame`.

To fix this, we need to cast the `Pandas DataFrame` to a `GeoPandas GeoDataFrame`. GeoPandas sees the geometry column and knows it can implement a GeoDataFrame object. In other words, a normal DataFrame must have a `geometry` column with `Shapely` objects to be cast into a `GeoDateFrame`. In general, it is always best to only have the same type of `Shapely` geometry (POINT, LINE, POLYGONE, etc.) in a single `GeoDataFrame`.  


<div class="run">
    ▶️ <b> Run the cells below. </b>
</div>

In [None]:
print(type(df_m))
gdf_m = gpd.GeoDataFrame(df_m)
print(type(gdf_m))

# ✏️ On your own

In [None]:
# plot stunting rates for the surveys taken in 2015 with boundaries

### Coordinate reference systems (CRS) and spatial Projections
<hr style="border-top: 0.2px solid gray; margin-top: 12px; margin-bottom: 1px"></hr>

The Earth is a three dimensional spheroid, but maps genearlly two dimensional representations of phenomena on the Earth. As such, cartographers have devevloped many methods to project 3-D data onto a 2-D surface. To map points, lines, and polygons from 3-D to 2-D, projections have a coordinate reference system (CRS) that provides information on the units and type of geographic transform performed on the data to map it. 

<img src="./assets/crs.jpg" alt="crs" width="500"/>

A full overview of geographic projections and coordinate reference systesms is out of scope for this course. But I suggest you [reading up on them](https://pro.arcgis.com/en/pro-app/3.1/help/mapping/properties/coordinate-systems-and-projections.htm) if you are unfamiliary with these terms. 

For this class, what you need to know is: 

1. Always check the `crs` of your data and to make sure that two or more datasets are in the same `crs` and `projection` before you perform analysis. This will be come clearer in future labs. 

2. Know that **reprojecting** data will fundmentally alter your data. This is okay. But it will add spatial uncertainty to analysis. More on this later.

Let's explore the .crs information and reproject the global country data. [EPSG](https://epsg.io) codes are useful to keep track of spatial meta data. Global data is often in [EPSG:4326](https://epsg.io/4326), the  World Geodetic System 1984.
<div class="run">
    ▶️ <b> Run the cells below. </b>
</div>

In [None]:
# Access the .crs attribute of a GeoDateFrame
print(type(gdf_sub.crs))
gdf_sub.crs

In [None]:
# Notice that .crs has many methods and attributes
dir(gdf_sub.crs)

In [None]:
# Let's plot the data
gdf_sub.plot()

Now reproject it and plot it in a different crs, [ESRI:54009](https://epsg.io/54009), another commonly used global CRS. Note, this CRS was created by ESRI, not EPSG. 
<div class="run">
    ▶️ <b> Run the cells below. </b>
</div

In [None]:
gdf_sub.to_crs('esri:54009').plot()

Notice that `esri:54009` distorts the edges of th earth and produces more an oval map. All projections distort one or more of the follow: area, direction, shape, and area. But some do a better job of preserving this information than others.

### Writing a shape file
It is very easy to write a `GeoDataFrame` to disk as a .shp file or a GeoJson file. Let's write our new DHS GeoDataFrame to disk.
<div class="run">
    ▶️ <b> Run the cells below. </b>
</div>

In [None]:
# first check the crs
gdf_m.crs

In [None]:
# Create a File Name
fn = os.path.join(base_dir, 'Day2/data/new_dhs_data.shp')

You can use the [`gdf.to_file()`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_file.html) funtion to export your geopandas as a vector file

In [None]:
# Export the updated geopandas as a esri shapefile 
# gdf_m.to_file(fn, driver='ESRI Shapefile')