<div class="frontmatter text-center">
<h1>Geospatial Data Science</h1>
<h2>Exercise 6: Clustering</h2>
<h3>IT University of Copenhagen, Spring 2022</h3>
<h3>Instructor: Anastassia Vybornova & Ane Rahbek Vierø</h3>
</div>

# Source
This notebook was adapted from:
* A course on geographic data science: https://darribas.org/gds_course/content/bG/diy_G.html

In [None]:
# import packages

import geopandas as gpd
import pandas as pd
import numpy as np

from sklearn import cluster
from pysal.lib import weights
from pysal.lib import examples

import matplotlib.pyplot as plt
import seaborn as sns
import contextily as cx

## Task I: NYC Geodemographics

We are going to try to get at the (geographic) essence of New York City. For that, we will rely on the same set up Census tracts for New York City we used for exercise 4 (from the [PySAL examples](https://pysal.org/libpysal/tutorial.html?highlight=examples#example-datasets)). Once you have the `nyc` object loaded, create a geodemographic classification using the following variables:

- `european`: Population White
- `asian`: Population Asian American
- `american`: Population American Indian
- `african`: Population African American
- `hispanic`: Population Hispanic
- `mixed`: Population Mixed race
- `pacific`: Population Pacific Islander
- `otherethni`: Population of other ethnicity

Before performing the clustering, make sure to standardize these variables (by using percentages instead of absolute numbers). NB: Compute the total population for each tract yourself (do *not* use the "poptot" column from the data set as this will lead to discrepancies.) The resulting values should range from 0 (no population of a given ethnic group in that tract) to 1 (all population of that tract is in the given ethnic group). Once this is ready, get to work with the following tasks:

1. Pick a number of clusters, N
1. Run K-Means for N clusters
1. Plot the different clusters on a map
1. Analyse the results:
    - *What do you find?*
    - *What are the main characteristics of each cluster?*
    - *How are clusters distributed geographically?*
    - *Can you identify some groups concentrated on particular areas (e.g. China Town, Little Italy)?*
1. Redo with a different N; how do results depend on the number of chosen clusters?

In [None]:
examples.explain("NYC Socio-Demographics")

In [None]:
# Load data
nyc_data = examples.load_example("NYC Socio-Demographics")

nyc = gpd.read_file(nyc_data.get_path("NYC_Tract_ACS2008_12.shp"))

nyc.plot(figsize=(9, 9));

In [None]:
# Your solution for Task I

## Task II: Regionalisation of Dar Es Salaam

For this task we will travel to Tanzania's Dar Es Salaam. We are using a dataset assembled to describe the built environment of the city centre. Let's load up the dataset first.

**Make sure you are connected to the internet when you run this cell.**


In [None]:
# Read the file in
db = gpd.read_file("http://darribas.org/gds_course/content/data/dar_es_salaam.geojson")

In [None]:
db.head()

Geographically, this is what we are looking at:

In [None]:
ax = db.plot(
    facecolor="none", 
    edgecolor="red",
    linewidth=0.5,
    figsize=(20, 20)
)
cx.add_basemap(
    ax, 
    crs=db.crs, 
    source=cx.providers.Esri.WorldImagery
)


We can inspect the table:

In [None]:
db.info()

Two main aspects of the built environment are considered: the street network and buildings. To capture those, the following variables are calculated at for the H3 hexagonal grid system, zoom level 8:

- Building density: number of buildings per hexagon
- Building coverage: proportion of the hexagon covered by buildings
- Street length: total length of streets within the hexagon
- Street linearity: a measure of how regular the street network is

With these at hand, your task is the following:

- **Develop a regionalisation that partitions Dar Es Salaam based on its built environment**

*These are only guidelines, feel free to improvise and go beyond what's set. The sky is the limit!*


For that, you can follow these suggestions:

- Create a spatial weights matrix to capture spatial relationships between hexagons.
    - If you use nearest neighbours: think about whether the geometry of the polygons should be accounted for;
    - If you use a distance band: think about an appropriate threshold and appropriate units (possibly change the CRS for that; hint: [this map of UTM zones](https://www.arcgis.com/home/item.html?id=b294795270aa4fb3bd25286bf09edc51) and a search on the https://epsg.io/ website can help!) 
- Set up a regionalisation algorithm with a given number of clusters (e.g. five or seven)
- Generate a geometry that contains only the boundaries of each region and visualise it (ideally with a satellite image as basemap for context); for an appropriate cmap browse [matplotlib's cmap palettes](https://matplotlib.org/stable/tutorials/colors/colormaps.html)
- Rinse and repeat with several combinations of variables and number of clusters
- Pick your best. *Why have you selected it? What does it show? What are the main groups of areas based on the built environment?*

In [None]:
# Your solution for Task II