**Original code**: [Babak Khavari](https://github.com/babakkhavari)<br>
**Conceptualization & Methodological review** : [Babak Khavari](https://github.com/babakkhavari)<br>
**Updates, Modifications**: [Babak Khavari](https://github.com/babakkhavari)<br>

# Clustering Notebook

The following notebook can be used in order to replicate the population clusters developed and published in https://data.mendeley.com/datasets/z9zfhzk8cr/. Please see **Assessing the urban-rural split and electrification in Sub-Saharan Africa  - a cluster-based methodology** for more information. For a more thorough description of each function please refer to the funcs.ipynb. 

## Datasets
The cluster makes use of three (3) GIS-datasets:
* **Administrative units (vector polygon)** - This should be disagreggated. It will be used to 1) delimit the population layer to the area of interest and 2) to limit the maximum size of the clusters
* **Population (raster)**
* **Nighttime lights (raster)** - This will be used in order to estimate electrified population in each cluster.


## Pre-processing
Before using this notebook please ensure that all of your datasets are in the WGS84 coordinate reference system (EPSG:4326) and that your raster datasets are clipped to the administrative boundaries of the area of interest 


## Output
The final clusters will include 7 columns.

1. **id** – The IDs are given as a unique number for each cluster. This enables the user to process the data contained in the clusters outside of a GIS software and then merge the data with the clusters.


2. **Country** – Name of the country. 


3. **Population** – This is the population in each cluster obtained from the population dataset.


4. **NightLight** – This value is obtained from the nighttime light map and represents the maximum luminance detected in each cluster (best results are obtained with stable lights).


5. **ElecPop** – The number of people in each cluster who live in areas where light sources are detected.


6. **Area** – The area of each cluster given in square kilometres.


7. **IsUrban** - Urban/Peri-urban/Rural classification (urban = 2, peri-urban = 1, rural = 0)
    

## Cell 1 - Importing packages

In [1]:
from ipynb.fs.full.funcs import *

## Cell 2 - Selecting Datasets

Select the workspace, this is the folder that will be used for the outputs. 

**NOTE** Select an empty folder as all the files will be deleted from the workspace once the clusters are generated

You will also have to select the three datasets used in the analysis. These are: administrative boundaries, population (.tif), Nighttime lights (.tif)
 


In [2]:
messagebox.showinfo('OnSSET extraction', 'Output folder')
workspace = filedialog.askdirectory()

messagebox.showinfo('OnSSET', 'Select the population map')
filename_pop = filedialog.askopenfilename(filetypes = (("rasters","*.tif"),("all files","*.*")))
pop=gdal.Open(filename_pop)
poprasterio=rasterio.open(filename_pop)

messagebox.showinfo('OnSSET', 'Select the nightlights map')
filename_NTL = filedialog.askopenfilename(filetypes = (("rasters","*.tif"),("all files","*.*")))
NTL = gdal.Open(filename_NTL)
NTLrasterio = rasterio.open(filename_NTL)

messagebox.showinfo('OnSSET', 'Select the admin map')
filename_admin = (filedialog.askopenfilename(filetypes = (("shapefile","*.shp"),("all files","*.*"))))
admin=gpd.read_file(filename_admin)

## Cell 3 - Setting study area name

This will dictate the name displayed in the country column of the final clusters

In [3]:
study_area_name = "Benin"

## Cell 4 - Setting the target coordinate system
When calculating distances and areas it is important to choose a coordinate system that represents distances and areas correctly in your area of interst.

In order to select your own coordinate system go to [epsg.io](http://epsg.io/) and type in your area of interest, this will give you a list of coordinate systems to choose from. Once you have selected your coordinate system replace the numbers below with the numbers from your coordinate system **(keep the "EPSG" part)**.

**NOTE** When selecting your coordinate system make sure that you select a system with the unit of meters (or another linear lenght unit), this is indicated for all systems on [epsg.io](http://epsg.io/)

In [4]:
crs = 'epsg:3395'

## Cell 5 - Urban ratio, actual population and thresholds

1) Enter the urban ratio in the study area. This will be used to calibrate the urban and rural clusters (0 = everyone is rural, 1 = everyone is urban).<br><br>
2) Enter the total population in your study area for the year of interest. This will be used to calibrate the GIS-population.<br><br>
3) Enter thresholds for population and nighttime light. All values under the threshold in the population and/or NTL maps will be removed. This will have an effect on the electrification proxy and can be used in order to remove noise in the NTL maps.

In [5]:
urban_ratio = 0.48
total_population = 12123000
population_threshold = 0
NTL_threshold = 0

## Cell 6 - Clipping raster layers

Clipping the nighttime lights and population maps to the extent of the study area. The first parameter is the raster to clip and the second parameter is the polygon to clip the raster by. **Do not change the parameters given here**.

In [6]:
clipped_NTL = clipRasterByExtent(NTL, admin)
clipped_Pop = clipRasterByExtent(pop, admin)

## Cell 7 - Reclassifying rasters

Reclassifies the clipped nighttime lights and population layers. The function sets everything under the thresholds to zero. The first parameter is the clipped raster from cell 6 and the second parameter the thresholds from cell 5. **Do not change the parameters given here**.

In [7]:
reclassified_NTL = reclassifyRasters(clipped_NTL, NTL_threshold)
reclassified_Pop = reclassifyRasters(clipped_Pop, population_threshold)

## Cell 8 - Resample population raster

Resample population layer and save the resampled map to disc.

In **resampleRaster** the first parameter is the raster to be reclassified. The second parameter is the factor used in the resampling (i.e if you have a raster with cell size 30m a factor 3 creates an output raster with cell size 90m). You are recommended to change this value to one if the cell-size of your population layer is larger than 100m. The output is a memory layer.

In **saveRaster** the memory layer from resampleRaster is saved to disc.

In [8]:
resampled_Pop = resampleRaster(reclassified_Pop, 3)
saveRaster(resampled_Pop, workspace + r"/rasterBase.tif")

## Cell 9 - Convert rasters to polygon

Converts the nighttime lights greater than the threshold to polygons. The second parameter ensures that the polygons are saved with the correct name. **Do not change the parameters given here**.

In [9]:
NTL_pol = toPolygon(reclassified_NTL, 2, workspace + r"/NTLArea")

## Cell 10 - Creating the cluster base

**rasterize** Coverts the administrative areas to rasters.

**rasterMultiplication** multiplies the resampled population with the rasterized admin and overwrites the reasmple raster with the results.

**toPolygon** coverts the results from the rasterMultiplication to polygons. The second parameter ensures that the polygons are saved with the correct name. 

This ensures that the clusters do not spil over different administrative borders. If you do not wish to add such a restriction use the national administrative unit as your administrative layer (level 0). 

**Do not change the parameters given here**.

In [10]:
rasterize(admin, filename_admin,resampled_Pop, workspace + r'/raster_admin.tif')
rasterMultiplication(workspace + r"/rasterBase.tif", workspace + r"/raster_admin.tif", workspace + r"/rasterBase.tif")
clusters = toPolygon(workspace + r"/rasterBase.tif", 1, workspace + r"/clusters")

#### Optional step - adding a buffer zone, dissolve and convert to single part polygons
The buffer will reduce the number of final clusters. You may skip this step if you want to have a more "clean" result.

In [11]:
#Define coordinate system of the geodataframe
clusters = gpd.GeoDataFrame(clusters, geometry='geometry', crs={'init': 'epsg:4326'})

# Projecting geodataframe to coordinate system (so buffer can be added in meters)
clusters_prj = clusters.to_crs({'init': "epsg:3395"})

In [12]:
# Adding buffer (in meters)
clusters_prj["geometry"] = clusters_prj.geometry.buffer(500)

# Adding supporting column for dissolving 
clusters_prj["DISS"] = 0

# Dissolve buffered polygons
clusters_prj = clusters_prj.dissolve(by="DISS")

# Convert one multipart polygon (product of dissolve) to many single part polygons 
clusters_prj = multi2single(clusters_prj)

# Reproject data to the reference coordinate system 
clusters = clusters_prj.to_crs({'init': 'epsg:4326'})    

## Cell 11 - Adding attributes to clusters

Generates the *id*, *Country* and *Area* columns. **Do not change the parameters given here**.

In [13]:
clusters = addAttributes(clusters, crs, study_area_name)

## Cell 12 - Generate ElecPop 

Clip the population data with the polygon nighttime lights to generate layer with population in lit areas. This spatial distribution in this layer is dependent on the thresholds given in Cell 5. **Do not change the parameters given here**.

In [14]:
elecPop = clipRasterByMask(poprasterio.name, workspace + r"/NTLArea.shp", "EPSG:4326", workspace + r"/rasterBase.tif")

## Cell 13 - Populating clusters

Adding the *Population*, *NightLight* and *ElecPop* columns. **Do not change the parameters given here**.

In [15]:
clusters = populatingClusters(clusters, poprasterio, "Pop", "sum")
clusters = populatingClusters(clusters, NTLrasterio, "NTL", "max")
clusters = populatingClusters(clusters, elecPop, "ElecPop", "sum")

## Cell 14 - Calibrate population 

Calibrating population based on the population value given in Cell 5. **Do not change the parameters given here**.

In [16]:
elecPop=None
clusters = calibratePop(clusters, workspace, total_population)

## Cell 15 - Calibrate Urban and rural split 

Calibrating urban ratio based on the urban rate given in Cell 5. **Do not change the parameters given here**.

In [17]:
urbanSplit = calibrateUrban(clusters, urban_ratio, workspace)

Modelled urban ratio is 0.419% in comparision to the actual ratio of 0.48% after 501 iterations.
