# Practice exercise: Geospatial Machine Learning

This is a short practice exercise for developing a machine learning model using `scikit-learn`. The task is to predict leaf area index (LAI) using predictor variables derived from a <a href="https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S2_SR_HARMONIZED" target="_blank">Sentinel-2 satellite image</a> and topographic variables. The LAI measurements were derived from LiDAR data with a 30 cm spatial resolution and 15 cm vertical resolution averaged to a 10 m spatial resolution to match the size of Sentinel-2 pixels. 

LAI is a measure of the total area of leaves relative to the ground area and is an important biophysical variable for studying vegetation growth and functioning. The data we are using here were collected over the Marburg Forest in Germany and are from the paper by <a href="https://www.sciencedirect.com/science/article/abs/pii/S0304380019303230" target="_blank">Meyer et al. (2019)</a>.

## Setup

### Load data

In [None]:
import os
import subprocess

if "data-geoml" not in os.listdir(os.getcwd()):
    subprocess.run('wget "https://github.com/envt-5566/geo-ml/raw/main/data/data-geoml.zip"', shell=True, capture_output=True, text=True)
    subprocess.run('unzip "data-geoml.zip"', shell=True, capture_output=True, text=True)
    if "data-geoml.zip" not in os.listdir(os.getcwd()):
        print("Has a directory called data-geoml been downloaded and placed in your working directory? If not, try re-executing this code chunk")
    else:
        print("Data download OK")

DATA_PATH = os.path.join(os.getcwd())

### Load packages

In [None]:
if 'google.colab' in str(get_ipython()):
    !pip install mapclassify
    !pip install contextily
    !pip install pysal

import os
import math

import numpy as np
import pandas as pd
import geopandas as gpd

# spatial analysis libraries
import pysal

# plotting
import seaborn as sns
import contextily as cx
import matplotlib.pyplot as plt

# preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# models
from sklearn.neural_network import MLPRegressor
from sklearn.cluster import KMeans

# metrics
from sklearn.metrics import mean_squared_error

## Load data

In [None]:
gdf = gpd.read_file(os.path.join(DATA_PATH, "lai_meyer_et_al_2019_marburg.geojson"))
gdf = gdf.drop(columns=["field_1", "ID", "x_utm_25832", "y_utm_25832"])
gdf = gdf.to_crs("EPSG:4326")

In [None]:
gdf.head()

In [None]:
gdf.explore(column="LAI")

## Activity

Predicting LAI is a machine learning regression task as LAI is a continuous numeric value. **Can you adapt the examples from previous notebooks to develop and evaluate a model that predicts LAI from spectral reflectance and topographic predictors?**

You will need to consider:

* What variables to drop before model training.
* If you need to standardise the training and test data.
* What metric you will use to evaluate the model.
* How you will create training and test splits to evaluate the model (think about the spatial structure of your data).

In [None]:
## ADD CODE HERE

## Activity

Consider the following questions and write about a paragraph in response.

**Outline the rationale behind your strategy for creating training and test splits for model evaluation.**

**Based on your current evaluation strategy, could you be confident in deploying your model to generate accurate LAI predictions for all of Germany? for all of Europe?**

**You are tasked with generating a Germany-wide LAI map, outline a strategy for generating training and test data to support this task and describe why this strategy is suitable.**