# Week 10: Revision session

This session recaps on important learning outcomes from the whole module.

# Connecting to Google Drive from Colab

In [None]:
# Load the Drive helper and mount your Google Drive as a drive in the virtual machine
from google.colab import drive
drive.mount('/content/drive')

# Import all necessary libraries

In [None]:
#import required libraries
!pip install rasterio
!pip install sentinelsat
!pip install geopandas
!pip install rasterstats

import csv 
import geopandas as gpd
import math
from math import floor, ceil
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import ogr
import os
from os import listdir
from os.path import isfile, isdir, join
from osgeo import gdal
import pickle
from pyproj import Proj
from pprint import pprint
import rasterio
from rasterio.windows import Window
from rasterio import features, plot
from rasterio.plot import show_hist, reshape_as_raster, reshape_as_image
from rasterio.warp import calculate_default_transform, reproject, Resampling
from rasterstats import zonal_stats
import skimage.io as io
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
import joblib
import shutil
import sys

# make sure that this path points to the location of the pygge module on your Google Drive
libdir = '/content/drive/MyDrive/practicals21-22' # this is where pygge.py needs to be saved
if libdir not in sys.path:
    sys.path.append(libdir)

# import the pygge module
import pygge

%matplotlib inline

# Week 1: Introduction to Python

Functions, Control Flow, Lists, Loops and Strings

Google Drive, iPython notebooks, Google Colab


#Week 2: Principles of Image Data

Data types and type conversions, data input and output

Modules, dictionaries, files, classes and function arguments

NumPy


#Week 3: Getting started with raster data processing with RasterIO

Raster datasets and their attributes

Coordinate reference systems

Reading and writing image data

Image metadata

GDAL and RasterIO libraries

#Week 4: Image analysis and visualisation with RasterIO

Reprojecting (warping) and reshaping a Sentinel-2 image

Image visualisation with Matplotlib, displaying RGB channels with image band data, image enhancement




#Week 5: Processing Sentinel-2 Image Composites

Searching for and downloading Sentinel-2 image composites from Google Earth Engine via the Python API

Creating an animated movie from a time series of images


#Week 6: Normalised Difference Vegetation Index (NDVI) Time-Series Analysis

Downloading time-series of Sentinel-2 reflectance for selected locations

Calculating NDVI

Visualisation and plotting


#Week 7: Accessing Sentinel-2 Images from the Copernicus Open Access Hub

Searching for and downloading Sentinel-2 data from the Copernicus Open Access Hub API using the SentinelSat library

Sentinel-2 file and directory structure

Image pre-processing, including unzipping, warping and region clipping


#Week 8: Vegetation Index analysis with Zonal Statistics of Polygons

Calculating an NDVI image from Sentinel-2 image bands

Zonal statistics: Extracting statistics of pixel values within polygons of a shapefile


#Week 9: Machine Learning Applications

Using QGIS to collect classification training data in a polygon vector layer

Land cover classification and model training with random forests using the SciKit-Learn library



# -----------------------------

Make sure that all your files are in the right places before running the next cell.

Edit the directory paths if they are not fitting to your own directory organisation.

In [None]:
# set up your working directory with the satellite data
root = '/content/drive/MyDrive/practicals21-22' 
wd = join(root, 'rf')

# path to your download directory that contains several full Sentinel-2 images (granules)
s2path = join(root,'download')

# path to our shapefile to get the extent of the clipped image area
shapefile = join(root, 'oakham', 'Polygons_small.shp') # ESRI Shapefile of the study area

# names of bands to be included in the merged GeoTiff file
bandnames = ["B02", "B03", "B04", "B08"]

'''
* We call the merged raster output file 'S2_stack.tif' because it contains several bands
  from several images acquired at different dates
'''
# path to the new merged file with the selected bands from different acquisition dates in GeoTiff format
s2merged = join(wd, 'S2_stack.tif')

# make a filename for the classified map
outfile = join(wd, "LandCoverMap.tif")

# define the name of the shapefile containing the training polygons
trainshapefile = join(wd, 'training_areas.shp')

# define the name of the output raster file that will contain the class numbers of our
#   training areas as pixel values
trainraster = join(wd, 'training_areas.tif')

# Merge all bands into a single file
The Sentinel-2 image bands are all in separate .jp2 files. First, we need to find the file names of all image bands we want to include in the classification and merge the band rasters into a single GeoTiff file.

In [None]:
'''
* We want to use all four 10 m resolution bands from all Sentinel-2 images in the download directory
* We stack all the band rasters into one array
* Use s2path pointing to the download directory (like two weeks ago)
'''

# how many .SAFE directories are in the download directory?
# get the list of all directories in the download directory
dirlist = [d for d in listdir(s2path) if isdir(join(s2path, d))]

# make an empty list of all Sentinel-2 granule IDs we have downloaded
s2IDs = [] 

# iterate over all Sentinel-2 .SAFE image directories
for d in range(len(dirlist)):
  # the directory names have the following structure, for example:
  # S2A_MSIL2A_20190919T110721_N0213_R137_T30UXD_20190919T140654.SAFE
  # the first part of the directory name is the granule ID
  # so we split off the ".SAFE" as follows:
  sceneID = dirlist[d].split(".")[0] 
  s2IDs.append(sceneID) #append the unique identifier to the list

print(len(s2IDs), " Sentinel-2 images found.")
print("List of all Granule IDs:")
pprint(s2IDs)

# make an empty list of all band file names for all images we want to merge into one raster for the classification
files_selected = []

# iterate over all Sentinel-2 image directories again
for d in range(len(dirlist)):

  # get all file names of the 10 m resolution band files from all Sentinel-2 images
  files_10m = sorted(pygge.get_filenames(s2path, "_10m", "10m"))
  print("All bands in the image directory:")
  pprint(files_10m)
  print("\n")

  # We split the filename into components based on the underscore _
  # e.g. "T30UXD_20190919T110721_B04_10m.jp2"
  # becomes ["T30UXD", "20190919T110721", "B04", "10m.jp2"]
  # so the component indexed 2 contains the band number
  for b in bandnames:
    '''
    below, we join the directory path to the 10 m resolution imagery for the Sentinel-2
    granule we are just iterating over to the names of the 10 m band files,
    so we can later find them again
    '''
    files_selected.append([files_10m[index] for index, content in enumerate(files_10m) if b in content][0])

print("\nList of all selected band files from all acquisition dates for merging into one raster file:")
for i in files_selected:
  print(i)

In the cell below, we now iterate over all band files from several images acquired on different dates and write them to subsequent output bands in the merged file. For example, if 4 bands are included from each image acquisition, then the first band of the second image would become output band 4+1=5.
That means our output raster file needs 'bands * images' output bands. 
We substitute len(bandnames) with len(files_selected) to create enough output bands.

In [None]:
print("Creating clipped, merged band file:" + s2merged)

# Get the shapefile layer's extent
extent, crs, epsg = pygge.get_shp_extent(shapefile)

# now iterate over all band files we want to include in the merged file
for index, f in enumerate(files_selected):

  with rasterio.open(f, 'r') as thisfile:
    # clip the input file
    clipfile = f.split(".")[0] + "_clip.tif"
    print("Creating temporary clipped image file: " + clipfile)
    pygge.easy_clip(f, clipfile, extent)
    
    # in the first iteration, open the new file with the clipped, merged band data for write access
    if index == 0:
      # open one of the band files to get metadata
      bandfile = rasterio.open(clipfile, 'r')
      dt = bandfile.read(1).dtype
      # open the output file
      s2merged_file = rasterio.open(s2merged, 'w', driver='Gtiff', width=bandfile.width, 
                                    height=bandfile.height, count=len(files_selected), 
                                    crs=bandfile.crs, transform=bandfile.transform, dtype=dt)
      # close the file containing the metadata we copied
      bandfile.close()

    # now write the clipped raster band from the input file to the output file
    with rasterio.open(clipfile, 'r') as bandfile:
      print("Writing clipped raster to band: " + str(index + 1))
      s2merged_file.write(bandfile.read(1), index + 1)
    
    # close the clipped input raster file
    bandfile.close()
    # remove the temporary clipped input raster file
    os.remove(clipfile)
  
  # close the full input raster file
  thisfile.close()

# close the output file
s2merged_file.close()

#Warp the stacked file

Now we will reproject (warp) the stacked raster file to the same projection as our shapefile or geojson file.


In [None]:
# make a file name for our new file
warpfile = s2merged.split(sep='.')[0] + '_warped.tif'

# warp the raster file to the same projection as the shapefile
pygge.easy_warp(s2merged, warpfile, epsg)

# Visualise our image
At this point, we may want to check whether all processing steps have worked. We need to look at our image to see whether anything went wrong.

In [None]:
# create a figure with subplots
fig, ax = plt.subplots(figsize=(10,10))
fig.patch.set_facecolor('white')

'''
We use the pygge easy_plot function here to check whether our images taken on 
different dates are coregistered well, i.e. the pixels are in the same place.
Hence, we select three bands that correspond to different acquisition dates
but have the same wavelength.
'''

# plot the bands acquired on the first, second and third acquisition date
#    as red, green and blue on screen
# The number of timesteps is the number of band files divided by the number of bands per image acquisition
ntimes = round(len(files_selected)/len(bandnames))
if ntimes == 1:
  timesteps=[3, 2, 1] # if only one timestep is used, then use three bands from that image
if ntimes == 2:
  timesteps=[1, 1 + len(bandnames), 1 + len(bandnames)] # if only two timesteps are
      # used, then use red as timestep one and green and blue as timestep two 
if ntimes > 2:
  timesteps = [1, 1 + len(bandnames), 1 + len(bandnames) * 2] # else, display the 
      # first three timesteps

print("Image acquisition timesteps found: "+str(ntimes))

pygge.easy_plot(warpfile, ax=ax, bands = timesteps, percentiles=[0,98])

You will notice that the colour scheme looks really odd and a bit psychedelic. This is because we display bands of the same wavelength but acquired on a different date as red, green and blue channels on screen. Check out the last block of code to see how this is done.

# Training a random forest model with QGIS and SciKit-Learn

Now we will train a random forest model with all available input bands in the warped and stacked raster - there will be all selected bands from all available acquisition dates. For example, if we selected 4 bands from each downloaded image, and we downloaded 3 Sentinel-2 images acquired at different dates, then that makes 3*4=12 bands in the stacked raster file.

The land cover classification scheme uses the following classes, which I defined when I collected the training vectors in QGIS.

LandCover:

1 = Water

2 = Residential

3 = Industrial

4 = Pasture

5 = Crops

6 = Bare soil

7 = Forest


# Read in the shapefile with the training polygons

In pygge, there is a function that reads in the shapefile with the training data as collected in QGIS and rasterises them to the same image extent as the image file that we will classify.


In [None]:
pygge.training_shapefile_to_raster(trainshapefile, warpfile, trainraster)

# Train the random forest model

Now we will train the model with the rasterised training layer and the warped and stacked raster file of Sentinel-2 data. We will choose 61 decision trees in the random forest model, but this can be modified. The script saves the model to a pickle file, so we can re-use it later.

In [None]:
# the name of our model file we want to save
modelfile = join(wd,"model.pkl")

print("Input files to the random forest model training:")
print(warpfile)
print(trainraster)
print(modelfile)

# call the training function
model = pygge.train_rf_model(warpfile, trainraster, modelfile=modelfile, ntrees=61)

So far, we have fitted the random forest classification model, assessed which Sentinel-2 bands contribute most to the classification, and looked at how the number of decision trees in the random forest influences the OOB error rate. This is useful to know to see whether the number of trees selected was too low, i.e. the error still decreases a lot when more trees are added.


#Run the classification

The next step is to classify the Sentinel-2 image. Following the same approach as above, we define a function to do the classification, then we execute it.

In [None]:
# call our classification function
pygge.classify_rf(warpfile, modelfile, outfile)

# Visualise the classified image
We need to check the results to make sure it worked. Let's take a look.

What we do differently here is that we visualise a classified raster dataset. This means we can define our own class labels (that match our training data labels) and associate them with a colour scheme.


In [None]:
# inspired by https://www.earthdatascience.org/courses/use-data-open-source-python/intro-raster-data-python/raster-data-processing/classify-plot-raster-data-in-python/

# Create a list of labels to use for your legend
class_labels = ["Water", 
                "Residential", 
                "Industrial", 
                "Pasture", 
                "Crops", 
                "Bare soil", 
                "Forest"]

# Create a colormap from a list of colours
# see this chart for information on available colours:
# https://matplotlib.org/2.0.0/examples/color/named_colors.html 
colours = ['mediumblue', 
          'firebrick', 
          'red', 
          'yellowgreen', 
          'gold', 
          'saddlebrown', 
          'green']


# Plot our classified land cover raster
fig, ax = plt.subplots(figsize=(20, 10))
fig.patch.set_facecolor('white')

pygge.easy_plot_landcovermap(outfile, ax, shapefile=None, band=1, xlim=None, ylim=None,
                           labels=class_labels, colours=colours, verbose=False)

We have produced a new land cover map from multi-temporal imagery. We could now evaluate which map is more accurate using accuracy assessment techniques.

The classification methodology and the workflow in these practicals can be a basis for your own satellite image processing application.

# Formative assignment of the week

Your assignment for this week is to:
1. Use QGIS to collect training data in a vector later with polygons. See the 
tutorial on Blackboard / Panopto how to do this.
2. Download 3-10 Sentinel-2 images for your test area from different acquisition dates.
3. Train a random forest model.
4. Apply the model to the Sentinel-2 data.