# Week 6: Sentinel-2 time-series analysis

Individual learning outcomes: At the end of this week, all students should be able to access Sentinel-2 time series data of a region of interest from Google Earth Engine via the Python API and visualise the results.

# Sentinel-2 time-series

Workflow for this practical:
* Define an area of interest based on an ESRI shapefile
* Define a time window for our data search
* Set a maximum acceptable cloud cover for our search
* Use Google Earth Engine to download a time-series of average NDVI values averaged over the area of interest
* Visualise the time series and fit a simple model


Connect to our Google Drive from Colab.

In [None]:
# Load the Drive helper and mount your Google Drive as a drive in the virtual machine
from google.colab import drive
drive.mount('/content/drive')

Import required libraries

In [None]:
# install some libraries that are not on Colab by default
!pip install rasterio
!pip install geopandas
!pip install rasterstats
!pip install earthengine-api
!pip install requests
!pip install sentinelsat


# import libraries
import json
import geopandas as gpd
import matplotlib.pyplot as plt
import math
import numpy as np
from osgeo import gdal, ogr
import os
from os import listdir
from os.path import isfile, isdir, join
import pandas as pd
from pprint import pprint
import rasterio
from rasterio import plot
from rasterio.plot import show_hist
from scipy import optimize
import shutil
import sys
import zipfile
import requests
import io
import webbrowser
import ee

# make sure that this path points to the location of the pygge module on your Google Drive
libdir = '/content/drive/MyDrive/practicals21-22' # this is where pygge.py needs to be saved
if libdir not in sys.path:
    sys.path.append(libdir)

# import the pygge module
import pygge

%matplotlib inline

# Authenticate to the Google Earth Engine API.

API stands for 'application programming interface'. An API defines interactions between multiple software intermediaries, in this case between our Jupyter Notebook and the ESA Copernicus Data Hub. It defines the kinds of calls or requests that can be made, how to make them, the data formats that should be used, the conventions to follow etc. (text modified after Wikipedia)

In [None]:
# Connect to Google Earth Engine API
# This will open a web page where you have to enter your account information and a code is provided. Paste it in the terminal.
!earthengine authenticate

ee.Initialize()

# Set up the directory paths on Google Drive
Modify these string variables to match your data directory structure.

BEFORE YOU RUN THIS CELL, EDIT THE VARIABLE wd BELOW TO POINT TO YOUR DIRECTORY ON GOOGLE DRIVE

IMPORTANT: You must upload a shapefile of your area of interest to your Google Drive before running the next cell. Set the variable 'shapefile' below to point to this file. You can draw a polygon and save it as a shapefile on http://www.geojson.io.

In [None]:
# path to your permanent Google Drive 
# (not so much space but will be kept after the session)
# EDIT THIS LINE (/content/drive/MyDrive is the top directory on Google Drive):
wd = "/content/drive/MyDrive/practicals21-22"
print("Connected to data directory: " + wd)

# path to your temporary drive on the Colab Virtual Machine 
# (more disk space but will be deleted when Colab is closed)
cd = "/content/work"

# directory for downloading data
downloaddir = join(cd, 'download') # where we save the downloaded images

# CAREFUL: This code removes the named directories and everything inside them to free up space
# Note: shutil provides a lot of useful functions for file and directory management
try:
  shutil.rmtree(downloaddir)
except:
  print(downloaddir + " not found.")

# create the new directories, unless they already exist
os.makedirs(cd, exist_ok=True)
os.makedirs(downloaddir, exist_ok=True)

print("Connected to Colab temporary data directory: " + cd)

print("\nList of contents of " + wd)
for f in sorted(os.listdir(wd)):
  print(f)

# Define our search parameters

You can modify some of the parameters and upload your own shapefile.

In [None]:
# EDIT THE SEARCH OPTIONS BELOW

# YOU CAN PLACE A DIFFERENT SHAPEFILE ONTO YOUR GOOGLE DRIVE BUT MAKE SURE THAT
#    THE VARIABLE shapefile POINTS TO THE CORRECT FILE:
shapefile = join(wd, 'oakham', 'Polygons_small.shp') # ESRI Shapefile of the study area

# Define a date range for our search
datefrom = '2021-01-01' # start date for imagery search
dateto   = '2021-08-31' # end date for imagery search
time_range = [datefrom, dateto] # format as a list

# Define which cloud cover we accept in the images
clouds = 10 # maximum acceptable cloud cover in %

Get some information about our shapefile.

In [None]:
# Get the shapefile layer's extent, CRS and EPSG code
extent, outSpatialRef, epsg = pygge.get_shp_extent(shapefile)
print("Extent of the area of interest (shapefile):\n", extent)
print(type(extent))
print("\nCoordinate referencing system (CRS) of the shapefile:\n", outSpatialRef)
print('EPSG code: ', epsg)

Get the extent of the shapefile into a format that Google Earth Engine understands.

Look at the printed outputs of the type conversions. The code will make more sense then.

In [None]:
# GEE needs a special format for defining an area of interest. 
# It has to be a GeoJSON Polygon and the coordinates should be first defined in a list and then converted using ee.Geometry. 
extent_list = list(extent)
print(extent_list)
print(type(extent_list))
# close the list of polygon coordinates by adding the starting node at the end again
# and make list elements in the form of coordinate pairs (y,x)
area_list = list([(extent[0], extent[2]),(extent[1], extent[2]),(extent[1], extent[3]),(extent[0], extent[3]),(extent[0], extent[2])])
print(area_list)
print(type(area_list))

search_area = ee.Geometry.Polygon(area_list)
print(search_area)
print(type(search_area))

Now we can access the Sentinel-2 collection on Google Earth Engine and run our search. This will return a URL (web link) from which we can download the data.

In [None]:
# Obtain download links for image composites from an image collection on Google Earth Engine
# All products available are detailed on this page https://developers.google.com/earth-engine/datasets/.

# Name of the Sentinel 2 image collection
s2collection = ('COPERNICUS/S2')
print("Image collection: ", s2collection)

# select bands
bands = ['B4', 'B8']
print(bands)

# spatial resolution of the downloaded data
resolution = 320 # in units of metres
print("resolution: ", resolution)
# get the Sentinel-2 image collection within the time range
s2collect = pygge.obtain_image_collection_sentinel(s2collection, time_range, search_area, clouds).select(bands)

# ‘region’ is obtained from the area, but the format has to be adjusted using get_region(geom) method
search_region = pygge.get_region(search_area)

# Get the time series data over the area of interest

Now we calculate the average reflectances of the selected bands for each Sentinel-2 image in the image collection.

In [None]:
# Get the data for the pixel intersecting the point in urban area
# .getRegion outputs an array of values for each [pixel, band, image] tuple in an ImageCollection
# The output contains rows of id, lon, lat, time, and all bands for each image that 
# intersects each pixel in the given region.
# Note that the getRegion function is limited to about 1 million values.
s2aoi = s2collect.getRegion(search_area, resolution).getInfo()

# Preview the result
s2aoi[:4]

In [None]:
# We apply a function to get the two time series for each of the downloaded bands into a pandas dataframe
df = pygge.ee_array_to_df(s2aoi,['longitude', 'latitude', 'B4', 'B8'])

# Calculate the vegetation index from B4 and B8 in the pandas data frame
df['ndvi'] =  (df['B8'] - df['B4']) / (df['B8'] + df['B4'])

# drop all rows with NaN values in the NDVI column
df = df.dropna(0)

print(df)
print(df.iloc[0])

Let's find out some properties of the data frame and the values within it for the date and ndvi columns.


In [None]:
print(type(df['date']))
print(type(df['date'].iloc[0]))
print(df['date'])

print(type(df['ndvi']))
print(type(df['ndvi'].iloc[0]))
print(df['ndvi'])

We can see from the output that the 'date' series in the dataframe is a pandas.core.series.Series type object. Its individual values are of type str (string).

The 'ndvi' series in the dataframe is of the same type but its individual values are of type float64 (floating point values encoded with 64 bits per value).

Work out the Julian day from the 'date' series. See https://quasar.as.utexas.edu/BillInfo/JulianDatesG.html for how is is done.

In [None]:
# express the date as Y M D, where Y is the year, M is the month number 
#   (Jan = 1, Feb = 2, etc.), and D is the day of the month

jd = [] # empty list to store the results in

# iterate over all time index entries in the 'date' series
for t in range(len(df['date'].iloc[:])): # time index

  # get year, month and date as integer values from the 'date' series
  y,m,d = pygge.split_YYYYMMDD(df['date'].iloc[t])

  # add result to the new list
  jd.append(pygge.julian_date(y,m,d))

# after the time loop has finished, add the Julian date list to the dataframe as a new column
df.insert(0, "JD", jd, True)

print(df.head())

In [None]:
# Because we are likely to have time series of NDVI over many pixels,
#   we want to plot a time series for each pixel location. These are
#   defined by the latitude and longitude values in the dataframe.
print(df.columns)
print(df['longitude'])
print(df['latitude'])
print(type(df['latitude'].iloc[0]))

# get unique combinations of latitude and longitude for each pixel and store them in the dataframe
pixel_ids = []

# iterate over all time index entries in the 'date' series
for t in range(len(df['date'].iloc[:])): # time index
  lon = df['longitude'].iloc[t]
  lat = df['latitude'].iloc[t]
  pixel_ids.append(str(lon)+'_'+str(lat))

df.insert(1, "lon_lat", pixel_ids, True)

print(df.columns)


In [None]:
# Now that we have our data in a good shape, we can make plots

# get a stratified split dataset by column lat_lon
df_grouped = df.groupby(df.lon_lat)

print("Number of pixels: ", df_grouped.ngroups)

# select maximum number of pixels to plot
npixels = 100
if npixels > df_grouped.ngroups:
  npixels = df_grouped.ngroups

# define the model fitting function with parameters
def fit_model(t, a, b):
    return a + b * t

# Set starting parameters for curve fitting
a = 0.
b = 0.1

# Subplots.
fig, ax = plt.subplots(1, figsize=(6, 6))

print("Making a plot with the following pixel locations:")
pixels_for_plot = pd.unique(df['lon_lat'])[0:npixels]
print(pixels_for_plot)

# iterate over each pixel based on its lon_lat location index
for p in pixels_for_plot:

  # get group of NDVI values in the time series for pixel with index p
  this_pixel_series = df_grouped.get_group(p)

  # extract the Julian dates available for that pixel from the dataframe into an array
  x = np.array(this_pixel_series['JD']).astype(float)

  # extract the NDVI values from the dataframe as well
  y = np.array(this_pixel_series['ndvi']).astype(float)

  # Add scatter plots
  #ax.scatter(this_pixel_series['JD'], this_pixel_series['ndvi'],
  #          c='green', alpha=0.2, label='NDVI')
  ax.scatter(x, y, c='green', alpha=0.2, label='NDVI')

  # check that we have enough values in the time series for that pixel
  if len(x) < 5:
    print("pixel " + p + " has only " + str(len(x)) + " time points. Omitted from analysis.")
  else:
    # fit the model to the NDVI time series
    params_u, params_covariance_u = optimize.curve_fit(fit_model, x, y, p0=[a,b])

    # Add fitted curves
    ax.plot(this_pixel_series['JD'],
            fit_model(x, params_u[0], params_u[1]),
            label='fitted model', color='black', lw=1)

# Add some parameters.
ax.set_title('NDVI near Rutland Water', fontsize=16)
ax.set_xlabel('Date', fontsize=14)
ax.set_ylabel('NDVI', fontsize=14)
ax.grid(lw=0.2)
#ax.legend(fontsize=14, loc='lower right')

fig.show()

# Formative Assignment for this week
Write a new code cell that annotates the x axis with the date in the notation of "Day Month Year", e.g. "4/12/2021" or "04122021" whilst retaining the plotting position of the points on the x axis as the Julian date. You can do this by extracting a string from the 'date' column of the dataframe, splitting it into individual characters and putting them together again in the right order. Then you can define x axis labels using the following help page: https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_xlabel.html

If you can do it, make the time series a bit longer and change the model from a linear model to a second order polynomial.