# Week 7: Accessing the Copernicus Open Access Hub with the SentinelSat library

Individual learning outcomes: At the end of this week, all students should be able to access the Copernicus Open Access Hub via the API, set up and submit a data query and automatically download individual Sentinel-2 images to Google Drive and Colab for further analysis. Students should understand the Sentinel-2 file structure and to pre-process the image automatically, including unzipping, reprojecting, and clipping.

The Copernicus Open Access Hub allows you to do a search for single images. However, there are limitations as to how many and which images you can request from the Long-Term Archive.

https://sentinel.esa.int/web/sentinel/-/activation-of-long-term-archive-lta-access-for-copernicus-sentinel-2-and-3


# Get a user account on the Sentinel Open Access Data Hub

Before we begin, make sure to register for an account in the Copernicus Open Access Hub.

For registration follow the link to the Open Access Hub and register: https://scihub.copernicus.eu/dhus/#/home

In previous weeks, we had manually uploaded a Sentinel-2 image to our Google Drive directory. We had also used Google Earth Engine to process multi-temporal Sentinel-2 images into image composites for us.

Today, we want to access the Sentinel Data Hub and search for available images over an area of interest of our choice.

Connect to our Google Drive from Colab.

In [None]:
# Load the Drive helper and mount your Google Drive as a drive in the virtual machine
from google.colab import drive
drive.mount('/content/drive')

## Sentinelsat API

We'll be using an API designed by Wille and Clauss (2016) called sentinelsat. This API was designed to query and download Copernicus product imagery from the Copernicus Open Access Hub API.

Follow the link to the API, and see if you can understand how it works: https://sentinelsat.readthedocs.io/en/stable/

In [None]:
#import required libraries, including the sentinelsat library this time
!pip install rasterio
!pip install sentinelsat
!pip install geopandas
!pip install rasterstats

import geopandas as gpd
import rasterio
from rasterio import plot
from rasterio.plot import show_hist
from rasterio.windows import Window
import matplotlib.pyplot as plt
import numpy as np
from sentinelsat.sentinel import SentinelAPI, read_geojson, geojson_to_wkt, geojson
from collections import OrderedDict
import json
import math
from osgeo import ogr
import os
from os import listdir
from os.path import isfile, isdir, join
from pyproj import Proj
from pprint import pprint
import shutil
import sys
import zipfile
from math import floor, ceil

# make sure that this path points to the location of the pygge module on your Google Drive
libdir = '/content/drive/MyDrive/practicals21-22' # this is where pygge.py needs to be saved
if libdir not in sys.path:
    sys.path.append(libdir)

# import the pygge module
import pygge

%matplotlib inline

# Accessing Sentinel-2 images

The workflow for this practical is similar to the previous one:
* Define an area of interest based on an ESRI shapefile
* Define a time window for our data search
* Set a maximum acceptable cloud cover for our search
* Search the ESA Copernicus Open Access Hub for all available images
* Select the individual images with the least cloud cover and download them to Google Drive
* Reproject (warp) the TCI images and crop them to our area of interest

Before proceeding, we have to go to Google Drive and create a text file called "sencredentials.txt" with your personal login details for the ESA Copernicus Sentinel Hub. Do not share your login details with anyone.

The file has two lines of text.

Line 1: Your username

Line 2: Your password

#Set the directory paths on Google Drive.

In [None]:
# BEFORE YOU RUN THIS BLOCK, YOU NEED A USER ACCOUNT ON THE ESA SENTINEL HUB
# In a browser, go to https://scihub.copernicus.eu/dhus/#/home
# Click on the user symbol in the top right and then on 'sign up'
# Follow the instructions.
# When you have your account, create a .txt file in Word that contains two lines:
#   line 1 - your username
#   line 2 - your password
# save it under the name "sencredentials.txt"
# upload it to the same directory as the Jupyter Notebook on your Google Drive.

# path to your Google Drive
# EDIT THIS LINE (/content/drive/My Drive is the top directory on Google Drive):
wd = "/content/drive/MyDrive/practicals21-22"
print("Connected to data directory: " + wd)
print("\nList of contents of " + wd)
for f in sorted(os.listdir(wd)):
  print(f)

# path to your temporary drive on the Colab Virtual Machine
cd = "/content/work"
print("Temporary work directory: ", cd)

# This file will allow the notebook to connect to your account on the ESA Data Archive.
credentials = join(wd, 'sencredentials.txt')  # contains two lines of text with username and password

# directory for downloading the Sentinel-2 granules
downloaddir = join(cd, 'download') # where we save the downloaded images
quickdir = join(cd, 'quicklooks')  # where we save the quicklooks
outdir = join(cd, 'out')           # where we save any other outputs

# CAREFUL: This code removes the named directories and everything inside them to free up space
# Note: shutil provides a lot of useful functions for file and directory management
try:
  shutil.rmtree(downloaddir)
except:
  print(downloaddir + " not found.")

try:
  shutil.rmtree(quickdir)
except:
  print(quickdir + " not found.")

try:
  shutil.rmtree(outdir)
except:
  print(outdir + " not found.")

# create the new directories, unless they already exist
os.makedirs(cd, exist_ok=True)
os.makedirs(downloaddir, exist_ok=True)
os.makedirs(quickdir, exist_ok=True)
os.makedirs(outdir, exist_ok=True)

# check whether the file with the login details exists
if "sencredentials.txt" not in os.listdir(wd):
  print("\nERROR: File sencredentials.txt not found. Cannot log in to Data Hub.\n")

Set up some directory names. 

Modify these string variables to match your data directory structure.

IMPORTANT: You must upload a shapefile of your area of interest to your Google Drive before running the next cell. Set the variable 'shapefile' below to point to this file. You can draw a polygon and save it as a shapefile on http://www.geojson.io.

In [None]:
# EDIT THE SEARCH AND DOWNLOAD OPTIONS BELOW
ndown = 3 # number of scenes to be downloaded (in order of least cloud cover)
shapefile = join(wd, 'oakham', 'Polygons_small.shp') # ESRI Shapefile of the study area

# Define a date range for our search
# WARNING: You need to check whether your images are available for download or
#          have been moved to the Long-Term Archive: 
#          https://scihub.copernicus.eu/userguide/LongTermArchive
datefrom = '20210401' # start date for imagery search
dateto   = '20210930' # end date for imagery search

# Define which cloud cover we accept in the images
clouds = '[0 TO 10]' # range of acceptable cloud cover % for imagery search

# Search for available image files on the ESA data server

We begin by reading in the user name and password we have saved in our text file 'sencredentials.txt'.

In [None]:
# go to working directory
os.chdir(wd)

# load user credentials for Sentinel Data Hub at ESA, i.e. read two lines of text with username and password
with open(join(wd, credentials)) as f:
    lines = f.readlines()
username = lines[0].strip()
password = lines[1].strip()
f.close()

# do the search
api, products = pygge.search_Cop_Hub(api='https://scihub.copernicus.eu/dhus', 
                                     username=username, password=password,
                                     shapefile=shapefile, datefrom=datefrom, 
                                     dateto=dateto, platformname='Sentinel-2', 
                                     processinglevel='Level-2A', clouds=clouds)

# convert the ordered dictionary to a dataframe
products_df = api.to_dataframe(products)
print(products_df.head())

Let's look at the query results, save them in a format that can be read into Excel and select the images we want to download.

In [None]:
print('Search resulted in '+str(products_df.shape[0])+' satellite images with '+
      str(products_df.shape[1])+' attributes.')

os.chdir(outdir) # set working direcory for output files

# sort the search results
products_df_sorted = products_df.sort_values(['cloudcoverpercentage', 'ingestiondate'], ascending=[True, True])
print("Sorted search results:")
print(products_df_sorted)

# save the full search results to a .csv file that you can read into Excel
outfile = 'searchresults_full.csv'
products_df_sorted.to_csv(outfile)
print("Search results saved: " + outfile)

# limit the download to the first 'ndown' images 
#   sorted by lowest cloud cover and earliest acquisition date
products_df_n = products_df_sorted.head(ndown)
print("Download list of selected images:")
print(products_df_n)

# save the list of data to be downloaded to a .csv file that you can read into Excel
outfile = 'searchresults4download.csv'
products_df_n.to_csv(outfile)
print("Download list saved: " + outfile)

# Download the individual Sentinel-2 granules to Google Drive
This takes a long time if many images are selected. Each is 100x100 km in size and has bands of 10 m and coarser resolution.


In [None]:
# Download all selected images into a data directory
os.chdir(downloaddir) # set working direcory to download directory

# print the unique image IDs and download the images from the API
for i in products_df_n['uuid']:
  # check which images are in the Long-Term Archive
  is_online = api.is_online(i)
  if is_online:
    print('Product ',i , ' is online. Starting download.')
    api.download(i)
  else:
    print('Product ',i , ' is not online and would have to be retrieved from the Long-Term Archive.')

We can save the footprints of the images returned in the query as a Geojson file. To do this, first we extract the selection of images for download from the complete ordered dictionary with our search results of Sentinel-2 products. Then, we put the selected data products into a new ordered dictionary with our selection of images for which we want to get the footprints. After that, we can plot the footprints of the images. 

In [None]:
# get the footprints of the selected scenes for use in Excel
s2footprints = products_df_n.footprint
# save them as an Excel file
outfile = join(outdir, 'footprints.csv')
s2footprints.to_csv(outfile, header = False)
print("Granule footprints saved as csv: " + outfile)

# define a function that filters an ordered dictionary by several keys
# from https://www.codegrepper.com/code-examples/python/python+select+multiple+keys+from+dict
dict_filter = lambda x, y: dict([ (i,x[i]) for i in x if i in set(y) ])

# define the keys (unique image IDs) of the downloaded products
keys = [uuid for uuid in products_df_n['uuid']]

# filter the ordered dictionary with all our product metadata and save those items 
#   that match the keys in a new ordered dictionary called 'selection'
selection = dict_filter(products, keys)

# print the keys
pprint(keys)

# print the selected metadata
pprint(selection)

# save the footprints of the scenes marked for download together with their metadata in a Geojson file
outfile = join(outdir, 'footprints.geojson')
with open(outfile, 'w') as f:
  json.dump(api.to_geojson(selection), f)
print("Granule footprints saved as GeoJson: " + outfile)

# plot the footprints and some background layers
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# these are the columns in the world geodataframe
print("Columns in world geodataframe:")
print([i for i in world.columns])
# zoom into the country
uk = world.loc[world.name == "United Kingdom"]
print(uk)
f = open(outfile)
foot = gpd.read_file(f)
base = uk.plot(color='white', edgecolor='black')
foot.plot(ax=base)

# Explore the data directory structure of our downloaded files


In [None]:
# where we stored the text files and csv files
os.chdir(outdir)
print("contents of ", outdir, ":")
!ls -l

# where we stored the downloaded Sentinel-2 images
os.chdir(downloaddir)
print("contents of ", downloaddir, ":")
!ls -l

Remember that we have saved the downloaded images to a temporary directory that will be deleted when we close the virtual machine. If you want to save your images to your local directory, this is how it goes.

Go to your Google Colab  folder in the panel on the left hand side.

Find the download directory and click on a Sentinel-2 image folder.

Right-click on it and select 'download' to save it.

# Iterate over all downloaded images and show the TCI file


The downloaded Sentinel-2 granules (or single images) are zipped. We need to unzip them first. At this stage, we remove the zipped file to free up disk space.

In [None]:
# set working direcory to download directory
os.chdir(downloaddir)

# get list of all zip files in the data directory
allfiles = [f for f in listdir(downloaddir) if isfile(join(downloaddir, f))]

# unzip all downloaded Sentinel-2 files
for x in range(len(allfiles)):

  # we can split the file name and check whether it ends with '.zip'
  if allfiles[x].split(".")[1] == "zip":
    print("Unzipping file ", x+1, ": ", allfiles[x])

    with zipfile.ZipFile(allfiles[x], "r") as zipf:
      # first extract the files
      zipf.extractall(downloaddir)

      # then remove the zip file to save disk space
      os.remove(join(downloaddir, allfiles[x]))

Before we can create a quick visualisation of the TCI files of all downloaded images, we need to find all the files. They are located in the 20m subdirectories of our downloaded Sentinel-2 directories for each image.
How can we do that? We can iterate over all directories and search for the right files.

In [None]:
tcifiles = pygge.get_filenames(downloaddir, filepattern="TCI", dirpattern="R20m")

print("\nList of all TCI files:")
pprint(tcifiles)

Now we know which image files we want to show on screen, the rest is easy. Just like last week.

In [None]:
# how many files are in the file list?
nfiles = len(tcifiles)

# arrange our subplots
cols = min(nfiles, 4) # maximum of 4 plots in one row
rows = math.ceil(nfiles / cols) # round up to nearest integer

# create a figure with subplots
fig, ax = plt.subplots(rows, cols, figsize=(21,7))
fig.patch.set_facecolor('white')

# iterate over all Sentinel-2 image directories and show the TCI file to check the image quality on screen
for x in range(nfiles):

  # join the directory path with the file name
  tcifile = tcifiles[x]
  print(tcifile)

  #open bands as separate single-band raster from the image directory pointing to the 20 m resolution bands
  bandTCI = rasterio.open(tcifile, driver='JP2OpenJPEG') #True Colour Image in uint8 data format

  #plot band using RasterIO
  if nfiles == 1:
    plot.show(bandTCI, ax=ax)
  else:
    plot.show(bandTCI, ax=ax[x])

  # set a title for the subplot
  if nfiles == 1:
    ax.set_title(tcifile.split("/")[-1], fontsize=8)
  else:
    ax[x].set_title(tcifile.split("/")[-1], fontsize=8)

To zoom to our shapefile area, let's warp all TCI images to the same projection as the shapefile. 

Remember we did this before with the Google Earth Engine Sentinel-2 image composites.

In [None]:
# get extent, coordinate referencing system and EPSG code of the map projection from the shapefile
extent, crs, epsg = pygge.get_shp_extent(shapefile)

print("Reprojecting all TCI images to projection with EPSG = ", epsg)

warpfiles = [] # make an empty list where we can remember all the warped output file names

# iterate over all Sentinel-2 image directories and warp the image
for tcifile in tcifiles:

  # make a directory path and file name for the warped output file
  warpfile = join(quickdir, tcifile.split("/")[-1].split(".")[0] + "_warped.jp2")
  warpfiles.append(warpfile) # add it to our list

  # call the easy_warp function
  tmp = pygge.easy_warp(tcifile, warpfile, epsg)

# print the list of new files we have created
print("List of warped files:")
pprint(warpfiles)

# Plot the shapefile on top of the warped TCI files

We will use the Geopandas library for plotting the shapefile on top of the raster image.


In [None]:
# open the shapefile for plotting
driver = ogr.GetDriverByName("ESRI Shapefile")
ds = driver.Open(shapefile, 0)

# create a figure with subplots
fig, ax = plt.subplots(rows, cols, figsize=(21,7))
fig.patch.set_facecolor('white')

# read the shapefile
shp = gpd.read_file(shapefile)

# if only one file is there, then do not loop over all files
if len(warpfiles) == 1:
  pygge.easy_plot(warpfile, ax, bands=[1,2,3])
  # plot the shapefile using Geopandas
  shp.boundary.plot(ax=ax, edgecolor="yellow")
  # set a title for the subplot
  ax.set_title(warpfile.split("/")[-1].split(".")[0], fontsize=8)
else:
  # iterate over all Sentinel-2 image directories and show the TCI file to check the image quality on screen
  for i, warpfile in enumerate(warpfiles):
    pygge.easy_plot(warpfile, ax[i], bands=[1,2,3])
    shp.boundary.plot(ax=ax[i], edgecolor="yellow")
    ax[i].set_title(warpfile.split("/")[-1].split(".")[0], fontsize=8)

#Clip the raster

Let's zoom into our shapefile area and clip the raster files to that area. 


In [None]:
# make an empty list where we will remember all clipped file names
clipfiles = [] 

# create a figure with subplots
fig, ax = plt.subplots(rows, cols, figsize=(21,7))
fig.patch.set_facecolor('white')

# iterate over all warped Sentinel-2 TCI files to check the image quality on screen
for i, warpfile in enumerate(warpfiles):
  print(warpfile)

  # make the filename of the new zoom image file
  clipfile = warpfile.split(".")[0] + "_clip.tif"
  clipfiles.append(clipfile) # remember the zoom file name in our list
  print("Clipped file: ", clipfile)

  # clip it to the shapefile extent
  pygge.easy_clip(warpfile, clipfile, extent)

# make maps
if len(clipfiles) == 1:
  pygge.easy_plot(clipfile, ax, bands=[1,2,3], percentiles=[0,100],
                  shapefile=shapefile, linecolor="yellow", 
                  title = clipfile.split("/")[-1].split(".")[0], fontsize=8)
else:
  for i, clipfile in enumerate(clipfiles):
    pygge.easy_plot(clipfile, ax[i], bands=[1,2,3], percentiles=[0,100],
                  shapefile=shapefile, linecolor="yellow", 
                  title = clipfile.split("/")[-1].split(".")[0], fontsize=8)
    
os.chdir(quickdir)
!ls -l

Before we end this session, we want to copy our downloaded Sentinel-2 data from Colab to Google Drive where the data will not be lost when this virtual machine terminates.

You could do it via the Linux command line like this:

```
!cp -r '/content/work/download' '/content/drive/MyDrive/practicals21-22'
```

Or in a more pythonic way using the shutil library. The copytree function copies a directory and all its contents to a destination folder.


In [None]:
# Copy the content of the download directory we created in Colab on a temporary drive
#    to the Google Drive partition, where it is not deleted when this session ends

# name of our new directory on Google Drive
# NOTE THAT THIS WILL BE DELETED AND OVERWRITTEN!
target_dir = "/content/drive/MyDrive/practicals21-22/download"

def copy_and_overwrite(from_path, to_path, delete_target_dir=True):
  '''
  Modified from: https://stackoverflow.com/questions/12683834/how-to-copy-directory-recursively-in-python-and-overwrite-all
  '''
  if os.path.exists(to_path):
    if delete_target_dir:
      shutil.rmtree(to_path)
      shutil.copytree(from_path, to_path)
    else:
      print("Error: Target directory exists and delete_target_dir is set to False.")
  else:
    shutil.copytree(from_path, to_path)
  return()

temp = copy_and_overwrite(downloaddir, target_dir, delete_target_dir=True)


#Formative assignment of this week

In a new code cell below, download a Sentinel-2 image of the town you were born in. Produce a true colour image (TCI) map of the image, zooming into the area so you can see sufficient detail in the map. You can use the pygge.easy_plot function to do this.