# Sentinel 2 Cloudless Mosaic

This tutorial constructs a *cloudless mosaic* (also known as a composite) from a time series of [Sentinel-2 Level-2A](https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a) images and is modified from the example notebook provided by Microsoft. This notebook performs the following steps:

* Find a time series of images within a bounding box
* Stack those images together into a single array
* Mask clouds and cloud shadows
* Synthesize a panchromatic band by averageing Red, Green, Blue and NIR bands
* Compute the cloudless mosaic by taking a median
* Save the result to a GeoTiff

This notebook is designed for the processing of large areas, so tasks like plotting that are useful but resource-intensive are omitted

## Setup

In [57]:
%matplotlib inline
import matplotlib.pyplot as plt

import numpy as np
import xarray as xr
import pandas as pd

import rasterio.features
import rioxarray
import stackstac
import pystac_client
import planetary_computer

import pyproj
from shapely.ops import transform
from shapely.geometry import Polygon

import xrspatial.multispectral as ms

import dask
from dask_gateway import GatewayCluster
from dask import visualize

import itertools
from datetime import datetime
from tqdm.notebook import tqdm

import geopandas as gpd
from pathlib import Path
from datetime import datetime

## Create a Dask cluster

We're going to process a large amount of data. To cut down on the execution time, we'll use a Dask cluster to do the computation in parallel, adaptively scaling to add and remove workers as needed. See [Scale With Dask](../quickstarts/scale-with-dask.ipynb) for more on using Dask.

In [2]:
# Set up the cluster
cluster = GatewayCluster()  # Creates the Dask Scheduler. Might take a minute.
client = cluster.get_client()
cluster.adapt(minimum=4, maximum=32)

In [3]:
print(cluster.dashboard_link)

https://pccompute.westeurope.cloudapp.azure.com/compute/services/dask-gateway/clusters/prod.745dee1dd9114fc4886b880402056a77/status


## Discover data

In this step we define our bounding box by creating a Shapely Polygon object. The Polygon object is created from a set of coordinate pairs in **Latitude and Longitude** (epsg 3857). A simple way of getting the coordinate pairs is by creating a bounding box in Google Earth, saving it to a kml, then opening it as a text file and copying the coordinates.

At this point, you'll have to decide if you want to process multiple years at once, or if you want to process the years separately. This decision comes down to how much compute power you have access to. For the Dask cluster parameters specified, `cluster.adapt(minimum=4, maximum=24)`, the maximum amount of images used should be less than 200. You can alter the number of images used by changing the values of `date_range`, `max_cloud`, and `pol`.

In [4]:
#proroa: R129, T60GUA
#andrew: R129, T60GVA

# setting options
date_range = '2016-01-01/2022-01-01'
frame = 'T60GVA'
orbit = 'R129'
max_cloud_image = 33
max_cloud_bbox = 5

# get bounding box (minx, miny, maxx, maxy)
gdf = gpd.read_file('andrew.geojson')
bbox = tuple(gdf.total_bounds)

Using `pystac_client` we can search the Planetary Computer's STAC endpoint for items matching our query parameters.

In [5]:
stac = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")

search = stac.search(
    bbox=bbox,
    datetime=date_range,
    collections=["sentinel-2-l2a"],
    limit=500,  # fetch items in batches of 500
    query={"eo:cloud_cover": {"gte":0,"lte": max_cloud_image}},
)

# Get items with the correct relative orbit
items = list(search.get_items())

Now we restrict the results to the same orbit

In [6]:
# grab ids
ids = np.array([x.id.split('_') for x in items])

# get correct orbit
ids = ids[ids[:,3] == orbit]

# get correct frame
ids = ids[ids[:,4] == frame]

# grab valid ids
valid_ids = ['_'.join(x) for x in ids]

#subset items
items = [x for x in items if any([y == x.id for y in valid_ids])]

print(f'Number of images before cloud masking is {len(items)}')

Number of images before cloud masking is 117


Depending on the year, this should return about 100-150 images for our study area over space, time, and cloudiness. Those items will still have *some* clouds over portions of the scenes, though. To create our cloudless mosaic, we'll load the data into an [xarray](https://xarray.pydata.org/en/stable/) DataArray using [stackstac](https://stackstac.readthedocs.io/) and then reduce the time-series of images down to a single image.

In [7]:
signed_items = []
for item in items:
    item.clear_links()
    signed_items.append(planetary_computer.sign(item).to_dict())

## Load Data

In this step we load the data and perform some initial cleaing that includes:
* subsetting to our exact bounding box
* removing pixels that correspond clouds and clouds shadows

To perform our cloud masking, we use Sentinel-2's Scene Classification Layer ([SCL](https://sentinels.copernicus.eu/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm)) and mask out values 3, 8, 9, and 10.

In [33]:
data = (
    stackstac.stack(
        signed_items,
        assets=["B08","SCL"],
        chunksize=4096,
        resolution=10
    )
    .where(lambda x: x > 0, other=np.nan)  # sentinel-2 uses 0 as nodata
)

# Get bounding box in projection of data
minx, miny, maxx, maxy = tuple(gdf.to_crs(data.crs).total_bounds)

# Subset data and mask clouds
data = data.sel(x=slice(minx, maxx), y=slice(maxy,miny))

## Cloud filtering

In [18]:
first = data.groupby('time').first(skipna=False)
valid = xr.where(first.sel(band='SCL',drop=True).isin([3,8,9]),x=0,y=1)

In [19]:
pct_valid = valid.sum(dim=['x','y']).compute().to_numpy() / (data.shape[2] * data.shape[3])
pct_valid = pct_valid.squeeze()
dates = valid.time.to_numpy()

In [20]:
clouds = pd.DataFrame({'date':dates,'pct_valid':pct_valid[:]})

In [21]:
best = clouds.loc[clouds.pct_valid > (100-max_cloud_bbox)/100].copy()

In [22]:
best['year'] = best.date.dt.year
counts = best.groupby('year').count().reset_index()

In [23]:
for i,row in counts.iterrows():
    print(f'Year {row["year"]} contains {row["date"]} images')

Year 2016 contains 6 images
Year 2017 contains 15 images
Year 2018 contains 11 images
Year 2019 contains 13 images
Year 2020 contains 15 images
Year 2021 contains 12 images


In [34]:
best_dates = data.sel(band=["B08"],time=list(best.date)).squeeze()

## Max NCC summarizing

In [47]:
date1,date2 = zip(*[sorted(x) for x in itertools.combinations(best_dates.time.to_numpy(),2)])
df_ncc = pd.DataFrame({'date1':date1,'date2':date2})
df_ncc['date_diff'] = (df_ncc.date2 - df_ncc.date1).dt.days
df_ncc = df_ncc.loc[(df_ncc.date_diff >= 345) & (df_ncc.date_diff <= 380)].copy()

df_ncc.shape

(104, 3)

In [48]:
for_ncc = best_dates.drop((set(best_dates.coords)  - set(best_dates.dims)))

In [50]:
lazy_output = []

for i,row in df_ncc.iterrows():
    data1 = for_ncc.sel(time=row['date1'],drop=True)
    data2 = for_ncc.sel(time=row['date2'],drop=True)
    lazy_output.append([row['date1'],row['date2'],xr.corr(data1,data2)])

In [52]:
%%time

df_ncc['ncc'] = 0
for date1,date2,arr in tqdm(lazy_output):
    df_ncc.loc[(df_ncc.date1==date1)&(df_ncc.date2==date2),'ncc'] = arr.values.item()

  0%|          | 0/104 [00:00<?, ?it/s]

CPU times: user 1min 3s, sys: 16.8 s, total: 1min 20s
Wall time: 1min 48s


In [53]:
dates = list(set(df_ncc.date1) | set(df_ncc.date2))
dates.sort()

box_ncc = pd.DataFrame(index=dates,columns=dates,dtype=np.float64)
for i,row in df_ncc.iterrows():
    date1 = row['date1']
    date2 = row['date2']
    box_ncc.loc[box_ncc.index==date1,date2] = row['ncc']
    box_ncc.loc[box_ncc.index==date2,date1] = row['ncc']

In [54]:
date_mean = box_ncc.mean(axis=0).reset_index().rename(columns={'index':'date',0:'ncc'})
date_mean['year'] = date_mean.date.dt.year
max_ncc = date_mean.groupby('year').max()
max_ncc

Unnamed: 0_level_0,date,ncc
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2016,2016-11-22 22:16:02.026,0.865657
2017,2017-11-22 22:15:49.027,0.884074
2018,2018-10-23 22:16:01.024,0.86448
2019,2019-11-02 22:16:09.024,0.900378
2020,2020-11-11 22:16:11.024,0.911931
2021,2021-09-02 22:15:59.024,0.909412


In [63]:
for t in max_ncc.date:
    time_name = datetime.strftime(t,'%Y%m%d')
    name = f's2_l2_{orbit}_{frame}_{time_name}.tif'
    print(f'Writing {name}...')
    best_dates.sel(time=t).rio.to_raster(name)

Writing s2_l2_R129_T60GVA_20161122.tif...
Writing s2_l2_R129_T60GVA_20171122.tif...
Writing s2_l2_R129_T60GVA_20181023.tif...
Writing s2_l2_R129_T60GVA_20191102.tif...
Writing s2_l2_R129_T60GVA_20201111.tif...
Writing s2_l2_R129_T60GVA_20210902.tif...


## Median summarizing

In [26]:
median = best_dates.groupby('time.year').median().fillna(0).persist()

In [27]:
median = median.rio.write_crs(pyproj.CRS(data.crs).to_string()).drop('proj:bbox')
median = median.rename('Median_NIR')
print('Median image processing complete')

Median image processing complete


In [30]:
for y in median.year.to_numpy():
    name = f's2_l2_{orbit}_{frame}_{y}0601.tif'
    print(f'Writing {name}...')
    median.sel(year=y).rio.to_raster(name)

Writing s2_l2_R129_T60GVA_20160601.tif...
Writing s2_l2_R129_T60GVA_20170601.tif...
Writing s2_l2_R129_T60GVA_20180601.tif...
Writing s2_l2_R129_T60GVA_20190601.tif...
Writing s2_l2_R129_T60GVA_20200601.tif...
Writing s2_l2_R129_T60GVA_20210601.tif...


## Download Data

In [31]:
print('Completed successfully!')

Completed successfully!


## Close cluster
Once we're done with our processing, let's be a good steward of our resources and close our cluster

In [64]:
cluster.close()

## Download Data
And you're done! The completed GeoTiff files should be in the same directory as this notebook, and can be downloaded via Jupyter's GUI