# <u>Measuring sediments in Lake Tempe</u>

**Contents**

  - [Background](#Background)
    - [Notebook overview](#Notebook-overview)
    - [Suspended sediments](#Suspended-sediments)
  - [Notebook setup](#Notebook-setup)
  - [Analysis parameters](#Analysis-parameters)
    - [Spatial and temporal window](#Spatial-and-temporal-window)
    - [Datacube query](#Datacube-query)
  - [Sentinel-2 dataset](#Sentinel-2-dataset)
    - [Data load](#Data-load)
    - [Data clean-up](#Data-clean-up)
  - [Lake boundary](#Lake-boundary)
    - [Loading up the shapefile](#Loading-up-the-shapefile)
    - [Raster mask](#Raster-mask)
    - [Masking the data](#Masking-the-data)
  - [Filtering out land &#40;i.e. non-water&#41; pixels](#Filtering-out-land-&#40;i.e.-non-water&#41;-pixels)
    - [Water index](#Water-index)
    - [Removing non-water pixels](#Removing-non-water-pixels)
  - [TSS analysis](#TSS-analysis)
    - [TSS calculation](#TSS-calculation)
    - [Selected displays](#Selected-displays)
      - [Individual time slices](#Individual-time-slices)
      - [Temporal aggregation &#40;1&#41;](#Temporal-aggregation-&#40;1&#41;)
      - [Temporal aggregation &#40;2&#41;](#Temporal-aggregation-&#40;2&#41;)
    - [Temporal statistics &#40;standard&#41;](#Temporal-statistics-&#40;standard&#41;)
    - [Robust statistics &#40;using Dask&#41;](#Robust-statistics-&#40;using-Dask&#41;)
      - [Approach](#Approach)
      - [Robust functions](#Robust-functions)
      - [Applying parallelised custom functions](#Applying-parallelised-custom-functions)
      - [Discussion](#Discussion)
  - [Extracting TSS data for further analysis &#40;pixel drills&#41;](#Extracting-TSS-data-for-further-analysis-&#40;pixel-drills&#41;)
    - [Time series indexing](#Time-series-indexing)
    - [Extracting to Pandas data frame](#Extracting-to-Pandas-data-frame)
  

# Background 

## Notebook overview

In this notebook, we use a dataset of Sentinel-2 data to extract a time series of total suspended sediments (TSS) for the purpose of monitoring water quality. The region of interest here is centred over Lake Tempe in South Sulawesi, Indonesia. 

In addition to providing a general overview of TSS estimation, this notebook also demonstrates a number of technical and computational aspects related to working in an open data cube (ODC) environment such as EASI. In particular, this notebook will touch on the following aspects:

 - basic handling and processing of remote sensing datasets
 - deriving a raster mask of lake pixels from a shape file
 - filtering out non-water pixels using a normalised water index
 - calculation of standard temporal statistics as well as robust statistics
 - parallelised application of custom functions to a Dask array
 - extracting a pixel-based time series of TSS values (pixel drill) and saving it to disk for further analysis.

## Suspended sediments

In order to derive an estimate of TSS concentration (in mg/L) from the remote sensing data, we need a specific formula (algorithm) that characterises the relationship between the Sentinel-2 spectral band values and the true TSS measurements. Such a formula needs to be derived and calibrated on the basis of, among others, TSS measurements sampled during one or more field campaigns on the lake of interest.

In this demonstration notebook, we rely on pre-existing research published in the following manuscript:

 > E. Pandhadha <i>et al.</i>, 2020. Total Suspended Solid (TSS) Estimation in Lake Tempe, South Sulawesi Using Sentinel-2B Imagery. 
Journal of Engineering Technology and Applied Physics, Special Issue on Remote Sensing for Sustainable Environment, no. 1 (2020). [DOI:10.33093/jetap.2020.x1.4](http://dx.doi.org/10.33093/jetap.2020.x1.4)

In this paper, the remote-sensing-based formula used to derive values for the $\text{TSS}$ parameter of interest is as follows: 

$$
\text{TSS} = \alpha \cdot \text{NSMI} + \beta \\
\text{NSMI} = \frac{ \text{red}+\text{green}-\text{blue} }{ \text{red}+\text{green}+\text{blue} } \\ 
\alpha = 775.98 \\
\beta = -93.606
$$

where $\text{red}$, $\text{green}$ and $\text{blue}$ correspond to the respective Sentinel-2 bands, and $\text{NSMI}$ represents the Normalised Suspended Material Index.

<div class="alert alert-info"><font color="black"><b>Caution &ndash;</b> No attempt is made here to double-check or validate the above TSS algorithm in the context of this notebook and the corresponding Sentinel-2 dataset available on the current EASI / ODC deployment. The results provided in this notebook should thus only be considered as an overview of a possible approach to TSS measurement from remote sensing data, which would need to be further scrutinised and validated.</font></div>

Note also that in the above paper, the authors implement a pre-processing step aiming to remove pixels affected by sun-glint in the time series of Sentinel-2 data. This step is _not_ implemented in this notebook.

# Notebook setup

First, let's import the key Python packages and supporting functions required in this notebook.

In [None]:
### System
import sys

### Datacube 
import datacube
from datacube.utils import masking
from odc.algo import enum_to_bool

### Data tools
import numpy as np
import xarray as xr
import pandas as pd
from astropy.stats import sigma_clip
import geopandas as gpd
import rasterio.features

### Plotting
%matplotlib inline
import matplotlib.pyplot as plt

### EASI tools
sys.path.append('../tools/')
from datacube_utils import display_map, mostcommon_crs, xarray_object_size

And let's now also connect to the EASI database:

In [None]:
dc = datacube.Datacube(app="Lake_Tempe_TSS")

# Analysis parameters

## Spatial and temporal window

The region of interest for this demonstration notebook is centred over Lake Tempe, South Sulawesi. The utility function `display_map` provides a convenient overview of the selected latitude / longitude extents.

In [None]:
### Region of interest
min_longitude, max_longitude = (119.87, 120.04)   # Lake Tempe
min_latitude, max_latitude = (-4.03, -4.198)

In [None]:
display_map( x = (min_longitude,max_longitude), 
             y = (min_latitude,max_latitude) )

A quick look at the Sentinel-2 dataset on the [ODC Explorer](https://explorer.sg-dev.easi-eo.solutions/products/s2_l2a/extents) for the current EASI deployment indicates that data is available from 2017 onwards. For reasonable loading times, we will here only use three years' worth of satellite data over the region of interest.

In [None]:
### Dates of interest:
min_date = '2018-01-01'
max_date = '2021-01-01'

## Datacube query

The Sentinel-2 product used in this notebook to calculate TSS is labelled `s2_l2a`.

In [None]:
product = 's2_l2a'     # Sentinel-2 product

We can now initialise the main parameters of a datacube query, which we can then use to check the dataset's native projection (`mostcommon_crs`) &ndash; as we don't need to re-project the dataset to another coordinate reference system, we will simply load up the data in its native projection. 

We also set the `dask_chunks` query parameter to ensure that the loading process makes use of Dask, which will return a (lazy-loaded) dataset that can be processed in a parallelised manner. The use of the `.persist()` directive throughout this notebook essentially "forces" the loading / computation of the data contained in these Dask arrays.

In [None]:
# This code cell may generate some 'SQLAlchemy' warnings, but they can be safely ignored.

query = { 'product': product,
          'lat': (min_latitude, max_latitude),
          'lon': (min_longitude, max_longitude),
          'time': (min_date, max_date) }

### Dataset's native projection
native_crs = mostcommon_crs(dc, query)
print(f"The dataset's native CRS is \"{native_crs}\".")

query.update({ 'output_crs': native_crs,
               'resolution': (30, 30),
               'group_by': 'solar_day',
               'dask_chunks': {'x': 1024, 'y': 1024} })

Finally, loading up all Sentinel-2 bands would lead to excessive memory requirements and computational overheads. We will thus only select the bands relevant to this analysis.

The list of measurements (i.e. satellite bands and derived products) for the current product of interest can be displayed as follows.

In [None]:
dc.list_measurements().loc[product]

According to the TSS formula provided at the beginning of this notebook, we can select only those bands needed for the analysis. In addition, we also load the layer `SCL` (or its known alias `qa`) of pixel QA data, which will allow us to clean up the dataset, as well as the `swir_2` band, which will be used further below to filter out non-water pixels.

In [None]:
query.update( {'measurements': ['red', 'green', 'blue', 'swir_2', 'qa']} )
query

# Sentinel-2 dataset

## Data load

In the next cell, we load up the Sentinel-2 dataset as directed by the `query` parameters. 

In [None]:
%%time

data = dc.load(**query)
data = data.persist()   # Dask data processing
data

## Data clean-up

As usual, we need to filter out various pixels from the remote sensing dataset. This includes the removal of invalid (`nodata`) pixels, as well as those affected by various pixel quality issues. In the next cell, we create the various masks required for this clean-up operation.

In [None]:
### Valid mask (i.e. not 'nodata'), for each data layer
valid_mask = masking.valid_data_mask(data).persist()

### Mask of clear pixels
bad_pixel_flags = {'no data', 'saturated or defective', 'cloud shadows', 'cloud high probability', 'cloud medium probability'}
good_pixel_mask = ~enum_to_bool(data['qa'], bad_pixel_flags).persist()

<div class="alert alert-info"><font color="black"><b>Caution &ndash;</b> Further work is required to investigate the resulting "cleaned-up" dataset in more detail. During the development of this notebook, various plots and results (not shown here) pointed to various issues related to the pixel QA information, with, among others, cloud shadows not being identified and filtered out properly, water pixels in the lake being mis-classified as <code>cloud medium probability</code> or <code>thin cirrus</code>, etc. A more in-depth analysis of the pixel QA information should be performed to ensure that such issues are fixed, and/or do not substantially bias the results further below.</font></div>

The Sentinel-2 masking and scaling operations are subsequently applied on a band-by-band basis, as done in the next cell.

In [None]:
### Scaling factor for Sentinel-2 data
scale = 0.0001  # divide by 10000
offset = 0.0

### Apply valid mask, good pixel mask, and scaling to each layer
data['red'] = ( data['red'].where(valid_mask['red'] & good_pixel_mask) * scale + offset ).persist()
data['blue'] = ( data['blue'].where(valid_mask['blue'] & good_pixel_mask) * scale + offset ).persist()
data['green'] = ( data['green'].where(valid_mask['green'] & good_pixel_mask) * scale + offset ).persist()
data['swir_2'] = ( data['swir_2'].where(valid_mask['swir_2'] & good_pixel_mask) * scale + offset ).persist()

In [None]:
### Dimensions of dataset
print( xarray_object_size(data) )

And finally, we can remove any time slice from the dataset (if any) not containing at least one valid (non-`NaN`) pixel, as a result of the above operations.

In [None]:
data = data.dropna('time', how='all')
data = data.persist()   # Dask data processing

We can now inspect the resulting data object further, e.g. by displaying the `Xarray.DataArray` for one of the bands:

In [None]:
data.red

From this, we can gather that the pre-processed Sentinel-2 data is available as a Dask array over a region of about 600-by-600 pixels, and with about 200 time steps. In this data object, each Dask chunk has a size `(1,623,632)` in the `time`, `x` and `y` dimensions.

# Lake boundary

To improve the plots further below in this notebook, we will use a mask of the lake area in the region of interest. This mask can be derived, for instance, from an existing shape file.

## Loading up the shapefile

For this example, we use a polygon of the Lake Tempe boundary line, which can be accessed from the shapefile provided in the `ancillary_data` folder in this repository.

In [None]:
shape_file = './ancillary_data/Base Map//Boundary_administration.shp'

### Load the shapefile
shp = gpd.read_file(shape_file)
display(shp)
shp.crs

We can see here that the vector data within that shapefile is in the projection `EPSG:4326`, which is different from that of our main Sentinel-2 dataset (`EPSG:32750`). For compatibility, we can here re-project the shapefile data to the CRS of the Sentinel-2 dataset. 

Subsequently, we will also filter the shapefile contents to only select those polygons associated with Lake Tempe.

In [None]:
### Reproject to current coordinate reference system
shp = shp.to_crs(native_crs)

### Remove unwanted polygons
print("Selected polygons are:")
drop_list = []
for ff in shp.iterrows():
    tmp = ff[1].Village.lower()
    if 'tempe' in tmp and 'danau' in tmp: 
        print(ff[0], ff[1].Village)
    else: 
        drop_list.append(ff[0])
        
shp.drop(drop_list, inplace=True)

### Plot
shp.boundary.plot(figsize=(8,8))
plt.xlabel("x [metre]"); plt.ylabel("y [metre]")
plt.title("Lake Tempe boundary");

## Raster mask

We can now create a raster mask from the vector data. The code below iterates over the polygons in the shapefile (in case multiple polygons are available), setting the raster mask values to `1` for all the pixels located within the footprint of each polygon, and `0` otherwise.

In [None]:
### Rasterise
mask = rasterio.features.rasterize( ((feature['geometry'], 1) for feature in shp.iterfeatures()),
                                    out_shape = (data.dims['y'],data.dims['x']),
                                    transform = data.affine )

### Convert the mask (numpy array) to an Xarray DataArray
mask = xr.DataArray(mask, coords=(data.y, data.x))
mask

In [None]:
### Plot
mask.plot(size=8).axes.set_aspect('equal')

## Masking the data

Finally, we can use the mask we just created, apply it to the time series of Sentinel-2 data, and plot the result.

In [None]:
### Masking
data = data.where(mask).persist()

In [None]:
time_ind3 = np.linspace(1, data.sizes['time'], 3, dtype='int') - 1   # select some time slices to display

### Plot the selected time slices (true-colour display)
image_array = data[['red', 'green', 'blue']].isel(time=time_ind3).to_array()
tmp = image_array.plot.imshow(robust=True, col='time', col_wrap=3, size=5)
for ax in tmp.axes.flatten(): 
    ax.set_aspect('equal')

We now have a cropped data time series containing only the pixels of interest over Lake Tempe. 

However, due to the varying extents of the lake over time (during wet / dry conditions), some time slices in the time series will contain a certain number of non-water (i.e. land) pixels. Further below, the TSS algorithm would thus also be applied to these land / vegetation pixels, thereby leading to some bias in the results.

# Filtering out land &#40;i.e. non-water&#41; pixels

In order to address this issue, we could try to use the modified normalised difference water index (MNDWI) in order to filter out the non-water pixels. 

## Water index

The MNDWI is calculated on the basis of the Sentinel-2 bands as per the following equation, with MNDWI values greater than 0 indicating water pixels:

$$
\text{MNDWI}= \frac{ \text{green}−\text{SWIR} }{ \text{green}+\text{SWIR} }.
$$

So let's apply this formula to the Sentinel-2 dataset, and save the resulting MNDWI data back into the `data` object as an additional band.

In [None]:
data['MNDWI'] = ( (data.green - data.swir_2) / (data.green + data.swir_2) ).persist()
data

As shown above, we now have the MNDWI band integrated as part of the `data` (Dask) array. For insight, we can also plot a few MNDWI time slices to investigate the results further.

In [None]:
time_ind9 = np.linspace(1, data.sizes['time'], 9, dtype='int') - 1   # select some time slices to display
tmp = data.MNDWI[time_ind9].plot(col='time', col_wrap=3, size=4)
for ax in tmp.axes.flatten(): 
    ax.set_aspect('equal')

This approach appears to provide good results for the current Sentinel-2 dataset, with various regions on the edge of the lake clearly identified as being non-water (MNDWI values below 0.0).

## Removing non-water pixels

We can now use the MNDWI information to remove the non-water pixels from the time series.

In [None]:
data = data.where(data.MNDWI>0.0).persist()

In [None]:
### Plot some selected time slices (true-colour display)
image_array = data[['red', 'green', 'blue']].isel(time=time_ind3).to_array()
tmp = image_array.plot.imshow(robust=True, col='time', col_wrap=3, size=5);
for ax in tmp.axes.flatten(): 
    ax.set_aspect('equal')

# TSS analysis

## TSS calculation

As per the formula provided at the start of this notebook, TSS values can be calculated for each pixel in the time series on the basis of the selected Sentinel-2 bands.

In [None]:
### TSS calculation
tmp = data.red + data.green
nsmi = (tmp - data.blue) / (tmp + data.blue)
data_tss = (775.98 * nsmi - 93.606).persist()
data_tss

## Selected displays

### Individual time slices

For insight, let's take a look at the TSS data at a few selected time points.

In [None]:
tmp = data_tss[time_ind9].plot( col='time', col_wrap=3, cmap='rainbow', size=4, robust=True,
                                cbar_kwargs = dict(label="TSS [mg/L]") )
for ax in tmp.axes.flatten(): 
    ax.set_aspect('equal')

We can here clearly see that some pixel quality issues are still affecting the datasets, e.g. with residual areas of cloud shadow not having been removed successfully during the data clean-up process.

### Temporal aggregation &#40;1&#41;

While these "daily" TSS plots can be insightful, the abundance of missing (and corrupt) data, as well as the many time slices in the time series, make for a difficult assessment of the results. One approach to circumvent this is to first aggregate the data over coarser time spans, and then display the average TSS over these periods.

`Xarray` allows for a straightforward aggregation of the data according to [Pandas indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) using the `dateoffset` functionality. For instance, we could calculate the temporal mean of the data over yearly quarters (with the first quarter ending in January):

In [None]:
data_tss_quarter = data_tss.resample(time="QS-JAN").mean()   # aggregation over each successive quarter
data_tss_quarter = data_tss_quarter.persist()
data_tss_quarter

As we can see from the resulting Xarray object, there is a total of 12 quarters in the current time series, leading to 12 "time slices" in the resulting array.

Then, as done before for the daily time series, we can select a few of the resulting quarters and plot the respective (temporally averaged) TSS maps.

In [None]:
time_ind = np.linspace(1, data_tss_quarter.sizes['time'], 9, dtype='int') - 1   # some selected time slices to display

### Main plot
tmp = data_tss_quarter[time_ind].plot( col='time', col_wrap=3, cmap='rainbow', size=4, robust=True,
                                       cbar_kwargs = dict(label="quarterly-averaged TSS [mg/L]") )
for ax in tmp.axes.flatten(): 
    ax.set_aspect('equal')

Here we can see that a clearer picture of the temporal TSS dynamics / characteristics is starting to emerge for each averaging period (quarter) along the selected time series.

### Temporal aggregation &#40;2&#41;

Another potential way to temporally aggregate the time series data is to calculate the average TSS values for _all_ time slices in specific periods such as months, seasons, quarters, etc. In other words, while the previous plots present averaged results over successive quarters (resulting in a total of 12 quarters for the current time series), we could now average the data from _all_ the time slices within, e.g., a given month or season (regardless of the year). This would provide an overview of the average TSS concentration in various months or seasons in any given year.

For illustration, let's apply this approach to calculate the average TSS maps for each individual month in our time series &ndash; this is done by essentially collecting _all_ the January time slices to calculate the January average, and so forth for all other months.


In [None]:
data_tss_group = data_tss.groupby("time.month").mean().persist()   # aggregation over each month
data_tss_group = data_tss_group
data_tss_group

As can be seen here, the result of this operation is a DataArray with a new dimension (`month`) and 12 coordinates along it &ndash; one "slice" for each individual month.

In [None]:
tmp = data_tss_group.plot( col='month', col_wrap=4, cmap='rainbow', size=4, robust=True,
                           cbar_kwargs = dict(label="monthly average TSS [mg/L]") )
for ax in tmp.axes.flatten(): 
    ax.set_aspect('equal')

This plot seems to indicate that, over the whole time series under current consideration, the months between January and April generally exhibit elevated levels of TSS over most of Lake Tempe. In contrast, lower values of TSS are recorded, on average, during the months of May, June and July.

<div class="alert alert-info"><font color="black"><b>Caution &ndash;</b> Once again, this result should be here treated with caution, and further analyses should be carried out to ensure that the data time series leading to this result is completely free of any potential pixel quality issues and artefacts.</font></div>
    
## Temporal statistics &#40;standard&#41;

The overall TSS average across the _entire_ time series, for each pixel, can be easily calculated as follows.

In [None]:
data_tss_mean = data_tss.mean('time')   # standard mean over entire time series
data_tss_mean = data_tss_mean.persist()

In [None]:
### Plot
fig = plt.figure(figsize=(11,7))
data_tss_mean.plot(robust=True, cmap='rainbow', cbar_kwargs={'label':'mean TSS [mg/L]'})
plt.gca().set_aspect('equal','box');

This map provides an overview of the overall TSS concentrations for the region of interest and over the entire specified time span.

From the previous results in this notebook, however, we should expect that the above plot integrates a number of pixels for which the pixel QA clean-up process has failed. For instance, some pixels affected by cloud shadows may not have been successfully masked out, leading to erroneously low TSS values in some time slices. These problematic TSS values have subsequently been used in the computation of the above average plot, leading to potentially biased results.

## Robust statistics &#40;using Dask&#41;

### Approach

In order to minimise the impact from such outliers, we could make use of a more _robust_ metric of temporal averaging instead of the simple `.mean()` operation used earlier. For instance, one such metric would be to calculate the _median_ TSS instead.

Here, we will use another approach, which is to define our own (custom) function and apply it to the Xarray / Dask array of TSS data (`data_tss`). The aim here is to instruct `Xarray` to take each pixel, apply the function to the corresponding TSS time series (along the `time` dimension), and aggregate the results to produce a map with two `x` and `y` dimensions. And given the Dask arrays at hand, we would also like this process to occur in a parallelised fashion on all CPUs available in this JupyterLab environment.

### Robust functions

In the next cell, we start by defining a `robust_mean()` function, which operates as follow on a Numpy vector `z` of input data:

1. remove `Nan`s from the data
1. remove outliers from the data using the `sigma_clip` function
1. calculate and return the (standard) mean of the filtered data (`NaN`s and outliers removed) &ndash; if the filtered vector contains less than 10 data points, simply return `NaN` instead.

Further below, this function will be applied, in a parallelised fashion, to the Dask array of TSS values &ndash; essentially, `robust_mean()` will receive as input `z` the time series of TSS values for each pixel in turn. The returned value of the pixel's (robust) average TSS will then be used to build the resulting map of average mean TSS, calculated in a robust way.

In addition, we here also define another function (`robust_cv()`) to calculate the _coefficient of variation_ (CV) for the same time series of TSS data (at each pixel). As shown below, the CV is simply defined as the standard deviation of the input values, divided by the mean &ndash; here again, calculated in a robust manner on the basis of the filtered vector of data.

In [None]:
def robust_mean(z):
    zf = z[~np.isnan(z)]   # filter out NaNs
    if len(zf)<10: return np.nan   # use at least 10 values to compute the mean
    zf = sigma_clip( zf, masked=False )   # remove outliers
    return np.mean(zf)   # mean of data without outliers

def robust_cv(z):
    zf = z[~np.isnan(z)]   # filter out NaNs
    if len(zf)<10: return np.nan   # use at least 10 values to compute the CV
    zf = sigma_clip( zf, masked=False )   # remove outliers
    return np.std(zf) / np.mean(zf)   # CV of data without outliers

### Applying parallelised custom functions

In the following cell, the `robust_mean` function is applied in a parallelised fashion to the dataset by making use of the `apply_ufunc()` function in `Xarray` &ndash; another way to carry out this operation would be to use `xr.map_blocks()`.

In [None]:
### Re-chunk Dask array for efficient time-series processing
data_tss = data_tss.chunk({'time':-1, 'x':32, 'y':32}).persist()

### Parallelised processing
data_tss_robMean = xr.apply_ufunc( robust_mean, data_tss, input_core_dims=[["time"]], 
                                   dask='parallelized', vectorize=True )   # robust mean, whole time series
data_tss_robMean = data_tss_robMean.persist()

Our second custom function can now be applied in the same way to the TSS data.

In [None]:
data_tss_robCV = xr.apply_ufunc( robust_cv, data_tss, input_core_dims=[["time"]], 
                                 dask='parallelized', vectorize=True )   # robust mean, whole time series
data_tss_robCV = data_tss_robCV.persist()

In [None]:
### Plots
fig, (ax1,ax2) = plt.subplots(1, 2, figsize=(20,7))

data_tss_robMean.plot(robust=True, cmap='rainbow', cbar_kwargs={'label':'TSS mean'}, ax=ax1)
ax1.set_title(f"Robust mean of TSS")
ax1.set_aspect('equal','box')

data_tss_robCV.plot(robust=True, cmap="rainbow", cbar_kwargs={'label':'TSS C.V.'}, ax=ax2)
ax2.set_title(f"Robust C.V. of TSS")
ax2.set_aspect('equal','box');

### Discussion

The plot of robust mean values (left-hand side) has not changed significantly compared to the previous plot of the (standard) average values. However, the fact that we have used a robust metric for the temporal averaging operation gives us more confidence in the validity of the plotted results, in particular with respect to the residual pixel quality issues affecting the considered TSS dataset.

On the right-hand side, the CV map provides further insight into the temporal dynamics in the TSS dataset, highlighting a number of "hot spots" with elevated TSS variability near the edge of the lake (perhaps as a result of regular sediment contribution from specific rivers / tributaries to the lake waters).

A final note here is with regards to the range of (average) TSS values displayed in the above plot (left-hand side). In the Pandhadha _et al._ paper cited at the beginning of this notebook, the authors report that the field measurements of TSS in Lake Tempe are ranging between 115 and 203 mg/L, which is (roughly) in the same order of magnitude as the values displayed in the plot. We can thus have some confidence that the TSS algorithm used in this notebook at least provides TSS data that appears to be sensible, though further validation work is required to determine whether the TSS algorithm used here provides accurate results for the range of values experienced in this notebook. 

# Extracting TSS data for further analysis &#40;pixel drills&#41;

## Time series indexing

At this point, a practitioner might want to extract the time series of TSS values at selected locations, display them, and potentially write them to file to be used as input to further processing or modelling work.

We demonstrate this by first selecting a few points of interest, e.g. along a transect line across the region of interest.

In [None]:
### Some pixels along a transect
n_points = 5
pixloc_y = np.linspace(9547000, 9545000, n_points)
pixloc_x = np.linspace(825000, 832000, n_points)

In [None]:
### Plot
fig = plt.figure(figsize=(8,8))
plt.plot(pixloc_x, pixloc_y, marker='o', color='black', linestyle='none')
shp.boundary.plot(ax=fig.axes[0], color='black');
[plt.text(x,y,f"{p:4d}") for p,(y,x) in enumerate(zip(pixloc_y,pixloc_x))];

We can now extract the pixels' TSS data using `Xarray`'s vectorised indexing functionality, which will retrieve data at the grid cells nearest to the target `x` and `y` coordinates. For illustration purposes, here we make use of the time series of quarterly averaged TSS values.

In [None]:
points_x = xr.DataArray(pixloc_x, dims="points")
points_y = xr.DataArray(pixloc_y, dims="points")

### Extract data (quarterly dataset)
points_dat = data_tss_quarter.sel(x=points_x, y=points_y, method="nearest")
points_dat = points_dat.dropna('time', how='all')
points_dat = points_dat.persist()

### Plot
points_dat.plot.line(x='time', marker='.', figsize=(15,5));
plt.gca().set_title("Quarterly TSS values, selected pixels");

From this plot, we can clearly identify a period of decreased TSS levels at all point of interest, preceded by a series of higher TSS concentrations.

## Extracting to Pandas data frame

The following code cell will save the sampled (quarterly) TSS data into a Pandas data frame:

In [None]:
### Extract to Pandas
tss_df = pd.DataFrame( data = points_dat.values, 
                       index = points_dat.time.values, 
                       columns = points_dat.points.values )

### (Re)set DF index
tss_df['month'] = tss_df.index.month
tss_df['year'] = tss_df.index.year
tss_df = tss_df.set_index(['year','month'])

tss_df

If desired, we can also "re-format" the data frame into a long (as opposed to wide) format:

In [None]:
tss_df = tss_df.stack(dropna=False).rename_axis(['year','month','point'])
tss_df = pd.DataFrame(tss_df).rename(columns={0:'TSS'})
tss_df

And finally, saving the data (e.g. to `.csv` or `.pkl` file) can be achieved with the following code if desired.

In [None]:
### Uncomment if needed:
# tss_df.to_csv('./TSS_pixel_data.csv')
# tss_df.to_pickle(path='./TSS_pixel_data.pkl')

In [None]:
### End notebook