# TS-1: Data preparation

*****

This notebook allows you to load and pre-process an SDC dataset, which you can then save into a NetCDF (.nc) file to be reused quickly in other Notebooks where you do your analysis.

Things you should change:

* The config_cell variables
* The output filename of the netcdf file (see the last cell).

Then, note that the Notebook has two different options depending on the dataset that you want to pre-process:

* Landsat
* Land use statistics

Only execute the section which corresponds to the product that you specified in the config_cell!

*****


In [None]:
# Import modules

# reload module before executing code
%load_ext autoreload
%autoreload 2

# define modules locations (you might have to adapt define_mod_locs.py)
%run ../sdc-notebooks/Tools/define_mod_locs.py

import os
import shutil

import numpy as np
import xarray as xr
    
from datetime import datetime

from sdc_tools.sdc_utilities import lsc2_loadcleanscale

import datacube
dc = datacube.Datacube()

ds_clean = None
ds_astat = None

The next cell contains the dataset configuration information:
- product
- geographical extent
- time period
- bands

You can generate it in three ways:
1. manually from scratch,
2. by manually copy/pasting the final cell content of the [config_tool](config_tool.ipynb) notebook,
3. by loading the final cell content of the [config_tool](config_tool.ipynb) notebook using the magic `%load config_cell.txt`.

In [None]:
%load "config_cell.txt"

# Choose your path now ...
## (1) Optical Landsat satellite data

In [None]:
# If you  like, you can load a longer time series of Landsat by requesting data from each satellite.
# Be aware that this will take quite a long time to load. 
# And only do this for an area a few kilometres/10s kilometres in extent (otherwise you risk requesting too much data!)
#products = ['landsat_ot_c2_l2', 'landsat_etm_c2_l2', 'landsat_tm_c2_l2']

ds_clean, mask = lsc2_loadcleanscale(dc = dc,
                                     products = product,
                                     longitude = longitude,
                                     latitude = latitude,
                                     crs = crs,
                                     time = time,
                                     measurements = measurements,
                                     output_crs = output_crs,
                                     resolution = resolution)

In [None]:
ds_clean = ds_clean.where(ds_clean >= 0) # keep only positive values
ds_clean = ds_clean.dropna('time', how='all') # drop scenes without data
ds_clean.time.attrs = {}

In [None]:
## Some necessary small changes so that we can save this dataset to a NetCDF (.nc) file.

# Remove quality info attributes
if 'pixel_qa' in measurements:
    ds_clean.pixel_qa.attrs['flags_definition'] = []
elif 'slc' in measurements:
    ds_clean.slc.attrs['flags_definition'] = []

### Optional: add normalised difference index

In [None]:
# OPTIONAL CELL TO CALCULATE NDIs
# You can already calculate normalised difference indexes here to be saved with the measurements.
# To do this, uncomment the relevant line(s) below and/or add your own.

ds_clean['ndvi'] = (ds_clean.nir - ds_clean.red) / (ds_clean.nir + ds_clean.red)
ds_clean['ndwi'] = (ds_clean.green - ds_clean.nir) / (ds_clean.green + ds_clean.nir)

# Remove time attributes from each of the indices that you define above.
ds_clean.ndvi.time.attrs = {}

# 'NDWI': '(ds.green - ds.nir) / (ds.green + ds.nir)',
# 'NDBI': '(ds.swir2 - ds.nir) / (ds.swir2 + ds.nir)'

### Take a quick look at the summary of the data

In [None]:
ds_clean

## And/or (2) Land use statistics

Here, you can either:

1. Load land use statistics directly using the information from the `config_cell` that you already loaded above
2. Or if you already loaded Landsat data above, you can now load the arealstatistik data at the same resolution for the same area. This secon option requires some choices from you in the box below ...

In [None]:
## STUFF FOR OPTION 2
# Here, we manually change the variables `product` and `measurements` to specify what we want to load from arealstatistik.
# We leave longitude, latitude, resolution, output_crs exactly as they were for Landsat. 
# This ensures that the data from arealstatistik will match the spatial coordinates of Landsat perfectly.

# TO PROCEED WITH THIS OPTION, UNCOMMENT AND EDIT 2 CODE LINES BELOW!

# Specify the arealstatistik product
# product = ['arealstatistik']

# Here, the measurements are not individual colour bands, 
# but instead are the different surveys with the desired number of classes.
# In this example, we are loading the 27-class measurements for two time periods: the one ending 1985 and the one ending 2018.
# measurements = ['AS85_27','AS18_27']

In [None]:
# Time is not relevant for the arealstatistik products, so we don't include it as a keyword here.
ds_astat = dc.load(product = product,
                measurements = measurements,
                longitude = longitude,
                latitude = latitude,
                output_crs = output_crs, 
                resolution = resolution)

### Take a quick look at the summary of these data

In [None]:
ds_astat

## Saving the data

In [None]:
## First, figure out if we need to combine Landsat data with arealstatistik.

if (ds_clean is not None) and (ds_astat is not None):
    # In this case, you have loaded both Landsat and arealstatistik.
    # So, let's combine them into a single Dataset, allowing them to be saved together.
    ds_save = xr.merge([ds_clean, ds_astat])
elif (ds_clean is not None):
    # We are saving only the Landsat dataset
    ds_save = ds_clean
elif (ds_astat is not None):
    # We are saving only the arealstatistik dataset
    ds_save = ds_astat
else:
    raise ValueError('Hmm, unknown combination of data. Ask a teacher for help.')

### This is what will be saved...

In [None]:
ds_save

### Save the file.

In [None]:
# Save the file. Change the output filename to something useful!
output_filename = 'myfile.nc'
ds_save.to_netcdf(output_filename)
