# EOCSI EASI training session 1: Introduction, datasets and xarray

## 1. Accessing data through EASI ODC API

View available products and data coverage at the EASI Explorer: https://explorer.asia.easi-eo.solutions

### Determine parameters for accessing data

#### Where and when?
e.g. Singapore, recent


#### What type of data? 

reflectance, temperature, elevation?

#### What resolution and projection?

e.g.: 10 m resolution and epsg:32648 (UTM) (https://explorer.asia.easi-eo.solutions/product/s2_l2a/regions/48NUG)

or lat/lon and native grid for dataset

### Explore datasets through ODC API

A good example for Sentinel-2 https://github.com/csiro-easi/eocsi-hackathon-2022/blob/main/case-studies/Chlorophyll_monitoring.ipynb

In [None]:
%matplotlib inline

import datacube
from datacube.utils import masking
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr

import warnings
warnings.filterwarnings("ignore")

import sys
sys.path.insert(1, '../Tools/')
from dea_tools.plotting import rgb, display_map
from dea_tools.bandindices import calculate_indices

In [None]:
#import datacube

dc = datacube.Datacube(app="data_avail")
dc.list_products()

In [None]:
product = "s2_l2a"

In [None]:
#dc.list_measurements().loc["nasa_aqua_l2_oc"]
dc.list_measurements().loc[product]

Useful figure for Sentinel-2 spectral bands: https://www.usgs.gov/faqs/how-does-data-sentinel-2as-multispectral-instrument-compare-landsat-data

In [None]:
dc.list_measurements().loc[product].loc["SCL"]["flags_definition"]

### Load data

In [None]:
# Define the area of interest  
#latitude = (1.4300, 1.3950)
#longitude = (103.82300, 103.87000)

latitude = (-1.87, -1.85)
longitude = (120.51, 120.53)

time = ("2022")
display_map(x=longitude, y=latitude)

In [None]:
# Specify the parameters to pass to the load query
query = {
    "x": longitude,
    "y": latitude,
    "time": time,
    "group_by": "solar_day", #
    "cloud_cover": [0, 30], #
    "measurements": ["red", "green", "blue", "mask"], #
    "output_crs": "EPSG:32751", #
    "resolution": (-10, 10), # 
    #"dask_chunks": {} #{"time": 1, "x":400, "y":400} #
}

# Load the data
ds_s2 = dc.load(product=product, **query)

In [None]:
print(ds_s2)

### Plot data

Some plotting examples: https://github.com/GeoscienceAustralia/dea-notebooks/blob/develop/Beginners_guide/05_Plotting.ipynb

In [None]:
ds_s2.isel(time=0)[["red","green","blue"]].to_array().plot.imshow(robust=True);

### Mask data

In [None]:
masking.describe_variable_flags(ds_s2.mask)

In [None]:
masking.describe_variable_flags(ds_s2.mask).loc["qa", "values"]

In [None]:
# Multiple flags are combined as logial OR using the | symbol
cloud_free_mask = (
    masking.make_mask(ds_s2.mask, qa="vegetation") | 
    masking.make_mask(ds_s2.mask, qa="bare soils") |
    masking.make_mask(ds_s2.mask, qa="water") |
    masking.make_mask(ds_s2.mask, qa="snow or ice")
)

In [None]:
# Calculate proportion of good pixels
valid_pixel_proportion = cloud_free_mask.sum(dim=("x", "y"))/(cloud_free_mask.shape[1] * cloud_free_mask.shape[2])

valid_threshold = 0.5
observations_to_keep = (valid_pixel_proportion >= valid_threshold)

In [None]:
# only keep observations above the good pixel proportion threshold
# The .compute() step means the values will be loaded into memory. This step may take some time
ds_s2 = ds_s2.sel(time=observations_to_keep)#.compute()

In [None]:
ds_s2.isel(time=0)[["red","green","blue"]].to_array().plot.imshow(robust=True);

In [None]:
# Mask the data
ds_s2_masked = ds_s2.where(cloud_free_mask)
ds_s2_masked.isel(time=0)[["red","green","blue"]].to_array().plot.imshow(robust=True);

In [None]:
print(ds_s2)

## 2. Working with xarray

Resources from https://github.com/csiro-easi/eocsi-hackathon-2022/blob/main/01-welcome-to-easi.ipynb

Blog article on Xarray: https://towardsdatascience.com/basic-data-structures-of-xarray-80bab8094efa

Xarray documentation: http://xarray.pydata.org/en/stable/user-guide/data-structures.html

### Data structure

>Xarray allows us to work with **labeled multi-dimensional array**


A `Dataset` can be seen as a dictionary structure packing up the data, dimensions and attributes. Variables in a `Dataset` object are called `DataArrays` and they share dimensions with the higher level `Dataset`. 


<img src="https://docs.xarray.dev/en/stable/_images/dataset-diagram.png" alt="drawing"/>


See also the terminology: https://docs.xarray.dev/en/stable/user-guide/terminology.html

* Data variables are stored as numpy or dask array
* Labels are in the forms of dimensions, coordinates and attributes
* xarray uses matplotlib for plotting
* ODC API (`datacube.load()`) loads data into a customized xarray dataset

See also an intro notebook (including how to construct a xarray dataset): https://github.com/GeoscienceAustralia/dea-notebooks/blob/develop/Beginners_guide/08_Intro_to_xarray.ipynb

And a more advanced notebook: https://rabernat.github.io/research_computing/xarray.html

In [None]:
print(ds_s2)

In [None]:
print(type(ds_s2.red.data))

In [None]:
print(ds_s2.time)

In [None]:
print(ds_s2.attrs)

In [None]:
print(ds_s2.crs)

In [None]:
print(ds_s2.geobox)

### Indexing and selecting

In [None]:
ds_s2.isel(time=0)

In [None]:
ds_s2.sel(time='2022-04')

In [None]:
ds_s2.isel(time=(ds_s2.time > np.datetime64('2022-03-01')))

### Xarray calculations (reduction)

In [None]:
ds_s2.mean(dim="time")[["red","green","blue"]].to_array().plot.imshow(robust=True);

In [None]:
ds_s2.median(dim="time")[["red","green","blue"]].to_array().plot.imshow(robust=True);

In [None]:
ds_s2.mean(dim=["x","y"])

In [None]:
ds_s2.mean(dim=["x","y"]).green.plot();

### Timeseries

In [None]:
ds_s2.resample(time='2W').nearest().mean(dim=["x","y"]).green.plot();

In [None]:
ds_s2.rolling(time=2, min_periods=1).mean().mean(dim=["x","y"]).green.plot();

### Xarray and Pandas

In [None]:
df = ds_s2.mean(dim=["x","y"]).green.to_dataframe()

In [None]:
type(df)

In [None]:
df.to_csv('test.csv')

In [None]:
dc.list_products()

#type(dc.list_products())

#### Learn more about pandas and geopands

pandas: https://pandas.pydata.org/docs/user_guide/10min.html

geopandas: https://geopandas.org/en/stable/docs/user_guide.html

## Practice now

### Pick a dataset you are interested in. 

If unsure, try Sentinel-2 for where you live or recently visited. If you have used Sentinel-2 through EASI or ODC, try another dataset.

### Explore loading the data and plotting.

### Try xarray operations

e.g.
* Select a timestamp to plot. Trying using .isel() and sel().
* Calculate mean values over time for each pixel and plot the result.
* Try a different calculation (e.g. sum, median) or try to apply the calculation on a different dimension and plot the results
* Resample the data to a monthly (or daily, quarterly) frequency and plot monthly mean values as a line plot
* Save the result


### Think about

* What did you try to achieve and what you've accomplished or learned?
* What type of data did you access? Why? E.g. what does this data measure?
* What else would you like to do with this dataset?