# Data Preprocessing

This notebook handles, preprocessing of raw data which is to be used for training and validating models

## Preprocess Sentinel-1 Data

Query the Sentinel-1 Data buckets:
 - Query S3 storage for general area and time of interest
 - Clip to region of interests (wetlands)
 - Clamp values to reduce noise  - data investigation determined a range of [0, 200]
 - Coarsen raster data so our models don't get too big - no GPUs atm
 - Save as int16 -> sufficient precision as we raw data is stored in int16 as well, and we don't add info

### Load S1-Datasets and Ramsar Geometry

Load the S1 datasets in the time and region of interest as well as the Ramsar shape files containing the geometry of the wetlands.

**NOTE:** We use VH for now, as the reduced noise should help training machine learning models

In [1]:
from pathlib import Path

from rattlinbog.loaders import load_s1_datasets_from_file_list, load_rois, DATE_FORMAT

S1_2021_AT_FILE_LIST = Path("/shared/sentinel-1/paths-west-AT-2021.txt")
RAMSAR_SHAPE_FILE = Path("/shared/ramsar/RAMSAR_AT_01.shp")

vh_datasets = load_s1_datasets_from_file_list(S1_2021_AT_FILE_LIST, bands={'VH'})
ramsar_rois = load_rois(RAMSAR_SHAPE_FILE)

### Sentinel-1 data reduction

To apply our models efficiently we want to reduce the data
- Clip to our specific Ramsar regions of interest
- Clamp data to reduce noise
- Coarsen the resolution - speeding up training of machine learning models and reducing noise
- Round to int16 - this precision should be sufficient
- Stream to shared disk for further usage

In [2]:
from xarray import Dataset
from rattlinbog.data_group import group_datasets, GroupByRois
from rattlinbog.transforms import Compose, ClipRoi, ConcatTimeSeries, ClipValues, CoarsenAvgSpatially, RoundToInt16, \
    StoreAsNetCDF, NameDatasets, EatMyData, SortByTime, ChunkGroup

S1_2021_100m_OUT = Path("/shared/sentinel-1/roi/100m")


def ds_namer(ds: Dataset) -> str:
    from_ts = ds.attrs['from_ts'].strftime(DATE_FORMAT)
    to_ts = ds.attrs['to_ts'].strftime(DATE_FORMAT)
    roi_name = ds.attrs['roi'].name.replace(' ', '_')
    return f"{roi_name}_{from_ts}_to_{to_ts}"


group = group_datasets(vh_datasets, by_rule=GroupByRois(ramsar_rois))
chunk_pipline = Compose([SortByTime(), ChunkGroup(16)])
chunked_groups = chunk_pipline(group)

stream_roi_pipeline = Compose([ClipRoi(),
                               ClipValues(vmin=0, vmax=200),
                               CoarsenAvgSpatially(stride=10),
                               ConcatTimeSeries(),
                               RoundToInt16(),
                               NameDatasets(ds_namer),
                               StoreAsNetCDF(S1_2021_100m_OUT),
                               EatMyData()])

for grp in chunked_groups:
    stream_roi_pipeline(grp)

print("transform successful")

transform successful


In [None]:
print("transform successful")
group