## Extracting pre-processed inputs

This notebook demonstrates how you can extract, for one or multiple small patches, pre-processed satellite time series required for running a cropland or crop type model locally on your machine.

### Why?

Having a set of small patches available comes in handy during the development of a custom crop type model. It allows you to quickly test different model set-ups, as each time you have trained a new model, you can immediately apply it to the same set of patches and check for improvements. By not having to deploy and run the model on CDSE, this drastically reduces the time required to get to your ideal crop model!

### How does it work?

All you need to specify is:
- the geometry of one or multiple small patches (< 20 x 20 km)
- start and end date of the time series

The notebook will then launch, for each of the specified geometries, an OpenEO processing job on the Copernicus Data Space Ecosystem (CDSE) extracting all relevant Sentinel-1, Sentinel-2, meteo and digital elevation information that is used by the WorldCereal classification algorithms to predict cropland and crop types.

<div class="alert alert-block alert-warning">
<b>PREREQUISITE:</b> <br>
This means you need a <a href="https://dataspace.copernicus.eu/" target="_blank">CDSE account</a> in order to proceed!
</div>

### Step 1: specify your area(s) of interest

**Option 1: draw a small patch on the map**

Use the rectangle button in the interactive widget below to draw a small area of interest.

In [None]:
from worldcereal.utils.map import ui_map

map = ui_map(area_limit=400)  # area limit in km2

Now save your area of interest for future reference. You will be asked to provide a short descriptive name.

In [None]:
from notebook_utils.production import bbox_extent_to_gdf
from pathlib import Path

bbox_name = input('Enter the name for the output bbox file (without extension): ')
patches_file = Path(f'./bbox/{bbox_name}.gpkg')
processing_extent = map.get_extent(projection='latlon')
bbox_extent_to_gdf(processing_extent, patches_file)
id_source_attribute = None  # Not needed for drawn bbox

**Option 2: Provide path to vector file (.shp/.gpkg/.geoparquet)** 

The file should contain one or multiple small polygons defining your areas of interest. Along with the geometry, the vector file should contain an id attribute, containing a unique identifier for each geometry. The name of this attribute should be passed to `id_source_attribute`.

In [None]:
from pathlib import Path

patches_file = Path(...)
id_source_attribute = ...  # e.g. 'id' or 'patch_id'

### Step 2: Select your processing period

Keep in mind WorldCereal models always use a processing period of 12 months.<br>
Use the slider below to define your processing period.

In [None]:
from notebook_utils.dateslider import date_slider

processing_slider = date_slider()

### Step 3: Launch the processing job(s)

You will be asked to provide a descriptive name for the output directory.<br>
Results will be automatically saved in a folder `./preprocessed_inputs/<your_name>`.<br>

If desired, you can also specify a preferred orbit state for the Sentinel-1 data. If not provided, the orbit state will be automatically determined based on the availability of data.

Extracting inputs for a small area takes around 10 minutes. Hang in there!

In [None]:
from notebook_utils.preprocessed_inputs import collect_worldcereal_inputs_patches

processing_period = processing_slider.get_selected_dates()

name_output = input('Enter name for the output directory: ')
outdir = Path(f'./preprocessed_inputs/{name_output}')

s1_orbit_state = None  # or 'ASCENDING' / 'DESCENDING'

collect_worldcereal_inputs_patches(patches_file, outdir, 
                                  processing_period.start_date,
                                  processing_period.end_date,
                                  id_source=id_source_attribute,
                                  s1_orbit_state=s1_orbit_state)

### Step 4: Check results!

Let's first have a look at the status of your processing job(s).

In [None]:
import pandas as pd
job_status_file = outdir / 'job_tracking.csv'
job_status = pd.read_csv(job_status_file)
print(job_status['status'].value_counts())
job_status.head()

As a result of each processing job, you should have received one NetCDF file.<br>
Let's inspect the first one:

In [None]:
import glob
import xarray as xr

outfiles = sorted(glob.glob(str(outdir / '*' / 'preprocessed-inputs_*.nc')))
outfile = outfiles[0]
ds = xr.open_dataset(outfile)
ds

We can do a quick quality check on the extracted data:

In [None]:
from notebook_utils.preprocessed_inputs import get_band_statistics_netcdf, visualize_timeseries_netcdf

stats = get_band_statistics_netcdf(ds)
visualize_timeseries_netcdf(ds, band="NDVI", npixels=6, random_seed=42)

All done!<br>