# General functioning

The algorithm is divided into two main parts:

1. data_preparation.py -> Selecting the gaps: cut temporal and spatial data of a given variable from a xcube data cube and if requested create artificial gaps.
2. gapfilling.py -> Fill in the gaps: estimate the values in these gaps by creating a model for each pixel.

Each time 'EarthSystemDataCubeS3(ds_name, variable, dimensions, artificial_gaps, actual_matrix).get_data()' is executed, a new directory is created in 'application_results/'.
Inside this directory the class "GapDataset" creates 2 subdirectories. 
In 'History/' all samples except one matrix from the specified time period and area are stored with the corresponding values of the specified variable for each pixel.
The only exception is the randomly selected actual array where the artificial gaps are created.
If artificial gaps should be created, the arrays with the artificial gaps are stored in 'GapImitation/' with different gap sizes in each array.
Two other arrays are stored in the corresponding subdirectory 'application_result/.../'.
The first is the current matrix mentioned earlier and the other is an array with the extra data e.g. corresponding land cover class values for each value in the matrix.
The extra matrix can be used as a predictor configuration for the gap filling process.

In the second step the execution of 'SupportVectorRegressionGapfill(ds_name, hyperparameters, predictor).gapfill()' creates the subdirectory 'Results/'.
For each gap the values will be estimated. 
The average difference of the estimated value and the actual value (only available at artificial gaps) of each gap will be calculated as well as the mean absolute error based on cross validation. 
The filled arrays will be stored in 'Results/' subdirectory.


## Data

The currently used data is from the [xcube](https://xcube.readthedocs.io/en/latest/installation.html) dataset.
This gapfilling algorithms works currently for a Earth System Data Cube but its functionality can be adapted for other gapfilling use cases by inheriting from the GapDataset class.
For more information about the xcube data, follow the link.  

## Parameters

### GapDataset class

#### Subclasses
- a subclass needs to be selected to perform the data preparation algorithm -> it selected the origin of the dataset, other datacubes as inherited classes as well
- EarthSystemDataCubeS3

#### ds_name
- specify the name of the dataset -> a new directory with this name will be created in 'application_results/' - if a directory with this name already exists, it will be overwritten
- DEFAULT: 'Test123'
- other options: 
    - free choice - no restriction for the naming convention

#### variable
- variable that will be estimated. More possible variables from the Earth System Data Cube can be found [here](https://deepesdl.readthedocs.io/en/latest/datasets/ESDC/#variable-list)
- DEFAULT: 'land_surface_temperature'
- other options:
    - 'air_temperature_2m'

#### dimensions
- dimensions of the data cube that will be sliced e.g. lat, lon, times
- DEFAULT: dimensions = {'lat': (54, 48), 'lon': (6, 15), 'times': (datetime.date(2008, 11, 1), datetime.date(2008, 12, 31))}
- other options:
    - free choice - global range: lat = (90, -90), lon = (-180, 180), times: total range from 1979-2018, but most values are recorded from 2002-2011

#### artificial_gaps
- list of artificial gaps that can be created -> the gapfilling algorithm performs on artificial gaps if they are stated; if this parameter is None, it estimates real gaps in the matrix
- DEFAULT: None
- other options
    - free choice - total range for each element from 0-1, e.g. [0.001, 0.01, 0.1, 0.25, 0.5, 0.75]

### Gapfiller class

#### Subclasses
- a subclass needs to be selected to perform the gapfilling algorithm - other learning algorithms such as Random Forest or other regressions can be added as inherited classes as well
- SupportVectorRegressionGapfill

#### ds_name
- same as the name for the GapDataset class -> otherwise no directory will be found and the gapfilling algorithm cannot perform
- DEFAULT: 'Test123'
- other options: 
    - free choice - but it should have the name of an existing directory in 'application_results/'

#### hyperparameters
- strategies for configuring hyperparameters
- DEFAULT: 'RandomGridSearch' - random grid search
- other options:
    - 'FullGridSearch' - full grid search
    - 'Custom' - custom settings according to the scikit-learn syntax that can be changed in the 'learning_function'-method - current settings: **params = {'kernel': 'linear', 'gamma': 'scale', 'C': 1000, 'epsilon': 1}

#### predictor
- strategies for selecting predictors
- DEFAULT: 'RandomPoints' - randomly selected 100 points in the matrix - if less than 100 points with values in the metric, all non-gap values will be used as predictors; if less than 50 pixel have known values, interpolation is used to estimate the gaps
- other options:
    - 'AllPoints' - all known values -> runtime can be very big
    - 'LCC' - the 40 closest pixels from the same land cover class (e.g. mixed forest) as the gap will be used as predictors - if there are less than 40 pixels from the same land cover class, the strategy changes to 'RandomPoints'

## Observations and recommendations (so far)

In terms of runtime and accuracy, 'RandomGridSearch' works best for configuring hyperparameters and 'LCC' for selecting predictors; 'RandomPoints' or other categorical variables might be suitable options to estimate the gaps, depending on the use case.
Since a model is created for each gap pixel, the runtime increases linearly proportional to the number of gaps, while the accuracy deteriorates only slightly.
For the number of training samples, no more than 40-100 arrays need to be cut from the respective area - since the recordings in the data cube occur every 8 days and the runtime increases with the number of training samples, 1-2 years could be sufficient as a time period. 

## Example with added artificial gaps

In [1]:
import datetime
from xcube.core.store import new_data_store

# add path, if mltools not installed
import sys
sys.path.append('../mltools')

from mltools.gap_dataset import EarthSystemDataCubeS3
from mltools.gap_filling import SupportVectorRegressionGapfill


# Directory name
ds_name = 'GermanyNB_artificial_gaps'
# Variable that will be estimated e.g. 'land_surface_temperature' or 'air_temperature_2m'
variable = 'land_surface_temperature'
# Dimension values of the datacube, e.g. Latitude and longitude and of the area and times.
# Global range: lat = (90, -90), lon = (-180, 180)
dimensions = {
    'lat': (54, 48),
    'lon': (6, 15),
    'times': (datetime.date(2008, 1, 1), datetime.date(2008, 12, 31))
}
# List of artificial gaps that will be created
# if no artificial gaps should be created, the gapfilling algorithms will perform on real gaps
# options: None or list of artificial gap sizes e.g. [0.001, 0.01, 0.1, 0.25, 0.5, 0.75]
artificial_gaps = [0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.25, 0.5, 0.75]
# Speficy whether the actual matrix will be chosen random or is from a specific file.
# Options: 'Random' or datetime.date of the specific file e.g. datetime.date(2008, 12, 14)
actual_matrix = 'Random'

data_store = new_data_store("s3", root="esdl-esdc-v2.1.1", storage_options=dict(anon=True))
dataset = data_store.open_data('esdc-8d-0.083deg-184x270x270-2.1.1.zarr')
# Select the variable of interest from the dataset
ds = dataset[variable]

EarthSystemDataCubeS3(ds, ds_name, dimensions, artificial_gaps, actual_matrix).get_data()

GermanyNB_artificial_gaps {'time': 46, 'lat': 72, 'lon': 108}
No of files: 46
1 / 46
2 / 46
3 / 46
4 / 46
5 / 46
6 / 46
7 / 46
8 / 46
9 / 46
10 / 46
11 / 46
12 / 46
13 / 46
14 / 46
15 / 46
16 / 46
17 / 46
18 / 46
19 / 46
20 / 46
21 / 46
22 / 46
23 / 46
24 / 46
25 / 46
26 / 46
27 / 46
28 / 46
29 / 46
30 / 46
31 / 46
32 / 46
33 / 46
34 / 46
35 / 46
36 / 46
37 / 46
38 / 46
39 / 46
40 / 46
41 / 46
42 / 46
43 / 46
44 / 46
45 / 46
46 / 46
Structure {'complete': 0, 'empty': 0, 'gaps': 46}
date: 2008-04-10
real gap size:  63 %
Exception: gap size 50.0 % -> contains not enough non-NaN values. No array with imitated gaps was created.
Exception: gap size 75.0 % -> contains not enough non-NaN values. No array with imitated gaps was created.
7 arrays with gaps were created!
runtime: 41.09


In [2]:
# Directory name based on the input name in 'GapDataset'
ds_name = 'GermanyNB_artificial_gaps'
# Choose hyperparameter settings. Options: 'RandomGridSearch' | 'FullGridSearch' | 'Custom'
hyperparameters = "RandomGridSearch"
# Choose the predictor type. Options: 'AllPoints' | 'LCC' | 'RandomPoints'
predictor = "LCC"
# Create an instance of the SupportVectorRegressionGapfill class with the specified settings
SVR_Gapfiller = SupportVectorRegressionGapfill(ds_name=ds_name, hyperparameters=hyperparameters, predictor=predictor)
# Perform the gap filling using the chosen settings
SVR_Gapfiller.gapfill()

date: 2008-04-10 
gap size: 0.1 % -> 7 pixel 
training pictures: 45
MAE actual: 1.65
MAE cross validation: 1.121
runtime: 0.15 seconds 

date: 2008-04-10 
gap size: 0.5 % -> 38 pixel 
training pictures: 45
MAE actual: 1.503
MAE cross validation: 0.97
runtime: 0.65 seconds 

date: 2008-04-10 
gap size: 1.0 % -> 77 pixel 
training pictures: 45
MAE actual: 1.612
MAE cross validation: 0.99
runtime: 1.38 seconds 

date: 2008-04-10 
gap size: 5.0 % -> 388 pixel 
training pictures: 45
MAE actual: 1.486
MAE cross validation: 1.08
runtime: 6.60 seconds 

date: 2008-04-10 
gap size: 10.0 % -> 777 pixel 
training pictures: 45
MAE actual: 1.586
MAE cross validation: 1.152
runtime: 13.52 seconds 

date: 2008-04-10 
gap size: 20.0 % -> 1555 pixel 
training pictures: 45
MAE actual: 1.652
MAE cross validation: 1.272
runtime: 27.69 seconds 

date: 2008-04-10 
gap size: 25.0 % -> 1944 pixel 
training pictures: 45
MAE actual: 1.823
MAE cross validation: 1.284
runtime: 35.66 seconds 



## Example without artificial gaps where the algorithm estimates real gaps

In [3]:
# Directory name
ds_name = 'GermanyNB_with_real_gaps'
# Variable that will be estimated e.g. 'land_surface_temperature' or 'air_temperature_2m'
variable = 'land_surface_temperature'
# Dimension values of the datacube, e.g. Latitude and longitude and of the area and times.
# Global range: lat = (90, -90), lon = (-180, 180)
dimensions = {
    'lat': (54, 48),
    'lon': (6, 15),
    'times': (datetime.date(2008, 1, 1), datetime.date(2008, 12, 31))
}
# List of artificial gaps that will be created
# if no artificial gaps should be created, the gapfilling algorithms will perform on real gaps
# options: None or list of artificial gap sizes e.g. [0.001, 0.01, 0.1, 0.25, 0.5, 0.75]
artificial_gaps = None
# Speficy whether the actual matrix will be chosen random or is from a specific file.
# Options: 'Random' or datetime.date of the specific file e.g. datetime.date(2008, 12, 14)
actual_matrix = datetime.date(2008, 12, 14)

data_store = new_data_store("s3", root="esdl-esdc-v2.1.1", storage_options=dict(anon=True))
dataset = data_store.open_data('esdc-8d-0.083deg-184x270x270-2.1.1.zarr')
# Select the variable of interest from the dataset
ds = dataset[variable]

EarthSystemDataCubeS3(ds, ds_name, dimensions, artificial_gaps, actual_matrix).get_data()

GermanyNB_with_real_gaps {'time': 46, 'lat': 72, 'lon': 108}
No of files: 46
1 / 46
2 / 46
3 / 46
4 / 46
5 / 46
6 / 46
7 / 46
8 / 46
9 / 46
10 / 46
11 / 46
12 / 46
13 / 46
14 / 46
15 / 46
16 / 46
17 / 46
18 / 46
19 / 46
20 / 46
21 / 46
22 / 46
23 / 46
24 / 46
25 / 46
26 / 46
27 / 46
28 / 46
29 / 46
30 / 46
31 / 46
32 / 46
33 / 46
34 / 46
35 / 46
36 / 46
37 / 46
38 / 46
39 / 46
40 / 46
41 / 46
42 / 46
43 / 46
44 / 46
45 / 46
46 / 46
Structure {'complete': 0, 'empty': 0, 'gaps': 46}
date: 2008-12-14
real gap size:  21 %
runtime: 43.48


In [5]:
# Directory name based on the input name in 'GapDataset'
ds_name = 'GermanyNB_with_real_gaps'
# Choose hyperparameter settings. Options: 'RandomGridSearch' | 'FullGridSearch' | 'Custom'
hyperparameters = "RandomGridSearch"
# Choose the predictor type. Options: 'AllPoints' | 'LCC' | 'RandomPoints'
predictor = "LCC"
# Create an instance of the SupportVectorRegressionGapfill class with the specified settings
SVR_Gapfiller = SupportVectorRegressionGapfill(ds_name=ds_name, hyperparameters=hyperparameters, predictor=predictor)
# Perform the gap filling using the chosen settings
SVR_Gapfiller.gapfill()

date: 2008-12-14 
gap size: 20.8 % -> 1617 pixel 
training pictures: 45


IndexError: too many indices for array: array is 0-dimensional, but 1 were indexed