# Generation of learning database from a stochastic reactors simulation

This notebook builds a database to be used as a training database for the ML algorithm. In order for this script to be used, a stochastic reactors simulation with *build_ML_dtb=True* must have neen run beforehand. This simulation produces files *X.csv* and *Y.csv* with raw states $T$, $Y_k$ and necessary information to clusterize the data (e.g. progress variable). 

The current script generate the final database and enables to choose several options:

+ Prediction of $Y_k(t+dt)$ or $Y_k(t+dt)-Y_k(t)$

+ Application of a transform such as logarithm or Box-Cox

+ Possibility to apply a temperature threshold to the data to avoid non-reactign zones

+ Possibility to clusterize the data based on (i) k-means algorithm; (ii) progress variable values.

Files *X_train*, *Y_train*, *X_val* and *Y_val* are created for each cluster. Note that if no clustering is applied, the default single cluster is cluster 0.

In [None]:
from ai_reacting_flows.stochastic_reactors_data_gen.database_processing import LearningDatabase

The parameters of the database processing are first set:

In [None]:
# Dictionary to store data processing parameters
dtb_processing_parameters = {}

dtb_processing_parameters["dtb_folder"] = "../scripts/STOCH_DTB_hotspot_H2_DEV"       # Stochastic reactors simulation folder
dtb_processing_parameters["database_name"] = "database_1"                   # Resulting database name
dtb_processing_parameters["log_transform"] = 1              # 0: no transform, 1: Logarithm transform, 2: Box-Cox transform
dtb_processing_parameters["threshold"] = 1.0e-10            # Threshold to be applied in case of logarithm transform
dtb_processing_parameters["output_omegas"] = True           # True: output differences, False: output mass fractions

The database is then created as a *LearningDatabase* object:

In [None]:
database = LearningDatabase(dtb_processing_parameters)

We can apply a temperature threshold if needed, here $600$ K for instance:

In [None]:
database.apply_temperature_threshold(600.0)

We can clusterize the dataset based on a progress variable if needed:

In [None]:
database.clusterize_dataset("progvar", 2, c_bounds=[0,0.95,1.0])

Alternatively, we could have used k-means: (commented because double clustering is banned)

In [None]:
# database.clusterize_dataset("kmeans", 3)

Finally, the database is processed in order ot be used in ML pipeline: (useless dataframe columns are suppressed and the transformation of the data is performed)

In [None]:
database.process_database()