# Generation of learning database from a stochastic reactors simulation

This notebook builds a database to be used as a training database for the ML algorithm. In order for this script to be used, a stochastic reactors simulation with *build_ML_dtb=True* must have neen run beforehand. This simulation produces files *X.csv* and *Y.csv* with raw states $T$, $Y_k$ and necessary information to clusterize the data (e.g. progress variable). 

The current script generate the final database and enables to choose several options:

+ Prediction of $Y_k(t+dt)$ or $Y_k(t+dt)-Y_k(t)$

+ Application of a transform such as logarithm or Box-Cox

+ Possibility to apply a temperature threshold to the data to avoid non-reacting zones

+ Possibility to clusterize the data based on (i) k-means algorithm; (ii) progress variable values.

Files *X_train*, *Y_train*, *X_val* and *Y_val* are created for each cluster. Note that if no clustering is applied, the default single cluster is cluster 0.

In [None]:
from ai_reacting_flows.stochastic_reactors_data_gen.database_processing import LearningDatabase

In [None]:
%load_ext autoreload
%autoreload 2

The parameters of the database processing are first set:

In [None]:
# Dictionary to store data processing parameters
dtb_processing_parameters = {}

dtb_processing_parameters["dtb_folder"] = "../scripts/STOCH_DTB_PREMIXED_CH4_TEST"       # Stochastic reactors simulation folder
dtb_processing_parameters["database_name"] = "database_1"                   # Resulting database name
dtb_processing_parameters["log_transform"] = 0              # 0: no transform, 1: Logarithm transform, 2: Box-Cox transform
dtb_processing_parameters["threshold"] = 1.0e-10            # Threshold to be applied in case of logarithm transform
dtb_processing_parameters["output_omegas"] = True           # True: output differences, False: output mass fractions
dtb_processing_parameters["detailed_mechanism"] = "/work/mehlc/2_IA_KINETICS/ai_reacting_flows/data/chemical_mechanisms/mech_H2.yaml"        # Mechanism used for the database generation (/!\ YAML format)
dtb_processing_parameters["fuel"] = "H2"           # Fuel name
dtb_processing_parameters["with_N_chemistry"] = False        # Considering Nitrogen chemistry or not (if not, N not considered in atom balance for reduction). In MLP, it will change treatment of N2.

The database is then created as a *LearningDatabase* object:

In [None]:
database = LearningDatabase(dtb_processing_parameters)

We can apply a temperature threshold if needed, here $600$ K for instance:

In [None]:
database.apply_temperature_threshold(600.0)

We have the possibility to apply ANN only to a reduced subset of species. In order to preserve atomic masses and enthalpy, a set of fictive species is added to the database. Their mass fractions are computed so that conservation of quantities is ensured. In order for the problem to have a solution, the following rules must be obeyed when selecting the fictive species:

+ The number of fictive species must be *number of atoms + 1* (for the enthalpy). At the moment, the number of atoms is 4 ($C$, $H$, $O$, $N$), except for $H_2$, where carbon is not considered. Another possibility is to discard $N$, this is done by setting the *with_N_chemistry* parameter above to *False*.

+ Each atom must be represented at least once.

The reduction operation can be done with the following lines:

In [None]:
fictive_species = ["O2", "H2O", "H2"]
subset_species = ["H2", "O2", "N2", "H2O"]
database.reduce_species_set(subset_species, fictive_species)

We can for instance check that the sum of species mass fractions is 1:

In [None]:
print(database.X[subset_species + [spec+"_F" for spec in fictive_species]].sum(axis=1))
print("")
print(database.Y[subset_species + [spec+"_F" for spec in fictive_species]].sum(axis=1))

Note that the check for the sum being 1 is here just illustrative, as an advanced verification on the individual atomic mass fractions and the enthalpy is made in the *reduce_species_set* routine.

We can clusterize the dataset based on a progress variable if needed:

In [None]:
# database.clusterize_dataset("progvar", 2, c_bounds=[0,0.95,1.0])

Alternatively, we could have used k-means: (commented because double clustering is banned)

In [None]:
database.clusterize_dataset("kmeans", 3)

We can print the size of the database (count made for each cluster):

In [None]:
database.print_data_size()

We can under-sample a given cluster (example: if the burnt gas cluster has too many states, we can reduce it). Here, we keep ratio_to_keep*size of cluster.

In [None]:
database.undersample_cluster(1, ratio_to_keep = 0.5)

We can check that the under-sampling has been applied correctly:

In [None]:
database.print_data_size()

Finally, the database is processed in order ot be used in ML pipeline: (useless dataframe columns are suppressed and the transformation of the data is performed)

In [None]:
database.process_database(plot_distributions = True, distribution_species=["CH4"])