<a id='SimulatedDataEnsembleModelsTop'></a>
# Pretrain an LSTM Network using Simulated Data

Investigating whether starting with a model pretrained on simulated data improves performance

- Architecture, informed by limited [hyperparameter tuning](SimulatedDataPretrainedModelTuner.ipynb#SimulatedDataPretrainedModelTunerTop) with keras tuner, is:
    - input layer - 50
    - hidden layer - 200
    - hidden layer - 50
    - hidden layer - 10
    - output layer - 2 (softplus activation, to ensure variance predictions are positive)
- batch size = 64
- Number of base learners in ensemble = 10
- 3:1:1 train:validation:test dataset split
- 2000 epochs

The model from the epoch which gives the best validation loss is saved for future use, for transfer learning using the experimental dataset

Minimum loss is -ln(minimum_variance)/2 = -6.91 (for a minimum variance chosen to be 1e-6)

In [None]:
import pandas as pd
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
import math
from LSTMutils import MeanVarianceLogLikelyhoodLoss
from sklearn.model_selection import train_test_split
import LSTMutils

# input parameters
SequenceLength = 250
validation_split = 0.1
batch_size = 64
NumEpochs = 2

#set random seeds
np.random.seed(42)
tf.random.set_seed(42)

# read simulated dataset
ExperimentalData = LSTMutils.ExperimentalData(ExperimentalDataFilePath="../TrainingData/SimulatedTrainingSet10000.csv",SequenceLength=SequenceLength)
unused, concentrations, df_data, unused = ExperimentalData.ReadData()
concentrations=concentrations.apply(pd.to_numeric)

# split data into stratified train and test sets, size defined by the test_split variable
# the split will always be the same provided the data is in the same order, the same random_state is used,
# and strangely the labels used for stratification are always the same type (str is used here)
df_train, df_val = train_test_split(df_data, test_size=validation_split, train_size=1-validation_split, random_state=42, shuffle=True, stratify=concentrations)
  
# normalise time series data
df_norm_train, df_norm_val, unused = ExperimentalData.NormalizeData(df_train,df_val)
    
# Define y as the last element in X, and ensure X and y are the correct shape
X_train, y_train = ExperimentalData.Shape(df_norm_train)
X_val, y_val = ExperimentalData.Shape(df_norm_val)

# define network architecture
model = keras.models.Sequential([keras.layers.LSTM(50, input_shape=(SequenceLength,1), return_sequences=True, stateful=False)
                                 , keras.layers.LSTM(200, return_sequences=True, stateful=False)
                                 , keras.layers.LSTM(50, return_sequences=True, stateful=False)
                                 , keras.layers.LSTM(10, return_sequences=True, stateful=False)
                                 , keras.layers.LSTM(2, activation='softplus', return_sequences=True, stateful=False)])

# save the model at the epoch which gives the lowest loss predictions on the validataion dataset
checkpoint_filepath = r"../Models/SimulatedDataPretrainedModel"
model_checkpoint_callback = keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    monitor='val_loss',
    mode='min',
    save_best_only=True)

model.compile(optimizer="adam",loss = MeanVarianceLogLikelyhoodLoss)

history = model.fit(X_train, y_train, batch_size=batch_size, validation_data=(X_val,y_val), epochs=NumEpochs, callbacks=[model_checkpoint_callback,keras.callbacks.TerminateOnNaN()])

# plot loss vs epochs
Evaluation = LSTMutils.ModelTrainingEvaluation()
Evaluation.PlotLossHistory(history)

# load and evaluate the best model, in terms of validation loss
bestModel = keras.models.load_model(checkpoint_filepath, custom_objects={"MeanVarianceLogLikelyhoodLoss": MeanVarianceLogLikelyhoodLoss})
bestModel.evaluate(X_train, y_train, batch_size=batch_size)