## Glucose Prediction 


This assignment's focus is on predicting blood glucose. There are four parts to the assignment:


   1. Data cleaning
 
   2. Population level model
 
   3. Improving model training
 
   4. Transfer learning


## Pipeline task overview: Forecasting glucose

Recall from your previous courses that these tasks can typically be described by the following components: 

 1. Data collection - <font color='green'>Done</font>
 2. Data cleaning / transformation - <font color='magenta'>You will do in 2a</font>
 3. Dataset splitting <font color='green'> - Done </font>
 4. Model training <font color='magenta'> - You will do in 2b, 2c and 2d</font>
 5. Model evaluation <font color='magenta'> - You will do in 2b, 2c and 2d</font>
 6. Repeat 1-5 to perform model selection <font color='magenta'> - You will do in 2b, 2c and 2d</font>

## Population level models

In this notebook you will be using our cleaned and transformed data to train and evaluate an LSTM model for glucose prediction. This dataset was created following the same steps you completed in notebook 2a.  In this notebook you will train the model on the entire dataset. In the next notebook you will train each model N times - once per person in the dataset. 

In [2]:
import data_cleaners as dc
import find_best_hyperparameters as fbh
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [3]:
data = dc.Diabetes()

X = data.X
y = data.y

pid_to_indices = {}
for pid in X['patient_id'].unique().tolist():
    X_df = X[X['patient_id'] == pid]
    pid_to_indices[pid] = X_df.index

splits = dc.TrainSplits(X, y, combine=True, pid_to_indices=pid_to_indices)

## <font color='magenta'>Task One</font>


##  LSTM

You will implement the LSTM model from the following paper ([A Deep Learning Approach for Blood Glucose Prediction
of Type 1 Diabetes](http://ceur-ws.org/Vol-2675/paper23.pdf)).  However, you can see from the function ```best_lstm_parameters_training_data``` that there are a fixed set of parameters we will evaluate your model with. So, you are encouraged to experiment with additional complexity, but for the purpose of passing the assignment you will need to stick to the structure provided. 


**You will not implement the Delta-LSTM mentioned in the paper, in the LSTM step simply use PyTorch's LSTM module https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html**

The function ```best_lstm_parameters_training_data``` in the ```find_best_hyperparameters.py``` file looks for the best hyperparameters. Your task is to fill in the missing pieces in the function. You will be asked to write a custom training loop in PyTorch to find the best hyperparameters for a model. For each hyperparameter combination you will train the model for a certain number of epochs. Then you will evaluate this model on the validation set and save the result. After evaluating on each hyperparameter combination you can choose the one which maximizes performance.
You will need to have a basic understanding of PyTorch and how to train a model in order to complete the task. **Good luck!**     

In [None]:
# Your solution goes in best_lstm_parameters_training_data() in find_best_hyperparameters.py
best_model, best_hidden_size, min_val_loss = fbh.best_lstm_parameters_training_data(splits.train_df_X,
                                                    splits.train_df_y, splits.validate_df_X, splits.validate_df_y)


Hidden layer size: 4
Evaluating the performance ...
Hidden Size: 4 Val Loss: 73.677139
Hidden layer size: 8
Evaluating the performance ...
Hidden Size: 8 Val Loss: 43.040031
Hidden layer size: 16
Evaluating the performance ...
Hidden Size: 16 Val Loss: 63.192327
Hidden layer size: 32
Evaluating the performance ...
Hidden Size: 32 Val Loss: 52.619152
Hidden layer size: 64
Evaluating the performance ...
Hidden Size: 64 Val Loss: 34.066485
Hidden layer size: 128


In [4]:
#hidden tests are within this cell

## <font color='magenta'>Task Two</font>

Across these notebooks we want to see how each model is performing for each patient.

You will implement code in `fbh.rmse_from_training_population_testing_individual_best_lstm(best_model, splits.data_dicts)` which will return a dictionary, `results_dict`, of the form: 

```python
{
 patient_id : (rmse for 30 min interval, list containing rmse for 5,10,...,60 min intervals)
}

```
For example, results should look like the following (note that the numbers below may not be correct)
```python
{1.0:(27.81,[26.07,26.74, ... , 29.17]),
 2.0:(27.81,[26.07,26.74, ... , 29.17]),
 ...
 10.0:(27.81,[26.07,26.74, ... , 29.17])}
```


This takes your best model from above as an argument.  

In [1]:
import copy
# print(best_model.forward(splits.data_dicts[1.0]))
test = copy.deepcopy(splits.data_dicts)

test_df = test[1.0][0]
test_df = test_df.drop(columns=['patient_id'])
test[1.0][2].head()

NameError: name 'splits' is not defined

In [6]:
# your code goes in rmse_from_training_population_testing_individual_best_lstm() in find_best_hyperparameters.py
results_dict = fbh.rmse_from_training_population_testing_individual_best_lstm(best_model, splits.data_dicts)

print(results_dict)

AttributeError: 'DataFrame' object has no attribute 'permute'

In [None]:
#hidden tests are within this cell