## Overview

This week you will classify whether a specific text snippet refers to an individual with diabetes. Each text snippet contains the notes from an electronic health record. You will consider deep learning models built with two types of data: 

- A bag of embeddings representation of the text data
- A pre-trained vector embedding of the notes

First, you will develop a model with each of these sources independently and then you will develop a model which incorporates both.

This is split up into three notebooks. 

 1. You will use a bag of embeddings text representation
 
 2. You will use a vector representation of each text snippet
 
 3. You will combine these two types of data in a single deep learning model 



## Pipeline task overview: Predicting presence of diabetes from text



Recall from your previous courses that these tasks can typically be described by the following components: 

 1. Data collection - <font color='green'>Done</font>
 2. Data cleaning / transformation - <font color='magenta'>You will do some in assignment 3 c</font>
 3. Dataset splitting <font color='green'>Done</font>
 4. Model training <font color='magenta'>You will do</font>
 5. Model evaluation <font color='magenta'>You will do</font>
 6. Repeat 1-5 to perform model selection <font color='magenta'>You will do</font>
 7. Presenation of findings (Visualization) <font color='green'>Not required</font>



## <font color='magenta'>Task One</font>

In this notebook you will find the best hyper-parameters for a deep learning model which is trained on word embeddings and text data. You will implement the following:

- ```MixedTypeDataset``` in ```data_cleaners```
    - ` __init__`, ` __getitem__`,`collate_batch`
- ```get_mixed_data``` in ```data_cleaners```
- ```MixedModel``` in ```find_best_hyperparameters```

You will then run the following code and match the shown output. 

Make sure to refer to the video on this assignment for information on how to merge the two layers. 

HINT: In the forward function, for each data type you will pass the input data through the same structure as in 3a and 3b. Before calculating the final output you will concat the output of each structure. Here is an example of how to perform concatenation in PyTorch.


```python 

   def __init__(self, vocab_size, embed_dim, pre_embed_size, layer, num_class):
        
        # declare the two layers that you used in BoEModel (in notebook 3a)
        self.embedding_bag = ...
        self.fc_layer = ...

        # declare the two layers that you used in VecModel (in notebook 3b)
        self.step_one_vector = ...
        self.out_vector = ...
        
        # declare an additional fully connected layer to take the output of the final layer of
        # the BoEModel and the output of the VecModel and produce the num_class output
        self.out = nn.Linear(2, 1)
        
    def forward(self, text_embeddings, offsets, pre_embeddings):
       
        xtext = self.fc_layer(self.embedding_bag(text_embeddings, offsets))
       
        xvec = self.out_vector(self.step_one_vector(pre_embeddings))
        
        x = torch.cat((xtext, xvec), dim=1)
        
        return self.out(x)
```

As you would have seen, the dataset in `mt_dia_labelled.csv` is highly imbalanced. For this exercise, we have given you the train and validation data but the test data is hidden.

The following pickle files contain the train and validation data (rows of tuples of length 3 - label, input features, embedding vectors):

- ../../assets/assignment3/train.pkl
- ../../assets/assignment3/valid.pkl

The vocab data for the reduced dataset has been provided at ../../assets/assignment3/vocab.pkl.

In [None]:
import data_cleaners as dc
import find_best_hyperparameters as fbh
import torch

In [None]:
VEC_SIZE = 200
NUM_CLASS = 1
d = 100
c = 50

# Uncomment the following lines after making the required changes to:
# 1. get_mixed_data in data_cleaners.py
# 2. MixedModel class in find_best_hyperparameters.py

# train_dataloader, val_dataloader, vocab = dc.get_mixed_data()

# mixed_model = fbh.MixedModel(len(vocab), d, VEC_SIZE, c, NUM_CLASS)

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

## <font color='magenta'>Task Two</font>

Now that we know you are using a similar deep learning model as us, let us search for the optimal set of hyperparameters.

This time, we've run the grid search for you. To pass this task just sort the results_dict below and enter the d and c value which will achieved the highest average precision. 

***Make sure that you take note of the best average precision in this notebook as we will ask about it later.***

In [None]:
import pickle

# results dictionary
with open('../../assets/assignment3/task3c.pkl', 'rb') as f:
    results_dict = pickle.load(f)

# Use results_dict to identify the best hyperparameters and change d and c below.

# d = None
# c = None

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

## <font color='magenta'>Task Three OPTIONAL</font>
Using the values of c and d above, on our implementation of `MixedModel` and `get_mixed_model_parameters` we achieve a performance of **0.65**. In this task, you can try to beat this performance. 

Your goal is to get an average precision score higher than **0.65** on the hidden test set.

To do so you can use `fbh.get_mixed_model_parameters` to search for the best value of c and d on your version of `MixedModel`, you should find the same values as above, but slight differences in your implementation could lead to different results. 

`fbh.get_mixed_model_parameters` will return the values of c and d which maximize the performance, as well as the best model. To check how well your best model does on the held-out test set, follow the instructions below.  

In [None]:
# Once you have made changes to the MixedModel and get_mixed_data, uncomment the next two lines

# train_dataloader, val_dataloader, vocab = dc.get_mixed_data()
# results_dict, best_model = fbh.get_mixed_model_parameters(train_dataloader, val_dataloader, vocab);

# Save the Pytorch model
# torch.save(best_model.state_dict(), 'best_model.pt')

# Uncomment the following 4 lines; ensure that the best d and c are updated and mixed_model loads
# correctly. Leave the lines uncommented before submitting.

# d = 10
# c = 10
# mixed_model = fbh.MixedModel(len(vocab), d, VEC_SIZE, c, NUM_CLASS)
# mixed_model.load_state_dict(torch.load('best_model.pt'));

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell