## Train an LSTM model on our Dataset

This notebook was used to train an LSTM model on our training set. It makes use of the dataloaders generated by the generate_dataloders notebook present in the data_prep folder.  It also uses our custom functions found in generate_dataloaders.py and evaluation.py.

In [4]:
import numpy as np
import matplotlib.pyplot as plt

import torch
from torch import nn
from torch import optim
from torch.utils.data import DataLoader
import torch.nn.functional as F

import pickle as pkl
import os
import datetime as dt
import pandas as pd
import random

from generate_dataloaders import *

from tqdm import tqdm_notebook as tqdm

import evaluation
import importlib
importlib.reload(evaluation)

<module 'evaluation' from '/content/drive/My Drive/Capstone_Hyperparam_Tuning/evaluation.py'>

## Get Dataloaders

In [0]:
seed = 1029
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)  # if you are using multi-GPU.
np.random.seed(seed)  # Numpy module.
random.seed(seed)  # Python random module.
torch.manual_seed(seed)
torch.backends.cudnn.enabled = False 
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)

def _init_fn(worker_id):
    np.random.seed(int(seed))

In [0]:
data_dir = '../data/' 

#### *Verify filenames are consistent*

In [0]:
train_loader_labelled = pkl.load(open(data_dir + 'train_labeled_dataloader_lstm.p','rb'))
train_loader_unlabelled = pkl.load(open(data_dir + 'train_unlabeled_dataloader_lstm.p','rb'))
val_loader = pkl.load(open(data_dir + 'val_dataloader_lstm.p','rb'))

In [0]:
review_dict = pkl.load(open(data_dir + 'dictionary.p','rb'))

In [0]:
#%conda install pytorch torchvision -c pytorch
## if torch.__version__ is not 1.3.1, run this cell then restart kernel

In [10]:
print(torch.__version__)

1.3.1


## Pre-trained Word Embeddings

We use the GloVe pretrained embeddings which are already present in the data folder in txt format. We filter out the embeddings for words not present in our dataset. Additionally, all words (including typos) which do not have a pre trained embedding are considered as unknown words.

In [0]:
def get_coefs(word, *arr):
    return word, np.asarray(arr, dtype='float16')

In [0]:
def load_embeddings(path):
    with open(path) as f:
        return dict(get_coefs(*line.strip().split(' ')) for line in f)

In [0]:
def build_matrix(review_dict, embedding_index ,dim = 200):
#     embedding_index = load_embeddings(path)
    embedding_matrix = np.zeros((len(review_dict.tokens), dim))
    unknown_words = []
    
    for word, i in review_dict.ids.items():
        try:
            embedding_matrix[i] = embedding_index[word]
        except KeyError:
            unknown_words.append(word)
    return embedding_matrix, unknown_words

In [0]:
glove_twitter = data_dir + 'glove.twitter.27B.200d.txt'

In [0]:
#os.listdir(data_dir)

In [0]:
embedding_index = load_embeddings(glove_twitter)

In [0]:
glove_embedding_index,unknown_words = build_matrix(review_dict, embedding_index)
del embedding_index

In place of the above written code, we could save the embeddings in pickle format after processing it for the first time and use the following cell to load it. This way we will not have to read the entire glove embedding txt file everytime


In [None]:
# glove_embedding_index = pkl.load(open(data_dir + 'glove_embedding_index.p','rb')) X

In [18]:
len(review_dict.tokens)

16256

In [0]:
#len(unknown_words)

In [0]:
# for word in unknown_words:
#     print(word)

In [21]:
review_dict.get_id('great')

34

## Neural Network LSTM Class

Our model consists of an embedding layer followed by an LSTM layer. The embedding present at the index of the flagged word is considered as the vector representation of the review. The model is trained in two phases:
- Supervised phase: In this phase, we add a linear layer after the lstm layer and perform classification on the labelled train dataset to train meaningful vector representations.
- Unsupervised phase: In this phase, we freeze the model and replace the final linear layer with an identity layer. And then we perform an unsupervised k-means clustering on the complete training set (labelled and unlabelled combined). 

NOTE: Data loader is defined as:
- tuple: (tokens, flagged_index, problematic)

In [0]:
def freeze_model(model):
    for param in model.parameters():
        param.requires_grad = False
        
def unfreeze_model(model):
    for param in model.parameters():
        param.requires_grad = True

In [0]:
class LSTM_model(nn.Module):
    """
    LSTM classification model using pretrained glove embeddings
    """
    # NOTE: we can't use linear layer until we take weighted average, otherwise it will
    # remember certain positions incorrectly (ie, 4th word has bigger weights vs 7th word)
    def __init__(self, opts):
        super(LSTM_model, self).__init__()
        self.embedding_matrix = opts['embedding_matrix']
        self.vocab_size = self.embedding_matrix.shape[0]
        self.embed_size = self.embedding_matrix.shape[1]

        self.num_hidden_layers = opts['num_hidden_layers']
        self.hidden_size = opts['hidden_size']
        self.dropout = opts['dropout']
        self.num_classes = 2
        self.lambda_loss = opts['lambda_loss']
        
        self.embed = nn.Embedding(self.vocab_size, self.embed_size, padding_idx=0)    
        self.embed.weight = nn.Parameter(torch.tensor(self.embedding_matrix, dtype=torch.float32))
        self.embed.weight.requires_grad = False

        self.lstm = nn.LSTM(self.embed_size, self.hidden_size, self.num_hidden_layers, batch_first=True, dropout=self.dropout, bidirectional=True, bias=True)
        
        self.projection = nn.Linear(2*self.hidden_size, self.num_classes, bias=True)

    
    def forward(self, tokens, flagged_index):
        batch_size, num_tokens = tokens.shape
        embedding = self.embed(tokens)
#         print(embedding.shape) # below assumes "batch_size x num_tokens x Emb_dim" (VERIFY)
        
        lstm_output = self.lstm(embedding)
        # lstm_output is a tuple containing lstm output and (hidden_state, lstm_cell). 
        # lstm_output[0] would be of shape "batch_size x num_tokens x hidden_size" (VERIFY)
        
        logits = self.projection(lstm_output[0])
        # logits would be of shape "batch_size x num_tokens x num_classes (2)" (VERIFY)
        
        batch_size, _, __ = logits.shape
        
        #selecting the logit at the flagged index
        relevant_logits = logits[list(range(batch_size)),flagged_index]
        # relevant_logits would be of shape "batch_size x num_classes (2)" (VERIFY)
        
        return relevant_logits

## 1. Perform Fully-Supervised Learning with Labelled Set 

In [1]:
def train_supervised_model(model, criterion, train_loader_labelled, valid_loader, num_frozen_epochs=10, num_unfrozen_epochs=0, path_to_save=None, print_every=1000, debug_mode=False):

    train_losses=[]
    val_losses=[]
    num_gpus = torch.cuda.device_count()
    if num_gpus > 0:
        current_device = 'cuda'
    else:
        current_device = 'cpu'
    
    empty_centroids = torch.tensor([])
    # freeze part    
    optimizer = torch.optim.Adam(model.parameters(), 0.01, amsgrad=True)
    
    for epoch in range(num_frozen_epochs):
        print('{} | Epoch {}'.format(dt.datetime.now(), epoch))
        model.train()
        total_epoch_loss = 0
        
        for i,(tokens_labelled, labels, flagged_indices_labelled) in tqdm(enumerate(train_loader_labelled)):
            
            tokens_labelled = tokens_labelled.to(current_device)
            flagged_indices_labelled = flagged_indices_labelled.to(current_device)
            labels = labels.to(current_device)

            # forward pass and compute loss
            logits = model(tokens_labelled,flagged_indices_labelled)
            
            loss = criterion(logits, labels)
        
            # run update step
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            #Add loss to the epoch loss
            total_epoch_loss += loss.detach()

            if i % print_every == 0:
                losses = loss/len(tokens_labelled)
                print('Average training loss at batch ',i,': %.3f' % losses)
            
        total_epoch_loss /= len(train_loader_labelled.dataset)
        total_epoch_loss = total_epoch_loss.detach()
        train_losses.append(total_epoch_loss)
        print('Average training loss after epoch ',epoch,': %.3f' % total_epoch_loss)
        
        # calculate validation loss after every epoch
        total_validation_loss = 0
        for i, (tokens, labels, flagged_indices) in enumerate(valid_loader):
            model.eval()
            tokens = tokens.to(current_device)
            labels = labels.to(current_device)
            flagged_indices = flagged_indices.to(current_device)
            
            # forward pass and compute loss
            logits = model(tokens,flagged_indices)
            
            loss = criterion(logits, labels)
            
            #Add loss to the validation loss
            total_validation_loss += loss

        total_validation_loss /= len(valid_loader.dataset)
        val_losses.append(total_validation_loss)
        print('Average validation loss after epoch ',epoch,': %.3f' % total_validation_loss)
        if debug_mode:
            print('Train result:')
            TP_cluster, FP_cluster, _ =evaluation.main(model, empty_centroids, train_loader_labelled, criterion, data_dir, current_device)
            print()
            print('Validation result:')
            TP_cluster, FP_cluster, _ =evaluation.main(model, empty_centroids, valid_loader, criterion, data_dir, current_device)
        
        if path_to_save == None:
            pass
        else:
            opts = {"embedding_matrix":model.embedding_matrix,\
                    "num_hidden_layers":model.num_hidden_layers,\
                    "hidden_size":model.hidden_size,\
                    "num_classes":model.num_classes}
            torch.save(model.state_dict(), path_to_save + 'model_dict_labelled.pt')
            torch.save(train_losses, path_to_save + 'train_losses_labelled')
            torch.save(val_losses, path_to_save + 'val_losses_labelled')
            torch.save(opts, path_to_save + 'opts_labelled')

    # unfreeze part
    unfreeze_model(model)
    print("*** UNFREEZING ***")    

    optimizer = torch.optim.Adam(model.parameters(), 0.01, amsgrad=True)
    
    for epoch in range(num_unfrozen_epochs):
        print('{} | Epoch {}'.format(dt.datetime.now(), epoch))
        model.train()
        total_epoch_loss = 0

        for i,(tokens_labelled, labels, flagged_indices_labelled) in tqdm(enumerate(train_loader_labelled)):
            
            tokens_labelled = tokens_labelled.to(current_device)
            flagged_indices_labelled = flagged_indices_labelled.to(current_device)
            labels = labels.to(current_device)

            # forward pass and compute loss
            logits = model(tokens_labelled,flagged_indices_labelled)
            
            loss = criterion(logits, labels)
        
            # run update step
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            #Add loss to the epoch loss
            total_epoch_loss += loss.detach()

            if i % print_every == 0:
                losses = loss/len(tokens_labelled)
                print('Average training loss at batch ',i,': %.3f' % losses)
            
        total_epoch_loss /= len(train_loader_labelled.dataset)
        total_epoch_loss = total_epoch_loss.detach()
        train_losses.append(total_epoch_loss)
        print('Average training loss after epoch ',epoch,': %.3f' % total_epoch_loss)
        
        # calculate validation loss after every epoch
        total_validation_loss = 0
        for i, (tokens, labels, flagged_indices) in enumerate(valid_loader):
            model.eval()
            tokens = tokens.to(current_device)
            labels = labels.to(current_device)
            flagged_indices = flagged_indices.to(current_device)
            
            # forward pass and compute loss
            logits = model(tokens,flagged_indices)
            
            loss = criterion(logits, labels)
            
            #Add loss to the validation loss
            total_validation_loss += loss

        total_validation_loss /= len(valid_loader.dataset)
        val_losses.append(total_validation_loss)
        print('Average validation loss after epoch ',epoch,': %.3f' % total_validation_loss)
        if debug_mode:
            print('Train result:')
            TP_cluster, FP_cluster, _ =evaluation.main(model, empty_centroids, train_loader_labelled, criterion, data_dir, current_device)
            print()
            print('Validation result:')
            TP_cluster, FP_cluster, _ =evaluation.main(model, empty_centroids, valid_loader, criterion, data_dir, current_device)
        
        
        if path_to_save == None:
            pass
        else:
            opts = {"embedding_matrix":model.embedding_matrix,\
                    "num_hidden_layers":model.num_hidden_layers,\
                    "hidden_size":model.hidden_size,\
                    "num_classes":model.num_classes}
            torch.save(model.state_dict(), path_to_save + 'model_dict_labelled.pt')
            torch.save(train_losses, path_to_save + 'train_losses_labelled')
            torch.save(val_losses, path_to_save + 'val_losses_labelled')
            torch.save(opts, path_to_save + 'opts_labelled')

    return model, train_losses, val_losses

## 2. Unsupervised Learning (Clustering)

### Define important functions that will be used during clustering

The KMeansCriterion method is used to calculate the clustering loss. The centroid_init method initializes the centroids and the update_clusters method is used to store the sum of distances of all points in a cluster from the cluster center and is later used to update the new cluster center location.

In [0]:
class KMeansCriterion(nn.Module):
    
    def __init__(self):
        super().__init__()
    
    def forward(self, embeddings, centroids, labelled = False,  cluster_assignments = None):
        if labelled:
            num_reviews = len(cluster_assignments)
            distances = torch.sum((embeddings[:, None, :] - centroids)**2, 2)
            cluster_distances = distances[list(range(num_reviews)),cluster_assignments]
            loss = cluster_distances.sum()
        else:
            distances = torch.sum((embeddings[:, None, :] - centroids)**2, 2)
            cluster_distances, cluster_assignments = distances.min(1)
            loss = cluster_distances.sum()
        return loss, cluster_assignments

In [0]:
def centroid_init(k, d, dataloader, model, current_device):
    ## Here we ideally don't want to do randomized/zero initialization
    centroid_sums = torch.zeros(k, d).to(current_device)
    centroid_counts = torch.zeros(k).to(current_device)
    for (tokens, labels, flagged_indices) in dataloader:
        # cluster_assignments = torch.LongTensor(tokens.size(0)).random_(k)
        cluster_assignments = labels.to(current_device)
        
        model.eval()
        sentence_embed = model(tokens.to(current_device),flagged_indices.to(current_device))
    
        update_clusters(centroid_sums.detach(), centroid_counts.detach(),
                        cluster_assignments.detach(), sentence_embed.to(current_device).detach())
    
    centroid_means = centroid_sums / centroid_counts[:, None].to(current_device)
    return centroid_means.clone()

def update_clusters(centroid_sums, centroid_counts,
                    cluster_assignments, embeddings):
    k = centroid_sums.size(0)

    centroid_sums.index_add_(0, cluster_assignments, embeddings)
    bin_counts = torch.bincount(cluster_assignments,minlength=k).type(torch.FloatTensor).to(current_device)
    centroid_counts.add_(bin_counts)

### Dataloader utility methods

In [0]:
def loadLabelledBatch(train_loader_labelled_iter, train_loader_labelled):
    try:
        tokens, labels, flagged_indices = next(train_loader_labelled_iter)
    except StopIteration:
        train_loader_labelled_iter = iter(train_loader_labelled)
        tokens, labels, flagged_indices = next(train_loader_labelled_iter)

    return tokens, labels, flagged_indices, train_loader_labelled_iter


def loadUnlabelledBatch(train_loader_unlabelled_iter, train_loader_unlabelled):
    try:
        tokens, labels, flagged_indices = next(train_loader_unlabelled_iter)
    except StopIteration:
        train_loader_unlabelled_iter = iter(train_loader_unlabelled)
        tokens, labels, flagged_indices = next(train_loader_unlabelled_iter)

    return tokens, labels, flagged_indices, train_loader_unlabelled_iter

In [0]:
def train_clusters(model, centroids, criterion, train_loader_labelled, train_loader_unlabelled, valid_loader, num_epochs=15, num_batches = 1000, path_to_save=None, print_every = 1000):

    train_loader_labelled_iter = iter(train_loader_labelled)
    train_loader_unlabelled_iter = iter(train_loader_unlabelled)
    lambda_loss = model.lambda_loss

    train_losses=[]
    val_losses=[]
    num_gpus = torch.cuda.device_count()
    if num_gpus > 0:
        current_device = 'cuda'
    else:
        current_device = 'cpu'
    
    optimizer = torch.optim.Adam(model.parameters(), 0.01, amsgrad=True)
    
    for epoch in range(num_epochs):
        print('{} | Epoch {}'.format(dt.datetime.now(), epoch))
        model.eval() # we're only clustering, not training model
        k, d = centroids.size()
        centroid_sums = torch.zeros_like(centroids).to(current_device)
        centroid_counts = torch.zeros(k).to(current_device)
        total_epoch_loss = 0
        
        for i in tqdm(range(int(num_batches))):
            tokens_labelled, labels, flagged_indices_labelled, train_loader_labelled_iter = loadLabelledBatch(train_loader_labelled_iter, train_loader_labelled)
            tokens_unlabelled, _, flagged_indices_unlabelled, train_loader_unlabelled_iter = loadUnlabelledBatch(train_loader_unlabelled_iter, train_loader_unlabelled)

            tokens_labelled = tokens_labelled.to(current_device)
            labels = labels.to(current_device)
            flagged_indices_labelled = flagged_indices_labelled.to(current_device)
            
            tokens_unlabelled = tokens_unlabelled.to(current_device)
            flagged_indices_unlabelled = flagged_indices_unlabelled.to(current_device)

            # forward pass and compute loss
            sentence_embed_labelled = model(tokens_labelled,flagged_indices_labelled)
            sentence_embed_unlabelled = model(tokens_unlabelled,flagged_indices_unlabelled)
            
            cluster_loss_unlabelled, cluster_assignments_unlabelled = criterion(sentence_embed_unlabelled, centroids.detach())
            cluster_loss_labelled, cluster_assignments_labelled = criterion(sentence_embed_labelled, centroids.detach(), labelled = True, cluster_assignments = labels)
    
            ### DEBUGGING!! ###
            if i % print_every == 0:
              print(lambda_loss)
            lambda_loss = 10000
            ###################
            total_batch_loss = cluster_loss_unlabelled.data + lambda_loss * cluster_loss_labelled.data
            
#             #Add loss to the epoch loss
            total_epoch_loss += total_batch_loss.data

#             # store centroid sums and counts in memory for later centering
            update_clusters(centroid_sums.detach(), centroid_counts.detach(),
                            cluster_assignments_labelled.detach(), sentence_embed_labelled.detach())
    
            update_clusters(centroid_sums.detach(), centroid_counts.detach(),
                            cluster_assignments_unlabelled.detach(), sentence_embed_unlabelled.detach())

            if i % print_every == 0:
                losses = total_batch_loss/(len(tokens_labelled)+ len(tokens_unlabelled))
                print('Average training loss at batch ',i,': %.3f' % losses)
            
        total_epoch_loss /= (len(train_loader_labelled.dataset)+len(train_loader_unlabelled.dataset))
        train_losses.append(total_epoch_loss)
        print('Average training loss after epoch ',epoch,': %.3f' % total_epoch_loss)
        
        # update centroids based on assignments from autoencoders
        centroids = centroid_sums / (centroid_counts[:, None] + 1).to(current_device)
        
        # calculate validation loss after every epoch
        total_validation_loss = 0
        for i, (tokens, labels, flagged_indices) in enumerate(valid_loader):
            model.eval()
            tokens = tokens.to(current_device)
            labels = labels.to(current_device)
            flagged_indices = flagged_indices.to(current_device)
            
            # forward pass and compute loss
            sentence_embed = model(tokens,flagged_indices)
            cluster_loss, cluster_assignments = criterion(sentence_embed, centroids)
            
            #Add loss to the validation loss
            total_validation_loss += cluster_loss.data

        total_validation_loss /= len(valid_loader.dataset)
        val_losses.append(total_validation_loss)
        print('Average validation loss after epoch ',epoch,': %.3f' % total_validation_loss)
        
        if path_to_save == None:
            pass
        else:
            opts = {"embedding_matrix":model.embedding_matrix,\
                    "num_hidden_layers":model.num_hidden_layers,\
                    "hidden_size":model.hidden_size,\
                    "num_classes":model.num_classes}
            torch.save(model.state_dict(), path_to_save+'model_dict_unlabelled.pt')
            torch.save(centroids, path_to_save+'centroids_unlabelled')
            torch.save(train_losses, path_to_save+'train_losses_unlabelled')
            torch.save(val_losses, path_to_save+'val_losses_unlabelled')
            torch.save(opts, path_to_save+'opts_unlabelled')
        
    return model, centroids, train_losses, val_losses

# Hyperparameter Tuning

In [0]:
num_gpus = torch.cuda.device_count()
if num_gpus > 0:
    current_device = 'cuda'
else:
    current_device = 'cpu'

In [0]:
def get_save_directory(opts):
    model_folder = 'lstm_model/'
    model_dir = '../model_outputs/' + model_folder
    
    # subfolder for each hyperparam config
    num_unfrozen_epochs = opts['num_unfrozen_epochs']
    num_hidden_layers = opts['num_hidden_layers']
    hidden_size = opts['hidden_size']
    dropout = opts['dropout']
    lambda_loss = opts['lambda_loss']
    subfolder = "num_unfrozen_epochs="+str(num_unfrozen_epochs) \
                + ",num_hidden_layers="+str(num_hidden_layers) \
                + ",hidden_size="+str(hidden_size) \
                + ",dropout="+str(dropout) \
                + ",lambda="+str(lambda_loss) + '/'
    
    try:
        os.makedirs(model_dir + subfolder) # will throw error if subfolder already exists
    except:
        pass
    
    return model_dir + subfolder

## Phase 1: Supervised Model

In [0]:
def train_config_supervised(opts):
    path_to_save = get_save_directory(opts)
    print(path_to_save)
    
    # supervised part -- embeddings
    model = LSTM_model(opts).to(current_device)
    criterion = nn.CrossEntropyLoss(reduction='sum')
    num_unfrozen_epochs = opts['num_unfrozen_epochs']
    train_supervised_model(model, criterion, train_loader_labelled, val_loader, num_unfrozen_epochs=num_unfrozen_epochs, path_to_save=path_to_save)

In [33]:
num_hidden_layers_list = [3]
hidden_sizes = [256]
dropouts = [0]
num_unfrozen_epochs_list = [2]
lambda_loss = None  # NOT TRAINING THIS YET

## NEXT: 3 hidden layers of 256 w/ 2 unfrozen epochs... dropouts = [.25, .5]

for num_hidden_layers in num_hidden_layers_list:
    for hidden_size in hidden_sizes:
        for dropout in dropouts:
            for num_unfrozen_epochs in num_unfrozen_epochs_list:
                opts = {
                    'embedding_matrix': glove_embedding_index,
                    'num_hidden_layers': num_hidden_layers,
                    'hidden_size': hidden_size,
                    'dropout': dropout,
                    'num_unfrozen_epochs': num_unfrozen_epochs,
                    'lambda_loss': lambda_loss
                }
                train_config_supervised(opts)

/content/drive/My Drive/Capstone_Hyperparam_Tuning/models/lstm_unfrozen_model/num_unfrozen_epochs=2,num_hidden_layers=3,hidden_size=256,dropout=0,lambda=None/
2019-12-10 04:44:52.842209 | Epoch 0


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Average training loss at batch  0 : 0.691

Average training loss after epoch  0 : 0.454
Average validation loss after epoch  0 : 0.349
2019-12-10 04:45:10.401473 | Epoch 1


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Average training loss at batch  0 : 0.325

Average training loss after epoch  1 : 0.263
Average validation loss after epoch  1 : 0.408
2019-12-10 04:45:26.189199 | Epoch 2


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Average training loss at batch  0 : 0.089

Average training loss after epoch  2 : 0.214
Average validation loss after epoch  2 : 0.316
2019-12-10 04:45:41.655765 | Epoch 3


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Average training loss at batch  0 : 0.226

Average training loss after epoch  3 : 0.163
Average validation loss after epoch  3 : 0.299
2019-12-10 04:45:57.382531 | Epoch 4


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Average training loss at batch  0 : 0.174

Average training loss after epoch  4 : 0.128
Average validation loss after epoch  4 : 0.373
2019-12-10 04:46:13.180964 | Epoch 5


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Average training loss at batch  0 : 0.201

Average training loss after epoch  5 : 0.100
Average validation loss after epoch  5 : 0.431
2019-12-10 04:46:28.590347 | Epoch 6


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Average training loss at batch  0 : 0.152

Average training loss after epoch  6 : 0.111
Average validation loss after epoch  6 : 0.586
2019-12-10 04:46:44.236803 | Epoch 7


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Average training loss at batch  0 : 0.007

Average training loss after epoch  7 : 0.109
Average validation loss after epoch  7 : 0.398
2019-12-10 04:47:00.133263 | Epoch 8


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Average training loss at batch  0 : 0.176

Average training loss after epoch  8 : 0.089
Average validation loss after epoch  8 : 0.403
2019-12-10 04:47:15.978988 | Epoch 9


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Average training loss at batch  0 : 0.010

Average training loss after epoch  9 : 0.076
Average validation loss after epoch  9 : 0.367
*** UNFREEZING ***
2019-12-10 04:47:32.010351 | Epoch 0


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Average training loss at batch  0 : 0.045

Average training loss after epoch  0 : 0.097
Average validation loss after epoch  0 : 0.403
2019-12-10 04:47:48.407328 | Epoch 1


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Average training loss at batch  0 : 0.020

Average training loss after epoch  1 : 0.080


RuntimeError: ignored

In [0]:
#sorted(os.listdir('models/lstm_unfrozen_model/'))

## Phase 2: Unsupervised / Clustering Model

In [0]:
def train_config_unsupervised(opts):    
    # get load directory
    opts_load = opts.copy()
    opts_load['lambda_loss'] = None
    path_to_load = get_save_directory(opts_load)
    print(path_to_load)
    
    # load 
    model = LSTM_model(opts)
    print(model.lambda_loss)
    model.load_state_dict(torch.load(path_to_load+'model_dict_labelled.pt',map_location=lambda storage, loc: storage))
    print(model.lambda_loss)
    model.lambda_loss = opts['lambda_loss']
    print(model.lambda_loss)
    model = model.to(current_device)
    
    # get save directory
    path_to_save = get_save_directory(opts)
    print(path_to_save)
    
    # unsupervised part -- assign clusters to unlabelled data
    model.projection = nn.Identity()
    centroids = centroid_init(2, 2*model.hidden_size, train_loader_labelled, model, current_device)
    criterion = KMeansCriterion().to(current_device)    
    num_batches = int(len(train_loader_unlabelled.dataset)/train_loader_unlabelled.batch_size)+1
    train_clusters(model, centroids, criterion, train_loader_labelled, train_loader_unlabelled, val_loader, num_epochs=4, num_batches=num_batches, path_to_save=path_to_save)

In [0]:
num_hidden_layers = 3  # BEST PARAM
hidden_size = 128  # BEST PARAM
dropout = 0  # BEST PARAM
num_unfrozen_epochs = 1 # BEST PARAM
lambda_losses = [.1, .5, 1, 5, 10, 25]

for lambda_loss in lambda_losses:
  opts = {
      'embedding_matrix': glove_embedding_index,
      'num_hidden_layers': num_hidden_layers,
      'hidden_size': hidden_size,
      'dropout': dropout,
      'num_unfrozen_epochs': num_unfrozen_epochs,
      'lambda_loss': lambda_loss
  }  
  train_config_unsupervised(opts)

# Evaluate Model

In [0]:
num_gpus = torch.cuda.device_count()
if num_gpus > 0:
    current_device = 'cuda'
else:
    current_device = 'cpu'

## Phase 1: Supervised Model

In [0]:
def evaluate_config_supervised(opts,verbose=True):
    path_to_save = get_save_directory(opts)
    #print(path_to_save)
    
    model = LSTM_model(opts) #change here depending on model
    model.load_state_dict(torch.load(path_to_save+'model_dict_labelled.pt',map_location=lambda storage, loc: storage))
    model = model.to(current_device)
    criterion = nn.CrossEntropyLoss(reduction='sum')
    criterion = criterion.to(current_device)
    
    empty_centroids = torch.tensor([])
    TP_cluster, FP_cluster, results_dict = evaluation.main(model, empty_centroids, val_loader, criterion, data_dir, current_device, verbose)
    results_dict.update(opts)
    return TP_cluster, FP_cluster, results_dict


In [0]:
num_hidden_layers_list = [1, 2, 3]
hidden_sizes = [128, 256]
dropouts = [0, .25, .5]
num_unfrozen_epochs_list = [0, 1, 2]
lambda_loss = None  # NOT TRAINING THIS YET

results_df = pd.DataFrame()
for num_hidden_layers in num_hidden_layers_list:
    for hidden_size in hidden_sizes:
        for dropout in dropouts:
            for num_unfrozen_epochs in num_unfrozen_epochs_list:
                if num_hidden_layers == 1 and dropout > 0:
                  continue
                if num_hidden_layers == 3 and num_unfrozen_epochs == 2:
                  continue
                opts = {
                    'embedding_matrix': glove_embedding_index,
                    'num_hidden_layers': num_hidden_layers,
                    'hidden_size': hidden_size,
                    'dropout': dropout,
                    'num_unfrozen_epochs': num_unfrozen_epochs,
                    'lambda_loss': lambda_loss
                }
                _, _, results_dict = evaluate_config_supervised(opts,False)
                results_df = results_df.append(results_dict,ignore_index=True)
                
results_df = results_df[['num_unfrozen_epochs','num_hidden_layers','hidden_size','dropout','Accuracy','F1 score','Precision','Recall',
                        'TP_rate','FP_rate','FN_rate','TN_rate']].sort_values(['num_unfrozen_epochs','num_hidden_layers'])

In [0]:
#results_df

In [40]:
results_df.sort_values(['F1 score'],ascending=False).head(10)

Unnamed: 0,num_unfrozen_epochs,num_hidden_layers,hidden_size,dropout,Accuracy,F1 score,Precision,Recall,TP_rate,FP_rate,FN_rate,TN_rate
25,1.0,3.0,128.0,0.0,0.802124,0.826593,0.943231,0.735627,0.943231,0.056769,0.338983,0.661017
29,1.0,3.0,128.0,0.5,0.798174,0.823425,0.941176,0.731861,0.941176,0.058824,0.344828,0.655172
21,0.0,2.0,256.0,0.5,0.788836,0.817658,0.946903,0.719458,0.946903,0.053097,0.369231,0.630769
12,0.0,2.0,128.0,0.5,0.787126,0.815501,0.940919,0.719585,0.940919,0.059081,0.366667,0.633333
7,1.0,2.0,128.0,0.0,0.788932,0.814325,0.92569,0.726877,0.92569,0.07431,0.347826,0.652174
6,0.0,2.0,128.0,0.0,0.783999,0.814238,0.946785,0.714246,0.946785,0.053215,0.378788,0.621212
2,2.0,1.0,128.0,0.0,0.786552,0.812004,0.921941,0.725493,0.921941,0.078059,0.348837,0.651163
8,2.0,2.0,128.0,0.0,0.784668,0.811883,0.929336,0.720787,0.929336,0.070664,0.36,0.64
16,1.0,2.0,256.0,0.0,0.78187,0.811782,0.940789,0.713888,0.940789,0.059211,0.377049,0.622951
14,2.0,2.0,128.0,0.5,0.780889,0.811411,0.942731,0.712203,0.942731,0.057269,0.380952,0.619048


## Phase 2: Clustering / Unsupervised

In [0]:
def evaluate_config_unsupervised(opts,verbose=True):
    path_to_save = get_save_directory(opts)
    #print(path_to_save)
    
    model = LSTM_model(opts) #change here depending on model
    model.projection = nn.Identity()
    model.load_state_dict(torch.load(path_to_save+'model_dict_unlabelled.pt',map_location=lambda storage, loc: storage))
    model = model.to(current_device)
    criterion = KMeansCriterion()
    criterion = criterion.to(current_device)
    centroids = torch.load(path_to_save+'centroids_unlabelled',map_location=lambda storage, loc: storage).to(current_device)
    
    TP_cluster, FP_cluster, results_dict = evaluation.main(model, centroids, val_loader, criterion, data_dir, current_device, verbose)
    results_dict.update(opts)
    return TP_cluster, FP_cluster, results_dict


In [0]:
num_hidden_layers = 3  # BEST PARAM
hidden_size = 128  # BEST PARAM
dropout = 0  # BEST PARAM
num_unfrozen_epochs = 1 # BEST PARAM
lambda_losses = [.1, .5, 1, 5, 10, 25]

#results_df2 = results_df.copy()
results_df = pd.DataFrame()
for lambda_loss in lambda_losses:
    opts = {
        'embedding_matrix': glove_embedding_index,
        'num_hidden_layers': num_hidden_layers,
        'hidden_size': hidden_size,
        'dropout': dropout,
        'num_unfrozen_epochs': num_unfrozen_epochs,
        'lambda_loss': lambda_loss
    }
    _, _, results_dict = evaluate_config_supervised(opts,False)
    results_df = results_df.append(results_dict,ignore_index=True)
                
results_df = results_df[['lambda_loss','num_hidden_layers','hidden_size','dropout','num_unfrozen_epochs','Accuracy','F1 score','Precision','Recall',
                        'TP_rate','FP_rate','FN_rate','TN_rate']]

In [64]:
results_df

Unnamed: 0,lambda_loss,num_hidden_layers,hidden_size,dropout,num_unfrozen_epochs,Accuracy,F1 score,Precision,Recall,TP_rate,FP_rate,FN_rate,TN_rate
0,0.1,3.0,128.0,0.0,1.0,0.774745,0.80777,0.946548,0.704483,0.946548,0.053452,0.397059,0.602941
1,0.5,3.0,128.0,0.0,1.0,0.774745,0.80777,0.946548,0.704483,0.946548,0.053452,0.397059,0.602941
2,1.0,3.0,128.0,0.0,1.0,0.774745,0.80777,0.946548,0.704483,0.946548,0.053452,0.397059,0.602941
3,5.0,3.0,128.0,0.0,1.0,0.774745,0.80777,0.946548,0.704483,0.946548,0.053452,0.397059,0.602941
4,10.0,3.0,128.0,0.0,1.0,0.774745,0.80777,0.946548,0.704483,0.946548,0.053452,0.397059,0.602941
5,25.0,3.0,128.0,0.0,1.0,0.774745,0.80777,0.946548,0.704483,0.946548,0.053452,0.397059,0.602941


# Save Embeddings for Plot

In [None]:
model_folder = 'lstm_model/'
save_dir = '../umap/' + model_folder

In [None]:
# make an embedding on validation set including centroids
val_embed_labelled = []
val_labels_lst = []

for i, (tokens, labels, flagged_indices) in enumerate(val_loader):
    model.eval()
    tokens = tokens.to(current_device)
    labels = labels.to(current_device)
    flagged_indices = flagged_indices.to(current_device)

    # forward pass and compute loss
    sentence_embed = model(tokens,flagged_indices)

    val_embed_labelled+= sentence_embed.tolist()    
    val_labels_lst+=labels.tolist()
val_embed_labelled += centroids.tolist()
val_labels_lst += [0,1]

In [None]:
# make an embedding on training set
embed_labelled = []
labels_lst = []

for i, (tokens, labels, flagged_indices) in enumerate(train_loader_labelled):
    model.eval()
    tokens = tokens.to(current_device)
    labels = labels.to(current_device)
    flagged_indices = flagged_indices.to(current_device)

    # forward pass and compute loss
    sentence_embed = model(tokens,flagged_indices)

    embed_labelled+= sentence_embed.tolist()    
    labels_lst+=labels.tolist()

In [None]:
pickle_out1 = open(save_dir + "val_embed_labelled.pickle","wb")
pickle.dump(val_embed_labelled, pickle_out1)
pickle_out1.close()

pickle_out2 = open(save_dir + "val_labels_lst.pickle","wb")
pickle.dump(val_labels_lst, pickle_out2)
pickle_out2.close()

pickle_out3 = open(save_dir + "embed_labelled.pickle","wb")
pickle.dump(embed_labelled, pickle_out3)
pickle_out3.close()

pickle_out4 = open(save_dir + "labels.pickle","wb")
pickle.dump(labels_lst, pickle_out4)
pickle_out4.close()