## Basic-ish LSTM Model

This script is an implementation of a fairly basic LSTM model.  This can be run with Grasser's duplicated data or the full data set (which includes the scraped data and does not contain review duplicates).

Here are the details of the model:
- It uses 300-dimension, trainable embedding initialized to the GloVe vectors
- It uses a 2-layer, birectional recurrent network
- The final layer is a linear layer that takes 300 inputs and outputs 3 dimensions

The model is trained for 30 epochs and the model with the highest Cohen's Kappa is saved and evaluated on the test data.  Additional implementation details are provided througout the script.

##### Note: This script is meant to be run on Google Colab and assumes a particular drive file structure.  This may not have been perfectly maintained in moving the repository to git.

### Access Google Drive

In [None]:
from google.colab import drive
#drive may already be mounted. allows access to drive data
drive.mount('/content/drive')

Mounted at /content/drive


#### Import our scripts

In [None]:
import sys
sys.path.append('./drive/MyDrive/drugproject/2_scripts')

from create_dataloader import get_dataloader
from collate import make_collate_rnn, make_collate_rnn_lemmas, make_collate_plus
from Trainer import Trainer
from Evaluator import Evaluator
from ModelContext import ModelContext
from Plotter import Plotter
from Models import BasicLSTM
# from Evaluator import Evaluator

No module named 'boto3'
No module named 'boto3'


#### Import torch and other packages

In [None]:
from torchtext.vocab import Vocab, GloVe
from torch.utils.data.dataset import random_split
from torch import nn
from torch import optim
from torch.nn.utils.rnn import pad_sequence, pad_packed_sequence, pack_padded_sequence
import torch.nn.functional as F

from collections import Counter
from csv import reader
import json
import time
import random
import numpy as np


Set up seed and cuda.

In [None]:
SEED = 6624 # Specify a seed for reproducability
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed(53113)
torch.backends.cudnn.deterministic = True

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


### Set parameters for run

data_set must be set to one of ("full", "dev", "grasser")

In [None]:
data_set = "dev"
lemmas = False
weights = True
additional_text_features = []

In [None]:
DATA = './drive/MyDrive/drugproject/3_data'

RESULTS_DIR = './drive/MyDrive/drugproject/4_results'

if data_set == "dev":
    TRAIN_FILE = f'{DATA}/grasser_data_for_lstm/train_small_df.csv'
    VALID_FILE = f'{DATA}/grasser_data_for_lstm/valid_small_df.csv'
    TEST_FILE = f'{DATA}/grasser_data_for_lstm/test_small_df.csv'
    RESULT_NAME = "basic_lstm_small"
    BATCH_SIZE = 2
    LOG_INTERVAL = 10
    VOCAB_PATH = './drive/MyDrive/drugproject/3_data/vocabs/full_vocab.json'
elif data_set == "grasser":
    TRAIN_FILE = f'{DATA}/grasser_data_for_lstm/train_full_df.csv'
    VALID_FILE = f'{DATA}/grasser_data_for_lstm/valid_full_df.csv'
    TEST_FILE = f'{DATA}/grasser_data_for_lstm/test_full_df.csv'  
    RESULT_NAME = 'basic_lstm_grasser'
    LOG_INTERVAL = 100
    BATCH_SIZE = 128
    VOCAB_PATH = './drive/MyDrive/drugproject/3_data/vocabs/grasser_vocab.json'
elif data_set == "full"
    TRAIN_FILE = f'{DATA}/grasser_data_for_lstm/train_fi.csv'
    VALID_FILE = f'{DATA}/grasser_data_for_lstm/valid_grasser_data.csv'
    TEST_FILE = f'{DATA}/grasser_data_for_lstm/test_grasser_data.csv'  
    RESULT_NAME = 'basic_lstm_full'
    LOG_INTERVAL = 100
    BATCH_SIZE = 128
    VOCAB_PATH = './drive/MyDrive/drugproject/3_data/vocabs/full_vocab.json'

if weights:
    RESULT_NAME += '_weights'

if lemmas:
    VOCAB_PATH = VOCAB_PATH[:-4] + '_lemmas.json'
    RESULT_NAME += '_lemmas'

for col in additional_text_features:
    RESULT_NAME += ('_' + col)

SHUFFLE = True


#### Load vocab

Our vocabs are already created and saved as json files - this is necessary because our tokenizer is very slow.

In [None]:
with open(VOCAB_PATH, "r") as f:
    counter = json.loads(f.read())

counter = Counter(counter)
review_vocab = Vocab(counter, min_freq=100)

### Load Dataloader

In [None]:
if lemmas:
    collate_fn = make_collate_rnn_lemmas(review_vocab, device=device)
else:
    collate_fn = make_collate_rnn(review_vocab, device=device)

optional_cols = []
encoding_cols = {}

if additional_text_features:

    # we know these from creating the dataset
    # moving forward we'd need a better way to dynamically get this information
    encoding_lengths={
        "pos_encoding": 19,
        "dep_encoding": 45
    }
    encoding_cols = ({'pos_encoding': 'pos_encoding_count',
                        'dep_encoding': 'dep_encoding_count'})
    encoding_cols = {k: v for k, v in encoding_cols.items() if k in additional_text_features}
    collate_fn = make_collate_plus(device=device, vocab=review_vocab, 
                                                  encoded_cols=additional_text_features),
                                                  encoding_lengths=encoding_lengths)



train_dataloader = get_dataloader(TRAIN_FILE,  BATCH_SIZE, SHUFFLE,
                                  optional_cols=additional_text_features,
                                  encoded_cols=encoding_cols,
                                  collate=collate_fn)
valid_dataloader = get_dataloader(VALID_FILE,  BATCH_SIZE, SHUFFLE,
                                  optional_cols=additional_text_features,
                                  encoded_cols=encoding_cols,
                                  collate=collate_fn)
test_dataloader = get_dataloader(TEST_FILE,  BATCH_SIZE, SHUFFLE,
                                  optional_cols=additional_text_features,
                                  encoded_cols=encoding_cols,
                                  collate=collate_fn)


get dataloader called
reading file ./drive/MyDrive/drugproject/3_data/grasser_data_for_lstm/train_grasser_data.csv
loading line 0
loading line 10000
loading line 20000
loading line 30000
loading line 40000
loading line 50000
loading line 60000
loading line 70000
loading line 80000
loading line 90000
loading line 100000
loading line 110000
loading line 120000
loading line 130000
get dataloader called
reading file ./drive/MyDrive/drugproject/3_data/grasser_data_for_lstm/valid_grasser_data.csv
loading line 0
loading line 10000
loading line 20000
get dataloader called
reading file ./drive/MyDrive/drugproject/3_data/grasser_data_for_lstm/test_grasser_data.csv
loading line 0
loading line 10000
loading line 20000
loading line 30000
loading line 40000
loading line 50000


### Load Embedding
(Glove for now)

In [None]:
if embedding_type == 'glove':
    if device == 'cpu':
        VECTORS_CACHE_DIR = '.vector_cache'
        # Please change above to your cache
    else:
        VECTORS_CACHE_DIR = './.vector_cache'
        # This is the default cache on Colab. Caching may not work
        # as expected on Colab.

    glove = GloVe('6B', cache=VECTORS_CACHE_DIR)
    embedding_vectors = glove.get_vecs_by_tokens(review_vocab.itos)
elif embedding_type == 'twitter':
    

./.vector_cache/glove.6B.zip: 862MB [02:45, 5.21MB/s]                           
100%|█████████▉| 399602/400000 [00:35<00:00, 11670.76it/s]

## Making the model

In [None]:
NUM_EPOCHS = 30
INPUT_DIM = len(review_vocab)
EMBEDDING_DIM = 300
HIDDEN_DIM = 300
OUTPUT_DIM = 3
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PADDING_IDX = review_vocab.stoi['<pad>']

model = BasicLSTM(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PADDING_IDX).to(device)

Print number of parameters because why not

In [None]:
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {num_params:,} trainable parameters')


## Train

The trainer and evaluator need an unbatch function that tells the objects how to unpack a batch to directly feed into the model.  This is not redundant with a collate function because it excludes the label.

In [None]:
if additional_text_features:
    def unpack_batch(data):
        reviews = data['reviews']
        lengths = data['lengths']
        tensors = (data[col] for col in additional_text_features)
        extras = torch.cat(tensors, dim=-1)
        return reviews, lengths, extras    

else:
    def unpack_batch(data):
        """
        data: output of dataloader object
        returns: tuple that can be piped directly into model
        """
        reviews = data['reviews']
        lengths = data['lengths']
        return reviews, lengths


In [None]:
if weights:
    def get_weights(dataloader, num_labels):
        """
        dataloader: training dataloader object
        num_labels (int): number of labels
        returns tensor of weights that can be the weight parameter in loss functions
        """
        label_dict = {}
        count = 0
        for data in dataloader:
            for label in data['labels'].tolist():
                label_dict[label] = label_dict.get(label, 0) + 1
                count += 1

        weights = [0] * num_labels
        for k, v in label_dict.items():
            weights[k]  = count/v

        return torch.tensor(weights)
    weights = get_weights(train_dataloader, 3)

100%|█████████▉| 399602/400000 [00:50<00:00, 11670.76it/s]

here are the weights tensor([4.5955, 8.3299, 1.5098])


In [None]:
optimizer = optim.Adam(model.parameters())
if weights:
    loss_function = nn.CrossEntropyLoss(weight=weights).to(device)
else:
    loss_function = nn.CrossEntropyLoss().to(device)

evaluator = Evaluator(root_dir=RESULTS_DIR, run_name=RESULT_NAME, 
                      unpack_batch_fn=unpack_batch)
trainer = Trainer(loss_fn=loss_function, optimizer=optimizer, unpack_batch_fn=unpack_batch)
plotter = Plotter(f"{RESULTS_DIR}/{RESULT_NAME}")
context = ModelContext(model=model, trainer=trainer, evaluator=evaluator, 
                       plotter=plotter, train_dataloader=train_dataloader, 
                       valid_dataloader=valid_dataloader, test_dataloader=test_dataloader)

context.run(NUM_EPOCHS, LOG_INTERVAL)

The model has 4,597,803 trainable parameters
At iteration 100 the loss is 0.860.
At iteration 200 the loss is 0.793.
At iteration 300 the loss is 0.557.
At iteration 400 the loss is 0.699.
At iteration 500 the loss is 0.653.
At iteration 600 the loss is 0.628.
At iteration 700 the loss is 0.620.
At iteration 800 the loss is 0.425.
At iteration 900 the loss is 0.551.
At iteration 1000 the loss is 0.615.
Epoch: 0, time taken: 159.9s, validation accuracy: 0.779.
At iteration 100 the loss is 0.528.
At iteration 200 the loss is 0.437.
At iteration 300 the loss is 0.547.
At iteration 400 the loss is 0.578.
At iteration 500 the loss is 0.552.
At iteration 600 the loss is 0.590.
At iteration 700 the loss is 0.428.
At iteration 800 the loss is 0.543.
At iteration 900 the loss is 0.529.
At iteration 1000 the loss is 0.638.
Epoch: 1, time taken: 160.4s, validation accuracy: 0.812.
At iteration 100 the loss is 0.341.
At iteration 200 the loss is 0.483.
At iteration 300 the loss is 0.485.
At iterat