## Move to GPU mode if you are in Google Colab
Go to `Runtime` -> `Change runtime type` to activate GPU.

 The following Python libraries are required for this part, and have been tested on Python 3.9 and Python 3.7.
 If you use Google Colab, PyTorch and SciPy are already installed, so you probably just want to install PyTorch Lightning.
  - [PyTorch](https://pytorch.org/get-started/locally/) (tested with 1.10)
  - [PyTorch Lightning](https://pypi.org/project/pytorch-lightning/) (tested with 1.5.8)
  - [SciPy](https://scipy.org/install/) (tested with 1.7.3 and with 1.4.1)


In [None]:
# # Download dataset
# !pip install gdown
# !gdown --id 1-FwYkKmml5pMgpfKM_Sz_O1JqDW12QSe -O sst2.zip
# !mkdir data
# !unzip sst2.zip -d .

In [2]:
# You may prefer to upload the data to your google drive and mount your google drive to this colab, 
# because the data will be erased if you stop using this colab for a while.
# Uncomment the code below to do so. After mounting, navigate to the appropriate folder, right click, and "copy path".
# Assign DATA_DIR global variable to that path.
# Remember to copy data files to the google drive folder if you decide to use set `DATA_DIR` as a google-drive folder.
# /content/data
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# DATA_DIR = "./data"
DATA_DIR = "/content/drive/MyDrive/nlp/a3/data"  #  If you have mounted want to use the google-drive folder; modify it as appropriate

Mounted at /content/drive


## Build Vocabulary
Different from A2, we provide a vocabulary this time so that we can use pretrained GloVe word embeddings.

## Pytorch Lightning Module
The next cell is the same as A2. You only need to implment the LSTM model if you simply want to build the model.
However, it may be useful for you to understand the next cell to truly understand how pytorch-lightning works and get ready for your own project.

In [None]:
# you only need to install the packages if you have not already. On Google Colab you need to reinstall these every time.
!pip install pytorch-lightning=="1.5.8"

In [4]:
import logging
logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )
import numpy as np
import scipy
import torch
import torch.nn.functional as F
from torch.utils.data.dataset import Dataset
import argparse
import os
from pathlib import Path
from torch.optim import SGD, Adam
import pytorch_lightning as pl
from torchmetrics import Accuracy
from datetime import datetime 
from pathlib import Path
from pytorch_lightning import loggers as pl_loggers
import time
from argparse import Namespace
import json
import shutil
logger = logging.getLogger(__name__)

class BaseModel(pl.LightningModule):
    def __init__(
        self,
        **config_kwargs
    ):
        """Initialize a model, tokenizer and config."""
        logger.info("Initilazing BaseModel")
        super().__init__()
        self.save_hyperparameters() #save hyperparameters to checkpoint
        self.step_count = 0
        self.output_dir = Path(self.hparams.output_dir)
        self.model = self._load_model()

        self.accuracy = Accuracy()

    def _load_model(self):
        raise NotImplementedError

    def forward(self, **inputs):
        return self.model(**inputs)

    def batch2input(self, batch):
        raise NotImplementedError

    def training_step(self, batch, batch_idx):
        input = self.batch2input(batch)
        labels = input['labels']
        loss, pred_labels, _ = self(**input)

        self.log('train_loss', loss, prog_bar=True)
        self.log('train_acc', self.accuracy(pred_labels.view(-1), labels.view(-1).int()), prog_bar=True)
        
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        input = self.batch2input(batch)
        labels = input['labels']
        loss, pred_labels, _ = self(**input)

        self.log('val_loss', loss)
        self.log('val_acc', self.accuracy(pred_labels.view(-1), labels.view(-1).int()))

    def test_step(self, batch, batch_nb):
        input = self.batch2input(batch)
        labels = input['labels']
        loss, pred_labels, _ = self(**input)

        self.log('test_loss', loss)
        self.log('test_acc', self.accuracy(pred_labels.view(-1), labels.view(-1).int()))

    def configure_optimizers(self):
        """Prepare optimizer and schedule (linear warmup and decay)"""
        model = self.model
        # optimizer = SGD(model.parameters(), lr=self.hparams.learning_rate)
        optimizer = Adam(model.parameters(), lr=self.hparams.learning_rate)

        self.opt = optimizer
        return [optimizer]

    def setup(self, stage):
        if stage == "fit":
            self.train_loader = self.get_dataloader("train", self.hparams.train_batch_size, shuffle=True)

    def train_dataloader(self):
        return self.train_loader

    def val_dataloader(self):
        return self.get_dataloader("dev", self.hparams.eval_batch_size, shuffle=False)

    def test_dataloader(self):
        return self.get_dataloader("test", self.hparams.eval_batch_size, shuffle=False)

    @staticmethod
    def add_generic_args(parser, root_dir) -> None:
        parser.add_argument(
            "--max_epochs",
            default=10,
            type=int,
            help="The number of epochs to train your model.",
        )
        ############################################################
        ## WARNING: set --gpus 0 if you do not have access to GPUS #
        ############################################################
        parser.add_argument(
            "--gpus",
            default=1,
            type=int,
            help="The number of GPUs allocated for this, it is by default 1. Set to 0 for no GPU.",
        )
        parser.add_argument(
            "--output_dir",
            default=None,
            type=str,
            required=True,
            help="The output directory where the model predictions and checkpoints will be written.",
        )
        parser.add_argument("--do_train", action="store_true", default=True, help="Whether to run training.")
        parser.add_argument("--do_predict", action="store_true", help="Whether to run predictions on the test set.")
        parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
        parser.add_argument(
            "--data_dir",
            default="./",
            type=str,
            help="The input data dir. Should contain the training files.",
        )
        parser.add_argument("--learning_rate", default=1e-2, type=float, help="The initial learning rate for training.")
        parser.add_argument("--num_workers", default=16, type=int, help="kwarg passed to DataLoader")
        parser.add_argument("--num_train_epochs", dest="max_epochs", default=3, type=int)
        parser.add_argument("--train_batch_size", default=32, type=int)
        parser.add_argument("--eval_batch_size", default=32, type=int)
    
def generic_train(
    model: BaseModel,
    args: argparse.Namespace,
    early_stopping_callback=False,
    extra_callbacks=[],
    checkpoint_callback=None,
    logging_callback=None,
    **extra_train_kwargs
):
    
    # init model
    odir = Path(model.hparams.output_dir)
    odir.mkdir(exist_ok=True)
    log_dir = Path(os.path.join(model.hparams.output_dir, 'logs'))
    log_dir.mkdir(exist_ok=True)

    # Tensorboard logger
    pl_logger = pl_loggers.TensorBoardLogger(
        save_dir=log_dir,
        version="version_" + datetime.now().strftime("%d-%m-%Y--%H-%M-%S"),
        name="",
        default_hp_metric=True
    )

    # add custom checkpoints
    ckpt_path = os.path.join(
        args.output_dir, pl_logger.version, "checkpoints",
    )
    if checkpoint_callback is None:
        checkpoint_callback = pl.callbacks.ModelCheckpoint(
            dirpath=ckpt_path, filename="{epoch}-{val_acc:.2f}", monitor="val_acc", mode="max", save_top_k=1, verbose=True
        )

    train_params = {}

    train_params["max_epochs"] = args.max_epochs

    if args.gpus > 1:
        train_params["distributed_backend"] = "ddp"

    trainer = pl.Trainer.from_argparse_args(
        args,
        enable_model_summary=False,
        callbacks= [checkpoint_callback] + extra_callbacks,
        logger=pl_logger,
        **train_params,
    )

    if args.do_train:
        trainer.fit(model)
        # track model performance under differnt hparams settings in "Hparams" of TensorBoard
        pl_logger.log_hyperparams(params=model.hparams, metrics={'hp_metric': checkpoint_callback.best_model_score.item()})
        pl_logger.save()

        # save best model to `best_model.ckpt`
        target_path = os.path.join(ckpt_path, 'best_model.ckpt')
        logger.info(f"Copy best model from {checkpoint_callback.best_model_path} to {target_path}.")
        shutil.copy(checkpoint_callback.best_model_path, target_path)

    
    # Optionally, predict on test set and write to output_dir
    if args.do_predict:
        best_model_path = os.path.join(ckpt_path, "best_model.ckpt")
        model = model.load_from_checkpoint(best_model_path)
        return trainer.test(model)
    
    return trainer


# Long Short-term Memory Network (LSTM)

You need to finish two class `LSTM` and `LSTM-Attention` in the following cells. Try to run LSTM first!

For model architecture, you can start with: 
* word embedding dimension: 300
* intermediate layer dimension: 300
* output layer dimension: 1

Feel free to tune hyperparameters to see different results!

You may reuse code for computing loss and model predictions from logistic regression.

In [5]:
from nltk.tokenize import WordPunctTokenizer 
tokenizer = WordPunctTokenizer()

class SST2Dataset(Dataset):
    """
    Using dataset to process input text on-the-fly
    """
    def __init__(self, vocab, data):
        self.data = data
        self.vocab = vocab
        self.max_len = 50 # assigned based on length analysis of training set

    def __getitem__(self, index):
        note = []
        label, text = int(self.data[index][0]), self.data[index][1]
        tokens = tokenizer.tokenize(text.lower())
        assert self.vocab["<pad>"] == 0 # check vocab["<pad>"] == 0
        assert self.vocab["<unk>"] == 1 # check vocab["<unk>"] == 1
        token_ids = [self.vocab.get(t, 1) for t in tokens] # if word does not exist, give <unk> token id
        length = min(len(token_ids), self.max_len) # in case token length exceed max length
        padded_token_ids = token_ids[:50] + [0] * (self.max_len - length ) # truncate or pad to max length
        mask = [1 if id!=0 else 0 for id in padded_token_ids]
        return padded_token_ids, label, length, mask

    def collate_fn(self, batch_data):
        padded_token_ids, labels, lengths, masks = list(zip(*batch_data))
        return (torch.LongTensor(padded_token_ids).view(-1, self.max_len),
                torch.FloatTensor(labels).view(-1,1),
                torch.LongTensor(lengths).view(-1,1),
                torch.FloatTensor(masks).view(-1, self.max_len)
                )

    def __len__(self):
        return len(self.data)

class LSTM_PL(BaseModel):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        
    def _load_model(self):
        self.hparams.vocab = json.load(
            open(
            os.path.join(self.hparams.data_dir, self.hparams.vocab_filename)
            )
        )
        self.hparams.vocab_size = len(self.hparams.vocab)
        if self.hparams.attention:
            return LSTM_Attention(self.hparams.vocab, self.hparams.vocab_size, self.hparams.word_embedding_size, self.hparams.use_glove)
        else:
            return LSTM(self.hparams.vocab, self.hparams.vocab_size, self.hparams.word_embedding_size, self.hparams.use_glove)

    def get_dataloader(self, type_path, batch_size, shuffle=False):
        # dataset path (change if necessary)
        datapath = os.path.join(self.hparams.data_dir, f"sst2.{type_path}")
        data = open(datapath).readlines()
        data = [d.strip().split(" ", maxsplit=1) for d in data] # list of [label, text] pair
        dataset = SST2Dataset(self.hparams.vocab, data)

        logger.info(f"Loading {type_path} data and labels from {datapath}")
        data_loader = torch.utils.data.DataLoader(
            dataset=dataset,
            batch_size=batch_size,
            shuffle=shuffle,
            num_workers=self.hparams.num_workers,
            collate_fn=dataset.collate_fn
        )
        
        return data_loader    

    def configure_optimizers(self):
        """Prepare optimizer and schedule (linear warmup and decay)"""
        model = self.model
        optimizer = Adam(model.parameters(), lr=self.hparams.learning_rate)
        self.opt = optimizer
        return [optimizer]
    
    def batch2input(self, batch):
        return {"input_ids": batch[0], "labels": batch[1], "lengths": batch[2], "masks": batch[3]}

    @staticmethod
    def add_model_specific_args(parser, root_dir):
        parser.add_argument(
            "--vocab_filename",
            default=None,
            type=str,
            required=True,
            help="Pretrained tokenizer name or path",
        )
        parser.add_argument(
            "--optimizer",
            default="adam",
            type=str,
            required=True,
            help="Whether to use SGD or not",
        )
        parser.add_argument(
            "--word_embedding_size",
            default=300,
            type=int,
            help="Pretrained tokenizer name or path",
        )
        parser.add_argument(
            "--attention",
            action="store_true",
            help="Use attention or not",
        )
        parser.add_argument("--use_glove", action="store_true", help="Whether to use vector representaion from GloVe")

        return parser

In [6]:

class LSTM(torch.nn.Module):
    """
    LSTM Seq classification model
    """
    def __init__(self, vocab, vocab_size, word_embedding_size, use_glove=None):
        """
        # Paramters
          vocab_size: int
              size of the vocabulary.
        """
        super(LSTM, self).__init__()
        self.embedding = torch.nn.Embedding(vocab_size, word_embedding_size, padding_idx=0)
        if use_glove:
            self._load_glove(vocab, word_embedding_size)
        #######################################
        ## TODO: add LSTM and output layer(s) #
        #######################################
        self.lstm = torch.nn.LSTM(word_embedding_size, 300, 1, batch_first=True)
        self.sigmoid = torch.nn.Sigmoid()
        self.output = torch.nn.Linear(300, 1)

        self.criterion = torch.nn.BCELoss()

        
    def _load_glove(self, vocab, word_embedding_size):
        logger.info("Load glove pretrained word embeddings")
        vectors = {}
        with open(os.path.join(DATA_DIR, "glove.small.300d.txt")) as fin:
            for line in fin:
                parts = line.split()
                vectors[parts[0]] = np.array([float(v) for v in parts[1:]])
        weights = []
        id2word = {k: w for w, k in vocab.items()}
        for i in range(len(vocab)):
            word = id2word[i]
            if word in vectors:
                weights.append(torch.from_numpy(vectors[word]))
            elif word in ["<pad>"]:
                weights.append(torch.zeros((word_embedding_size,)))
            else:
                weights.append(torch.randn((word_embedding_size,)))
        weights = torch.stack(weights).float()
        self.embedding.load_state_dict({"weight":weights})


    def forward(self, input_ids, labels, lengths, masks):
        """
        # Parameters
        input_ids: 
            matrix of size (batch_size, feature_length). Each row in data represents a sequence of token ids coming from tokenzied input text and vocabulary. 
        label: matrix of size (batch_size,). 
            Ground truth labels.
        lengths: matrix of size (batch_size, 1). 
            Token length of input text. Help you to compute average word embedding
        mask: matrix of size (batch_size, feature_length). 
            Input mask that tells you whether the token is pad or not. If not masks = 1, else = 0. This helps you to compute attention weights
        # Returns
        loss: tensor
            loss should be a scalar averaged accross batches
        predicted_labels : model predictions. 
            Should be either 0 or 1 based on a threshold (usually 0.5).
        """
        #################################################################
        ## TODO: compute loss and predicted_labels based on model output#
        #################################################################
        
        # HINT: you can use lengths to retrieve the hidden state corresponding to the last word
        # you may find this link helpful: https://discuss.pytorch.org/t/selecting-element-on-dimension-from-list-of-indexes/36319

        
        
        embeds = self.embedding(input_ids)
        
        lengths = torch.as_tensor(lengths).view(-1) - 1
        lstm_out, _ = self.lstm(embeds)

        out = lstm_out[torch.arange(lstm_out.size(0)), lengths]
        out = self.output(out)
        out = self.sigmoid(out)
        
        loss = self.criterion(out, labels)
        predicted_labels = (out > .5).int()

        
        return loss, predicted_labels, [] # use empty list to keep number of return tensors consistant with lstm attention



In [7]:
import logging
logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )
import time
import argparse
import glob
import os
logger = logging.getLogger(__name__)

def main():
    ########################################################
    ## TODO: change args if needed according to your files #
    ########################################################
    mock_args = f"--word_embedding_size 300 --data_dir {DATA_DIR} --output_dir lstm --optimizer adam \
    --vocab_filename unigram_vocab.json --learning_rate 0.001 --max_epochs 5 --do_predict \
    --train_batch_size 16 --use_glove"

    # load hyperparameters
    parser = argparse.ArgumentParser()
    BaseModel.add_generic_args(parser, os.getcwd())
    parser = LSTM_PL.add_model_specific_args(parser, os.getcwd())
    args = parser.parse_args(mock_args.split())
    print(args)
    # fix random seed to make sure the result is reproducible
    pl.seed_everything(args.seed)

    # If output_dir not provided, a folder will be generated in pwd
    if args.output_dir is None:
        args.output_dir = os.path.join(
            "./results",
            f"{args.task}_{time.strftime('%Y%m%d_%H%M%S')}",
        )
        os.makedirs(args.output_dir)
    dict_args = vars(args)
    model = LSTM_PL(**dict_args)
    trainer = generic_train(model, args)


if __name__ == "__main__":
    main()

02/19/2022 01:33:10 - INFO - pytorch_lightning.utilities.seed -   Global seed set to 42
02/19/2022 01:33:10 - INFO - __main__ -   Initilazing BaseModel


Namespace(attention=False, data_dir='/content/drive/MyDrive/nlp/a3/data', do_predict=True, do_train=True, eval_batch_size=32, gpus=1, learning_rate=0.001, max_epochs=5, num_workers=16, optimizer='adam', output_dir='lstm', seed=42, train_batch_size=16, use_glove=True, vocab_filename='unigram_vocab.json', word_embedding_size=300)


02/19/2022 01:33:11 - INFO - __main__ -   Load glove pretrained word embeddings
02/19/2022 01:33:13 - INFO - pytorch_lightning.utilities.distributed -   GPU available: True, used: True
02/19/2022 01:33:13 - INFO - pytorch_lightning.utilities.distributed -   TPU available: False, using: 0 TPU cores
02/19/2022 01:33:13 - INFO - pytorch_lightning.utilities.distributed -   IPU available: False, using: 0 IPUs
02/19/2022 01:33:13 - INFO - __main__ -   Loading train data and labels from /content/drive/MyDrive/nlp/a3/data/sst2.train
  cpuset_checked))
02/19/2022 01:33:13 - INFO - pytorch_lightning.accelerators.gpu -   LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Validation sanity check: 0it [00:00, ?it/s]

02/19/2022 01:33:24 - INFO - __main__ -   Loading dev data and labels from /content/drive/MyDrive/nlp/a3/data/sst2.dev
02/19/2022 01:33:27 - INFO - pytorch_lightning.utilities.seed -   Global seed set to 42


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

02/19/2022 01:33:35 - INFO - pytorch_lightning.utilities.distributed -   Epoch 0, global step 432: val_acc reached 0.78440 (best 0.78440), saving model to "/content/lstm/version_19-02-2022--01-33-13/checkpoints/epoch=0-val_acc=0.78.ckpt" as top 1


Validating: 0it [00:00, ?it/s]

02/19/2022 01:33:44 - INFO - pytorch_lightning.utilities.distributed -   Epoch 1, global step 865: val_acc reached 0.82225 (best 0.82225), saving model to "/content/lstm/version_19-02-2022--01-33-13/checkpoints/epoch=1-val_acc=0.82.ckpt" as top 1


Validating: 0it [00:00, ?it/s]

02/19/2022 01:33:52 - INFO - pytorch_lightning.utilities.distributed -   Epoch 2, global step 1298: val_acc reached 0.82339 (best 0.82339), saving model to "/content/lstm/version_19-02-2022--01-33-13/checkpoints/epoch=2-val_acc=0.82.ckpt" as top 1


Validating: 0it [00:00, ?it/s]

02/19/2022 01:34:00 - INFO - pytorch_lightning.utilities.distributed -   Epoch 3, global step 1731: val_acc was not in top 1


Validating: 0it [00:00, ?it/s]

02/19/2022 01:34:08 - INFO - pytorch_lightning.utilities.distributed -   Epoch 4, global step 2164: val_acc was not in top 1
02/19/2022 01:34:09 - INFO - __main__ -   Copy best model from /content/lstm/version_19-02-2022--01-33-13/checkpoints/epoch=2-val_acc=0.82.ckpt to lstm/version_19-02-2022--01-33-13/checkpoints/best_model.ckpt.
02/19/2022 01:34:09 - INFO - __main__ -   Initilazing BaseModel
02/19/2022 01:34:09 - INFO - __main__ -   Load glove pretrained word embeddings
02/19/2022 01:34:11 - INFO - pytorch_lightning.accelerators.gpu -   LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
02/19/2022 01:34:11 - INFO - __main__ -   Loading test data and labels from /content/drive/MyDrive/nlp/a3/data/sst2.test


Testing: 0it [00:00, ?it/s]

--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_acc': 0.8336079120635986, 'test_loss': 0.440121591091156}
--------------------------------------------------------------------------------


In [12]:

class LSTM_Attention(torch.nn.Module):
    """
    LSTM with Attention Seq classification model
    """
    def __init__(self, vocab, vocab_size, word_embedding_size, use_glove=None):
        """
        # Parameters
        vocab_size: int
            size of the vocabulary.
        """
        super(LSTM_Attention, self).__init__()
        self.embedding = torch.nn.Embedding(vocab_size, word_embedding_size, padding_idx=0)
        if use_glove:
            self._load_glove(vocab, word_embedding_size)
        #################################################
        ## TODO: add LSTM, attention, and output layers #
        #################################################
        self.lambda_ = 3

        self.lstm = torch.nn.LSTM(word_embedding_size, 300, 1, batch_first=True)
        self.attention = torch.nn.Linear(300, 1)
        self.output = torch.nn.Linear(300, 1)
        self.sigmoid = torch.nn.Sigmoid()
        self.softmax = torch.nn.Softmax()

        self.criterion = torch.nn.BCELoss()
        
    def _load_glove(self, vocab, word_embedding_size):
        logger.info("Load glove pretrained word embeddings")
        vectors = {}
        with open(os.path.join(DATA_DIR, "glove.small.300d.txt")) as fin:
            for line in fin:
                parts = line.split()
                vectors[parts[0]] = np.array([float(v) for v in parts[1:]])
        weights = []
        id2word = {k: w for w, k in vocab.items()}
        for i in range(len(vocab)):
            word = id2word[i]
            if word in vectors:
                weights.append(torch.from_numpy(vectors[word]))
            elif word in ["<pad>"]:
                weights.append(torch.zeros((word_embedding_size,)))
            else:
                weights.append(torch.randn((word_embedding_size,)))
        weights = torch.stack(weights).float()
        self.embedding.load_state_dict({"weight":weights})
        

    def forward(self, input_ids, labels, lengths, masks):
        """
        # Parameters
        input_ids: matrix of size (batch_size, feature_length). 
            Each row in data represents a sequence of token ids coming from tokenzied input text and vocabulary. 
        label: matrix of size (batch_size,).
            Ground truth labels.
        lengths: matrix of size (batch_size, 1). 
            Token length of input text. Help you to compute average word embedding
        mask: matrix of size (batch_size, feature_length). 
            Input mask that tells you whether the token is pad or not. If not masks = 1, else = 0. This helps you to compute attention weights
        # Returns
        loss: loss should be a scalar averaged accross batches
        predicted_labels : model predictions. Should be either 0 or 1 based on a threshold (usually 0.5).
        """
        #################################################################
        ## TODO: compute loss and predicted_labels based on model output#
        #################################################################
        
        # HINT: you can assign -1e9 to padded tokens based on masks so that after softmax, these tokens get zero attention
        embeds = self.embedding(input_ids)
        
        h_t, _ = self.lstm(embeds) 

        out = self.attention(h_t) / self.lambda_
        out[~masks.bool()] = -1e9
        a_t = self.softmax(out).view(-1, 1, 50) # size |X|

        h_att = a_t @ h_t
        out = self.output(h_att)
        out = self.sigmoid(out.view(-1, 1))
        loss = self.criterion(out, labels)
        predicted_labels = (out > .5).int()
        
        return loss, predicted_labels, a_t #weights

In [13]:
import logging
logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )
import time
import argparse
import glob
import os
logger = logging.getLogger(__name__)

def main():
    ########################################################
    ## TODO: change args if needed according to your files #
    ########################################################
    mock_args = f"--word_embedding_size 300 --data_dir {DATA_DIR} --output_dir lstm-att --optimizer adam \
    --vocab_filename unigram_vocab.json --learning_rate 0.001 --max_epochs 5 --do_predict --attention --use_glove \
    --train_batch_size 16" 
    
    # load hyperparameters
    parser = argparse.ArgumentParser()
    BaseModel.add_generic_args(parser, os.getcwd())
    parser = LSTM_PL.add_model_specific_args(parser, os.getcwd())
    args = parser.parse_args(mock_args.split())
    print(args)
    # fix random seed to make sure the result is reproducible
    pl.seed_everything(args.seed)

    # If output_dir not provided, a folder will be generated in pwd
    if args.output_dir is None:
        args.output_dir = os.path.join(
            "./results",
            f"{args.task}_{time.strftime('%Y%m%d_%H%M%S')}",
        )
        os.makedirs(args.output_dir)
    dict_args = vars(args)
    model = LSTM_PL(**dict_args)
    trainer = generic_train(model, args)


if __name__ == "__main__":
    main()


02/19/2022 01:38:34 - INFO - pytorch_lightning.utilities.seed -   Global seed set to 42
02/19/2022 01:38:34 - INFO - __main__ -   Initilazing BaseModel
02/19/2022 01:38:34 - INFO - __main__ -   Load glove pretrained word embeddings


Namespace(attention=True, data_dir='/content/drive/MyDrive/nlp/a3/data', do_predict=True, do_train=True, eval_batch_size=32, gpus=1, learning_rate=0.001, max_epochs=5, num_workers=16, optimizer='adam', output_dir='lstm-att', seed=42, train_batch_size=16, use_glove=True, vocab_filename='unigram_vocab.json', word_embedding_size=300)


02/19/2022 01:38:35 - INFO - pytorch_lightning.utilities.distributed -   GPU available: True, used: True
02/19/2022 01:38:35 - INFO - pytorch_lightning.utilities.distributed -   TPU available: False, using: 0 TPU cores
02/19/2022 01:38:35 - INFO - pytorch_lightning.utilities.distributed -   IPU available: False, using: 0 IPUs
02/19/2022 01:38:35 - INFO - __main__ -   Loading train data and labels from /content/drive/MyDrive/nlp/a3/data/sst2.train
  cpuset_checked))
02/19/2022 01:38:35 - INFO - pytorch_lightning.accelerators.gpu -   LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Validation sanity check: 0it [00:00, ?it/s]

02/19/2022 01:38:36 - INFO - __main__ -   Loading dev data and labels from /content/drive/MyDrive/nlp/a3/data/sst2.dev
02/19/2022 01:38:36 - INFO - pytorch_lightning.utilities.seed -   Global seed set to 42


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

02/19/2022 01:38:45 - INFO - pytorch_lightning.utilities.distributed -   Epoch 0, global step 432: val_acc reached 0.77867 (best 0.77867), saving model to "/content/lstm-att/version_19-02-2022--01-38-35/checkpoints/epoch=0-val_acc=0.78.ckpt" as top 1


Validating: 0it [00:00, ?it/s]

02/19/2022 01:38:54 - INFO - pytorch_lightning.utilities.distributed -   Epoch 1, global step 865: val_acc reached 0.78440 (best 0.78440), saving model to "/content/lstm-att/version_19-02-2022--01-38-35/checkpoints/epoch=1-val_acc=0.78.ckpt" as top 1


Validating: 0it [00:00, ?it/s]

02/19/2022 01:39:02 - INFO - pytorch_lightning.utilities.distributed -   Epoch 2, global step 1298: val_acc reached 0.81766 (best 0.81766), saving model to "/content/lstm-att/version_19-02-2022--01-38-35/checkpoints/epoch=2-val_acc=0.82.ckpt" as top 1


Validating: 0it [00:00, ?it/s]

02/19/2022 01:39:11 - INFO - pytorch_lightning.utilities.distributed -   Epoch 3, global step 1731: val_acc reached 0.81995 (best 0.81995), saving model to "/content/lstm-att/version_19-02-2022--01-38-35/checkpoints/epoch=3-val_acc=0.82.ckpt" as top 1


Validating: 0it [00:00, ?it/s]

02/19/2022 01:39:19 - INFO - pytorch_lightning.utilities.distributed -   Epoch 4, global step 2164: val_acc was not in top 1
02/19/2022 01:39:20 - INFO - __main__ -   Copy best model from /content/lstm-att/version_19-02-2022--01-38-35/checkpoints/epoch=3-val_acc=0.82.ckpt to lstm-att/version_19-02-2022--01-38-35/checkpoints/best_model.ckpt.
02/19/2022 01:39:20 - INFO - __main__ -   Initilazing BaseModel
02/19/2022 01:39:20 - INFO - __main__ -   Load glove pretrained word embeddings
02/19/2022 01:39:22 - INFO - pytorch_lightning.accelerators.gpu -   LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
02/19/2022 01:39:22 - INFO - __main__ -   Loading test data and labels from /content/drive/MyDrive/nlp/a3/data/sst2.test


Testing: 0it [00:00, ?it/s]

--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_acc': 0.8187808990478516, 'test_loss': 0.6311379075050354}
--------------------------------------------------------------------------------


In [None]:
%reload_ext  tensorboard
%tensorboard --logdir lstm/

# Visualize Attention Weights

In [14]:
# This is a helper function for you to visualize your attention weights
from IPython.display import HTML, display
def visualize_attention_weights(tokens, att_weights):
    """
    # Paramters
    tokens: list of strings
        tokenized words of a sentence. 
    att_weights: list of floats, each weight should be in [0, 1]
        att_weights gerneated by LSTM Attention model. 
    """
    html_template = """<span style="background-color:rgb(255, {}, {})">{}</span>"""
    out = []
    for t, w in zip(tokens, att_weights):
        rgb = 255 - w*255
        out.append(html_template.format(rgb,rgb,t))
    html = " ".join(out)
    display(HTML(html), metadata=dict(isolated=True))
    
############## YOU NEED TO USE ACTUAL EXAMPLES AND ATTENTION FROM YOUR MODEL ###
tokens = ["this", "is", "good"]
att_weights = [0.1, 0.2, 0.7]
visualize_attention_weights(tokens, att_weights)
############## 

In [195]:
# Get test labels
with open(os.path.join(DATA_DIR, 'sst2.test')) as f:
    labels = torch.tensor([int(i[0]) for i in f])

# Get vocab
with open(os.path.join(DATA_DIR, 'unigram_vocab.json')) as f:
    vocab = json.load(f)

tokens_list = []
all_atts = []
preds = torch.tensor([])

## HINT: If you want to run predictions on test data to get attention weights, you may adapt the following code:
## On CPU
model = LSTM_PL.load_from_checkpoint("/content/lstm-att/version_18-02-2022--22-33-11/checkpoints/best_model.ckpt")
test_loader = model.test_dataloader()
for batch in test_loader:
    input = model.batch2input(batch)
    loss, pred, att_weights = model(**input)
    
    all_atts += att_weights.view(-1, 50)
    preds = torch.cat((preds, pred.view(-1)))

    for l in input['input_ids']: 
        t = []
        for i in l:
            if i > 1:
                t.append(list(vocab.keys())[i])
        tokens_list.append(t)

# # On GPU
# model = LSTM_PL.load_from_checkpoint("/content/lstm-att/version_18-02-2022--22-33-11/checkpoints/best_model.ckpt").to('cuda')
# test_loader = model.test_dataloader()
# for i, batch in enumerate(test_loader):
#     model.transfer_batch_to_device(batch, 'cuda', i) # move data to gpu
#     input = model.batch2input(batch)
#     loss, pred, att_weights = model(**input)


02/19/2022 00:20:07 - INFO - __main__ -   Initilazing BaseModel
02/19/2022 00:20:07 - INFO - __main__ -   Load glove pretrained word embeddings
02/19/2022 00:20:10 - INFO - __main__ -   Loading test data and labels from /content/drive/MyDrive/nlp/a3/data/sst2.test
  cpuset_checked))


In [234]:
correct_preds = []
false_preds = []
for i, j in enumerate(labels == preds): 
    if j and len(correct_preds) <= 10:
        correct_preds.append(i)
    if not j and len(false_preds) <= 10:
        false_preds.append(i)
    if len(correct_preds + false_preds) == 20:
        break

print("### CORRECT PREDICTIONS ###")
for i in correct_preds:
    print("Label:", int(labels[i]))
    visualize_attention_weights(tokens_list[i], all_atts[i] * 15)

print("\n### FALSE PREDICTIONS ####")
for i in false_preds: 
    print("Label:", int(labels[i]))
    visualize_attention_weights(tokens_list[i], all_atts[i] * 15)

### CORRECT PREDICTIONS ###
Label: 0


Label: 0


Label: 0


Label: 0


Label: 1


Label: 1


Label: 0


Label: 0


Label: 0


Label: 0


Label: 1



### FALSE PREDICTIONS ####
Label: 1


Label: 1


Label: 0


Label: 0


Label: 0


Label: 1


Label: 1


Label: 0


Label: 1


# BERT 
To reduce the computation, we use `distill BERT` which has much less parameters (66m) than `BERT base` model (~100m) https://github.com/huggingface/transformers/tree/master/examples/distillation .

In [None]:
!pip install transformers=="4.2.2"

In [17]:
from transformers import AutoModelForSequenceClassification, AutoConfig, AutoTokenizer

class BERTSST2Dataset(Dataset):
    """
    Using dataset to process input text on-the-fly
    """
    def __init__(self, tokenizer, data):
        self.data = data
        self.tokenizer = tokenizer
        self.max_len = 50 # assigned based on length analysis of training set

    def __getitem__(self, index):
        note = []
        label, text = int(self.data[index][0]), self.data[index][1]
        return text, label

    def collate_fn(self, batch_data):
        texts, labels = list(zip(*batch_data))
        # print(text)
        encodings = self.tokenizer(list(texts), padding=True, truncation=True, max_length=self.max_len, return_tensors= 'pt')
        return (
                encodings['input_ids'],
                encodings['attention_mask'],
                torch.LongTensor(labels).view(-1,1)
               )

    def __len__(self):
        return len(self.data)

class BERT_PL(BaseModel):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.tokenizer = AutoTokenizer.from_pretrained(self.hparams.model_name)
        
    def _load_model(self):
        model_config = AutoConfig.from_pretrained(
            self.hparams.model_name,
            num_labels=2,
        )
        return AutoModelForSequenceClassification.from_pretrained(self.hparams.model_name, config=model_config)

    def forward(self, **args):
        outputs = self.model(**args)
        loss, logits = outputs[0], outputs[1]
        predicted_labels = torch.argmax(logits, dim=1)
        return loss, predicted_labels, []

    def get_dataloader(self, type_path, batch_size, shuffle=False):
        # todo add dataset path
        datapath = os.path.join(self.hparams.data_dir, f"sst2.{type_path}")
        data = open(datapath).readlines()
        data = [d.strip().split(" ", maxsplit=1) for d in data] # list of [label, text] pair
        dataset = BERTSST2Dataset(self.tokenizer, data)

        logger.info(f"Loading {type_path} data and labels from {datapath}")
        data_loader = torch.utils.data.DataLoader(
            dataset=dataset,
            batch_size=batch_size,
            shuffle=shuffle,
            num_workers=self.hparams.num_workers,
            collate_fn=dataset.collate_fn
        )
        
        return data_loader    

    def configure_optimizers(self):
        """Prepare optimizer and schedule (linear warmup and decay)"""
        model = self.model
        optimizer = Adam(model.parameters(), lr=self.hparams.learning_rate)
        self.opt = optimizer
        return [optimizer]
    
    def batch2input(self, batch):
        return {"input_ids": batch[0], "labels": batch[2], "attention_mask": batch[1]}

    @staticmethod
    def add_model_specific_args(parser, root_dir):
        parser.add_argument(
            "--model_name",
            default=None,
            type=str,
            required=True,
            help="Pretrained tokenizer name or path",
        )
        parser.add_argument(
            "--optimizer",
            default="adam",
            type=str,
            required=True,
            help="Whether to use SGD or not",
        )
        return parser

In [18]:
import logging
logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )
import time
import argparse
import glob
import os
logger = logging.getLogger(__name__)

def main():
    ########################################################
    ## TODO: change args if needed according to your files #
    ########################################################
    mock_args = f"--data_dir {DATA_DIR} --output_dir bert --optimizer adam \
    --model_name distilbert-base-uncased --learning_rate 0.00005 --max_epochs 3 --do_predict" # change model_name here

    # load hyperparameters
    parser = argparse.ArgumentParser()
    BaseModel.add_generic_args(parser, os.getcwd())
    parser = BERT_PL.add_model_specific_args(parser, os.getcwd())
    args = parser.parse_args(mock_args.split())
    print(args)
    # fix random seed to make sure the result is reproducible
    pl.seed_everything(args.seed)

    # If output_dir not provided, a folder will be generated in pwd
    if args.output_dir is None:
        args.output_dir = os.path.join(
            "./results",
            f"{args.task}_{time.strftime('%Y%m%d_%H%M%S')}",
        )
        os.makedirs(args.output_dir)
    dict_args = vars(args)
    model = BERT_PL(**dict_args)
    trainer = generic_train(model, args)


if __name__ == "__main__":
    main()

02/19/2022 01:45:07 - INFO - pytorch_lightning.utilities.seed -   Global seed set to 42
02/19/2022 01:45:07 - INFO - __main__ -   Initilazing BaseModel


Namespace(data_dir='/content/drive/MyDrive/nlp/a3/data', do_predict=True, do_train=True, eval_batch_size=32, gpus=1, learning_rate=5e-05, max_epochs=3, model_name='distilbert-base-uncased', num_workers=16, optimizer='adam', output_dir='bert', seed=42, train_batch_size=32)


Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

02/19/2022 01:45:17 - INFO - pytorch_lightning.utilities.distributed -   GPU available: True, used: True
02/19/2022 01:45:17 - INFO - pytorch_lightning.utilities.distributed -   TPU available: False, using: 0 TPU cores
02/19/2022 01:45:17 - INFO - pytorch_lightning.utilities.distributed -   IPU available: False, using: 0 IPUs
02/19/2022 01:45:17 - INFO - __main__ -   Loading train data and labels from /content/drive/MyDrive/nlp/a3/data/sst2.train
  cpuset_checked))
02/19/2022 01:45:17 - INFO - pytorch_lightning.accelerators.gpu -   LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Validation sanity check: 0it [00:00, ?it/s]

02/19/2022 01:45:17 - INFO - __main__ -   Loading dev data and labels from /content/drive/MyDrive/nlp/a3/data/sst2.dev
02/19/2022 01:45:18 - INFO - pytorch_lightning.utilities.seed -   Global seed set to 42


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

02/19/2022 01:45:42 - INFO - pytorch_lightning.utilities.distributed -   Epoch 0, global step 216: val_acc reached 0.89564 (best 0.89564), saving model to "/content/bert/version_19-02-2022--01-45-17/checkpoints/epoch=0-val_acc=0.90.ckpt" as top 1


Validating: 0it [00:00, ?it/s]

02/19/2022 01:46:12 - INFO - pytorch_lightning.utilities.distributed -   Epoch 1, global step 433: val_acc reached 0.89794 (best 0.89794), saving model to "/content/bert/version_19-02-2022--01-45-17/checkpoints/epoch=1-val_acc=0.90.ckpt" as top 1


Validating: 0it [00:00, ?it/s]

02/19/2022 01:46:41 - INFO - pytorch_lightning.utilities.distributed -   Epoch 2, global step 650: val_acc was not in top 1
02/19/2022 01:46:42 - INFO - __main__ -   Copy best model from /content/bert/version_19-02-2022--01-45-17/checkpoints/epoch=1-val_acc=0.90.ckpt to bert/version_19-02-2022--01-45-17/checkpoints/best_model.ckpt.
02/19/2022 01:46:45 - INFO - __main__ -   Initilazing BaseModel
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSe

Testing: 0it [00:00, ?it/s]

--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_acc': 0.9044480919837952, 'test_loss': 0.24474403262138367}
--------------------------------------------------------------------------------


In [None]:
%reload_ext  tensorboard
%tensorboard --logdir bert/