# Model Building

In this notebook we will build different models for text summarization and evaluate their performances. In the spirit of quick proto-typing of these different models, we will initially limit the size of the data to 128 training examples and 16 validation examples. And we will try to overfit to the training data with as small of a model as possible. Thereafter we will scale-up some of the higher performing models and train them using 5000 training examples. Moreover, we will also experiment with different models covered in the advanced unit on NLP such as LSTMs, Transformers, etc. **Hence, this notebook serves as my submission for steps 8, 9, and 10 (i.e. units 20.5.1, 20.5.2, and 20.5.3) of the capstone project.**  

In particular, we will build various deep learning based models for text summarization. The common thread among all these models is that they are based upon an encoder-decoder architecture where encoder "encodes" the text to be summarized into a compressed form representing the underlying meaning of the input text. The decoder then takes this representation to generate a succinct summary of the input text. Refer to the literature review notebook for further details on this architecture.  

At a high level, we will experiment with three different model types:
1. LSTM

2. Attention based LSTM  
While LSTM is better than a simple RNN in capturing long term dependencies (b'cse it addresses the problem of vanishing gradients suffered by RNNs), it is still fairly limited because we are expecting too much out of it. Specifically, we expect the internal state of an LSTM cell at step t to capture all the relevant information from steps 0 to t-1 -- which is a lot to ask for as t gets very large. This is especially problematic for this dataset where the input text is about 4000 words long. To address this limitation, we use a multiplication based global attention mechanism over the encoded input.

3. Transformers  
While the above attention mechanism allows the decoder in directly attending to the input text at any time step, it still has significant limitations. Specifically, it does not have self-attention in the encoder and decoder modules -- because it only performs cross attention from the decoder to the encoder module. So the encoder and decoder modules still rely on the LSTM's ability to capture long term dependencies, which is fairly limited. To address these limitations, we use transformers because of their ability to directly attend to a token at any time step in one step. Moreover, due to their parallel nature (vs. the sequential nature of LSTMs), it is much faster to train transformers on a GPU.

4. Memory Efficient Transformer  
A major drawback of vanilla transformers is its huge memory consumption. In particular, the self-attention matrix inside the encoder layer is of size input_length x input_length which for a 4000 words input with a batch size of 32 and float32 representation takes 2GB of memory. And this is just for a single attention head in a single layer inside a multi-layer multi-head attention based encoder. And so it becomes very challenging to train with any reasonable batch size without compromising on the number of layers or the hidden dimension size.  
To address this, we can take advantage of the fact that in most cases, tokens don't need to attend across words that are thousands of time steps away. That is, the input text does not generally have this super long dependency structure. So we can split the input text into, for example, four chunks of 1100 words inputs with an overlap of 100 tokens to allow for cross attention across these chunks. This leads to ~4x reduction in memory consumption. And the encoder outputs of these four input chunks are concatenated together before being fed to the decoder.  

Moreover, for each model type, in order to get the optimal performance, we'll experiment with various hyperparameters such as number of hidden dimensions, number of encoder and decoder layers, dropout, etc. Furthermore, we'll also experiment with limited "teacher-forcing" i.e. not teacher forcing 100% of the time but only 70% instead -- inorder to make the training process be closer to the evaluation process. Moreover, for memory efficient version of transformers, we'll experiment with weight sharing between the decoder embeddings and the decoder output projection matrix as per the "attention is all you need" paper. We'll further experiment with scaling-up the embedding's output before adding to the positional embeddings vector inorder to give the word embeddings more influence vs positional encodings, as per the "attention is all you need" paper.  

As can be seen below, the highest Rouge-1 score we can get using LSTM based model is 29.5%. Overall, the best performing model (MODEL_7) is using transformers and it achieves Rouge-1 score of 38.4%, which is much better than the baseline Lead-3 model's Rouge-1 score of 31.3%. Moreover, rouge scores of the transformer based model on the training data approaches 99%; and so, it has high variance. I tried to experiment by adding dropout or reduce model complexity to mitigate high variance, but it didn't improve validation score. I believe that adding alot more training data coupled with an even larger model will help reduce variance and thus get an even higher Rouge score. Unfortunately however, I was not able to experiment much with more than 5000 training examples due to the large training time. For example, with 5000 training examples, it took about 3.5 hours to train a transformer based model on a P100 GPU.  

One point I'd like to make is that human text summarization is subjective, and so using a fixed metric to measure it will not be perfect. It is important to note that the Rouge metric just compares the generated text to a reference text, and this can cause a couple problems:
1. Intolerance to paraphrasing. Even a well paraphrased version of the reference text will lead to a low Rouge score.  For example, if we replace a word with its synonym, the Rouge score will decrease. This is because Rouge measures syntactical matching as opposed to semantical matching between the reference and predicted summaries.
2. They tend to reward extractive summaries more than abstractive summaries even if a human would evaluate both summaries as equally good. It is widely observed that just selecting the first 3 sentences from a text (i.e. the baseline Lead-3 model) will result in pretty good Rouge scores, even though a human evaluator may not give such high scores to it. See slides 25-26 of https://www.aclweb.org/anthology/P17-1099/ for details.  

For the decoder module, as it is really a language generation model, in addition to the greedy approach of choosing a word (at each time step) with the highest probability amongst all the words in the vocabulary, I also experimented with beam search. In theory, beam search should be able to generate sentences with higher probability, and thus rouge scores; but I did not observe any improvement. Perhaps it is due to a flaw in my implementation of beam search; so I will revisit it in the future.  

In terms of framework, Pytorch is used to train the models and Tensorboard (integrated with Pytorch) to visualize the training process.

All the code is in the [src directory](https://github.com/amitp-ai/Text_Summarization_UCSD/tree/main/ModelBuilding/src) inside the capstone project's GitHub repository, whereby models.py contains the code for building all the different models, train.py is used for training the models, evaluate.py is used for model evaluations, and utils.py contains various utility methods and classes.

Lastly, we will use cpc_codes 'de' from the BigPatent dataset for training and validating the models.


## Mount Google Drive and Import Libraries

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# import sys
import os
import torch
# for auto-reloading external modules (automatically reloads before using an imported module)
# %load_ext autoreload
# %autoreload 2

#To ensure that the Colab Python interpreter can load Python files from within
PATH_NAME = os.path.join('/', 'content', 'drive', 'My Drive', 'Colab Notebooks', 'UCSDX_MLE_Bootcamp', 'Text_Summarization_UCSD', 'ModelBuilding')
%cd $PATH_NAME
# sys.path.append(os.path.join(PATH_NAME, 'src'))
# print(sys.path)

print(f'Torch version {torch.__version__}') #1.8.1+cu101

/content/drive/My Drive/Colab Notebooks/UCSDX_MLE_Bootcamp/Text_Summarization_UCSD/ModelBuilding
Torch version 1.8.1+cu101


In [None]:
!nvidia-smi

Sat May 29 14:33:26 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!pip install rouge

Collecting rouge
  Downloading https://files.pythonhosted.org/packages/43/cc/e18e33be20971ff73a056ebdb023476b5a545e744e3fc22acd8c758f1e0d/rouge-1.0.0-py3-none-any.whl
Installing collected packages: rouge
Successfully installed rouge-1.0.0


In [None]:
#testing
!python -m pytest -s ../tests/

platform linux -- Python 3.7.10, pytest-3.6.4, py-1.10.0, pluggy-0.7.1
rootdir: /content/drive/My Drive/Colab Notebooks/UCSDX_MLE_Bootcamp/Text_Summarization_UCSD, inifile:
plugins: typeguard-2.7.1
collected 3 items                                                              [0m

../tests/test_ModelBuilding1.py ...



### Speed Difference: Storing Data on Gdrive vs Locally on the GCP VM
Conclusion: No difference in speed was observed

In [None]:
!ls

images	ModelBuilding_step8.ipynb		  __pycache__  saved_models
logs	Model_Experimentation_step7_14-8-1.ipynb  runs	       src


In [None]:
%%timeit -r 1 -n 1
#From GDrive
''' MODEL_DELETE: 
'''
!python ./src/train.py --hiddenDim 200 --numLayers 2 --batchSize 64 --numEpochs 100 --lr 3e-3 \
                        --savedModelDir './saved_models/MODEL_DELETE' --printEveryIters 200 --tbDescr 'MODEL_DELETE' \
                        --modelType 'models.Seq2SeqwithAttention' --loadBestModel False --toTrain True --dropout 0.2 \
                        --fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

In [None]:
!mkdir '/content/Text_Summarization_UCSD'
!mkdir '/content/Text_Summarization_UCSD/ModelBuilding'
!mkdir '/content/Text_Summarization_UCSD/DataWrangling'
!mkdir '/content/Text_Summarization_UCSD/ModelBuilding/logs'
!mkdir '/content/Text_Summarization_UCSD/ModelBuilding/runs'
!mkdir '/content/Text_Summarization_UCSD/ModelBuilding/runs/seq2seqWithAtten'
!mkdir '/content/Text_Summarization_UCSD/ModelBuilding/saved_models'
!cp -r '../../Text_Summarization_UCSD/ModelBuilding/src' '/content/Text_Summarization_UCSD/ModelBuilding/'
!cp -r '../../Text_Summarization_UCSD/DataWrangling/bigPatentPreprocessedData' '/content/Text_Summarization_UCSD/DataWrangling/'
!cp -r ../../Text_Summarization_UCSD/DataWrangling/*.json /content/Text_Summarization_UCSD/DataWrangling/
!ls /content

drive  sample_data  Text_Summarization_UCSD


In [None]:
sys.path.pop() #remove the path in Gdrive
sys.path.append('/content/Text_Summarization_UCSD/ModelBuilding/src')
sys.path
%cd '/content/Text_Summarization_UCSD/ModelBuilding'
!ls

/content/Text_Summarization_UCSD/ModelBuilding
src


In [None]:
'''
Change input_path in utils.load_data_string() and load_data_numpy()
'''

In [None]:
%%timeit -r 1 -n 1
#From GCP VM
''' MODEL_DELETE: 
'''
!python ./src/train.py --hiddenDim 200 --numLayers 2 --batchSize 64 --numEpochs 100 --lr 3e-3 \
                        --savedModelDir './saved_models/MODEL_DELETE' --printEveryIters 200 --tbDescr 'MODEL_DELETE' \
                        --modelType 'models.Seq2SeqwithAttention' --loadBestModel False --toTrain True --dropout 0.2 \
                        --fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

## 1. LSTM

For further details:-

https://www.analyticsvidhya.com/blog/2019/06/comprehensive-guide-text-summarization-using-deep-learning-python/

 http://www.abigailsee.com/2017/04/16/taming-rnns-for-better-summarization.html

### Seq2Seq: lr=0.004, dropout=0.0, hiddim=200, numlyrs=2, full-de-vocab, train_size=128, val_size=16

#### Training

In [None]:
%%timeit -r 1 -n 1
#test above trained model with beamsize=5 
!python ./src/train.py --hiddenDim 200 --numLayers 2 --batchSize 64 --numEpochs 700 --lr 2e-3 --savedModelDir './saved_models/seq2seq_200hid_2lyrs' \
                        --printEveryIters 4800 --tbDescr 'dropout-0_hiddim-200_numlyrs-2_full-de-data' \
                        --modelType 'models.Seq2Seq' --loadBestModel False --toTrain True
#but there is no improvement in rouge scores vs no beam search (greedy search seems to be ok for a well trained model)

#### Results

In [None]:
%%timeit -r 1 -n 1
#without attention
#test above trained model with beamsize=5 
!python ./src/train.py --hiddenDim 200 --numLayers 2 --batchSize 64 --numEpochs 700 --lr 2e-3 --savedModelDir './saved_models/seq2seq_200hid_2lyrs' \
                        --printEveryIters 4800 --tbDescr 'dropout-0_hiddim-200_numlyrs-2_full-de-data' \
                        --modelType 'models.Seq2Seq' --loadBestModel True --toTrain False
#but there is no improvement in rouge scores vs no beam search (greedy search seems to be ok for a well trained model)

Best model at checkpoint 67200: Rouge-1 is 0.2893, Rouge-2 is 0.0430, and Rouge-l is 0.2302


## 2. Attention Based LSTM

### Seq2Seq with Atten: lr=0.004, dropout=0.1, hiddim=200, numlyrs=2, full-de-vocab, train_size=128, val_size=16

#### Training

In [None]:
%%timeit -r 1 -n 1
#with attention
!python ./src/train.py --hiddenDim 200 --numLayers 2 --batchSize 64 --numEpochs 3000 --lr 4e-3 --savedModelDir './saved_models/seq2seq_withAtten_200hid_2lyrs' \
                        --printEveryIters 400 --tbDescr 'seq2seq_withAtten_dropout-0p1_hiddim-200_numlyrs-2_full-de-vocab' \
                        --modelType 'models.Seq2SeqwithAttention' --loadBestModel False --toTrain True \
                        --dropout 0.1

Getting the training data...
Size of description vocab is 36828 and abstract vocab is 10769
tcmalloc: large alloc 2406391808 bytes == 0x558ffe044000 @  0x7fec669a51e7 0x558f8340ff48 0x7fec3d51753e 0x7fec3d517cd9 0x7fec3d517faf 0x7fec3d5154b4 0x558f833de0e4 0x558f833ddde0 0x558f834526f5 0x558f833df69a 0x558f8344dc9e 0x558f833df69a 0x558f8344dc9e 0x558f833df69a 0x558f8344dc9e 0x558f833df69a 0x558f8344dc9e 0x558f8344cb0e 0x558f8331ee2b 0x558f8344f1e6 0x558f8344ce0d 0x558f8331ee2b 0x558f8344f1e6 0x558f8344ce0d 0x558f833df77a 0x558f8344e86a 0x558f834d0858 0x558f8344dee2 0x558f8344cb0e 0x558f833df77a 0x558f8344e86a
max length (before adding stop token) in mini_df.description is 3943 and in mini_df.abstract (before adding start/stop tokens) is 147
(128, 4)
max length (before adding stop token) in mini_df.description is 3993 and in mini_df.abstract (before adding start/stop tokens) is 140
(16, 4)
Data shape is: torch.Size([128, 4000]), torch.Size([128, 150]), torch.Size([128])
Total data size 

#### Results

Training Loss |Training Data Rouge | Validation Data Rouge
--- | --- | ---
![](https://drive.google.com/uc?export=view&id=137QFqKnmrVTz9GGSByKKod2zn0SgYgYZ) | ![](https://drive.google.com/uc?export=view&id=12m-s2b_sxEsnHzigJGIC4eojyiF8I_tF) | ![](https://drive.google.com/uc?export=view&id=12iDd_RbDEdvKAcKNlhBDAd4HQh1cfC1e)
| Dark Blue: Rouge-1, Red: Rouge-2, Light Blue: Rouge-l | Pink: Rouge-1, Green: Rouge-2, Gray: Rouge-l


Note that the initial loss will be approximately -log(abstract_vocab_size) because the model is randomly initialized.

Best checkpoint at 1200: Rouge-1 is 0.2941, Rouge-2 is 0.0498, and Rouge-l is 0.2024


### Seq2Seq with Atten: lr=0.004, dropout=0.4, hiddim=200, numlyrs=2, full-de-vocab, train_size=128, val_size=16

#### Training

In [None]:
%%timeit -r 1 -n 1
#with attention
!python ./src/train.py --hiddenDim 200 --numLayers 2 --batchSize 64 --numEpochs 3000 --lr 4e-3 --savedModelDir './saved_models/seq2seq_withAtten_200hid_2lyrs_0p4dropout' \
                        --printEveryIters 400 --tbDescr 'seq2seq_withAtten_dropout-0p4_hiddim-200_numlyrs-2_full-de-vocab' \
                        --modelType 'models.Seq2SeqwithAttention' --loadBestModel False --toTrain True \
                        --dropout 0.4

Getting the training data...
Size of description vocab is 36828 and abstract vocab is 10769
tcmalloc: large alloc 2406391808 bytes == 0x55c6459b2000 @  0x7f66719d51e7 0x55c5caf76f48 0x7f664854753e 0x7f6648547cd9 0x7f6648547faf 0x7f66485454b4 0x55c5caf450e4 0x55c5caf44de0 0x55c5cafb96f5 0x55c5caf4669a 0x55c5cafb4c9e 0x55c5caf4669a 0x55c5cafb4c9e 0x55c5caf4669a 0x55c5cafb4c9e 0x55c5caf4669a 0x55c5cafb4c9e 0x55c5cafb3b0e 0x55c5cae85e2b 0x55c5cafb61e6 0x55c5cafb3e0d 0x55c5cae85e2b 0x55c5cafb61e6 0x55c5cafb3e0d 0x55c5caf4677a 0x55c5cafb586a 0x55c5cb037858 0x55c5cafb4ee2 0x55c5cafb3b0e 0x55c5caf4677a 0x55c5cafb586a
max length (before adding stop token) in mini_df.description is 3943 and in mini_df.abstract (before adding start/stop tokens) is 147
(128, 4)
max length (before adding stop token) in mini_df.description is 3993 and in mini_df.abstract (before adding start/stop tokens) is 140
(16, 4)
Data shape is: torch.Size([128, 4000]), torch.Size([128, 150]), torch.Size([128])
Total data size 

#### Results

Training Loss |Training Data Rouge | Validation Data Rouge
--- | --- | ---
![](https://drive.google.com/uc?export=view&id=12YouTMIN-kKqm9bkMZGwb6Hlb6fltIcv) | ![](https://drive.google.com/uc?export=view&id=12WpOKdQ3HMepjn2O_uSwUUp7cGtp_7mW) | ![](https://drive.google.com/uc?export=view&id=12ev-TtxbARCWZUC17v1jpyqO-LvdXtAW)
| Dark Blue: Rouge-1, Red: Rouge-2, Light Blue: Rouge-l | Pink: Rouge-1, Green: Rouge-2, Gray: Rouge-l

Note that the initial loss will be approximately -log(abstract_vocab_size) because the model is randomly initialized.

Best checkpoint at 1200: Rouge-1 is 0.2935, Rouge-2 is 0.0382, and Rouge-l is 0.2242

Didn't see much difference even with dropout of 0.75. 

### Seq2Seq with Atten: lr=0.004, dropout=0.4 and 0.0, hiddim=200, numlyrs=2, full-de-vocab, train_size=1024, val_size=16

Did not notice much improvement in R1/R2 scores

### Model_1: Seq2Seq with Atten: lr=0.004, dropout=0.4, hiddim=200, numlyrs=2, full-de-vocab

#### Training

In [None]:
%%timeit -r 1 -n 1
''' MODEL_1: 
--hiddenDim 200 --numLayers 2 --batchSize 64 --numEpochs 3000 --lr 4e-3 \
--savedModelDir './saved_models/MODEL_1' --printEveryIters 400 --tbDescr 'MODEL_1' \
--modelType 'models.Seq2SeqwithAttention' --loadBestModel False --toTrain True --dropout 0.4 \
--fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0
'''

!python ./src/train.py --hiddenDim 200 --numLayers 2 --batchSize 64 --numEpochs 3000 --lr 4e-3 \
                        --savedModelDir './saved_models/MODEL_1' --printEveryIters 400 --tbDescr 'MODEL_1' \
                        --modelType 'models.Seq2SeqwithAttention' --loadBestModel False --toTrain True --dropout 0.4 \
                        --fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

Namespace(batchSize=64, dropout=0.4, fullVocab=True, hiddenDim=200, loadBestModel=False, lr=0.004, modelType='models.Seq2SeqwithAttention', numEpochs=3000, numLayers=2, printEveryIters=400, savedModelDir='./saved_models/MODEL_1', seed=0, tbDescr='MODEL_1', toTrain=True, trainSize=128, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x561ecc4ea000 @  0x7f2554b691e7 0x561e51f65f48 0x7f252b6db53e 0x7f252b6dbcd9 0x7f252b6dbfaf 0x7f252b6d94b4 0x561e51f340e4 0x561e51f33de0 0x561e51fa86f5 0x561e51f3569a 0x561e51fa3c9e 0x561e51f3569a 0x561e51fa3c9e 0x561e51f3569a 0x561e51fa3c9e 0x561e51f3569a 0x561e51fa3c9e 0x561e51fa2b0e 0x561e51e74e2b 0x561e51fa51e6 0x561e51fa2e0d 0x561e51e74e2b 0x561e51fa51e6 0x561e51fa2e0d 0x561e51f3577a 0x561e51fa486a 0x561e52026858 0x561e51fa3ee2 0x561e51fa2b0e 0x561e51f3577a 0x561e51fa486a
max length (before adding stop token) in mini_df.description is 3943 and in mini_df.abstract (before adding start/stop tokens) is 147

#### Results

In [None]:
%%timeit -r 1 -n 1
#MODEL_1 evaluation using best model
!python ./src/train.py --hiddenDim 200 --numLayers 2 --batchSize 64 --numEpochs 3000 --lr 4e-3 \
                        --savedModelDir './saved_models/MODEL_1' --printEveryIters 400 --tbDescr 'MODEL_1' \
                        --modelType 'models.Seq2SeqwithAttention' --loadBestModel True --toTrain False --dropout 0.4 \
                        --fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

Namespace(batchSize=64, beamSize=0, dropout=0.4, fullVocab=True, hiddenDim=200, loadBestModel=True, lr=0.004, modelType='models.Seq2SeqwithAttention', numEpochs=3000, numLayers=2, printEveryIters=400, savedModelDir='./saved_models/MODEL_1', seed=0, tbDescr='MODEL_1', tfThresh=0.0, toTrain=False, trainSize=128, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x55779a532000 @  0x7f7ac65921e7 0x55771fb42f48 0x7f7a9d10453e 0x7f7a9d104cd9 0x7f7a9d104faf 0x7f7a9d1024b4 0x55771fb110e4 0x55771fb10de0 0x55771fb856f5 0x55771fb1269a 0x55771fb80c9e 0x55771fb1269a 0x55771fb80c9e 0x55771fb1269a 0x55771fb80c9e 0x55771fb1269a 0x55771fb80c9e 0x55771fb7fb0e 0x55771fa51e2b 0x55771fb821e6 0x55771fb7fe0d 0x55771fa51e2b 0x55771fb821e6 0x55771fb7fe0d 0x55771fb1277a 0x55771fb8186a 0x55771fc03858 0x55771fb80ee2 0x55771fb7fb0e 0x55771fb1277a 0x55771fb8186a
max length (before adding stop token) in mini_df.description is 3943 and in mini_df.abstract (before adding

Training Loss |Training Data Rouge | Validation Data Rouge
--- | --- | ---
![](https://drive.google.com/uc?export=view&id=11M1iZi9ifejRDriypcsbA8RCp-MGfJLF) | ![](https://drive.google.com/uc?export=view&id=11Gd1FT9rgPa1pnM1UPYY2oSvYfhpqgfB) | ![](https://drive.google.com/uc?export=view&id=11CL3pbL420yrplyc-wJu8NmteP6SCBV2)
| Pink: Rouge-1, Teal: Rouge-2, Gray: Rouge-l | Orange: Rouge-1, Blue: Rouge-2, Red: Rouge-l

Best checkpoint at 1200: Rouge-1 is 0.2909, Rouge-2 is 0.0358, and Rouge-l is 0.1944

### Seq2Seq with Atten (Model_1B): lr=0.004, dropout=0.4, hiddim=200, numlyrs=2, full-de-vocab (same as Model1 but with attention layer properly implemented by fixing enc and dec mask)

#### Training

In [None]:
%%timeit -r 1 -n 1
''' MODEL_1B: 
--hiddenDim 200 --numLayers 2 --batchSize 64 --numEpochs 3000 --lr 4e-3 \
--savedModelDir './saved_models/MODEL_1B' --printEveryIters 400 --tbDescr 'MODEL_1B' \
--modelType 'models.Seq2SeqwithAttention' --loadBestModel False --toTrain True --dropout 0.4 \
--fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0
'''

!python ./src/train.py --hiddenDim 200 --numLayers 2 --batchSize 64 --numEpochs 3000 --lr 4e-3 \
                        --savedModelDir './saved_models/MODEL_1B' --printEveryIters 400 --tbDescr 'MODEL_1B' \
                        --modelType 'models.Seq2SeqwithAttention' --loadBestModel False --toTrain True --dropout 0.4 \
                        --fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

Namespace(batchSize=64, beamSize=0, dropout=0.4, fullVocab=True, hiddenDim=200, loadBestModel=False, lr=0.004, modelType='models.Seq2SeqwithAttention', numEpochs=3000, numLayers=2, printEveryIters=400, savedModelDir='./saved_models/MODEL_1B', seed=0, tbDescr='MODEL_1B', tfThresh=0.0, toTrain=True, trainSize=128, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x55b2d327e000 @  0x7ff816ec11e7 0x55b257caff48 0x7ff7eda3353e 0x7ff7eda33cd9 0x7ff7eda33faf 0x7ff7eda314b4 0x55b257c7e0e4 0x55b257c7dde0 0x55b257cf26f5 0x55b257c7f69a 0x55b257cedc9e 0x55b257c7f69a 0x55b257cedc9e 0x55b257c7f69a 0x55b257cedc9e 0x55b257c7f69a 0x55b257cedc9e 0x55b257cecb0e 0x55b257bbee2b 0x55b257cef1e6 0x55b257cece0d 0x55b257bbee2b 0x55b257cef1e6 0x55b257cece0d 0x55b257c7f77a 0x55b257cee86a 0x55b257d70858 0x55b257cedee2 0x55b257cecb0e 0x55b257c7f77a 0x55b257cee86a
max length (before adding stop token) in mini_df.description is 3943 and in mini_df.abstract (before addi


#### Results

In [None]:
%%timeit -r 1 -n 1
#Evaluation Model1B
!python ./src/train.py --hiddenDim 200 --numLayers 2 --batchSize 64 --numEpochs 3000 --lr 4e-3 \
                        --savedModelDir './saved_models/MODEL_1B' --printEveryIters 400 --tbDescr 'MODEL_1B' \
                        --modelType 'models.Seq2SeqwithAttention' --loadBestModel True --toTrain False --dropout 0.4 \
                        --fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

Namespace(batchSize=64, beamSize=0, dropout=0.4, fullVocab=True, hiddenDim=200, loadBestModel=True, lr=0.004, modelType='models.Seq2SeqwithAttention', numEpochs=3000, numLayers=2, printEveryIters=400, savedModelDir='./saved_models/MODEL_1B', seed=0, tbDescr='MODEL_1B', tfThresh=0.0, toTrain=False, trainSize=128, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x5643c046a000 @  0x7f5a6bd9b1e7 0x56434517cf48 0x7f5a4290d53e 0x7f5a4290dcd9 0x7f5a4290dfaf 0x7f5a4290b4b4 0x56434514b0e4 0x56434514ade0 0x5643451bf6f5 0x56434514c69a 0x5643451bac9e 0x56434514c69a 0x5643451bac9e 0x56434514c69a 0x5643451bac9e 0x56434514c69a 0x5643451bac9e 0x5643451b9b0e 0x56434508be2b 0x5643451bc1e6 0x5643451b9e0d 0x56434508be2b 0x5643451bc1e6 0x5643451b9e0d 0x56434514c77a 0x5643451bb86a 0x56434523d858 0x5643451baee2 0x5643451b9b0e 0x56434514c77a 0x5643451bb86a
max length (before adding stop token) in mini_df.description is 3943 and in mini_df.abstract (before addi

Training Loss |Training Data Rouge | Validation Data Rouge
--- | --- | ---
![](https://drive.google.com/uc?export=view&id=116x9laTV0sJqkUqSjreHxU0q0kTH8Nhd) | ![](https://drive.google.com/uc?export=view&id=113fy4HUoOq-ik837fazeGcdwQX3R8BL2) | ![](https://drive.google.com/uc?export=view&id=113T5ISBqWppLv_wuax1yXtlAHKRCEVR6)
| Red: Rouge-1, Blue: Rouge-2, Pink: Rouge-l | Teal: Rouge-1, Gray: Rouge-2, Orange: Rouge-l

Best checkpoint at 2000: Rouge-1 is 0.2695, Rouge-2 is 0.0450, and Rouge-l is 0.1773

### Seq2Seq with Atten (Model_3): lr=0.004, dropout=0.4, hiddim=200, numlyrs=2, full-de-vocab and teacher forcing only 70% of the time during training

#### Training

In [None]:
%%timeit -r 1 -n 1
''' MODEL_3: 
--hiddenDim 200 --numLayers 2 --batchSize 16 --numEpochs 3000 --lr 3e-3 \
--savedModelDir './saved_models/MODEL_3' --printEveryIters 500 --tbDescr 'MODEL_3' \
--modelType 'models.Seq2SeqwithAttention' --loadBestModel False --toTrain True --dropout 0.4 \
--fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.3 --beamSize 0
'''

!python ./src/train.py --hiddenDim 200 --numLayers 2 --batchSize 16 --numEpochs 3000 --lr 3e-3 \
                        --savedModelDir './saved_models/MODEL_3' --printEveryIters 500 --tbDescr 'MODEL_3' \
                        --modelType 'models.Seq2SeqwithAttention' --loadBestModel False --toTrain True --dropout 0.4 \
                        --fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.3 --beamSize 0

Namespace(batchSize=16, beamSize=0, dropout=0.4, fullVocab=True, hiddenDim=200, loadBestModel=False, lr=0.003, modelType='models.Seq2SeqwithAttention', numEpochs=3000, numLayers=2, printEveryIters=500, savedModelDir='./saved_models/MODEL_3', seed=0, tbDescr='MODEL_3', tfThresh=0.3, toTrain=True, trainSize=128, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x55fb23e28000 @  0x7fdc3547d1e7 0x55faa914ff48 0x7fdc0bfef53e 0x7fdc0bfefcd9 0x7fdc0bfeffaf 0x7fdc0bfed4b4 0x55faa911e0e4 0x55faa911dde0 0x55faa91926f5 0x55faa911f69a 0x55faa918dc9e 0x55faa911f69a 0x55faa918dc9e 0x55faa911f69a 0x55faa918dc9e 0x55faa911f69a 0x55faa918dc9e 0x55faa918cb0e 0x55faa905ee2b 0x55faa918f1e6 0x55faa918ce0d 0x55faa905ee2b 0x55faa918f1e6 0x55faa918ce0d 0x55faa911f77a 0x55faa918e86a 0x55faa9210858 0x55faa918dee2 0x55faa918cb0e 0x55faa911f77a 0x55faa918e86a
max length (before adding stop token) in mini_df.description is 3943 and in mini_df.abstract (before adding

#### Results

In [None]:
#Model3 best model's evaluation
!python ./src/train.py --hiddenDim 200 --numLayers 2 --batchSize 16 --numEpochs 3000 --lr 3e-3 \
                        --savedModelDir './saved_models/MODEL_3' --printEveryIters 500 --tbDescr 'MODEL_3' \
                        --modelType 'models.Seq2SeqwithAttention' --loadBestModel True --toTrain False --dropout 0.4 \
                        --fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.3 --beamSize 0

Namespace(batchSize=16, beamSize=0, dropout=0.4, fullVocab=True, hiddenDim=200, loadBestModel=True, lr=0.003, modelType='models.Seq2SeqwithAttention', numEpochs=3000, numLayers=2, printEveryIters=500, savedModelDir='./saved_models/MODEL_3', seed=0, tbDescr='MODEL_3', tfThresh=0.3, toTrain=False, trainSize=128, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x55fadc97c000 @  0x7f07536381e7 0x55fa62af9f48 0x7f072a1aa53e 0x7f072a1aacd9 0x7f072a1aafaf 0x7f072a1a84b4 0x55fa62ac80e4 0x55fa62ac7de0 0x55fa62b3c6f5 0x55fa62ac969a 0x55fa62b37c9e 0x55fa62ac969a 0x55fa62b37c9e 0x55fa62ac969a 0x55fa62b37c9e 0x55fa62ac969a 0x55fa62b37c9e 0x55fa62b36b0e 0x55fa62a08e2b 0x55fa62b391e6 0x55fa62b36e0d 0x55fa62a08e2b 0x55fa62b391e6 0x55fa62b36e0d 0x55fa62ac977a 0x55fa62b3886a 0x55fa62bba858 0x55fa62b37ee2 0x55fa62b36b0e 0x55fa62ac977a 0x55fa62b3886a
max length (before adding stop token) in mini_df.description is 3943 and in mini_df.abstract (before adding

Training Loss |Training Data Rouge | Validation Data Rouge
--- | --- | ---
![](https://drive.google.com/uc?export=view&id=10z049RMSC54qHjhNIrDGdCZt2QCV-Qfb) | ![](https://drive.google.com/uc?export=view&id=112M03Qj3_c4OG2R4eRl2rA4VWAeVOG3h) | ![](https://drive.google.com/uc?export=view&id=11qDo_oNdVXh3Oxy6H-z9W8084euc7dDQ)
| Teal: Rouge-1, Gray: Rouge-2, Orange: Rouge-l | Dark Blue: Rouge-1, Red: Rouge-2, Light Blue: Rouge-l

Best checkpoint at 6000: Rouge-1 is 0.2704, Rouge-2 is 0.0312, and Rouge-l is 0.1798

### Seq2Seq with Atten: lr=0.004, dropout=0.4, hiddim=200, numlyrs=2, full-de-vocab and teacher forcing only 70% of the time (finetuning from model 1).

Continue training from the best checkpoint of Model 1

#### Training

In [None]:
# !cp -r saved_models/MODEL_1 saved_models/MODEL_2

In [None]:
%%timeit -r 1 -n 1
''' MODEL_2: 
--hiddenDim 200 --numLayers 2 --batchSize 16 --numEpochs 500 --lr 6e-3 \
--savedModelDir './saved_models/MODEL_2' --printEveryIters 100 --tbDescr 'MODEL_2' \
--modelType 'models.Seq2SeqwithAttention' --loadBestModel True --toTrain True --dropout 0.4 \
--fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.3 --beamSize 0
'''

!python ./src/train.py --hiddenDim 200 --numLayers 2 --batchSize 16 --numEpochs 500 --lr 6e-3 \
                        --savedModelDir './saved_models/MODEL_2' --printEveryIters 100 --tbDescr 'MODEL_2' \
                        --modelType 'models.Seq2SeqwithAttention' --loadBestModel True --toTrain True --dropout 0.4 \
                        --fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.3 --beamSize 0

Namespace(batchSize=16, beamSize=0, dropout=0.4, fullVocab=True, hiddenDim=200, loadBestModel=True, lr=0.006, modelType='models.Seq2SeqwithAttention', numEpochs=500, numLayers=2, printEveryIters=100, savedModelDir='./saved_models/MODEL_2', seed=0, tbDescr='MODEL_2', tfThresh=0.3, toTrain=True, trainSize=128, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x563b7e012000 @  0x7f166d0f11e7 0x563b02c69f48 0x7f1643c6353e 0x7f1643c63cd9 0x7f1643c63faf 0x7f1643c614b4 0x563b02c380e4 0x563b02c37de0 0x563b02cac6f5 0x563b02c3969a 0x563b02ca7c9e 0x563b02c3969a 0x563b02ca7c9e 0x563b02c3969a 0x563b02ca7c9e 0x563b02c3969a 0x563b02ca7c9e 0x563b02ca6b0e 0x563b02b78e2b 0x563b02ca91e6 0x563b02ca6e0d 0x563b02b78e2b 0x563b02ca91e6 0x563b02ca6e0d 0x563b02c3977a 0x563b02ca886a 0x563b02d2a858 0x563b02ca7ee2 0x563b02ca6b0e 0x563b02c3977a 0x563b02ca886a
max length (before adding stop token) in mini_df.description is 3943 and in mini_df.abstract (before adding s

#### Results

This does not seem to help much

### Seq2Seq with Atten (Model_4): lr=0.004, dropout=0.6, hiddim=200, numlyrs=2, full-de-vocab and teacher forcing only 70% of the time during training

#### Training

In [None]:
%%timeit -r 1 -n 1
''' MODEL_4: 
--hiddenDim 200 --numLayers 2 --batchSize 16 --numEpochs 2000 --lr 3e-3 \
--savedModelDir './saved_models/MODEL_4' --printEveryIters 400 --tbDescr 'MODEL_4' \
--modelType 'models.Seq2SeqwithAttention' --loadBestModel False --toTrain True --dropout 0.6 \
--fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.3 --beamSize 0
'''

!python ./src/train.py --hiddenDim 200 --numLayers 2 --batchSize 16 --numEpochs 2000 --lr 3e-3 \
                        --savedModelDir './saved_models/MODEL_4' --printEveryIters 400 --tbDescr 'MODEL_4' \
                        --modelType 'models.Seq2SeqwithAttention' --loadBestModel False --toTrain True --dropout 0.6 \
                        --fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.3 --beamSize 0

Namespace(batchSize=16, beamSize=0, dropout=0.6, fullVocab=True, hiddenDim=200, loadBestModel=False, lr=0.003, modelType='models.Seq2SeqwithAttention', numEpochs=2000, numLayers=2, printEveryIters=400, savedModelDir='./saved_models/MODEL_4', seed=0, tbDescr='MODEL_4', tfThresh=0.3, toTrain=True, trainSize=128, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x5652bc4f6000 @  0x7fd3cbb691e7 0x5652418c1f48 0x7fd3a26db53e 0x7fd3a26dbcd9 0x7fd3a26dbfaf 0x7fd3a26d94b4 0x5652418900e4 0x56524188fde0 0x5652419046f5 0x56524189169a 0x5652418ffc9e 0x56524189169a 0x5652418ffc9e 0x56524189169a 0x5652418ffc9e 0x56524189169a 0x5652418ffc9e 0x5652418feb0e 0x5652417d0e2b 0x5652419011e6 0x5652418fee0d 0x5652417d0e2b 0x5652419011e6 0x5652418fee0d 0x56524189177a 0x56524190086a 0x565241982858 0x5652418ffee2 0x5652418feb0e 0x56524189177a 0x56524190086a
max length (before adding stop token) in mini_df.description is 3943 and in mini_df.abstract (before adding

#### Results

In [None]:
%%timeit -r 1 -n 1
#evaluation
!python ./src/train.py --hiddenDim 200 --numLayers 2 --batchSize 16 --numEpochs 2000 --lr 3e-3 \
                        --savedModelDir './saved_models/MODEL_4' --printEveryIters 400 --tbDescr 'MODEL_4' \
                        --modelType 'models.Seq2SeqwithAttention' --loadBestModel True --toTrain False --dropout 0.6 \
                        --fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.3 --beamSize 0

Namespace(batchSize=16, beamSize=0, dropout=0.6, fullVocab=True, hiddenDim=200, loadBestModel=True, lr=0.003, modelType='models.Seq2SeqwithAttention', numEpochs=2000, numLayers=2, printEveryIters=400, savedModelDir='./saved_models/MODEL_4', seed=0, tbDescr='MODEL_4', tfThresh=0.3, toTrain=False, trainSize=128, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x55c4fdd94000 @  0x7f7c6d6b21e7 0x55c4843fbf48 0x7f7c4422453e 0x7f7c44224cd9 0x7f7c44224faf 0x7f7c442224b4 0x55c4843ca0e4 0x55c4843c9de0 0x55c48443e6f5 0x55c4843cb69a 0x55c484439c9e 0x55c4843cb69a 0x55c484439c9e 0x55c4843cb69a 0x55c484439c9e 0x55c4843cb69a 0x55c484439c9e 0x55c484438b0e 0x55c48430ae2b 0x55c48443b1e6 0x55c484438e0d 0x55c48430ae2b 0x55c48443b1e6 0x55c484438e0d 0x55c4843cb77a 0x55c48443a86a 0x55c4844bc858 0x55c484439ee2 0x55c484438b0e 0x55c4843cb77a 0x55c48443a86a
max length (before adding stop token) in mini_df.description is 3943 and in mini_df.abstract (before adding

Training Loss |Training Data Rouge | Validation Data Rouge
--- | --- | ---
![](https://drive.google.com/uc?export=view&id=11oUDh5M1kaki69bL885XM7YdSXGBZ4xJ) | ![](https://drive.google.com/uc?export=view&id=11kCg75kma5vDQnEc9_TaEIGTqIlmqPRL) | ![](https://drive.google.com/uc?export=view&id=11k4bcHEExetdNyzpo3sRDAka-NdjuKAa)
| Teal: Rouge-1, Gray: Rouge-2, Orange: Rouge-l | Dark Blue: Rouge-1, Red: Rouge-2, Light Blue: Rouge-l

Best checkpoint at 14800: Rouge-1 is 0.2427, Rouge-2 is 0.0372, and Rouge-l is 0.1996

## 3. Transformers
Was not able to use hiddenDimension any larger than 48 (due to afore mentioned memory constraints)

### Transformer based Model 5: Dropout 0.3

#### Training

In [None]:
%%timeit -r 1 -n 1
''' MODEL_5: 
--hiddenDim 48 --numLayers 2 --batchSize 6 --numEpochs 500 --lr 3e-3 \
--savedModelDir './saved_models/MODEL_5' --printEveryIters 400 --tbDescr 'MODEL_5' \
--modelType 'models.Seq2SeqwithXfmr' --loadBestModel False --toTrain True --dropout 0.3 \
--fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0
'''
!python ./src/train.py --hiddenDim 48 --numLayers 2 --batchSize 6 --numEpochs 500 --lr 3e-3 \
                        --savedModelDir './saved_models/MODEL_5' --printEveryIters 400 --tbDescr 'MODEL_5' \
                        --modelType 'models.Seq2SeqwithXfmr' --loadBestModel False --toTrain True --dropout 0.3 \
                        --fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

Namespace(batchSize=6, beamSize=0, dropout=0.3, fullVocab=True, hiddenDim=48, loadBestModel=False, lr=0.003, modelType='models.Seq2SeqwithXfmr', numEpochs=500, numLayers=2, printEveryIters=400, savedModelDir='./saved_models/MODEL_DEL', seed=0, tbDescr='MODEL_DEL', tfThresh=0.0, toTrain=True, trainSize=128, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x55a3a73ca000 @  0x7f626ed891e7 0x55a32c22ff48 0x7f62458fb53e 0x7f62458fbcd9 0x7f62458fbfaf 0x7f62458f94b4 0x55a32c1fe0e4 0x55a32c1fdde0 0x55a32c2726f5 0x55a32c1ff69a 0x55a32c26dc9e 0x55a32c1ff69a 0x55a32c26dc9e 0x55a32c1ff69a 0x55a32c26dc9e 0x55a32c1ff69a 0x55a32c26dc9e 0x55a32c26cb0e 0x55a32c13ee2b 0x55a32c26f1e6 0x55a32c26ce0d 0x55a32c13ee2b 0x55a32c26f1e6 0x55a32c26ce0d 0x55a32c1ff77a 0x55a32c26e86a 0x55a32c2f0858 0x55a32c26dee2 0x55a32c26cb0e 0x55a32c1ff77a 0x55a32c26e86a
max length (before adding stop token) in mini_df.description is 3943 and in mini_df.abstract (before adding sta

#### Results

In [None]:
%%timeit -r 1 -n 1

!python ./src/train.py --hiddenDim 48 --numLayers 2 --batchSize 6 --numEpochs 500 --lr 3e-3 \
                        --savedModelDir './saved_models/MODEL_5' --printEveryIters 400 --tbDescr 'MODEL_5' \
                        --modelType 'models.Seq2SeqwithXfmr' --loadBestModel True --toTrain False --dropout 0.3 \
                        --fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

Namespace(batchSize=6, beamSize=0, dropout=0.3, fullVocab=True, hiddenDim=48, loadBestModel=True, lr=0.003, modelType='models.Seq2SeqwithXfmr', numEpochs=500, numLayers=2, printEveryIters=400, savedModelDir='./saved_models/MODEL_5', seed=0, tbDescr='MODEL_5', tfThresh=0.0, toTrain=False, trainSize=128, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x55cb0da8c000 @  0x7f4fd43cc1e7 0x55ca935b2f48 0x7f4faaf3e53e 0x7f4faaf3ecd9 0x7f4faaf3efaf 0x7f4faaf3c4b4 0x55ca935810e4 0x55ca93580de0 0x55ca935f56f5 0x55ca9358269a 0x55ca935f0c9e 0x55ca9358269a 0x55ca935f0c9e 0x55ca9358269a 0x55ca935f0c9e 0x55ca9358269a 0x55ca935f0c9e 0x55ca935efb0e 0x55ca934c1e2b 0x55ca935f21e6 0x55ca935efe0d 0x55ca934c1e2b 0x55ca935f21e6 0x55ca935efe0d 0x55ca9358277a 0x55ca935f186a 0x55ca93673858 0x55ca935f0ee2 0x55ca935efb0e 0x55ca9358277a 0x55ca935f186a
max length (before adding stop token) in mini_df.description is 3943 and in mini_df.abstract (before adding start/s

Training Loss |Training Data Rouge | Validation Data Rouge
--- | --- | ---
![](https://drive.google.com/uc?export=view&id=11cPpBU1AI9kdm_FUgL3i53swyXDeDylk) | ![](https://drive.google.com/uc?export=view&id=12AWlADqD7pugYCQ83re445cCoqLgVVhK) | ![](https://drive.google.com/uc?export=view&id=125Lcei-yNY-udqwCbx8UmRLKjLvkmjVc)
| Red: Rouge-1, Light Blue: Rouge-2, Pink: Rouge-l | Teal: Rouge-1, Gray: Rouge-2, Orange: Rouge-l

Best checkpoint at 11000: Rouge-1 is 0.2565, Rouge-2 is 0.0312, and Rouge-l is 0.1536

### Transformer based Model 6: Dropout 0.6

#### Training

In [None]:
%%timeit -r 1 -n 1
''' MODEL_6: 
--hiddenDim 48 --numLayers 2 --batchSize 6 --numEpochs 500 --lr 3e-3 \
--savedModelDir './saved_models/MODEL_6' --printEveryIters 400 --tbDescr 'MODEL_6' \
--modelType 'models.Seq2SeqwithXfmr' --loadBestModel False --toTrain True --dropout 0.6 \
--fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0
'''
!python ./src/train.py --hiddenDim 48 --numLayers 2 --batchSize 6 --numEpochs 500 --lr 3e-3 \
                        --savedModelDir './saved_models/MODEL_6' --printEveryIters 400 --tbDescr 'MODEL_6' \
                        --modelType 'models.Seq2SeqwithXfmr' --loadBestModel False --toTrain True --dropout 0.6 \
                        --fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

Namespace(batchSize=6, beamSize=0, dropout=0.6, fullVocab=True, hiddenDim=48, loadBestModel=False, lr=0.003, modelType='models.Seq2SeqwithXfmr', numEpochs=500, numLayers=2, printEveryIters=400, savedModelDir='./saved_models/MODEL_6', seed=0, tbDescr='MODEL_6', tfThresh=0.0, toTrain=True, trainSize=128, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x55d5aa196000 @  0x7f34f8cc61e7 0x55d52f910f48 0x7f34cf83853e 0x7f34cf838cd9 0x7f34cf838faf 0x7f34cf8364b4 0x55d52f8df0e4 0x55d52f8dede0 0x55d52f9536f5 0x55d52f8e069a 0x55d52f94ec9e 0x55d52f8e069a 0x55d52f94ec9e 0x55d52f8e069a 0x55d52f94ec9e 0x55d52f8e069a 0x55d52f94ec9e 0x55d52f94db0e 0x55d52f81fe2b 0x55d52f9501e6 0x55d52f94de0d 0x55d52f81fe2b 0x55d52f9501e6 0x55d52f94de0d 0x55d52f8e077a 0x55d52f94f86a 0x55d52f9d1858 0x55d52f94eee2 0x55d52f94db0e 0x55d52f8e077a 0x55d52f94f86a
max length (before adding stop token) in mini_df.description is 3943 and in mini_df.abstract (before adding start/s

#### Results

In [None]:
%%timeit -r 1 -n 1
''' MODEL_6: '''
!python ./src/train.py --hiddenDim 48 --numLayers 2 --batchSize 6 --numEpochs 500 --lr 3e-3 \
                        --savedModelDir './saved_models/MODEL_6' --printEveryIters 400 --tbDescr 'MODEL_6' \
                        --modelType 'models.Seq2SeqwithXfmr' --loadBestModel True --toTrain False --dropout 0.6 \
                        --fullVocab True --trainSize 128 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

Namespace(batchSize=6, beamSize=0, dropout=0.6, fullVocab=True, hiddenDim=48, loadBestModel=True, lr=0.003, modelType='models.Seq2SeqwithXfmr', numEpochs=500, numLayers=2, printEveryIters=400, savedModelDir='./saved_models/MODEL_6', seed=0, tbDescr='MODEL_6', tfThresh=0.0, toTrain=False, trainSize=128, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x55d5a67e0000 @  0x7ff2db5da1e7 0x55d52c272f48 0x7ff2b214c53e 0x7ff2b214ccd9 0x7ff2b214cfaf 0x7ff2b214a4b4 0x55d52c2410e4 0x55d52c240de0 0x55d52c2b56f5 0x55d52c24269a 0x55d52c2b0c9e 0x55d52c24269a 0x55d52c2b0c9e 0x55d52c24269a 0x55d52c2b0c9e 0x55d52c24269a 0x55d52c2b0c9e 0x55d52c2afb0e 0x55d52c181e2b 0x55d52c2b21e6 0x55d52c2afe0d 0x55d52c181e2b 0x55d52c2b21e6 0x55d52c2afe0d 0x55d52c24277a 0x55d52c2b186a 0x55d52c333858 0x55d52c2b0ee2 0x55d52c2afb0e 0x55d52c24277a 0x55d52c2b186a
max length (before adding stop token) in mini_df.description is 3943 and in mini_df.abstract (before adding start/s

Training Loss |Training Data Rouge | Validation Data Rouge
--- | --- | ---
![](https://drive.google.com/uc?export=view&id=121unJTMckXqoxdb4GwnSYdB5M3GE3tyz) | ![](https://drive.google.com/uc?export=view&id=11sIh1MxAZviErSHzhhJxLX8WrsttWzzO) | ![](https://drive.google.com/uc?export=view&id=11qcdt1QRTbKRJErdfeQTfjrjTcSoEolu)
| Red: Rouge-1, Light Blue: Rouge-2, Pink: Rouge-l | Teal: Rouge-1, Gray: Rouge-2, Orange: Rouge-l

Best checkpoint at 2800: Rouge-1 is 0.2441, Rouge-2 is 0.0293, and Rouge-l is 0.1564

### Transformer based Model 5B: Same as Model 5 but train size = 512

#### Training

In [None]:
%%timeit -r 1 -n 1
''' MODEL_5B: 
--hiddenDim 48 --numLayers 2 --batchSize 6 --numEpochs 400 --lr 3e-3 \
--savedModelDir './saved_models/MODEL_5B' --printEveryIters 400 --tbDescr 'MODEL_5B' \
--modelType 'models.Seq2SeqwithXfmr' --loadBestModel False --toTrain True --dropout 0.3 \
--fullVocab True --trainSize 512 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0
'''
!python ./src/train.py --hiddenDim 48 --numLayers 2 --batchSize 6 --numEpochs 400 --lr 3e-3 \
                        --savedModelDir './saved_models/MODEL_5B' --printEveryIters 400 --tbDescr 'MODEL_5B' \
                        --modelType 'models.Seq2SeqwithXfmr' --loadBestModel False --toTrain True --dropout 0.3 \
                        --fullVocab True --trainSize 512 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

Namespace(batchSize=6, beamSize=0, dropout=0.3, fullVocab=True, hiddenDim=48, loadBestModel=False, lr=0.003, modelType='models.Seq2SeqwithXfmr', numEpochs=400, numLayers=2, printEveryIters=400, savedModelDir='./saved_models/MODEL_5B', seed=0, tbDescr='MODEL_5B', tfThresh=0.0, toTrain=True, trainSize=512, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x564f24cd0000 @  0x7f27a95451e7 0x564ea9d7cf48 0x7f27800b753e 0x7f27800b7cd9 0x7f27800b7faf 0x7f27800b54b4 0x564ea9d4b0e4 0x564ea9d4ade0 0x564ea9dbf6f5 0x564ea9d4c69a 0x564ea9dbac9e 0x564ea9d4c69a 0x564ea9dbac9e 0x564ea9d4c69a 0x564ea9dbac9e 0x564ea9d4c69a 0x564ea9dbac9e 0x564ea9db9b0e 0x564ea9c8be2b 0x564ea9dbc1e6 0x564ea9db9e0d 0x564ea9c8be2b 0x564ea9dbc1e6 0x564ea9db9e0d 0x564ea9d4c77a 0x564ea9dbb86a 0x564ea9e3d858 0x564ea9dbaee2 0x564ea9db9b0e 0x564ea9d4c77a 0x564ea9dbb86a
max length (before adding stop token) in mini_df.description is 3974 and in mini_df.abstract (before adding start

#### Results

In [None]:
%%timeit -r 1 -n 1
''' MODEL_5B: '''
!python ./src/train.py --hiddenDim 48 --numLayers 2 --batchSize 6 --numEpochs 400 --lr 3e-3 \
                        --savedModelDir './saved_models/MODEL_5B' --printEveryIters 400 --tbDescr 'MODEL_5B' \
                        --modelType 'models.Seq2SeqwithXfmr' --loadBestModel True --toTrain False --dropout 0.3 \
                        --fullVocab True --trainSize 512 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

Namespace(batchSize=6, beamSize=0, dropout=0.3, fullVocab=True, hiddenDim=48, loadBestModel=True, lr=0.003, modelType='models.Seq2SeqwithXfmr', numEpochs=400, numLayers=2, printEveryIters=400, savedModelDir='./saved_models/MODEL_5B', seed=0, tbDescr='MODEL_5B', tfThresh=0.0, toTrain=False, trainSize=512, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x55d1c858c000 @  0x7f18b77431e7 0x55d14e8e2f48 0x7f188e2b553e 0x7f188e2b5cd9 0x7f188e2b5faf 0x7f188e2b34b4 0x55d14e8b10e4 0x55d14e8b0de0 0x55d14e9256f5 0x55d14e8b269a 0x55d14e920c9e 0x55d14e8b269a 0x55d14e920c9e 0x55d14e8b269a 0x55d14e920c9e 0x55d14e8b269a 0x55d14e920c9e 0x55d14e91fb0e 0x55d14e7f1e2b 0x55d14e9221e6 0x55d14e91fe0d 0x55d14e7f1e2b 0x55d14e9221e6 0x55d14e91fe0d 0x55d14e8b277a 0x55d14e92186a 0x55d14e9a3858 0x55d14e920ee2 0x55d14e91fb0e 0x55d14e8b277a 0x55d14e92186a
max length (before adding stop token) in mini_df.description is 3974 and in mini_df.abstract (before adding start

Training Loss |Training Data Rouge | Validation Data Rouge
--- | --- | ---
![](https://drive.google.com/uc?export=view&id=11ifqnH18KMUNBjZdS_0INuhHTTPCZ9jw) | ![](https://drive.google.com/uc?export=view&id=11fWnHXbv5iUWcXpEpUuhnO1qYOsuW3CB) | ![](https://drive.google.com/uc?export=view&id=11d3bT-wjqybq_n0qab3WsibB-MQeucxw)
| Red: Rouge-1, Light Blue: Rouge-2, Pink: Rouge-l | Teal: Rouge-1, Gray: Rouge-2, Orange: Rouge-l

Best checkpoint at 6000: Rouge-1 is 0.2745, Rouge-2 is 0.0337, and Rouge-l is 0.2030

### Transformer based Model 5C: Same as Model 5 but train size = 1024

With training_size = 2048, tried dropout rate of 0.3, 0.1, 0.1 but it was not able to reduce the training loss very much.

with train_size=1024, dropout=0, the training rouge score gets to 90 but val rouge 1 maxes at 27.4.

#### Training

In [None]:
%%timeit -r 1 -n 1
''' MODEL_5C: 
--hiddenDim 48 --numLayers 2 --batchSize 6 --numEpochs 400 --lr 3e-3 \
--savedModelDir './saved_models/MODEL_5C' --printEveryIters 400 --tbDescr 'MODEL_5C' \
--modelType 'models.Seq2SeqwithXfmr' --loadBestModel False --toTrain True --dropout 0.0 \
--fullVocab True --trainSize 1024 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0
'''
!python ./src/train.py --hiddenDim 48 --numLayers 2 --batchSize 6 --numEpochs 100 --lr 3e-3 \
                        --savedModelDir './saved_models/MODEL_5C' --printEveryIters 400 --tbDescr 'MODEL_5C' \
                        --modelType 'models.Seq2SeqwithXfmr' --loadBestModel True --toTrain True --dropout 0.0 \
                        --fullVocab True --trainSize 1024 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

#### Results

In [None]:
%%timeit -r 1 -n 1
''' MODEL_5C: '''
!python ./src/train.py --hiddenDim 48 --numLayers 2 --batchSize 6 --numEpochs 40 --lr 3e-3 \
                        --savedModelDir './saved_models/MODEL_5C' --printEveryIters 400 --tbDescr 'MODEL_5C' \
                        --modelType 'models.Seq2SeqwithXfmr' --loadBestModel True --toTrain False --dropout 0.0 \
                        --fullVocab True --trainSize 1024 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

## 4. Memory Efficient Transformers
For the same model size, I did not observe any degradation in performance using the memory efficient transformer over a standard transformer. And the advantage is that we can use larger complexity models and still fit inside the GPU memory. 

### Transformer based Model 7
Has been the best so far with Rouge-1 of 38.36

#### Training

In [None]:
%%timeit -r 1 -n 1
'''
--hiddenDim 128 --numLayers 2 --batchSize 24 --numEpochs 100 --lr 1e-3 --dropout 0.0 \
--savedModelDir './saved_models/MODEL_7' --printEveryIters 500 --tbDescr 'MODEL_7' \
--modelType 'models.Seq2SeqwithXfmrMemEfficient' --loadBestModel False --toTrain True \
--fullVocab True --trainSize 5000 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0
'''
!python ./src/train.py --hiddenDim 128 --numLayers 2 --batchSize 24 --numEpochs 100 --lr 1e-3 --dropout 0.0 \
                        --savedModelDir './saved_models/MODEL_7' --printEveryIters 500 --tbDescr 'MODEL_7' \
                        --modelType 'models.Seq2SeqwithXfmrMemEfficient' --loadBestModel False --toTrain True \
                        --fullVocab True --trainSize 5000 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

Namespace(batchSize=24, beamSize=0, dropout=0.0, fullVocab=True, hiddenDim=128, loadBestModel=False, lr=0.001, modelType='models.Seq2SeqwithXfmrMemEfficient', numEpochs=100, numLayers=2, printEveryIters=500, savedModelDir='./saved_models/MODEL_7', seed=0, tbDescr='MODEL_7', tfThresh=0.0, toTrain=True, trainSize=5000, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x55d29efc8000 @  0x7f376c62b1e7 0x55d2256eff48 0x7f374319d53e 0x7f374319dcd9 0x7f374319dfaf 0x7f374319b4b4 0x55d2256be0e4 0x55d2256bdde0 0x55d2257326f5 0x55d2256bf69a 0x55d22572dc9e 0x55d2256bf69a 0x55d22572dc9e 0x55d2256bf69a 0x55d22572dc9e 0x55d2256bf69a 0x55d22572dc9e 0x55d22572cb0e 0x55d2255fee2b 0x55d22572f1e6 0x55d22572ce0d 0x55d2255fee2b 0x55d22572f1e6 0x55d22572ce0d 0x55d2256bf77a 0x55d22572e86a 0x55d2257b0858 0x55d22572dee2 0x55d22572cb0e 0x55d2256bf77a 0x55d22572e86a
max length (before adding stop token) in mini_df.description is 3996 and in mini_df.abstract (before

#### Results

In [None]:
!python ./src/train.py --hiddenDim 128 --numLayers 2 --batchSize 24 --numEpochs 100 --lr 1e-3 --dropout 0.0 \
                        --savedModelDir './saved_models/MODEL_7' --printEveryIters 500 --tbDescr 'MODEL_7' \
                        --modelType 'models.Seq2SeqwithXfmrMemEfficient' --loadBestModel True --toTrain False \
                        --fullVocab True --trainSize 5000 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

Namespace(batchSize=24, beamSize=0, dropout=0.0, fullVocab=True, hiddenDim=128, loadBestModel=True, lr=0.001, modelType='models.Seq2SeqwithXfmrMemEfficient', numEpochs=100, numLayers=2, printEveryIters=500, savedModelDir='./saved_models/MODEL_7', seed=0, tbDescr='MODEL_7', tfThresh=0.0, toTrain=False, trainSize=5000, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x563c4221a000 @  0x7f8031cc81e7 0x563bc6c64e68 0x7f800883a53e 0x7f800883acd9 0x7f800883afaf 0x7f80088384b4 0x563bc6c32d54 0x563bc6c32a50 0x563bc6ca7105 0x563bc6c3430a 0x563bc6ca260e 0x563bc6c3430a 0x563bc6ca260e 0x563bc6c3430a 0x563bc6ca260e 0x563bc6c3430a 0x563bc6ca260e 0x563bc6ca14ae 0x563bc6b73e2c 0x563bc6ca3bb5 0x563bc6ca17ad 0x563bc6b73e2c 0x563bc6ca3bb5 0x563bc6ca17ad 0x563bc6c343ea 0x563bc6ca332a 0x563bc6d24ec8 0x563bc6ca2853 0x563bc6ca14ae 0x563bc6c343ea 0x563bc6ca332a
max length (before adding stop token) in mini_df.description is 3996 and in mini_df.abstract (before

Training Loss |Training Data Rouge | Validation Data Rouge
--- | --- | ---
![](https://drive.google.com/uc?export=view&id=12U3tao6kgLmYxp7zcP5KQKSkitVpw4GF) | ![](https://drive.google.com/uc?export=view&id=12JC6WPT-rqFl9mRByhBKhkpEo12QHOGJ) | ![](https://drive.google.com/uc?export=view&id=12FkLXmLfdvBpxpSIyM_Cul0QoXzBxNLU)
| Red: Rouge-1, Light Blue: Rouge-2, Pink: Rouge-l | Teal: Rouge-1, Gray: Rouge-2, Orange: Rouge-l

Best checkpoint at 20,500: Rouge-1 is 0.3836, Rouge-2 is 0.1328, and Rouge-l is 0.2679

### Transformer based Model 7C
Same as Model 7 but with 6000 training examples

#### Training

In [None]:
%%timeit -r 1 -n 1
'''
--hiddenDim 128 --numLayers 2 --batchSize 24 --numEpochs 100 --lr 1e-3 --dropout 0.0 \
--savedModelDir './saved_models/MODEL_7C' --printEveryIters 500 --tbDescr 'MODEL_7C' \
--modelType 'models.Seq2SeqwithXfmrMemEfficient' --loadBestModel False --toTrain True \
--fullVocab True --trainSize 6000 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0
'''
!python ./src/train.py --hiddenDim 128 --numLayers 2 --batchSize 24 --numEpochs 100 --lr 1e-3 --dropout 0.0 \
                        --savedModelDir './saved_models/MODEL_7C' --printEveryIters 500 --tbDescr 'MODEL_7C' \
                        --modelType 'models.Seq2SeqwithXfmrMemEfficient' --loadBestModel False --toTrain True \
                        --fullVocab True --trainSize 6000 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

Namespace(batchSize=24, beamSize=0, dropout=0.0, fullVocab=True, hiddenDim=128, loadBestModel=False, lr=0.001, modelType='models.Seq2SeqwithXfmrMemEfficient', numEpochs=100, numLayers=2, printEveryIters=500, savedModelDir='./saved_models/MODEL_7C', seed=0, tbDescr='MODEL_7C', tfThresh=0.0, toTrain=True, trainSize=6000, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x55aded82a000 @  0x7f777c5da1e7 0x55ad71fd3f48 0x7f775314c53e 0x7f775314ccd9 0x7f775314cfaf 0x7f775314a4b4 0x55ad71fa20e4 0x55ad71fa1de0 0x55ad720166f5 0x55ad71fa369a 0x55ad72011c9e 0x55ad71fa369a 0x55ad72011c9e 0x55ad71fa369a 0x55ad72011c9e 0x55ad71fa369a 0x55ad72011c9e 0x55ad72010b0e 0x55ad71ee2e2b 0x55ad720131e6 0x55ad72010e0d 0x55ad71ee2e2b 0x55ad720131e6 0x55ad72010e0d 0x55ad71fa377a 0x55ad7201286a 0x55ad72094858 0x55ad72011ee2 0x55ad72010b0e 0x55ad71fa377a 0x55ad7201286a
max length (before adding stop token) in mini_df.description is 3998 and in mini_df.abstract (befo

#### Results

In [None]:
!python ./src/train.py --hiddenDim 128 --numLayers 2 --batchSize 24 --numEpochs 100 --lr 1e-3 --dropout 0.0 \
                        --savedModelDir './saved_models/MODEL_7C' --printEveryIters 500 --tbDescr 'MODEL_7C' \
                        --modelType 'models.Seq2SeqwithXfmrMemEfficient' --loadBestModel True --toTrain False \
                        --fullVocab True --trainSize 6000 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

Namespace(batchSize=24, beamSize=0, dropout=0.0, fullVocab=True, hiddenDim=128, loadBestModel=True, lr=0.001, modelType='models.Seq2SeqwithXfmrMemEfficient', numEpochs=100, numLayers=2, printEveryIters=500, savedModelDir='./saved_models/MODEL_7C', seed=0, tbDescr='MODEL_7C', tfThresh=0.0, toTrain=False, trainSize=6000, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x5555e977a000 @  0x7f9ebbaa01e7 0x55556e193e68 0x7f9e9261253e 0x7f9e92612cd9 0x7f9e92612faf 0x7f9e926104b4 0x55556e161d54 0x55556e161a50 0x55556e1d6105 0x55556e16330a 0x55556e1d160e 0x55556e16330a 0x55556e1d160e 0x55556e16330a 0x55556e1d160e 0x55556e16330a 0x55556e1d160e 0x55556e1d04ae 0x55556e0a2e2c 0x55556e1d2bb5 0x55556e1d07ad 0x55556e0a2e2c 0x55556e1d2bb5 0x55556e1d07ad 0x55556e1633ea 0x55556e1d232a 0x55556e253ec8 0x55556e1d1853 0x55556e1d04ae 0x55556e1633ea 0x55556e1d232a
max length (before adding stop token) in mini_df.description is 3998 and in mini_df.abstract (befo

Training Loss |Training Data Rouge | Validation Data Rouge
--- | --- | ---
![](https://drive.google.com/uc?export=view&id=11qKWCYg74V1R2MUsRTR4uQC6m73fnsjd) | ![](https://drive.google.com/uc?export=view&id=1-LuYqr5KyBAHPfPG4DZ2WZ244VE4XSrH) | ![](https://drive.google.com/uc?export=view&id=12ESjtqY8MV3qv2xGnlxB4zy-Vm2JBrXM)
| Red: Rouge-1, Light Blue: Rouge-2, Pink: Rouge-l | Teal: Rouge-1, Gray: Rouge-2, Orange: Rouge-l

Best checkpoint at 20000: Rouge-1 is 0.3642, Rouge-2 is 0.1243, and Rouge-l is 0.2890

### MODEL_7B

NumTrain: 5000 + dropout (dropout 0.1 it performs poorly i.e. high bias. Even dropout of 0.03 it has high bias, same story with dropout of 0.01)

### Transformers Based Model_8 (same as model 7 but with 10000 training examples)

Does not work well. Worse than model_7

#### Training

In [None]:
%%timeit -r 1 -n 1
'''
--hiddenDim 128 --numLayers 2 --batchSize 24 --numEpochs 100 --lr 1e-3 --dropout 0.0 \
--savedModelDir './saved_models/MODEL_8' --printEveryIters 400 --tbDescr 'MODEL_8' \
--modelType 'models.Seq2SeqwithXfmrMemEfficient' --loadBestModel False --toTrain True \
--fullVocab True --trainSize 10000 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0
'''
!python ./src/train.py --hiddenDim 128 --numLayers 2 --batchSize 24 --numEpochs 100 --lr 1e-3 --dropout 0.0 \
                        --savedModelDir './saved_models/MODEL_8' --printEveryIters 500 --tbDescr 'MODEL_8' \
                        --modelType 'models.Seq2SeqwithXfmrMemEfficient' --loadBestModel False --toTrain True \
                        --fullVocab True --trainSize 10000 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

Namespace(batchSize=24, beamSize=0, dropout=0.0, fullVocab=True, hiddenDim=128, loadBestModel=False, lr=0.001, modelType='models.Seq2SeqwithXfmrMemEfficient', numEpochs=100, numLayers=2, printEveryIters=500, savedModelDir='./saved_models/MODEL_8', seed=0, tbDescr='MODEL_8', tfThresh=0.0, toTrain=True, trainSize=10000, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x56288f918000 @  0x7febe33ec1e7 0x562815881f48 0x7febb9f5e53e 0x7febb9f5ecd9 0x7febb9f5efaf 0x7febb9f5c4b4 0x5628158500e4 0x56281584fde0 0x5628158c46f5 0x56281585169a 0x5628158bfc9e 0x56281585169a 0x5628158bfc9e 0x56281585169a 0x5628158bfc9e 0x56281585169a 0x5628158bfc9e 0x5628158beb0e 0x562815790e2b 0x5628158c11e6 0x5628158bee0d 0x562815790e2b 0x5628158c11e6 0x5628158bee0d 0x56281585177a 0x5628158c086a 0x562815942858 0x5628158bfee2 0x5628158beb0e 0x56281585177a 0x5628158c086a
max length (before adding stop token) in mini_df.description is 3998 and in mini_df.abstract (befor

#### Results

In [None]:
!python ./src/train.py --hiddenDim 128 --numLayers 2 --batchSize 24 --numEpochs 100 --lr 1e-3 --dropout 0.0 \
                        --savedModelDir './saved_models/MODEL_8' --printEveryIters 500 --tbDescr 'MODEL_8' \
                        --modelType 'models.Seq2SeqwithXfmrMemEfficient' --loadBestModel True --toTrain False \
                        --fullVocab True --trainSize 10000 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

Namespace(batchSize=24, beamSize=0, dropout=0.0, fullVocab=True, hiddenDim=128, loadBestModel=True, lr=0.001, modelType='models.Seq2SeqwithXfmrMemEfficient', numEpochs=100, numLayers=2, printEveryIters=500, savedModelDir='./saved_models/MODEL_8', seed=0, tbDescr='MODEL_8', tfThresh=0.0, toTrain=False, trainSize=10000, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x55e63e56a000 @  0x7fc60d9911e7 0x55e5c3a52f48 0x7fc5e450353e 0x7fc5e4503cd9 0x7fc5e4503faf 0x7fc5e45014b4 0x55e5c3a210e4 0x55e5c3a20de0 0x55e5c3a956f5 0x55e5c3a2269a 0x55e5c3a90c9e 0x55e5c3a2269a 0x55e5c3a90c9e 0x55e5c3a2269a 0x55e5c3a90c9e 0x55e5c3a2269a 0x55e5c3a90c9e 0x55e5c3a8fb0e 0x55e5c3961e2b 0x55e5c3a921e6 0x55e5c3a8fe0d 0x55e5c3961e2b 0x55e5c3a921e6 0x55e5c3a8fe0d 0x55e5c3a2277a 0x55e5c3a9186a 0x55e5c3b13858 0x55e5c3a90ee2 0x55e5c3a8fb0e 0x55e5c3a2277a 0x55e5c3a9186a
max length (before adding stop token) in mini_df.description is 3998 and in mini_df.abstract (befor

### Transformer based Model 9

#### Training

In [None]:
%%timeit -r 1 -n 1
'''
--hiddenDim 128 --numLayers 3 --batchSize 24 --numEpochs 100 --lr 1e-3 --dropout 0.0 \
--savedModelDir './saved_models/MODEL_9' --printEveryIters 500 --tbDescr 'MODEL_9' \
--modelType 'models.Seq2SeqwithXfmrMemEfficient' --loadBestModel False --toTrain True \
--fullVocab True --trainSize 10000 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

weight_tying=False (by default it is False)
dec_numlayers = eniic_numlayers+2 (by default it is enc_numlayers+2)
'''
!python ./src/train.py --hiddenDim 128 --numLayers 4 --batchSize 15 --numEpochs 100 --lr 1e-4 --dropout 0.0 \
                        --savedModelDir './saved_models/MODEL_9' --printEveryIters 500 --tbDescr 'MODEL_9' \
                        --modelType 'models.Seq2SeqwithXfmrMemEfficient' --loadBestModel True --toTrain True \
                        --fullVocab True --trainSize 1000 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

Namespace(batchSize=15, beamSize=0, dropout=0.0, fullVocab=True, hiddenDim=128, loadBestModel=True, lr=0.0001, modelType='models.Seq2SeqwithXfmrMemEfficient', numEpochs=100, numLayers=4, printEveryIters=500, savedModelDir='./saved_models/MODEL_9', seed=0, tbDescr='MODEL_9', tfThresh=0.0, toTrain=True, trainSize=1000, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x55960a394000 @  0x7ff4a50681e7 0x55958fb02f48 0x7ff47bbda53e 0x7ff47bbdacd9 0x7ff47bbdafaf 0x7ff47bbd84b4 0x55958fad10e4 0x55958fad0de0 0x55958fb456f5 0x55958fad269a 0x55958fb40c9e 0x55958fad269a 0x55958fb40c9e 0x55958fad269a 0x55958fb40c9e 0x55958fad269a 0x55958fb40c9e 0x55958fb3fb0e 0x55958fa11e2b 0x55958fb421e6 0x55958fb3fe0d 0x55958fa11e2b 0x55958fb421e6 0x55958fb3fe0d 0x55958fad277a 0x55958fb4186a 0x55958fbc3858 0x55958fb40ee2 0x55958fb3fb0e 0x55958fad277a 0x55958fb4186a
max length (before adding stop token) in mini_df.description is 3996 and in mini_df.abstract (before

#### Results

In [None]:
!python ./src/train.py --hiddenDim 128 --numLayers 2 --batchSize 24 --numEpochs 100 --lr 1e-3 --dropout 0.0 \
                        --savedModelDir './saved_models/MODEL_7' --printEveryIters 500 --tbDescr 'MODEL_7' \
                        --modelType 'models.Seq2SeqwithXfmrMemEfficient' --loadBestModel True --toTrain False \
                        --fullVocab True --trainSize 5000 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

Namespace(batchSize=24, beamSize=0, dropout=0.0, fullVocab=True, hiddenDim=128, loadBestModel=True, lr=0.001, modelType='models.Seq2SeqwithXfmrMemEfficient', numEpochs=100, numLayers=2, printEveryIters=500, savedModelDir='./saved_models/MODEL_7', seed=0, tbDescr='MODEL_7', tfThresh=0.0, toTrain=False, trainSize=5000, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x55f3e2d52000 @  0x7f06066481e7 0x55f368ff0f48 0x7f05dd1ba53e 0x7f05dd1bacd9 0x7f05dd1bafaf 0x7f05dd1b84b4 0x55f368fbf0e4 0x55f368fbede0 0x55f3690336f5 0x55f368fc069a 0x55f36902ec9e 0x55f368fc069a 0x55f36902ec9e 0x55f368fc069a 0x55f36902ec9e 0x55f368fc069a 0x55f36902ec9e 0x55f36902db0e 0x55f368effe2b 0x55f3690301e6 0x55f36902de0d 0x55f368effe2b 0x55f3690301e6 0x55f36902de0d 0x55f368fc077a 0x55f36902f86a 0x55f3690b1858 0x55f36902eee2 0x55f36902db0e 0x55f368fc077a 0x55f36902f86a
max length (before adding stop token) in mini_df.description is 3996 and in mini_df.abstract (before

### Transformer based Model 10

Same as  Model 7 but with hiddim = 200

But not able to train it with lr=1e-3. Had the same prob. with 4 layers.

Note that with hidden_dim=48 or 128, I can train fine with lr=1e-3. Even over fit the training data with 128 examples.

There appears to be a challenge in finding the right learning rate with a more complex model (i.e. more layers or larger hidden dimension). I was not successful in training such models. (I even tried with 128 training examples and was not able to overfit to the training data -- the loss did not decrease too much, I think it stagnated around 4)

#### Training

In [None]:
%%timeit -r 1 -n 1
'''
--hiddenDim 200 --numLayers 2 --batchSize 24 --numEpochs 100 --lr 1e-3 --dropout 0.0 \
--savedModelDir './saved_models/MODEL_10' --printEveryIters 500 --tbDescr 'MODEL_10' \
--modelType 'models.Seq2SeqwithXfmrMemEfficient' --loadBestModel False --toTrain True \
--fullVocab True --trainSize 5000 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0
'''
!python ./src/train.py --hiddenDim 48 --numLayers 2 --batchSize 24 --numEpochs 100 --lr 1e-3 --dropout 0.0 \
                        --savedModelDir './saved_models/MODEL_10' --printEveryIters 500 --tbDescr 'MODEL_10' \
                        --modelType 'models.Seq2SeqwithXfmrMemEfficient' --loadBestModel False --toTrain True \
                        --fullVocab True --trainSize 5000 --valSize 16 --seed 0 --tfThresh 0.0 --beamSize 0

Namespace(batchSize=24, beamSize=0, dropout=0.0, fullVocab=True, hiddenDim=48, loadBestModel=False, lr=0.001, modelType='models.Seq2SeqwithXfmrMemEfficient', numEpochs=100, numLayers=2, printEveryIters=500, savedModelDir='./saved_models/MODEL_10', seed=0, tbDescr='MODEL_10', tfThresh=0.0, toTrain=True, trainSize=5000, valSize=16)
Getting the training and validation data...
tcmalloc: large alloc 2406391808 bytes == 0x559a10f5a000 @  0x7f17fe52f1e7 0x5599968e0f48 0x7f17d50a153e 0x7f17d50a1cd9 0x7f17d50a1faf 0x7f17d509f4b4 0x5599968af0e4 0x5599968aede0 0x5599969236f5 0x5599968b069a 0x55999691ec9e 0x5599968b069a 0x55999691ec9e 0x5599968b069a 0x55999691ec9e 0x5599968b069a 0x55999691ec9e 0x55999691db0e 0x5599967efe2b 0x5599969201e6 0x55999691de0d 0x5599967efe2b 0x5599969201e6 0x55999691de0d 0x5599968b077a 0x55999691f86a 0x5599969a1858 0x55999691eee2 0x55999691db0e 0x5599968b077a 0x55999691f86a
max length (before adding stop token) in mini_df.description is 3996 and in mini_df.abstract (befor

#### Results

### Transformer based model 11
I tried to share the weights between the decoder embeddings and the decoder output project (into the vocabulary dimension) as per the "attention is all you need paper." But it was performing worse. Not sure why. I even tried it by increasing the size of hidden dimension or number of layers, but didn't help.  

I also tried adding a scaled-up version of word embeddings to the positional embeddings, i.e. multiplied the word embeddings by sqrt(hiddenDim) as per the "attention is all you need paper," but it seemed to perform worse. So I settled with sqrt(4).



## Visualization Using Tensorboard

In [None]:
%load_ext tensorboard

In [None]:
%tensorboard --logdir='runs/seq2seqWithAtten'

## Future Experimentation
- Beam search and other ideas that I tried above that theoretically make sense but didn't help. Need to revisit them.
- Use pretrained word embeddings e.g. Glove, BERT

Other ideas to try:
- http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/
- https://towardsdatascience.com/abstractive-summarization-of-dialogues-f530c7d290be
- https://www.analyticsvidhya.com/blog/2021/02/dialogue-summarization-a-deep-learning-approach/
- https://www.analyticsvidhya.com/blog/2020/11/summarize-twitter-live-data-using-pretrained-nlp-models/