# Model Experimentation

In this notebook we will experiment with different deep learning based models for text summarization. In particular, the architecture used here is an encoder-decoder type of network built using LSTM layers. Refer to the literature review notebook for further details on this architecture.

In this notebook we will experiment with various settings such as number of hidden dimensions, dropout, size of training data vocabulary, number of LSTM layers, etc.

We will use Pytorch to train the models and Tensorboard (integrated with Pytorch) for visualization.

## Mount Google Drive and Import Libraries

In [116]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [117]:
import sys
import os
import numpy as np
import pandas as pd
import torch
import torch.utils.data as data
import matplotlib.pyplot as plt
%matplotlib inline
# for auto-reloading external modules (automatically reloads before using an imported module)
%load_ext autoreload
%autoreload 2

#To ensure that the Colab Python interpreter can load Python files from within
PATH_NAME = os.path.join('/', 'content', 'drive', 'My Drive', 'Colab Notebooks', 'UCSDX_MLE_Bootcamp', 'Text_Summarization_UCSD', 'ModelBuilding')
sys.path.append(os.path.join(PATH_NAME, 'src'))
print(sys.path)
%cd $PATH_NAME

print(f'Torch version {torch.__version__}')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
['', '/content', '/env/python', '/usr/lib/python37.zip', '/usr/lib/python3.7', '/usr/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/dist-packages', '/usr/lib/python3/dist-packages', '/usr/local/lib/python3.7/dist-packages/IPython/extensions', '/root/.ipython', '/content/drive/My Drive/Colab Notebooks/UCSDX_MLE_Bootcamp/Text_Summarization_UCSD/ModelBuilding/src', '/content/drive/My Drive/Colab Notebooks/UCSDX_MLE_Bootcamp/Text_Summarization_UCSD/ModelBuilding/src', '/content/drive/My Drive/Colab Notebooks/UCSDX_MLE_Bootcamp/Text_Summarization_UCSD/ModelBuilding/src']
/content/drive/My Drive/Colab Notebooks/UCSDX_MLE_Bootcamp/Text_Summarization_UCSD/ModelBuilding
Torch version 1.8.1+cu101


In [118]:
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



## Load Data and Utility Functions
We will use cpc_codes 'de' from the BigPatent dataset

In [119]:
import utils

In [120]:
'''
data = utils.load_data_numpy(split_type='train', cpc_codes='de', fname='data0_np.npz')
for data_np in data:
    print(data_np['data'].shape, data_np['data'][0,0].shape[1], data_np['data'][0,1].shape[1])
    print(data_np['data'][0,1])
    break
del data, data_np
''';

### Mini Data: Generate vocabulary, word2idx, idx2word, and numpy array

Need to do this as the vocabulary for the full dataset is too large for quick prototying and debugging.

But try with both, the full vocabulary for the de dataset as well as the vocabulary created from the mini training set.

## LSTM Based Encoder-Decoder

For further details:-

https://www.analyticsvidhya.com/blog/2019/06/comprehensive-guide-text-summarization-using-deep-learning-python/

 http://www.abigailsee.com/2017/04/16/taming-rnns-for-better-summarization.html

In [121]:
!pip install rouge



In [123]:
# !touch ../__init__.py
# !touch __init__.py
# !touch src/__init__.py
# !touch ../tests/__init__.py

'''
then add 
from ..ModelBuilding.src import models in tests/test_ModelBuilding1.py

https://stackoverflow.com/questions/448271/what-is-init-py-for

see 5.7 package relative imports at: 
https://docs.python.org/3/reference/import.html#regular-packages
''';

In [124]:
#testing
!python -m pytest -s ../tests/

platform linux -- Python 3.7.10, pytest-3.6.4, py-1.10.0, pluggy-0.7.1
rootdir: /content/drive/My Drive/Colab Notebooks/UCSDX_MLE_Bootcamp/Text_Summarization_UCSD, inifile:
plugins: typeguard-2.7.1
collected 3 items                                                              [0m

../tests/test_ModelBuilding1.py ...



In [None]:
!python ./src/train.py --hidden_dim 50 --num_layers 2 --batch_size 2 --num_epochs 1 --lr 2e-3 \
                        --print_every_iters 50 --tb_descr 'zzdropout-0_hiddim-50_numlyrs-2_full-de-data'

In [None]:
import models
import train

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_data, val_data, lang_train = train.get_data(use_full_vocab=True, cpc_codes='de', fname='data0_str_json.gz',
                                                    train_size=128, val_size=16)

Getting the training data...
Size of description vocab is 36828 and abstract vocab is 10769
max length (before adding stop token) in mini_df.description is 3943 and in mini_df.abstract (before adding start/stop tokens) is 147
(128, 4)
max length (before adding stop token) in mini_df.description is 3993 and in mini_df.abstract (before adding start/stop tokens) is 140
(16, 4)
Data shape is: torch.Size([128, 4000]), torch.Size([128, 150]), torch.Size([128])
Total data size is: 2.125MB
Data shape is: torch.Size([16, 4000]), torch.Size([16, 150]), torch.Size([16])


#### Seq2Seq: lr=0.002, dropout=0.0, hiddim=50, numlyrs=2, full-de-vocab, train_size=128, val_size=16

In [None]:
%%timeit -r 1 -n 1
#using the full vocab/word2idx/idx2word from the de dataset (train_size=128, val_size=16)
#make sure have same number of layers for both encoder and decoder
encoder = models.EncoderLSTM(vocab_size=len(lang_train.desc_vocab), hidden_dim=50, num_layers=2, bidir=True)
decoder = models.DecoderLSTM(vocab_size=len(lang_train.abs_vocab), hidden_dim=50, num_layers=2, bidir=False)
model = models.Seq2Seq(encoder, decoder)

train.train(model=model, train_data=train_data, val_data=val_data, abs_idx2word=lang_train.abs_idx2word, device=device, 
            batch_size=128, num_epochs=300, lr=2e-3, print_every_iters=250, tb_descr='dropout-0_hiddim-50_numlyrs-2_full-de-data')

Training Loss |Training Data Rouge | Validation Data Rouge
--- | --- | ---
![](https://drive.google.com/uc?export=view&id=1Kf7zitCVM8_7V1GsmQLbYttLWU9hqgDv) | ![](https://drive.google.com/uc?export=view&id=1Ng2IWaG9rRoLaQELqWBVfqUiB3lkrVBD) | ![](https://drive.google.com/uc?export=view&id=1a4DN6uOJky9tyZ2WyOd1GCTUzIkFkj8F)
| Dark Blue: Rouge-1, Red: Rouge-2, Light Blue: Rouge-l | Pink: Rouge-1, Green: Rouge-2, Gray: Rouge-l

#### Seq2Seq: lr=0.002, dropout=0.0, hiddim=100, numlyrs=2, vocab generated from 128 training examples, train_size=128, val_size=16

In [None]:
%%timeit -r 1 -n 1
#make sure have same number of layers for both encoder and decoder (train_size=128, val_size=16)
encoder = models.EncoderLSTM(vocab_size=len(lang_train.desc_vocab), hidden_dim=100, num_layers=2, bidir=True)
decoder = models.DecoderLSTM(vocab_size=len(lang_train.abs_vocab), hidden_dim=100, num_layers=2, bidir=False)
model = models.Seq2Seq(encoder, decoder)

train.train(model=model, train_data=train_data, val_data=val_data, abs_idx2word=lang_train.abs_idx2word, device=device, 
            batch_size=128, num_epochs=2500, lr=2e-3, print_every_iters=250, tb_descr='dropout-0_hiddim-100_numlyrs-2')

After Iteration 0, Loss is: 7.773667
	Model eval on training data after iteration 0...
		Rouge-1 is 0.0004, Rouge-2 is 0.0000, and Rouge-l is 0.0013
	Model eval on validation data after iteration 0...
		Rouge-1 is 0.0000, Rouge-2 is 0.0000, and Rouge-l is 0.0000
After Iteration 250, Loss is: 5.875730
	Model eval on training data after iteration 250...
		Rouge-1 is 0.1582, Rouge-2 is 0.0106, and Rouge-l is 0.1518
	Model eval on validation data after iteration 250...
		Rouge-1 is 0.1653, Rouge-2 is 0.0091, and Rouge-l is 0.1676
After Iteration 500, Loss is: 4.404660
	Model eval on training data after iteration 500...
		Rouge-1 is 0.1582, Rouge-2 is 0.0213, and Rouge-l is 0.1611
	Model eval on validation data after iteration 500...
		Rouge-1 is 0.1539, Rouge-2 is 0.0208, and Rouge-l is 0.1720
After Iteration 750, Loss is: 2.953698
	Model eval on training data after iteration 750...
		Rouge-1 is 0.2504, Rouge-2 is 0.0370, and Rouge-l is 0.1456
	Model eval on validation data after iteration

#### Seq2Seq: lr=0.002, dropout=0.2, hiddim=100, numlyrs=2, vocab generated from 128 training examples, train_size=128, val_size=16

In [None]:
%%timeit -r 1 -n 1
#make sure have same number of layers for both encoder and decoder (train_size=128, val_size=16)
encoder = models.EncoderLSTM(vocab_size=len(lang_train.desc_vocab), hidden_dim=100, num_layers=2, bidir=True, dropout=0.2)
decoder = models.DecoderLSTM(vocab_size=len(lang_train.abs_vocab), hidden_dim=100, num_layers=2, bidir=False, dropout=0.2)
model = models.Seq2Seq(encoder, decoder)

train.train(model=model, train_data=train_data, val_data=val_data, abs_idx2word=lang_train.abs_idx2word, device=device, 
            batch_size=128, num_epochs=3500, lr=2e-3, print_every_iters=250, tb_descr='dropout-0p2_hiddim-100_numlyrs-2')

After Iteration 0, Loss is: 7.825840
	Model eval on training data after iteration 0...
		Rouge-1 is 0.0555, Rouge-2 is 0.0001, and Rouge-l is 0.0408
	Model eval on validation data after iteration 0...
		Rouge-1 is 0.0557, Rouge-2 is 0.0000, and Rouge-l is 0.0470
After Iteration 250, Loss is: 5.781209
	Model eval on training data after iteration 250...
		Rouge-1 is 0.2227, Rouge-2 is 0.0280, and Rouge-l is 0.1415
	Model eval on validation data after iteration 250...
		Rouge-1 is 0.2102, Rouge-2 is 0.0289, and Rouge-l is 0.1569
After Iteration 500, Loss is: 4.535874
	Model eval on training data after iteration 500...
		Rouge-1 is 0.2484, Rouge-2 is 0.0415, and Rouge-l is 0.1804
	Model eval on validation data after iteration 500...
		Rouge-1 is 0.2547, Rouge-2 is 0.0407, and Rouge-l is 0.2007
After Iteration 750, Loss is: 3.241772
	Model eval on training data after iteration 750...
		Rouge-1 is 0.2728, Rouge-2 is 0.0481, and Rouge-l is 0.1823
	Model eval on validation data after iteration

Training Loss |Training Data Rouge | Validation Data Rouge
--- | --- | ---
![](https://drive.google.com/uc?export=view&id=1kAlMYvIaPsl0M8J2c-kasY8YfAW6Ho77) | ![](https://drive.google.com/uc?export=view&id=1LffsjVI7Yck9lXRBxa2p7YXAt6Ht2X6c) | ![](https://drive.google.com/uc?export=view&id=18LqZ-oSPAMJFBfuAKFNJ3nzn11LrX5NX)
| Dark Blue: Rouge-1, Red: Rouge-2, Light Blue: Rouge-l | Pink: Rouge-1, Green: Rouge-2, Gray: Rouge-l


#### Seq2Seq: lr=0.002, dropout=0.2, hiddim=100, numlyrs=2, full de vocab, train_size=128, val_size=16

In [None]:
%%timeit -r 1 -n 1
#using the full vocab/word2idx/idx2word from the de dataset (train_size=128, val_size=16)
#make sure have same number of layers for both encoder and decoder
encoder = models.EncoderLSTM(vocab_size=len(lang_train.desc_vocab), hidden_dim=100, num_layers=2, bidir=True, dropout=0.2)
decoder = models.DecoderLSTM(vocab_size=len(lang_train.abs_vocab), hidden_dim=100, num_layers=2, bidir=False, dropout=0.2)
model = models.Seq2Seq(encoder, decoder)

train.train(model=model, train_data=train_data, val_data=val_data, abs_idx2word=lang_train.abs_idx2word, device=device, 
            batch_size=128, num_epochs=2500, lr=2e-3, print_every_iters=250, tb_descr='dropout-0p2_hiddim-100_numlyrs-2_full-de-vocab')

After Iteration 0, Loss is: 9.306280
	Model eval on training data after iteration 0...
		Rouge-1 is 0.0016, Rouge-2 is 0.0000, and Rouge-l is 0.0052
	Model eval on validation data after iteration 0...
		Rouge-1 is 0.0010, Rouge-2 is 0.0000, and Rouge-l is 0.0047
After Iteration 250, Loss is: 6.714604
	Model eval on training data after iteration 250...
		Rouge-1 is 0.0913, Rouge-2 is 0.0000, and Rouge-l is 0.0718
	Model eval on validation data after iteration 250...
		Rouge-1 is 0.0952, Rouge-2 is 0.0000, and Rouge-l is 0.0704
After Iteration 500, Loss is: 6.069162
	Model eval on training data after iteration 500...
		Rouge-1 is 0.1487, Rouge-2 is 0.0065, and Rouge-l is 0.1274
	Model eval on validation data after iteration 500...
		Rouge-1 is 0.1547, Rouge-2 is 0.0049, and Rouge-l is 0.1297
After Iteration 750, Loss is: 5.183101
	Model eval on training data after iteration 750...
		Rouge-1 is 0.2089, Rouge-2 is 0.0305, and Rouge-l is 0.1989
	Model eval on validation data after iteration

Training Loss |Training Data Rouge | Validation Data Rouge
--- | --- | ---
![](https://drive.google.com/uc?export=view&id=1AjN7dYoinJ7jdhGvcYXmnOyK9YvlKleA) | ![](https://drive.google.com/uc?export=view&id=1vDXrVreM9fzYs4qEdFWcF4YeQE1LZd02) | ![](https://drive.google.com/uc?export=view&id=1IbbRKXkcn2yus053zsx17stB6Grw_Op-)
| Dark Blue: Rouge-1, Red: Rouge-2, Light Blue: Rouge-l | Pink: Rouge-1, Green: Rouge-2, Gray: Rouge-l


#### Seq2Seq: lr=0.002, dropout=0.3, hiddim=100, numlyrs=2, full-de-vocab, train_size=128, val_size=16

In [None]:
%%timeit -r 1 -n 1
#using the full vocab/word2idx/idx2word from the de dataset (train_size=128 and val_size=16)
#make sure have same number of layers for both encoder and decoder
encoder = models.EncoderLSTM(vocab_size=len(lang_train.desc_vocab), hidden_dim=100, num_layers=2, bidir=True, dropout=0.3)
decoder = models.DecoderLSTM(vocab_size=len(lang_train.abs_vocab), hidden_dim=100, num_layers=2, bidir=False, dropout=0.3)
model = models.Seq2Seq(encoder, decoder)

train.train(model=model, train_data=train_data, val_data=val_data, abs_idx2word=lang_train.abs_idx2word, device=device, 
            batch_size=128, num_epochs=2500, lr=2e-3, print_every_iters=250, tb_descr='dropout-0p3_hiddim-100_numlyrs-2_full-de-vocab')

After Iteration 0, Loss is: 9.304827
	Model eval on training data after iteration 0...
		Rouge-1 is 0.0003, Rouge-2 is 0.0000, and Rouge-l is 0.0011
	Model eval on validation data after iteration 0...
		Rouge-1 is 0.0000, Rouge-2 is 0.0000, and Rouge-l is 0.0000
After Iteration 250, Loss is: 6.736654
	Model eval on training data after iteration 250...
		Rouge-1 is 0.0913, Rouge-2 is 0.0000, and Rouge-l is 0.0718
	Model eval on validation data after iteration 250...
		Rouge-1 is 0.0952, Rouge-2 is 0.0000, and Rouge-l is 0.0704
After Iteration 500, Loss is: 5.781800
	Model eval on training data after iteration 500...
		Rouge-1 is 0.1210, Rouge-2 is 0.0120, and Rouge-l is 0.1623
	Model eval on validation data after iteration 500...
		Rouge-1 is 0.1240, Rouge-2 is 0.0108, and Rouge-l is 0.1655
After Iteration 750, Loss is: 4.726680
	Model eval on training data after iteration 750...
		Rouge-1 is 0.1668, Rouge-2 is 0.0253, and Rouge-l is 0.1811
	Model eval on validation data after iteration

#### Seq2Seq: lr=0.002, dropout=0.2, hiddim=100, numlyrs=2, full-de-vocab, train_size=512, val_size=16

In [None]:
data_train = utils.load_data_string(split_type='train', cpc_codes='de', fname='data0_str_json.gz')
data_val = utils.load_data_string(split_type='val', cpc_codes='de', fname='data0_str_json.gz')
mini_df_train = get_mini_df(data_train, mini_df_size=512) 
mini_df_val = get_mini_df(data_val, mini_df_size=16) 
#--------------------------------------------------------------------------------------------
lang_train = utils.Mini_Data_Language_Info(mini_df_train, desc_word2idx=desc_word2idx,abs_word2idx=abs_word2idx,
                                           desc_idx2word=desc_idx2word, abs_idx2word=abs_idx2word,
                                           desc_vocab=desc_vocab, abs_vocab=abs_vocab)
# lang_train = utils.Mini_Data_Language_Info(mini_df_train) #generate vocab etc
lang_val = utils.Mini_Data_Language_Info(mini_df_val, desc_word2idx=lang_train.desc_word2idx,abs_word2idx=lang_train.abs_word2idx)
#---------------------------------------------------------------------------------------------
train_data = utils.bigPatentDataset(lang_train.mini_data, shuffle=True)
train_data.memory_size()
val_data = utils.bigPatentDataset(lang_val.mini_data, shuffle=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

max length (before adding stop token) in mini_df.description is 3974 and in mini_df.abstract (before adding start/stop tokens) is 149
(512, 4)
max length (before adding stop token) in mini_df.description is 3993 and in mini_df.abstract (before adding start/stop tokens) is 140
(16, 4)
Data shape is: torch.Size([512, 4000]), torch.Size([512, 150]), torch.Size([512])
Total data size is: 8.500MB
Data shape is: torch.Size([16, 4000]), torch.Size([16, 150]), torch.Size([16])


In [None]:
%%timeit -r 1 -n 1
#using the full vocab/word2idx/idx2word from the de dataset (train_size=512, val_size=16)
#make sure have same number of layers for both encoder and decoder
encoder = models.EncoderLSTM(vocab_size=len(lang_train.desc_vocab), hidden_dim=100, num_layers=2, bidir=True, dropout=0.2)
decoder = models.DecoderLSTM(vocab_size=len(lang_train.abs_vocab), hidden_dim=100, num_layers=2, bidir=False, dropout=0.2)
model = models.Seq2Seq(encoder, decoder)

train.train(model=model, train_data=train_data, val_data=val_data, abs_idx2word=lang_train.abs_idx2word, device=device, 
            batch_size=128, num_epochs=2500, lr=2e-3, print_every_iters=250, tb_descr='dropout-0p2_hiddim-100_numlyrs-2_full-de-vocab-trainsize-512-valsize-16')

After Iteration 0, Loss is: 9.313465
	Model eval on training data after iteration 0...
		Rouge-1 is 0.0013, Rouge-2 is 0.0000, and Rouge-l is 0.0045
	Model eval on validation data after iteration 0...
		Rouge-1 is 0.0013, Rouge-2 is 0.0000, and Rouge-l is 0.0054
After Iteration 250, Loss is: 6.959686
	Model eval on training data after iteration 250...
		Rouge-1 is 0.0564, Rouge-2 is 0.0000, and Rouge-l is 0.0385
	Model eval on validation data after iteration 250...
		Rouge-1 is 0.0544, Rouge-2 is 0.0000, and Rouge-l is 0.0406
After Iteration 500, Loss is: 6.579335
	Model eval on training data after iteration 500...
		Rouge-1 is 0.0386, Rouge-2 is 0.0000, and Rouge-l is 0.0397
	Model eval on validation data after iteration 500...
		Rouge-1 is 0.0408, Rouge-2 is 0.0000, and Rouge-l is 0.0313
After Iteration 750, Loss is: 5.850194
	Model eval on training data after iteration 750...
		Rouge-1 is 0.1547, Rouge-2 is 0.0217, and Rouge-l is 0.1933
	Model eval on validation data after iteration

Training Loss |Training Data Rouge | Validation Data Rouge
--- | --- | ---
![](https://drive.google.com/uc?export=view&id=1Hy_uPwJqZeXkGS8yMop4hh8EPXouBBs8) | ![](https://drive.google.com/uc?export=view&id=16pGdDmWSNT20jhBXsYNomIdgQQetDPwT) | ![](https://drive.google.com/uc?export=view&id=1-Aq6-lTs0fJ6_FopLEQC-G1suURPtTZc)
| Orange: Rouge-1, Blue: Rouge-2, Dark Orange: Rouge-l | Light Blue: Rouge-1, Pink: Rouge-2, Teal: Rouge-l

Note that the initial loss will be approximately -log(abstract_vocab_size) because the model is randomly initialized.

## Visualization Using Tensorboard

In [None]:
%load_ext tensorboard

In [None]:
%tensorboard --logdir='runs'

## Future Experimentation
1. Use glove embeddings to initialize the embeddings layer
2. for decoder use initial value of h and c from encoder output or just h and initialize c with zero?
3. see if can get rid of stop token from the decription
4. use weight in the cross entropy loss proportional to the word counts in the abstract vocabulary
5. Use beam search
6. add attention and make model larger (more lstm layers and increase hidden dim size)
7. use transformers
8. use teacher forcing only 50% of the time when training and not 100%
9. use some of the ideas documented as part of my literature survey
10. Use command line arguments when calling python script
11. Other ideas to try:
- http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/
- https://towardsdatascience.com/abstractive-summarization-of-dialogues-f530c7d290be
- https://www.analyticsvidhya.com/blog/2021/02/dialogue-summarization-a-deep-learning-approach/
- https://www.analyticsvidhya.com/blog/2020/11/summarize-twitter-live-data-using-pretrained-nlp-models/