# Model Building

In this notebook we will build with different deep learning based models for text summarization. In particular, the architecture used here is an encoder-decoder type of network built using LSTM layers. Refer to the literature review notebook for further details on this architecture.

In this notebook we will experiment with various settings such as number of hidden dimensions, dropout, size of training data vocabulary, number of LSTM layers, etc.

We will use Pytorch to train the models and Tensorboard (integrated with Pytorch) for visualization.

## Mount Google Drive and Import Libraries

In [2]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import sys
import os
import torch
# for auto-reloading external modules (automatically reloads before using an imported module)
%load_ext autoreload
%autoreload 2

#To ensure that the Colab Python interpreter can load Python files from within
PATH_NAME = os.path.join('/', 'content', 'drive', 'My Drive', 'Colab Notebooks', 'UCSDX_MLE_Bootcamp', 'Text_Summarization_UCSD', 'ModelBuilding')
sys.path.append(os.path.join(PATH_NAME, 'src'))
print(sys.path)
%cd $PATH_NAME

print(f'Torch version {torch.__version__}') #1.8.1+cu101

['', '/content', '/env/python', '/usr/lib/python37.zip', '/usr/lib/python3.7', '/usr/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/dist-packages', '/usr/lib/python3/dist-packages', '/usr/local/lib/python3.7/dist-packages/IPython/extensions', '/root/.ipython', '/content/drive/My Drive/Colab Notebooks/UCSDX_MLE_Bootcamp/Text_Summarization_UCSD/ModelBuilding/src']
/content/drive/My Drive/Colab Notebooks/UCSDX_MLE_Bootcamp/Text_Summarization_UCSD/ModelBuilding
Torch version 1.8.1+cu101


In [None]:
!git config --global user.name “[Amit Patel]”
!git config --global user.email “[amitpatel.gt@gmail.com]”
!git config --global color.ui auto
!git config -l

!git add .
!git commit -m "Added LSTM with attention model"
!git status

images	ModelBuilding_step8.ipynb		  __pycache__  saved_models
logs	Model_Experimentation_step7_14-8-1.ipynb  runs	       src


In [4]:
!nvidia-smi

Mon Apr 19 15:44:24 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.67       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Load Data and Utility Functions
We will use cpc_codes 'de' from the BigPatent dataset

In [5]:
'''
import utils
data = utils.load_data_numpy(split_type='train', cpc_codes='de', fname='data0_np.npz')
for data_np in data:
    print(data_np['data'].shape, data_np['data'][0,0].shape[1], data_np['data'][0,1].shape[1])
    print(data_np['data'][0,1])
    break
del data, data_np
''';

### Mini Data: Generate vocabulary, word2idx, idx2word, and numpy array

Need to do this as the vocabulary for the full dataset is too large for quick prototying and debugging.

But try with both, the full vocabulary for the de dataset as well as the vocabulary created from the mini training set.

## LSTM Based Encoder-Decoder

For further details:-

https://www.analyticsvidhya.com/blog/2019/06/comprehensive-guide-text-summarization-using-deep-learning-python/

 http://www.abigailsee.com/2017/04/16/taming-rnns-for-better-summarization.html

In [6]:
!pip install rouge

Collecting rouge
  Downloading https://files.pythonhosted.org/packages/43/cc/e18e33be20971ff73a056ebdb023476b5a545e744e3fc22acd8c758f1e0d/rouge-1.0.0-py3-none-any.whl
Installing collected packages: rouge
Successfully installed rouge-1.0.0


In [9]:
#testing
!python -m pytest -s ../tests/

platform linux -- Python 3.7.10, pytest-3.6.4, py-1.10.0, pluggy-0.7.1
rootdir: /content/drive/My Drive/Colab Notebooks/UCSDX_MLE_Bootcamp/Text_Summarization_UCSD, inifile:
plugins: typeguard-2.7.1
collected 3 items                                                              [0m

../tests/test_ModelBuilding1.py ...



#### Seq2Seq: lr=0.004, dropout=0.0, hiddim=200, numlyrs=2, full-de-vocab, train_size=128, val_size=16

In [None]:
%%timeit -r 1 -n 1
#test above trained model with beamsize=5 
!python ./src/train.py --hiddenDim 200 --numLayers 2 --batchSize 64 --numEpochs 700 --lr 2e-3 --savedModelDir './saved_models/seq2seq_200hid_2lyrs' \
                        --printEveryIters 4800 --tbDescr 'dropout-0_hiddim-200_numlyrs-2_full-de-data' \
                        --modelType 'models.Seq2Seq' --loadBestModel False --toTrain True
#but there is no improvement in rouge scores vs no beam search (greedy search seems to be ok for a well trained model)

Getting the training data...
Size of description vocab is 36828 and abstract vocab is 10769
tcmalloc: large alloc 2406391808 bytes == 0x55912d3c0000 @  0x7fb2f175c1e7 0x5590b2ec0f48 0x7fb2c82ce53e 0x7fb2c82cecd9 0x7fb2c82cefaf 0x7fb2c82cc4b4 0x5590b2e8f0e4 0x5590b2e8ede0 0x5590b2f036f5 0x5590b2e9069a 0x5590b2efec9e 0x5590b2e9069a 0x5590b2efec9e 0x5590b2e9069a 0x5590b2efec9e 0x5590b2e9069a 0x5590b2efec9e 0x5590b2efdb0e 0x5590b2dcfe2b 0x5590b2f001e6 0x5590b2efde0d 0x5590b2dcfe2b 0x5590b2f001e6 0x5590b2efde0d 0x5590b2e9077a 0x5590b2eff86a 0x5590b2f81858 0x5590b2efeee2 0x5590b2efdb0e 0x5590b2e9077a 0x5590b2eff86a
max length (before adding stop token) in mini_df.description is 3943 and in mini_df.abstract (before adding start/stop tokens) is 147
(128, 4)
max length (before adding stop token) in mini_df.description is 3993 and in mini_df.abstract (before adding start/stop tokens) is 140
(16, 4)
Data shape is: torch.Size([128, 4000]), torch.Size([128, 150]), torch.Size([128])
Total data size 

In [None]:
%%timeit -r 1 -n 1
#without attention
#test above trained model with beamsize=5 
!python ./src/train.py --hiddenDim 200 --numLayers 2 --batchSize 64 --numEpochs 700 --lr 2e-3 --savedModelDir './saved_models/seq2seq_200hid_2lyrs' \
                        --printEveryIters 4800 --tbDescr 'dropout-0_hiddim-200_numlyrs-2_full-de-data' \
                        --modelType 'models.Seq2Seq' --loadBestModel True --toTrain False
#but there is no improvement in rouge scores vs no beam search (greedy search seems to be ok for a well trained model)

Getting the training data...
Size of description vocab is 36828 and abstract vocab is 10769
tcmalloc: large alloc 2406391808 bytes == 0x559e320c8000 @  0x7f5b4fc221e7 0x559db70f0f48 0x7f5b2679453e 0x7f5b26794cd9 0x7f5b26794faf 0x7f5b267924b4 0x559db70bf0e4 0x559db70bede0 0x559db71336f5 0x559db70c069a 0x559db712ec9e 0x559db70c069a 0x559db712ec9e 0x559db70c069a 0x559db712ec9e 0x559db70c069a 0x559db712ec9e 0x559db712db0e 0x559db6fffe2b 0x559db71301e6 0x559db712de0d 0x559db6fffe2b 0x559db71301e6 0x559db712de0d 0x559db70c077a 0x559db712f86a 0x559db71b1858 0x559db712eee2 0x559db712db0e 0x559db70c077a 0x559db712f86a
max length (before adding stop token) in mini_df.description is 3943 and in mini_df.abstract (before adding start/stop tokens) is 147
(128, 4)
max length (before adding stop token) in mini_df.description is 3993 and in mini_df.abstract (before adding start/stop tokens) is 140
(16, 4)
Data shape is: torch.Size([128, 4000]), torch.Size([128, 150]), torch.Size([128])
Total data size 

#### Seq2Seq with Atten: lr=0.004, dropout=0.1, hiddim=200, numlyrs=2, full-de-vocab, train_size=128, val_size=16

In [None]:
%%timeit -r 1 -n 1
#with attention
!python ./src/train.py --hiddenDim 200 --numLayers 2 --batchSize 64 --numEpochs 3000 --lr 4e-3 --savedModelDir './saved_models/seq2seq_withAtten_200hid_2lyrs' \
                        --printEveryIters 400 --tbDescr 'seq2seq_withAtten_dropout-0p1_hiddim-200_numlyrs-2_full-de-vocab' \
                        --modelType 'models.Seq2SeqwithAttention' --loadBestModel False --toTrain True \
                        --dropout 0.1

Getting the training data...
Size of description vocab is 36828 and abstract vocab is 10769
tcmalloc: large alloc 2406391808 bytes == 0x558ffe044000 @  0x7fec669a51e7 0x558f8340ff48 0x7fec3d51753e 0x7fec3d517cd9 0x7fec3d517faf 0x7fec3d5154b4 0x558f833de0e4 0x558f833ddde0 0x558f834526f5 0x558f833df69a 0x558f8344dc9e 0x558f833df69a 0x558f8344dc9e 0x558f833df69a 0x558f8344dc9e 0x558f833df69a 0x558f8344dc9e 0x558f8344cb0e 0x558f8331ee2b 0x558f8344f1e6 0x558f8344ce0d 0x558f8331ee2b 0x558f8344f1e6 0x558f8344ce0d 0x558f833df77a 0x558f8344e86a 0x558f834d0858 0x558f8344dee2 0x558f8344cb0e 0x558f833df77a 0x558f8344e86a
max length (before adding stop token) in mini_df.description is 3943 and in mini_df.abstract (before adding start/stop tokens) is 147
(128, 4)
max length (before adding stop token) in mini_df.description is 3993 and in mini_df.abstract (before adding start/stop tokens) is 140
(16, 4)
Data shape is: torch.Size([128, 4000]), torch.Size([128, 150]), torch.Size([128])
Total data size 

Training Loss |Training Data Rouge | Validation Data Rouge
--- | --- | ---
![](https://drive.google.com/uc?export=view&id=1d26RtIVDwygkh3A_Lq3v4eFR-RCK3ZbA) | ![](https://drive.google.com/uc?export=view&id=1xMBiCByW-57N7i-_Brd8o3Rmnc-pJFnA) | ![](https://drive.google.com/uc?export=view&id=1q4GQOM8M7fsxe4qaahFOaUEqDKqO6Xel)
| Dark Blue: Rouge-1, Red: Rouge-2, Light Blue: Rouge-l | Pink: Rouge-1, Green: Rouge-2, Gray: Rouge-l


Note that the initial loss will be approximately -log(abstract_vocab_size) because the model is randomly initialized.

Best checkpoint at 1200: Rouge-1 is 0.2941, Rouge-2 is 0.0498, and Rouge-l is 0.2024


#### Seq2Seq with Atten: lr=0.004, dropout=0.4, hiddim=200, numlyrs=2, full-de-vocab, train_size=128, val_size=16

In [None]:
%%timeit -r 1 -n 1
#with attention
!python ./src/train.py --hiddenDim 200 --numLayers 2 --batchSize 64 --numEpochs 3000 --lr 4e-3 --savedModelDir './saved_models/seq2seq_withAtten_200hid_2lyrs_0p4dropout' \
                        --printEveryIters 400 --tbDescr 'seq2seq_withAtten_dropout-0p4_hiddim-200_numlyrs-2_full-de-vocab' \
                        --modelType 'models.Seq2SeqwithAttention' --loadBestModel False --toTrain True \
                        --dropout 0.4

Getting the training data...
Size of description vocab is 36828 and abstract vocab is 10769
tcmalloc: large alloc 2406391808 bytes == 0x55c6459b2000 @  0x7f66719d51e7 0x55c5caf76f48 0x7f664854753e 0x7f6648547cd9 0x7f6648547faf 0x7f66485454b4 0x55c5caf450e4 0x55c5caf44de0 0x55c5cafb96f5 0x55c5caf4669a 0x55c5cafb4c9e 0x55c5caf4669a 0x55c5cafb4c9e 0x55c5caf4669a 0x55c5cafb4c9e 0x55c5caf4669a 0x55c5cafb4c9e 0x55c5cafb3b0e 0x55c5cae85e2b 0x55c5cafb61e6 0x55c5cafb3e0d 0x55c5cae85e2b 0x55c5cafb61e6 0x55c5cafb3e0d 0x55c5caf4677a 0x55c5cafb586a 0x55c5cb037858 0x55c5cafb4ee2 0x55c5cafb3b0e 0x55c5caf4677a 0x55c5cafb586a
max length (before adding stop token) in mini_df.description is 3943 and in mini_df.abstract (before adding start/stop tokens) is 147
(128, 4)
max length (before adding stop token) in mini_df.description is 3993 and in mini_df.abstract (before adding start/stop tokens) is 140
(16, 4)
Data shape is: torch.Size([128, 4000]), torch.Size([128, 150]), torch.Size([128])
Total data size 

Training Loss |Training Data Rouge | Validation Data Rouge
--- | --- | ---
![](https://drive.google.com/uc?export=view&id=1dX5GYaPXYXegue-OS67ak4sfSyFEV997) | ![](https://drive.google.com/uc?export=view&id=138SJejuTlbeqKcuNbjcek-bbwyILMSiq) | ![](https://drive.google.com/uc?export=view&id=1OplXxb3_m7rE-90CUwJ2w8wPQO6aDq9o)
| Dark Blue: Rouge-1, Red: Rouge-2, Light Blue: Rouge-l | Pink: Rouge-1, Green: Rouge-2, Gray: Rouge-l

Note that the initial loss will be approximately -log(abstract_vocab_size) because the model is randomly initialized.

Best checkpoint at 1200: Rouge-1 is 0.2935, Rouge-2 is 0.0382, and Rouge-l is 0.2242

Didn't see much difference even with dropout of 0.75. 

#### Seq2Seq with Atten: lr=0.004, dropout=0.4 and 0.0, hiddim=200, numlyrs=2, full-de-vocab, train_size=1024, val_size=16

But did not notice much improvement in R1/R2 scores

In [12]:
%%timeit -r 1 -n 1
#with attention
!python ./src/train.py --hiddenDim 200 --numLayers 2 --batchSize 64 --numEpochs 3000 --lr 4e-3 --savedModelDir './saved_models/seq2seq_withAtten_200hid_2lyrs_0p0dropout' \
                        --printEveryIters 400 --tbDescr 'seq2seq_withAtten_dropout-0p0_hiddim-200_numlyrs-2_full-de-vocab_train-1024' \
                        --modelType 'models.Seq2SeqwithAttention' --loadBestModel False --toTrain True \
                        --dropout 0.0

Getting the training data...
Size of description vocab is 36828 and abstract vocab is 10769
tcmalloc: large alloc 2406391808 bytes == 0x5629c0fd6000 @  0x7fbedfc1d1e7 0x56294632df48 0x7fbeb678f53e 0x7fbeb678fcd9 0x7fbeb678ffaf 0x7fbeb678d4b4 0x5629462fc0e4 0x5629462fbde0 0x5629463706f5 0x5629462fd69a 0x56294636bc9e 0x5629462fd69a 0x56294636bc9e 0x5629462fd69a 0x56294636bc9e 0x5629462fd69a 0x56294636bc9e 0x56294636ab0e 0x56294623ce2b 0x56294636d1e6 0x56294636ae0d 0x56294623ce2b 0x56294636d1e6 0x56294636ae0d 0x5629462fd77a 0x56294636c86a 0x5629463ee858 0x56294636bee2 0x56294636ab0e 0x5629462fd77a 0x56294636c86a
max length (before adding stop token) in mini_df.description is 3996 and in mini_df.abstract (before adding start/stop tokens) is 149
(1024, 4)
max length (before adding stop token) in mini_df.description is 3993 and in mini_df.abstract (before adding start/stop tokens) is 140
(16, 4)
Data shape is: torch.Size([1024, 4000]), torch.Size([1024, 150]), torch.Size([1024])
Total data s

In [None]:
# import models
# import train

# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# train_data, val_data, lang_train = train.get_data(use_full_vocab=True, cpc_codes='de', fname='data0_str_json.gz',
#                                                     train_size=128, val_size=16)

# encoder = models.EncoderLSTM(vocab_size=len(lang_train.desc_vocab), hidden_dim=50, num_layers=2, bidir=True)
# decoder = models.DecoderLSTM(vocab_size=len(lang_train.abs_vocab), hidden_dim=50, num_layers=2, bidir=False)
# model = models.Seq2Seq(encoder, decoder)

# train_data.shuffle(2)
# val_data.shuffle(2)

# train.train(model=model, train_data=train_data, val_data=val_data, abs_idx2word=lang_train.abs_idx2word, device=device, 
#             batch_size=128, num_epochs=1, lr=2e-3, print_every_iters=250, tb_descr='zzzdropout-0_hiddim-50_numlyrs-2_full-de-data')

#only do this once you are done with this notebook
# utils.closeLoggerFileHandler(train.logger)
# utils.closeLoggerFileHandler(train.evaluate.logger)

## Visualization Using Tensorboard

In [None]:
%load_ext tensorboard

In [None]:
%tensorboard --logdir='runs'

## Future Experimentation
1. Use glove embeddings to initialize the embeddings layer
2. for decoder use initial value of h and c from encoder output or just h and initialize c with zero?
3. see if can get rid of stop token from the decription
4. use weight in the cross entropy loss proportional to the word counts in the abstract vocabulary
5. Use beam search
- Didn't make much difference. So revisit this byt looking at CS224n lecture slides on this (http://web.stanford.edu/class/cs224n/slides/cs224n-2021-lecture07-nmt.pdf)
6. add attention and make model larger (more lstm layers and increase hidden dim size)
7. use transformers
8. use teacher forcing only 50% of the time when training and not 100%
9. use some of the ideas documented as part of my literature survey
10. Other ideas to try:
- http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/
- https://towardsdatascience.com/abstractive-summarization-of-dialogues-f530c7d290be
- https://www.analyticsvidhya.com/blog/2021/02/dialogue-summarization-a-deep-learning-approach/
- https://www.analyticsvidhya.com/blog/2020/11/summarize-twitter-live-data-using-pretrained-nlp-models/