# Reproducing results from the Multi-News paper

In this notebook we will try to reproduce the results from  the [Multi-News](https://www.aclweb.org/anthology/P19-1102/) [(GitHub)](https://github.com/Alex-Fabbri/Multi-News). 


They use the [OpenNMT](https://opennmt.net/) neural machine translation system, but adapted it some places.

The main steps of a OpenNMT-pipeline are
- Preprocessing 
- Model training
- Translate

You can find the documentation [here](https://opennmt.net/OpenNMT-py/quickstart.html#step-1-preprocess-the-data).


See also the python implementation of OpenNMT [here](https://github.com/OpenNMT/OpenNMT-py).

## Preparing Data

The data consists of parallel source (src) and target (tgt) data containing one sentence per line with tokens separated by a space.
We won't need to do preprocessing on our own, since preprocessed data is already provided

In [1]:
!dir

code  data  output  README.md  Train_BERT_HiPMAP.ipynb	Train_HiPMAP.ipynb


In [1]:
!pip install torchtext nltk opencv-python transformers==3.1.0



In [1]:
!python code/HiPMAP/preprocess.py \
    -train_src ../news-opinion-summarization/data/multi_news/preprocessed_truncated/train.txt.src.tokenized.fixed.cleaned.final.truncated.txt \
    -train_tgt ../news-opinion-summarization/data/multi_news/preprocessed_truncated/train.txt.tgt.tokenized.fixed.cleaned.final.truncated.txt \
    -valid_src ../news-opinion-summarization/data/multi_news/preprocessed_truncated/val.txt.src.tokenized.fixed.cleaned.final.truncated.txt \
    -valid_tgt ../news-opinion-summarization/data/multi_news/preprocessed_truncated/val.txt.tgt.tokenized.fixed.cleaned.final.truncated.txt \
    -save_data ../news-opinion-summarization/data/multi_news/final_preprocessed/final \
    -src_seq_length 10000 \
    -tgt_seq_length 10000 \
    -src_seq_length_trunc 500 \
    -tgt_seq_length_trunc 300 \
    -dynamic_dict \
    -share_vocab \
    -max_shard_size 10000000


[2020-10-17 10:02:47,672 INFO] Extracting features...
[2020-10-17 10:02:47,673 INFO]  * number of source features: 0.
[2020-10-17 10:02:47,673 INFO]  * number of target features: 0.
[2020-10-17 10:02:47,673 INFO] Building `Fields` object...
[2020-10-17 10:02:47,673 INFO] Building & saving training data...
[2020-10-17 10:02:47,673 INFO]  * divide corpus into shards and build dataset separately (shard_size = 10000000 bytes).
[2020-10-17 10:02:54,124 INFO]  * saving train data shard to ../news-opinion-summarization/data/multi_news/final_preprocessed/.train.1.pt.
Traceback (most recent call last):
  File "code/HiPMAP/preprocess.py", line 290, in <module>
    main()
  File "code/HiPMAP/preprocess.py", line 280, in main
    train_dataset_files = build_save_dataset('train', fields, opt)
  File "code/HiPMAP/preprocess.py", line 207, in build_save_dataset
    corpus_type, opt)
  File "code/HiPMAP/preprocess.py", line 117, in build_save_in_shards
    torch.save(dataset, pt_file)
  File "/opt/con

## Train the Model



In [5]:
%%time
!CUDA_VISIBLE_DEVICES=0,1,2,3 python code/PointerGen/train.py \
    -save_model output/summarisation/model_newser_without_mmr_polarity/Okt17_ \
    -data ../news-opinion-summarization/data/multi_news/final_preprocessed/final \
    -copy_attn -accum_count 5\
    -global_attention mlp \
    -word_vec_size 128 \
    -rnn_size 512  -layers 1 \
    -encoder_type brnn \
    -train_steps 20000 \
    -max_grad_norm 4 \
    -dropout 0. \
    -batch_size 2 \
    -optim adagrad \
    -learning_rate 0.15 \
    -adagrad_accumulator_init 0.1 \
    -reuse_copy_attn \
    -copy_loss_by_seqlength \
    -bridge \
    -seed 777 \
    -world_size 1  \
    -gpu_ranks 0 \
    -save_checkpoint_steps 1000 \
    -train_from output/summarisation/model_newser_without_mmr_polarity/Okt17__step_9000.pt

[2020-10-23 18:05:46,814 INFO] Loading checkpoint from output/summarisation/model_newser_without_mmr_polarity/Okt17__step_9000.pt
[2020-10-23 18:05:49,867 INFO] Loading vocab from checkpoint at output/summarisation/model_newser_without_mmr_polarity/Okt17__step_9000.pt.
[2020-10-23 18:05:49,897 INFO]  * vocabulary size. source = 50004; target = 50004
[2020-10-23 18:05:49,897 INFO] Building model...
[2020-10-23 18:05:52,074 INFO] NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(50004, 128, padding_idx=1)
        )
      )
    )
    (rnn): LSTM(128, 256, bidirectional=True)
    (sent_rnn): LSTM(512, 256, bidirectional=True)
    (bridge): ModuleList(
      (0): Linear(in_features=256, out_features=256, bias=True)
      (1): Linear(in_features=256, out_features=256, bias=True)
    )
  )
  (decoder): InputFeedRNNDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequentia