<a href="https://colab.research.google.com/github/VincentK1991/2016-campaign/blob/master/Primer_to_GPT2_summarization_Apr19_2020.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

Abstractive summarization is a task of shortening a text by paraphrasing. Unlike extractive summarization where a "key" sentence is extracted from a paragraph, abstractive summarization relies on language modeling task, i.e.  predict the next workds/sentences.

Re-framing the question is way, our task becomes how to use languge model to generate a summary given the main text.

$TEXT => summary$



![alt text](https://www.topbots.com/wp-content/uploads/2019/04/language-model-comparison_web.jpg)

We will use Open-AI GPT2 to do this task. Unlike BERT which are bidirectional encoder, GPT2 is auto-regressive. It means the output is fed back in as a new input, like the recurrent-neural network. The information flows to the right only. This has **advantage and drawback**. Comparison with BERT is [here](https://www.topbots.com/generalized-language-models-bert-openai-gpt2/).

The advantage is that it is good at generating new words on its own. i.e. paraphrasing, conversing, or the like; whereas BERT is very good at classification or finding highlight in a text.

The drawback is that feeding output back as input can generate a vicious loop or getting off topic. For example, sometimes the output keeps repeating itself (i.e. the output feeding back to the model creating a recursive loop), or output slowly drifting to a new topic.

GPT2 uses decoder stacks to do language model. Each stack is called "transformer block". The lite version of GPT2 that we use called "distilgpt2" has 6 blocks. The typical version (small) has 12 blocks.

But first of it starts with embedding, so this is just to embed our tokenized words to high dimensional space. There are two kinds, "WTE" for "word-token embedding" and "WPE" for "word-position embedding". The "WTE" maps 50K vocab to 768 dimensions. The "WPE" maps word position up to 1024 in a text to 768 dimensions.

            (wte): Embedding(50257, 768)
            (wpe): Embedding(1024, 768)
            (drop): Dropout(p=0.1, inplace=False)

Each block has these stacks:

          (0): Block(
          (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (attn): Attention(
            (c_attn): Conv1D()
            (c_proj): Conv1D()
            (attn_dropout): Dropout(p=0.1, inplace=False)
            (resid_dropout): Dropout(p=0.1, inplace=False)
          )
          (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            (c_fc): Conv1D()
            (c_proj): Conv1D()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          )
Basically, we have 6 of these blocks on top of one another.

At the end, we have 

          (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True))
          (lm_head): Linear(in_features=768, out_features=50257,
           bias=False)

Which maps the high dimensional space back to tokenized words.

The key to success seems to be the use of clever attention mechanism. read more [here](http://jalammar.github.io/illustrated-gpt2/) or [here](https://github.com/harvardnlp/annotated-transformer/blob/master/The%20Annotated%20Transformer.ipynb).

The training of this kind of model works like most other neural network model. [Under the hood](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py), the model performs forward propagation, computes loss, performs backward propagation, performs step optimization, clip gradient norm, and repeat.

The loss is called "causal language modeling" (CLM) loss, which is a cross entropy loss where the labels are input shifted to the right by 1 tokens. 

The exponentiated form of the loss is also called "perplexity", meaning how "perplexed" the model is to see the answer, not surprise = low score = good.

# set up packages

This work will be pytorch-based. And use adaptation developed by [Huggingface](https://huggingface.co/).

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
!git clone https://github.com/huggingface/transformers

import os
os.chdir('/content/transformers')

!pip install .
!pip install -r ./examples/requirements.txt

os.chdir('/content/transformers/examples')

!pip install dict_to_obj

In [3]:
%tensorflow_version 1.x

TensorFlow 1.x selected.


In [0]:
import torch
import run_language_modeling  # package from huggingface
import run_generation # package from huggingface
from dict_to_obj import DictToObj
import collections
import random
import numpy as np

In [5]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla P100-PCIE-16GB


# Fine-tuning the model

I use 100K samples (about 300MB of txt file) to train on GPU. I planned to run for ~ 5 epochs, but one epoch already takes about 2 hours. So I just run 1 epoch.

I use 100K samples for evaluation, which is also overkilled. 

In [0]:
!python run_language_modeling.py \
    --output_dir='/content/drive/My Drive/finetuned_models/Newsroom' \
    --model_type=gpt2 \
    --model_name_or_path=distilgpt2 \
    --save_total_limit=5 \
    --num_train_epochs=7.0 \
    --do_train \
    --overwrite_output_dir \
    --evaluate_during_training \
    --logging_steps=5000 \
    --save_steps=5000 \
    --train_data_file='/content/drive/My Drive/Colab Notebooks/GPT-2/Newsroom/trainFile_small_format_Apr19_2020.txt' \
    --do_eval \
    --eval_data_file='/content/drive/My Drive/Colab Notebooks/GPT-2/Newsroom/devFile_format_Apr19_2020.txt' \
    --per_gpu_train_batch_size=256 \
    --per_gpu_eval_batch_size=256 \
    --block_size=5 \
    --gradient_accumulation_steps=5

In [6]:
!ls '/content/drive/My Drive/finetuned_models/Newsroom'

checkpoint-10000  checkpoint-25000    test_eval_checkpoint-5000
checkpoint-15000  checkpoint-5000
checkpoint-20000  eval_results_1.txt


# Evaluate the Language Model

run the evaluation using run_language_modeling command line directly. We will use the weights and model configuration saved in the checkpoint.

In [0]:
!python run_language_modeling.py \
    --train_data_file='/content/drive/My Drive/Colab Notebooks/GPT-2/Newsroom/trainFile_small_format_Apr19_2020.txt' \
    --output_dir='/content/drive/My Drive/finetuned_models/Newsroom/checkpoint-15000' \
    --model_type=gpt2 \
    --model_name_or_path='/content/drive/My Drive/finetuned_models/Newsroom/checkpoint-15000' \
    --overwrite_output_dir \
    --do_eval \
    --eval_data_file='/content/drive/My Drive/Colab Notebooks/GPT-2/Newsroom/devFile_small2_format_Apr19_2020.txt' \
    --per_gpu_eval_batch_size=2 \
    --block_size=128 \

# generate text

To do this, we will call a model and do a forward pass.

In [0]:
CHECKPOINT_PATH = '/content/drive/My Drive/finetuned_models/Newsroom/checkpoint-25000/'

In [0]:
PROMPT = """<|startoftext|> the precolonial era, the area of present-day New York City was inhabited by Algonquian Native Americans, including the Lenape. Their homeland, known as Lenapehoking, included Staten Island, Manhattan, the Bronx, the western portion of Long Island (including the areas that would later become the boroughs of Brooklyn and Queens), and the Lower Hudson Valley.
The first documented visit into New York Harbor by a European was in 1524 by Giovanni da Verrazzano, a Florentine explorer in the service of the French crown. He claimed the area for France and named it Nouvelle Angoulême (New Angoulême). A Spanish expedition, led by the Portuguese captain Estêvão Gomes sailing for Emperor Charles V, arrived in New York Harbor in January 1525 and charted the mouth of the Hudson River, which he named Río de San Antonio (Saint Anthony's River). The Padrón Real of 1527, the first scientific map to show the East Coast of North America continuously, was informed by Gomes' expedition and labeled the northeastern United States as Tierra de Esteban Gómez in his honor. <|summarize|>"""

In [0]:
tokenizer = run_language_modeling.AutoTokenizer.from_pretrained(CHECKPOINT_PATH)

In [0]:
device = 'cpu'

In [0]:
encoded_prompt = tokenizer.encode(PROMPT, add_special_tokens=False, return_tensors="pt")
encoded_prompt = encoded_prompt.to(device)

In [0]:
model = run_language_modeling.AutoModelWithLMHead.from_pretrained(CHECKPOINT_PATH)

In [0]:
model = model.to(device)

In [54]:
output_sequences = model.generate(
      input_ids=encoded_prompt,
      max_length=200 + len(encoded_prompt[0]),
      temperature=0.5,
      decoder_start_token_id= '<|summarize|>',
      top_k=50,
      top_p=1.0,
      repetition_penalty=None,
      do_sample=True,
      num_return_sequences=3)
 # Remove the batch dimension when returning multiple sequences
if len(output_sequences.shape) > 2:
  output_sequences.squeeze_()



In [0]:
stop_token = '<|endoftext|>'
generated_sequences = []  
for generated_sequence_idx, generated_sequence in enumerate(output_sequences):
  generated_sequence = generated_sequence.tolist()

  # Decode text
  text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)

  # Remove all text after the stop token
  text = text[: text.find(stop_token) if stop_token else None]

  # Remove the excess text that was used for pre-processing
  text = text[len(tokenizer.decode(encoded_prompt[0], clean_up_tokenization_spaces=True)) :]

  # Add the prompt at the beginning of the sequence.
  total_sequence = PROMPT + text

  generated_sequences.append(total_sequence)

In [0]:
import textwrap
wrapper = textwrap.TextWrapper(width=70)

In [58]:
wrapper.wrap(generated_sequences[0])

['<|startoftext|> the precolonial era, the area of present-day New York',
 'City was inhabited by Algonquian Native Americans, including the',
 'Lenape. Their homeland, known as Lenapehoking, included Staten Island,',
 'Manhattan, the Bronx, the western portion of Long Island (including',
 'the areas that would later become the boroughs of Brooklyn and',
 'Queens), and the Lower Hudson Valley. The first documented visit into',
 'New York Harbor by a European was in 1524 by Giovanni da Verrazzano, a',
 'Florentine explorer in the service of the French crown. He claimed the',
 'area for France and named it Nouvelle Angoulême (New Angoulême). A',
 'Spanish expedition, led by the Portuguese captain Estêvão Gomes',
 'sailing for Emperor Charles V, arrived in New York Harbor in January',
 '1525 and charted the mouth of the Hudson River, which he named Río de',
 "San Antonio (Saint Anthony's River). The Padrón Real of 1527, the",
 'first scientific map to show the East Coast of North America'

## future direction

Still at this stage, the summarization looks gibberish.

There are many things to try.
For example,
- Try changing the block_size, the first trial I use block size = 5. This is very likely a wrong choice as it is too short. Perhaps try, longer like 512, 1024 or use default by not specifying or specify as -1.

- size of the dev file (initially use 100K samples, this takes too long) 10K sample would have been fine I think. 

- and gradient accumulation. Perhaps, if I use bigger batch size, I don't have to use this step.

# Citation

[distilgpt2](https://huggingface.co/distilgpt2)
- a lite version of GPT2, making it more compact and faster to train

[example code for training GPT2](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py)
- An example by Huggingface on how to train GPT2. In this notebook, we simply call this command directly to train GPT2.

[example code for language generation](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py)
- Example by Huggingface on how to generate text from pre-trained or fine-tuned GPT2. 

[Dataset](https://summari.es/)
- We use first 100K summary-text pairs in the training set to train the model.
- This dataset also has articles of various length, making training more challenging (!?). Other dataset that I see has articles that are shorter and more uniform in length.


[blog post on how to train GPT2](https://minimaxir.com/2019/09/howto-gpt2/)

[Annoted notbook on how attention works](https://github.com/harvardnlp/annotated-transformer/blob/master/The%20Annotated%20Transformer.ipynb)

[Blog post on attention](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)

[illutration of how GPT2 works](http://jalammar.github.io/illustrated-gpt2/)
- I use this as a guide to strategize how to train GPT2 for abstractive summarization.

- Basically from what I understand, The training set just contains pairs of article texts and their summaries, separated by a keyword token. In this case, I use '<|startoftext|>' to denote where the article text starts. '<|summarize|> ' to denote where the text ends and the summary starts. ' <|endoftext|>' to denote where the summary ends. This part is what I'm not sure about what kind of tokens I should use.  

- I did not specify to the model tokenizer that they are special tokens. Perhaps this would make a big difference if I did. So, this is something to play around with. 