<a href="https://colab.research.google.com/github/VincentK1991/BERT_summarization_1/blob/master/Ignite_train_GPT2_abstractive_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0. Using Ignite to train GPT2 summarization

This notebook illustrate how to use Ignite Engine to train GPT2 for abstractive summarization. The goal here is to get a fine-tuned weight tensors of GPT2 that we will later use for abstractive summarization of biomedical science publication. The dataset is processed from this [kaggle](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge).

To get a sense on how to train GPT2, and why it is done this way, and what an outcome looks like, read this [companion notebook here](https://github.com/VincentK1991/BERT_summarization_1/blob/master/Copy_of_BERTandGPT2_abstractive_summarization_Apr28_2020.ipynb)



# 1. installing Pytorch, Huggingface, check GPU, etc.

In [1]:
%cd '/content/drive/My Drive/Colab Notebooks/GPT-2/Ignite_training_Apr29_2020'

/content/drive/My Drive/Colab Notebooks/GPT-2/Ignite_training_Apr29_2020


In [2]:
!nvidia-smi

Thu Apr 30 13:32:21 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [0]:
import numpy as np
import matplotlib.pyplot as plt
import timeit
import torch
from torch.utils.data import DataLoader, TensorDataset, RandomSampler


SEED = 1234
torch.manual_seed(SEED)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [4]:
!pip install transformers==2.6.0

Collecting transformers==2.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/4c/a0/32e3a4501ef480f7ea01aac329a716132f32f7911ef1c2fac228acc57ca7/transformers-2.6.0-py3-none-any.whl (540kB)
[K     |████████████████████████████████| 542kB 3.4MB/s 
Collecting tokenizers==0.5.2
[?25l  Downloading https://files.pythonhosted.org/packages/d1/3f/73c881ea4723e43c1e9acf317cf407fab3a278daab3a69c98dcac511c04f/tokenizers-0.5.2-cp36-cp36m-manylinux1_x86_64.whl (3.7MB)
[K     |████████████████████████████████| 3.7MB 74.0MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/99/50/93509f906a40bffd7d175f97fd75ea328ad9bd91f48f59c4bd084c94a25e/sacremoses-0.0.41.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 64.0MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/98/2c/8df20f3ac6c22ac224fff307ebc102818206c53fc454ecd37d8ac2060df5/sentencepiece-0.1.86-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K   

In [5]:
import transformers
from transformers import GPT2Tokenizer, GPT2DoubleHeadsModel, AdamW
load_model = False
load_previous_weight = False
resize_model = False
print(transformers.__version__) # make sure it is 2.6.0

2.6.0


# 2. Test load the GPT2DoubleHeadsModel

In [0]:
model = GPT2DoubleHeadsModel.from_pretrained('Apr29_2020_epoch1')
load_model = True

In [0]:
tokenizer = GPT2Tokenizer.from_pretrained('Apr29_2020_epoch1')

In [8]:
print(len(tokenizer), 'total length of vocab') # expect 50257

50257 total length of vocab


In [0]:
# Add a [CLS] to the vocabulary (we should train it also!)
special_tokens = {'bos_token':'<|startoftext|>','eos_token':'<|endoftext|>','pad_token':'<pad>','additional_special_tokens':['<|keyword|>','<|summarize|>']}
#special_tokens2 = {'bos_token':'<|startoftext|>','eos_token':'<|endoftext|>','keyword_token':'<|keyword|>','summary_token':'<|summarize|>'}
tokenizer.add_special_tokens(special_tokens)
#model.resize_token_embeddings(len(tokenizer))  # Update the model embeddings with the new vocabulary size
# The newly token the last token of the vocabulary
resize_model = True

In [11]:
print(len(tokenizer), 'total length of vocab')
print(tokenizer.bos_token_id, 'bos_token')
print(tokenizer.eos_token_id, 'eos_token')
print(tokenizer.pad_token_id, 'pad_token')  #token for <pad>, len of all tokens in the tokenizer
print(tokenizer.additional_special_tokens_ids[0], 'keyword_token') #token for <|keyword|>
print(tokenizer.additional_special_tokens_ids[1], 'summary_token') #token for <|summarize|>

50261 total length of vocab
50257 bos_token
50256 eos_token
50258 pad_token
50259 keyword_token
50260 summary_token


expected output

50261 total length of vocab

50257 bos_token

50256 eos_token

50258 pad_token

50259 keyword_token

50260 summary_token

# 3. Load dataset and make dataloader

The dataset is in the torch tensor format. It is bundled into a tuple of 5 items, which are 
  1. the input tokens. 
  2. the segment tokens. 
  3. the index for last token (this is used for multiple choice), 
  4. the language model expected output tokens, the masked [-100] is used to mask away part that model doesn't have to output.

    - this 1-4 items come in a batch of 4, only one of these 4 is the correct keyword-summary pair. The other 3 are distractors.

  5. the multiple choice label which one of the 4 item in the current batch is the correct choice.

In [0]:
train_dataset_1 = torch.load('torch_trainFile_2_Apr29_2020.pt')

In [13]:
train_dataset_1[5]

(tensor([[50257,   370,  7456,  ..., 50258, 50258, 50258],
         [50257,   370,  7456,  ..., 50258, 50258, 50258],
         [50257,   370,  7456,  ..., 50258, 50258, 50258],
         [50257,   370,  7456,  ..., 50258, 50258, 50258]]),
 tensor([[50259, 50259, 50259,  ..., 50258, 50258, 50258],
         [50259, 50259, 50259,  ..., 50258, 50258, 50258],
         [50259, 50259, 50259,  ..., 50258, 50258, 50258],
         [50259, 50259, 50259,  ..., 50258, 50258, 50258]]),
 tensor([337,  86, 335, 290]),
 tensor([[-100, -100, -100,  ..., -100, -100, -100],
         [-100, -100, -100,  ..., -100, -100, -100],
         [-100, -100, -100,  ..., -100, -100, -100],
         [-100, -100, -100,  ..., -100, -100, -100]]),
 tensor([1]))

In [0]:
for count,i in enumerate(train_dataset_1[5][3][0]):
  i = int(i)
  if i == -100:
    decode_i = 'masked'
  else:
    decode_i = tokenizer.decode(i)
  print(count,int(i), decode_i)

In [0]:
train1_sampler = RandomSampler(train_dataset_1)
train1_dataloader = DataLoader(train_dataset_1, sampler=train1_sampler, batch_size=1)

In [0]:
val_dataset_1 = torch.load('torch_valFile_1_Apr29_2020.pt')
val1_sampler = RandomSampler(val_dataset_1)
val1_dataloader = DataLoader(val_dataset_1, sampler=val1_sampler, batch_size=1)

# 4. Test run

In [15]:
input_ids, token_type_ids, mc_token_ids, lm_labels, mc_labels = train_dataset_1[0]
print(input_ids.shape)
print(mc_token_ids.shape)
print(lm_labels.shape)
print(mc_labels.shape)
print(token_type_ids.shape)

torch.Size([4, 1024])
torch.Size([4])
torch.Size([4, 1024])
torch.Size([1])
torch.Size([4, 1024])


In [0]:
model = model.to(device)
optimizer = AdamW(model.parameters(),lr=3e-5,eps=1e-8, correct_bias=True)
max_norm = 1.0

In [0]:
gradient_accumulation_steps = 10

In [18]:
total_steps = len(train1_dataloader)
print('total step for learning rate scheduler = ',total_steps)

total step for learning rate scheduler =  32146


In [0]:
from transformers import get_linear_schedule_with_warmup
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 50, num_training_steps = total_steps)

In [20]:
!nvidia-smi

Thu Apr 30 13:35:30 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    31W / 250W |   1145MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
+-------

In [0]:
test_run = train_dataset_1[1]

In [22]:
# Forward pass
start = timeit.default_timer()
model.train()
optimizer.zero_grad()
test_run = (item.to(device) for item in test_run)
input_ids, token_type_ids, mc_token_ids, lm_labels, mc_labels = test_run
#input_ids, mc_token_ids, lm_labels, mc_labels, token_type_ids = input_ids.to(device), mc_token_ids.to(device), lm_labels.to(device), mc_labels.to(device), token_type_ids.to(device)
outputs = model(input_ids = input_ids, mc_token_ids = mc_token_ids, mc_labels = mc_labels,lm_labels = lm_labels, token_type_ids = token_type_ids)
lm_loss, mc_loss = outputs[0], outputs[1]
#del input_ids, token_type_ids, mc_token_ids, lm_labels, mc_labels
lm_coef = 2.0
mc_coef = 1.0
total_loss = lm_loss * lm_coef + mc_loss * mc_coef
print('lm_loss = ',lm_loss.item())
print('mc_loss = ',mc_loss.item())
print('total_loss = ',total_loss.item())
total_loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
optimizer.step()
stop = timeit.default_timer()
print('1 epoch takes {:.3f}'.format(stop - start),' sec')

lm_loss =  1.9656041860580444
mc_loss =  0.0
total_loss =  3.931208372116089
1 epoch takes 0.868  sec


	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha)


# 5. set up Ignite

In [23]:
!pip install pytorch-ignite

Collecting pytorch-ignite
[?25l  Downloading https://files.pythonhosted.org/packages/35/55/41e8a995876fd2ade29bdba0c3efefa38e7d605cb353c70f3173c04928b5/pytorch_ignite-0.3.0-py2.py3-none-any.whl (103kB)
[K     |███▏                            | 10kB 33.3MB/s eta 0:00:01[K     |██████▎                         | 20kB 2.2MB/s eta 0:00:01[K     |█████████▌                      | 30kB 3.2MB/s eta 0:00:01[K     |████████████▋                   | 40kB 2.1MB/s eta 0:00:01[K     |███████████████▉                | 51kB 2.6MB/s eta 0:00:01[K     |███████████████████             | 61kB 3.1MB/s eta 0:00:01[K     |██████████████████████▏         | 71kB 3.6MB/s eta 0:00:01[K     |█████████████████████████▎      | 81kB 2.8MB/s eta 0:00:01[K     |████████████████████████████▍   | 92kB 3.1MB/s eta 0:00:01[K     |███████████████████████████████▋| 102kB 3.5MB/s eta 0:00:01[K     |████████████████████████████████| 112kB 3.5MB/s 
Installing collected packages: pytorch-ignite
Successful

In [0]:
from ignite.engine import Engine, Events
from ignite.metrics import MeanSquaredError, Loss, RunningAverage
from ignite.handlers import ModelCheckpoint, EarlyStopping

In [0]:
def process_function(engine,batch):
  #start = timeit.default_timer()
  model.train()
  #optimizer.zero_grad()
  batch = (item.to(device) for item in batch)
  input_ids, token_type_ids, mc_token_ids, lm_labels, mc_labels = batch
  outputs = model(input_ids = input_ids, mc_token_ids = mc_token_ids, mc_labels = mc_labels,
                  lm_labels = lm_labels, token_type_ids = token_type_ids)
  lm_loss, mc_loss = outputs[0], outputs[1]
  #del input_ids, token_type_ids, mc_token_ids, lm_labels, mc_labels
  lm_coef = 2.0
  mc_coef = 1.0
  total_loss = lm_loss * lm_coef + mc_loss * mc_coef
  total_loss = total_loss / gradient_accumulation_steps
  total_loss.backward()
  torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
  if engine.state.iteration % gradient_accumulation_steps == 0:
    optimizer.step()
    optimizer.zero_grad()
  scheduler.step()
  return lm_loss.item(),mc_loss.item(),total_loss.item()*gradient_accumulation_steps

In [0]:
def evaluate_function(engine,batch):
  model.eval()
  with torch.no_grad():
    batch = (item.to(device) for item in batch)
    input_ids, token_type_ids, mc_token_ids, lm_labels, mc_labels = batch
    outputs = model(input_ids = input_ids, mc_token_ids = mc_token_ids, mc_labels = mc_labels,
                  lm_labels = lm_labels, token_type_ids = token_type_ids)
    lm_loss, mc_loss = outputs[0], outputs[1]
    lm_coef = 2.0
    mc_coef = 1.0
    total_loss = lm_loss * lm_coef + mc_loss * mc_coef
  return lm_loss.item(),mc_loss.item(),total_loss.item()

In [0]:
trainer = Engine(process_function)
evaluator = Engine(evaluate_function)

training_history = {'lm_loss': [], 'mc_loss': [], 'total_loss': []}
validation_history = {'lm_loss': [], 'mc_loss': [], 'total_loss': []}

In [0]:
RunningAverage(output_transform=lambda x: x[0]).attach(trainer, 'lm_loss')
RunningAverage(output_transform=lambda x: x[1]).attach(trainer, 'mc_loss')
RunningAverage(output_transform=lambda x: x[2]).attach(trainer, 'total_loss')

In [0]:
RunningAverage(output_transform=lambda x: x[0]).attach(evaluator, 'lm_loss')
RunningAverage(output_transform=lambda x: x[1]).attach(evaluator, 'mc_loss')
RunningAverage(output_transform=lambda x: x[2]).attach(evaluator, 'total_loss')

In [0]:
@trainer.on(Events.ITERATION_COMPLETED(every=100))
def print_trainer_logs(engine):
    # try:
    #   start
    # except:
    #   start = timeit.default_timer()
    loss_LM = engine.state.metrics['lm_loss']
    loss_NSP = engine.state.metrics['mc_loss']
    combined_loss = engine.state.metrics['total_loss']
    stop = timeit.default_timer()
    print("Trainer Results - iteration {} - LM loss: {:.2f} MC loss: {:.2f} total loss: {:.2f} report time: {:.1f}"
    .format(engine.state.iteration, loss_LM, loss_NSP, combined_loss,stop))

In [31]:
checkpointer = ModelCheckpoint('/content/drive/My Drive/Colab Notebooks/GPT-2/Ignite_training_Apr29_2020/GPT2_dir', 'GPT2_summarizer', n_saved=2, create_dir=True, save_as_state_dict=True,require_empty=False)
trainer.add_event_handler(Events.ITERATION_COMPLETED(every=15000), checkpointer, {'epoch_2': model})
trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpointer, {'epoch_2_done': model})

<ignite.engine.engine.RemovableEventHandle at 0x7f5d9820feb8>

In [32]:
def print_logs(engine, dataloader, mode, history_dict):
    evaluator.run(dataloader, max_epochs=1)
    metrics = evaluator.state.metrics
    avg_LM_loss = metrics['lm_loss']
    avg_NSP_loss = metrics['mc_loss']
    avg_total_loss = metrics['total_loss']
    #avg_loss =  avg_bce + avg_kld
    print(
        mode + " Results - Epoch {} - Avg lm_loss: {:.2f} Avg mc_loss: {:.2f} Avg total_loss: {:.2f}"
        .format(engine.state.epoch, avg_LM_loss, avg_NSP_loss, avg_total_loss))
    for key in evaluator.state.metrics.keys():
        history_dict[key].append(evaluator.state.metrics[key])

trainer.add_event_handler(Events.EPOCH_COMPLETED, print_logs, val1_dataloader, 'Validation', validation_history)

<ignite.engine.engine.RemovableEventHandle at 0x7f5d9820fa20>

# Run Ignite Engine

In [0]:
e = trainer.run(train1_dataloader, max_epochs=1)

In [34]:
# save the model and tokenizer configuration
model.config.to_json_file('GPT2_dir/config.json')
tokenizer.save_vocabulary('GPT2_dir')

('Apr29_2020_epoch2/vocab.json', 'Apr29_2020_epoch2/merges.txt')

# 6. Result

epoch 1

- lm loss = 1.96-2.00
- mc loss = 0.0
- lr 3x10^-8, max_norm = 1.0, gradient accumulation = 5


---
epoch 2

- lm loss = 1.77
- mc loss = 0.0
- lr 3x10^-8, max_norm = 1.0, gradient accumulation = 10

