<a href="https://colab.research.google.com/github/antahiap/dsr-nlp/blob/main/notebooks/05_GPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GP2 training

- Hugging phase, [link](https://huggingface.co/)

In [None]:
!pip install transformers==4.28.0 tokenizers datasets accelerate

Collecting transformers==4.28.0
  Downloading transformers-4.28.0-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m39.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m51.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0 (from transformers==4.28.0)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)


In [None]:
import tensorflow as tf
import glob
import os
import shutil
import tqdm
import random
import matplotlib.pyplot as plt
import torch
from datasets import load_dataset
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from transformers import PreTrainedTokenizerFast
from transformers import DataCollatorForLanguageModeling
from transformers import GPT2Config, GPT2LMHeadModel
from transformers import TrainingArguments, Trainer

tf.config.list_physical_devices("GPU")

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

# Dataset

In [None]:
dataset_file = "dataset.txt"

# How many files to load.
file_number = 100

# Clone the repo.
!git clone https://github.com/vilmibm/lovecraftcorpus

# Find all the files.
paths = glob.glob("lovecraftcorpus/*.txt")

# Do not use all.
paths = paths[:file_number]
print(sorted(paths))

# Merge.
with open(dataset_file, "w") as output_file:
    for path in paths:
        for line in open(path, "r"):
            for split in line.split("\n"):
                split = split.strip()
                if split != "":
                    print(split, file=output_file)

# Delete repo.
!rm -rf lovecraftcorpus

# Done.
print("Corpus downloaded.")

Cloning into 'lovecraftcorpus'...
remote: Enumerating objects: 74, done.[K
remote: Counting objects: 100% (4/4), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 74 (delta 0), reused 3 (delta 0), pack-reused 70[K
Receiving objects: 100% (74/74), 1.12 MiB | 4.99 MiB/s, done.
['lovecraftcorpus/alchemist.txt', 'lovecraftcorpus/arthur_jermyn.txt', 'lovecraftcorpus/azathoth.txt', 'lovecraftcorpus/beast.txt', 'lovecraftcorpus/beyond_wall_of_sleep.txt', 'lovecraftcorpus/book.txt', 'lovecraftcorpus/celephais.txt', 'lovecraftcorpus/charles_dexter_ward.txt', 'lovecraftcorpus/clergyman.txt', 'lovecraftcorpus/colour_out_of_space.txt', 'lovecraftcorpus/cool_air.txt', 'lovecraftcorpus/crawling_chaos.txt', 'lovecraftcorpus/cthulhu.txt', 'lovecraftcorpus/dagon.txt', 'lovecraftcorpus/descendent.txt', 'lovecraftcorpus/doorstep.txt', 'lovecraftcorpus/dreams_in_the_witch.txt', 'lovecraftcorpus/dunwich.txt', 'lovecraftcorpus/erich_zann.txt', 'lovecraftcorpus/ex_oblivione.txt', 'lo

In [None]:
raw_datasets = load_dataset("text", data_files=[dataset_file])

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 4371
    })
})

In [None]:
for index in range(10):
  token_sequence = raw_datasets["train"][index]
  print(token_sequence)

{'text': "PICKMAN'S MODEL"}
{'text': "I know I'm more nervous than I was when you saw me last year, but you don't need to hold a clinic over it. There's plenty of reason, God knows, and I fancy I'm lucky to be sane at all. Why the third degree? You didn't use to be so inquisitive."}
{'text': "Well, if you must hear it, I don't know why you shouldn't. Maybe you ought to, anyhow, for you kept writing me like a grieved parent when you heard I'd begun to cut the Art Club and keep away from Pickman. Now that he's disappeared I go round to the club once in a while, but my nerves aren't what they were."}
{'text': "No, I don't know what's become of Pickman, and I don't like to guess. You might have surmised I had some inside information when I dropped him--and that's why I don't want to think where he's gone. Let the police find what they can--it won't be much, judging from the fact that they don't know yet of the old North End place he hired under the name of Peters."}
{'text': "I'm not sure 

# Tokenizer

In [None]:
def batch_iterator(batch_size=1000):
  for i in range(0, len(raw_datasets["train"]), batch_size):
    yield raw_datasets["train"][i:i + batch_size]["text"]

In [None]:
tokenizer = Tokenizer(BPE(unk_token="UNK"))
trainer = BpeTrainer(vocab_size=5_000, special_tokens=["[UNK]", "[PAD]"])
tokenizer.pre_tokenizer = Whitespace()

tokenizer.train_from_iterator(
    batch_iterator(),
    trainer=trainer,
    length=len(raw_datasets["train"])
    )
tokenizer.save("tokenizer.json")

In [None]:
tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")
tokenizer.add_special_tokens({"pad_token": "[PAD]"})

0

In [None]:
tokenizer.vocab_size # tokenizer.vocab #

5000

In [None]:
tokenizer.encode("Hi there!")

[34, 61, 267, 2]

# Train the mmodel

In [None]:
sequence_length = 256

def tokenize_function(example):
  tokenized_example = tokenizer(
      example["text"],
      truncation = True,  # if longer cut it
      padding=True,
      max_length=sequence_length
  )
  return {
      "input_ids": tokenized_example['input_ids']
  }

token_sequence = raw_datasets["train"][666]
print(token_sequence)
tokenized = tokenize_function(token_sequence)
print(tokenized)

{'text': 'It was a paw, fully two feet and a half across, and equipped with formidable talons. After it came another paw, and after that a great black-furred arm to which both of the paws were attached by short forearms. Then two pink eyes shone, and the head of the awakened Gug sentry, large as a barrel, wabbled into view. The eyes jutted two inches from each side, shaded by bony protuberances overgrown with coarse hairs. But the head was chiefly terrible because of the mouth. That mouth had great yellow fangs and ran from the top to the bottom of the head, opening vertically instead of horizontally.'}
{'input_ids': [288, 127, 53, 68, 208, 9, 877, 542, 795, 102, 53, 630, 1119, 9, 102, 1820, 430, 439, 152, 378, 133, 274, 434, 955, 11, 1149, 113, 361, 814, 68, 208, 9, 102, 371, 128, 53, 348, 462, 10, 917, 166, 2212, 111, 182, 880, 103, 93, 4254, 194, 626, 3335, 213, 1634, 300, 2967, 11, 538, 542, 68, 443, 690, 3575, 9, 102, 93, 475, 103, 93, 208, 255, 485, 33, 698, 982, 508, 9, 949, 109

In [None]:
tokenized_datasets = raw_datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=raw_datasets["train"].column_names
)
(tokenized_datasets["train"][666].keys())

dict_keys(['input_ids'])

In [None]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer,
    mlm=False,  # masked language modelling used for BERT
)

# The Model

In [None]:
model_config = GPT2Config(
    vocab_size=tokenizer.vocab_size,
    pad_token_id=tokenizer.pad_token_id,
    n_ctx=sequence_length,      # tocken length
    n_position=sequence_length,
    n_embd=512,
    n_head=8,
    n_layer=6
)

model = GPT2LMHeadModel(model_config)
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(5000, 512)
    (wpe): Embedding(1024, 512)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-5): 6 x GPT2Block(
        (ln_1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=512, out_features=5000, bias=False)
)

In [None]:
output_path = "output"

training_args = TrainingArguments(
    output_dir=output_path,
    overwrite_output_dir=True,
    num_train_epochs=5,
    per_device_train_batch_size=8,
    prediction_loss_only=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"]
)
trainer.train()

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,6.4982
1000,5.9881
1500,5.7789
2000,5.6476
2500,5.5699


TrainOutput(global_step=2735, training_loss=5.865057540419333, metrics={'train_runtime': 366.4266, 'train_samples_per_second': 59.644, 'train_steps_per_second': 7.464, 'total_flos': 634973941923840.0, 'train_loss': 5.865057540419333, 'epoch': 5.0})

In [None]:
tokenizer.save_pretrained(output_path)
model.save_pretrained(output_path)

In [None]:
!mkdir outputJson

NotImplementedError: ignored

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!zip -r /content/output_GPT2_files.zip /content/output

NotImplementedError: ignored

In [None]:
from google.colab import files
files.download('/content/output_GPT2.zip')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# Encode the conditioning tokens.
input_ids = tokenizer.encode("The most merciful thing in the world, I think, is the inability of the human mind to correlate all its contents.", return_tensors="pt").cuda()
print(input_ids)

# Generate more tokens.
generated_ids = model.generate(
    input_ids,
    max_length=100,
    do_sample=True,
    temperature=0.5
)
generated_sequence = tokenizer.decode(generated_ids[0], clean_up_tokenization_spaces=True)
print(generated_sequence)

tensor([[ 184,  325, 3454,  205,   92,   93,  552,    9,   35,  678,    9,  114,
           93,   92, 3974,  103,   93,  577,  609,  111,  421,  695,  227,  156,
          282, 4911,   11]], device='cuda:0')
The most merciful thing in the world, I think, is the in ability of the human mind to cor rel ate all its contents. I have been a, and I had been a very ancient. I saw that I had a moment of a - t, and I found a small - like the old man, and I have been the old man had been a - like the great - like. It was a time, I had been, and a man could not be a moment I had been
