# Basic GPT-2 Model

We are now using HuggingFace's model! I am currently using [this article](https://www.modeldifferently.com/en/2021/12/generaci%C3%B3n-de-fake-news-con-gpt-2/) and [this HuggingFace link](https://huggingface.co/gpt2).

In [12]:
# uncomment for colab
#!pip install transformers datasets accelerate nvidia-ml-py3

# import hugging face
from transformers import GPT2Tokenizer, GPT2LMHeadModel

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [13]:
# for colab to keep track of utilization

""" from pynvml import *

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")


def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()

# print GPU utilization
print_gpu_utilization()"""

' from pynvml import *\n\ndef print_gpu_utilization():\n    nvmlInit()\n    handle = nvmlDeviceGetHandleByIndex(0)\n    info = nvmlDeviceGetMemoryInfo(handle)\n    print(f"GPU memory occupied: {info.used//1024**2} MB.")\n\n\ndef print_summary(result):\n    print(f"Time: {result.metrics[\'train_runtime\']:.2f}")\n    print(f"Samples/second: {result.metrics[\'train_samples_per_second\']:.2f}")\n    print_gpu_utilization()\n\n# print GPU utilization\nprint_gpu_utilization()'

In [14]:
# create dataframe from text data
from datasets import Dataset
import pandas as pd
filename = 'data/5000_booksummaries.zip'
tokens_df = pd.read_csv(filename)
tokens_df.head(5)

Unnamed: 0,Text
0,Generate a book summary with genres Science Fi...
1,Generate a book summary with genres Fantasy:\n...
2,Generate a book summary with genres Crime Fict...
3,"Generate a book summary with genres Fiction, N..."
4,"Generate a book summary with genres War novel,..."


In [15]:
# split data into train and test data
from sklearn.model_selection import train_test_split

# split the data
train_data, eval_set = train_test_split(tokens_df, random_state=8)

# create HuggingFace Dataset
train_ds = Dataset.from_pandas(train_data, split="train")
eval_ds = Dataset.from_pandas(eval_set, split="eval")

In [16]:
# tokenize datasets
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
    return tokenizer(examples["Text"], truncation=True)

train_tok_ds = train_ds.map(tokenize_function, batched=True).shuffle(seed=42)
eval_tok_ds = eval_ds.map(tokenize_function, batched=True).shuffle(seed=42)

Map:   0%|          | 0/3750 [00:00<?, ? examples/s]

Map:   0%|          | 0/1250 [00:00<?, ? examples/s]

In [17]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

model = GPT2LMHeadModel.from_pretrained('gpt2')

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# temp_model directory is for training arguments/runs, not the model config itself
training_args = TrainingArguments(
        output_dir="temp_model",
        overwrite_output_dir=True,
        do_train=True,
        do_eval=True,
        evaluation_strategy='no',
        per_device_train_batch_size=4,
        num_train_epochs=3,
        save_total_limit=1,
        gradient_accumulation_steps=4, gradient_checkpointing=True, fp16=True) # using gradient accumulation and checkpointing to not take as much memory

trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_tok_ds,
        eval_dataset=eval_tok_ds,
    )

In [18]:
trainer.train()



Step,Training Loss
500,3.4065


TrainOutput(global_step=702, training_loss=3.3774174964665686, metrics={'train_runtime': 2057.0658, 'train_samples_per_second': 5.469, 'train_steps_per_second': 0.341, 'total_flos': 4844178717696000.0, 'train_loss': 3.3774174964665686, 'epoch': 2.99})

In [19]:
# save local version of MODEL CONFIGURATION
checkpoint = "./model_config"
model.save_pretrained(checkpoint)
tokenizer.save_pretrained(checkpoint)

('./model_config/tokenizer_config.json',
 './model_config/special_tokens_map.json',
 './model_config/vocab.json',
 './model_config/merges.txt',
 './model_config/added_tokens.json')

In [20]:
# load into model and tokenizer
model = GPT2LMHeadModel.from_pretrained(checkpoint)
tokenizer = GPT2Tokenizer.from_pretrained(checkpoint)

In [21]:
# load input prompt
input_prompt = "Generate a book summary with genre science fiction, mystery:\n"
inputs = tokenizer(input_prompt, return_tensors="pt")

# generate output from pretrained experiments (see baseline file)
outputs = model.generate(**inputs, 
    max_length=150, 
    num_beams=2, 
    no_repeat_ngram_size=2, 
    do_sample = True,
    early_stopping=True)

# decode output and print out summary
output = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(output)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


["Generate a book summary with genre science fiction, mystery:\n The story begins with a series of events that take place in the year 2086, with the discovery of a large chunk of the Earth's crust. The crust is thought to have been formed by the impact of an asteroid which hit the planet in early 2087, and which is believed to be the most powerful asteroid ever discovered. In the aftermath of this event, the earth is devastated by a massive volcanic eruption that erupts into a huge crater in which thousands of people are killed and many more are left homeless. After the eruption, a group of scientists, led by Dr. Robert C. Jones, discover that the crust was formed as a result of two separate explosions, which are"]


In [23]:
# uncomment for colab downloading the model
#!zip -r /content/model_setup_5000_3.zip /content/model_config

  adding: content/model_config/ (stored 0%)
  adding: content/model_config/tokenizer_config.json (deflated 70%)
  adding: content/model_config/config.json (deflated 51%)
  adding: content/model_config/special_tokens_map.json (deflated 74%)
  adding: content/model_config/generation_config.json (deflated 24%)
  adding: content/model_config/merges.txt (deflated 53%)
  adding: content/model_config/pytorch_model.bin (deflated 9%)
  adding: content/model_config/vocab.json (deflated 68%)
