# Basic GPT-2 Model

We are now using HuggingFace's model! I am currently using [this article](https://www.modeldifferently.com/en/2021/12/generaci%C3%B3n-de-fake-news-con-gpt-2/) and [this HuggingFace link](https://huggingface.co/gpt2).

In [2]:
!pip install transformers

# import hugging face
from transformers import GPT2Tokenizer, GPT2LMHeadModel, pipeline, set_seed
from transformers import TFAutoModelForCausalLM, AutoTokenizer, AdamWeightDecay, TextGenerationPipeline
from transformers import DefaultDataCollator

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m68.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.2-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m112.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.2 tokenizers-0.13.2 transformers-4.26.1


In [3]:
# basic GPT-2 model (from the site)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [4]:
# test the model
text = "This is a comedy story:"
text_ids = tokenizer.encode(text, return_tensors = 'pt')

generated_text_samples = model.generate(text_ids)

#Print output for each sequence generated above
for i, beam in enumerate(generated_text_samples):
  print("{}: {}".format(i,tokenizer.decode(beam, skip_special_tokens=True)))
  print()

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


0: This is a comedy story: a young man who has been living in a small town for a few



In [5]:
# use hugging face documentation
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator(text, max_length=50, num_return_sequences=5)

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "This is a comedy story: The girl who's been following everyone who happened to miss her for the entire book is called her 'Bubba' (a black-skinned girl who's not a fan of KATOR) at school. While"},
 {'generated_text': "This is a comedy story: What you'd see in this film. We're taking the audience by surprise. This person is no ordinary person who has nothing of her own, what he stands to gain by going there would be to the end of the"},
 {'generated_text': "This is a comedy story: the world is overrun by terrorists, and each time one comes out looking for more explosives, more money, more power, all the time. All the time—but most of the time it's bad. Some of the"},
 {'generated_text': 'This is a comedy story: a series of films that were both funny and memorable after a period of time in the UK, and about how it is possible to not only survive but thrive in this moment, and where there is nothing very good in it'},
 {'generated_text': 'This is a comedy story: a ch

In [7]:
!pip install datasets

# create smaller dataset from our subset data
from datasets import Dataset
import pandas as pd
filename = 'data/500_books.txt'
df = pd.read_csv(filename, sep="\t", 
                 names=['Wikipedia ID', 'Freebase ID', 'Title', 'Author', 'Publication Date', 'Genres', 'Summary'])

# clean data
import re

def clean(text):
    cleaned_text = ""
    punc_less_text = re.sub(r'[^\w\s]', '', text)
    alpha_only_text = re.sub(r'[^a-zA-Z]',' ',punc_less_text)
    cleaned_text = ' '.join(alpha_only_text.split())
    return cleaned_text.lower()

# apply to dataframe col that contains the book summary
df['CleanSummary'] = df['Summary'].apply(lambda s: clean(s))
df.head(5)

# remove stop words
import nltk
from nltk.corpus import stopwords

# download stopwords list
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
  stop_less = ' '.join([word for word in text.split() if word not in (stop_words)])
  return stop_less

# apply stopword removal to dataframe col that contains the book summary
df['CleanSummary'] = df['CleanSummary'].apply(lambda s: remove_stopwords(s))
df.head(5)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp
  Downloading aiohttp-3.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m46.4 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash
  Downloading xxhash-3.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py39-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19
  Downl

Unnamed: 0,Wikipedia ID,Freebase ID,Title,Author,Publication Date,Genres,Summary,CleanSummary
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca...",old major old boar manor farm calls animals fa...
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan...",alex teenager living nearfuture england leads ...
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...,text plague divided five parts town oran thous...
3,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...,argument enquiry proceeds series incremental s...
4,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...,novel posits space around milky way divided co...


In [8]:
import json

# drop data
df = df.drop_duplicates(subset=['Wikipedia ID'])
df = df.dropna(subset=['Genres','CleanSummary', 'Summary'])
df['Genres'] = df['Genres'].map(lambda genre : list(json.loads(str(genre)).values()))

In [9]:
# create condensed data with tokens
BOS_TOKEN = '<BOS>'
EOS_TOKEN = '<EOS>'
SPECIAL_TOKENS = []

def transform_genres(genre_list):
    genre_token = ''
    for genre in genre_list:
        genre_token += ('<' + genre + '>')
        if genre_token not in SPECIAL_TOKENS:
            SPECIAL_TOKENS.append(genre)
        genre_token += ' '
    return genre_token

In [10]:
# go thru data and clean up
new_data = []
df = df.reset_index()  # make sure indexes pair with number of rows

for index, row in df.iterrows():
    stringified_row = BOS_TOKEN + ' ' + transform_genres(row['Genres']) + row['Summary'] + ' ' +EOS_TOKEN
    new_data.append(stringified_row)

print(new_data[0])

<BOS> <Roman à clef> <Satire> <Children's literature> <Speculative fiction> <Fiction>  Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, 'Beasts of England'. When Major dies, two young pigs, Snowball and Napoleon, assume command and turn his dream into a philosophy. The animals revolt and drive the drunken and irresponsible Mr Jones from the farm, renaming it "Animal Farm". They adopt Seven Commandments of Animal-ism, the most important of which is, "All animals are equal". Snowball attempts to teach the animals reading and writing; food is plentiful, and the farm runs smoothly. The pigs elevate themselves to positions of leadership and set aside special food items, ostensibly for their personal health. Napoleon takes the pups from the farm dogs and trains them privately. Napoleon and Snowball struggle for leadership. When Snowball announces his plans to build a

In [11]:
# create new dataframe
tokens_df = pd.DataFrame(new_data, columns=['Text'])
tokens_df.head()

Unnamed: 0,Text
0,<BOS> <Roman à clef> <Satire> <Children's lite...
1,<BOS> <Science Fiction> <Novella> <Speculative...
2,<BOS> <Existentialism> <Fiction> <Absurdist fi...
3,<BOS> <Hard science fiction> <Science Fiction>...
4,<BOS> <War novel> <Roman à clef> The book tel...


In [30]:
# split data into train and test data
from sklearn.model_selection import train_test_split

# split the data into training and test data: 80:20
train_data, test_data = train_test_split(tokens_df, test_size=.2, random_state=8)

# create HuggingFace Dataset
train_ds = Dataset.from_pandas(train_data, split="train")
eval_ds = Dataset.from_pandas(test_data, split="eval")

In [31]:
# tokenize datasets
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["Text"], padding="max_length", truncation=True)

train_tok_ds = train_ds.map(tokenize_function, batched=True).shuffle(seed=42).select(range(50))
eval_tok_ds = eval_ds.map(tokenize_function, batched=True).shuffle(seed=42).select(range(50))

special_tokens_dict = {
    "bos_token": BOS_TOKEN,
    "eos_token": EOS_TOKEN,
    "pad_token": "<PAD>",
    "additional_special_tokens": SPECIAL_TOKENS,
}

num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/vocab.txt
loading file tokenize

Map:   0%|          | 0/325 [00:00<?, ? examples/s]

Map:   0%|          | 0/82 [00:00<?, ? examples/s]

Assigning <BOS> to the bos_token key of the tokenizer
Assigning <EOS> to the eos_token key of the tokenizer
Assigning <PAD> to the pad_token key of the tokenizer
Assigning ['Roman à clef', 'Satire', "Children's literature", 'Speculative fiction', 'Fiction', 'Science Fiction', 'Novella', 'Speculative fiction', 'Utopian and dystopian fiction', 'Satire', 'Fiction', 'Existentialism', 'Fiction', 'Absurdist fiction', 'Novel', 'Hard science fiction', 'Science Fiction', 'Speculative fiction', 'Fantasy', 'Fiction', 'War novel', 'Roman à clef', "Children's literature", 'Fantasy', 'Speculative fiction', 'Bildungsroman', 'Fiction', 'Science Fiction', 'Speculative fiction', 'Science Fiction', 'Speculative fiction', 'Religious text', 'Speculative fiction', 'Fiction', 'Novel', 'Science Fiction', 'Speculative fiction', "Children's literature", 'Fiction', 'Satire', 'Bildungsroman', 'Picaresque novel', 'Science Fiction', 'Speculative fiction', "Children's literature", 'Fiction', 'Gothic fiction', 'Ficti

Embedding(29084, 768)

In [15]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
    )

training_args = TrainingArguments(
        output_dir="test_trainer",
        overwrite_output_dir=True,
        do_train=True,
        do_eval=True,
        evaluation_strategy='no',
        per_device_train_batch_size=4,
        num_train_epochs=1,
        save_total_limit=1,
        save_steps=1000)

trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_tok_ds,
        eval_dataset=eval_tok_ds,
    )



In [16]:
trainer.train()

***** Running training *****
  Num examples = 50
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 13
  Number of trainable parameters = 108178944
The following columns in the training set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: Text, __index_level_0__. If Text, __index_level_0__ are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=13, training_loss=9.056475712702824, metrics={'train_runtime': 869.934, 'train_samples_per_second': 0.057, 'train_steps_per_second': 0.015, 'total_flos': 13064601600000.0, 'train_loss': 9.056475712702824, 'epoch': 1.0})

In [28]:
model.save_pretrained('./content/test_trainer')
tokenizer.save_pretrained('./content/test_trainer')

checkpoint = "./content/test_trainer"

model = GPT2LMHeadModel.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
story_generator = TextGenerationPipeline(model=model, tokenizer=tokenizer)

Configuration saved in ./content/test_trainer/config.json
Configuration saved in ./content/test_trainer/generation_config.json
Model weights saved in ./content/test_trainer/pytorch_model.bin
tokenizer config file saved in ./content/test_trainer/tokenizer_config.json
Special tokens file saved in ./content/test_trainer/special_tokens_map.json
loading configuration file ./content/test_trainer/config.json
Model config GPT2Config {
  "_name_or_path": "./content/test_trainer",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation"

In [29]:
input_prompt = "<BOS> <Fiction>"
story = story_generator(input_prompt, max_length=75, do_sample=True,
               repetition_penalty=1.1, temperature=1.2, 
               top_p=0.95, top_k=50)
print(story)

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.1"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': '<BOS> <Fiction> - ữ ǔishing デ houses communism ț ə [unused20]ca 2005 cheekted large courts [unused13] 1856 [unused8] ₱ campaign Sanders cited [unused11] too last ― last 1945 [unused13] Have ї [unused41] 2000 sugar [unused35] succeeded ĩ snuck [unused42] Paris ς [unused13]vio ď Isles sights ś ð tears staff ė aircraft Đ Bristol [unused25] ë mutation completely against surveys [unused12] wasting Č 07 rugged [unused26] us ļ Dustin ƒ'}]
