# Basic GPT-2 Model

We are now using HuggingFace's model! I am currently using [this article](https://www.modeldifferently.com/en/2021/12/generaci%C3%B3n-de-fake-news-con-gpt-2/) and [this HuggingFace link](https://huggingface.co/gpt2).

In [21]:
#!pip install transformers

# import hugging face
from transformers import GPT2Tokenizer, GPT2LMHeadModel

In [22]:
# basic GPT-2 model (from the site)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

In [23]:
# test the model
text = "Generate a book summary with genre fiction:\n"
text_ids = tokenizer.encode(text, return_tensors = 'pt')

generated_text_samples = model.generate(text_ids)

#Print output for each sequence generated above
for i, beam in enumerate(generated_text_samples):
  print("{}: {}".format(i,tokenizer.decode(beam, skip_special_tokens=True)))
  print()

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


0: Generate a book summary with genre fiction:

The book summary is a summary of the book



In [2]:
#!pip install datasets

# create smaller dataset from our subset data
from datasets import Dataset
import pandas as pd
filename = 'data/500_books.txt'
df = pd.read_csv(filename, sep="\t", 
                 names=['Wikipedia ID', 'Freebase ID', 'Title', 'Author', 'Publication Date', 'Genres', 'Summary'])

# clean data
import re

def clean(text):
    cleaned_text = ""
    punc_less_text = re.sub(r'[^\w\s]', '', text)
    alpha_only_text = re.sub(r'[^a-zA-Z]',' ',punc_less_text)
    cleaned_text = ' '.join(alpha_only_text.split())
    return cleaned_text.lower()

# apply to dataframe col that contains the book summary
df['CleanSummary'] = df['Summary'].apply(lambda s: clean(s))
df.head(5)

# remove stop words
import nltk
from nltk.corpus import stopwords

# download stopwords list
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
  stop_less = ' '.join([word for word in text.split() if word not in (stop_words)])
  return stop_less

# apply stopword removal to dataframe col that contains the book summary
df['CleanSummary'] = df['CleanSummary'].apply(lambda s: remove_stopwords(s))
df.head(5)

Unnamed: 0,Wikipedia ID,Freebase ID,Title,Author,Publication Date,Genres,Summary,CleanSummary
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca...",old major old boar manor farm calls animals fa...
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan...",alex teenager living nearfuture england leads ...
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...,text plague divided five parts town oran thous...
3,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...,argument enquiry proceeds series incremental s...
4,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...,novel posits space around milky way divided co...


In [3]:
import json

# drop data
df = df.drop_duplicates(subset=['Wikipedia ID'])
df = df.dropna(subset=['Genres','CleanSummary', 'Summary'])
df['Genres'] = df['Genres'].map(lambda genre : list(json.loads(str(genre)).values()))

In [4]:
def get_genres_str(genre_list):
  genre_str = ''
  for g in genre_list:
    genre_str += (g + ', ')
  return genre_str[:-2]

In [5]:
# go thru data and clean up
new_data = []
df = df.reset_index()  # make sure indexes pair with number of rows
prefix = 'Generate book summary with genres '

for index, row in df.iterrows():
    stringified_row = prefix + get_genres_str(row['Genres']) + ':\n'+ row['Summary']
    new_data.append(stringified_row)

print(new_data[0])

Generate book summary with genres Roman à clef, Satire, Children's literature, Speculative fiction, Fiction:
 Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, 'Beasts of England'. When Major dies, two young pigs, Snowball and Napoleon, assume command and turn his dream into a philosophy. The animals revolt and drive the drunken and irresponsible Mr Jones from the farm, renaming it "Animal Farm". They adopt Seven Commandments of Animal-ism, the most important of which is, "All animals are equal". Snowball attempts to teach the animals reading and writing; food is plentiful, and the farm runs smoothly. The pigs elevate themselves to positions of leadership and set aside special food items, ostensibly for their personal health. Napoleon takes the pups from the farm dogs and trains them privately. Napoleon and Snowball struggle for leadership. When Snowball announc

In [6]:
# create new dataframe
tokens_df = pd.DataFrame(new_data, columns=['Text'])
tokens_df.head()

Unnamed: 0,Text
0,Generate book summary with genres Roman à clef...
1,Generate book summary with genres Science Fict...
2,Generate book summary with genres Existentiali...
3,Generate book summary with genres Hard science...
4,"Generate book summary with genres War novel, R..."


In [29]:
# split data into train and test data
from sklearn.model_selection import train_test_split

# split the data
train_data, eval_set = train_test_split(tokens_df, random_state=8)

# create HuggingFace Dataset
train_ds = Dataset.from_pandas(train_data, split="train")
eval_ds = Dataset.from_pandas(eval_set, split="eval")

In [30]:
# tokenize datasets
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
    return tokenizer(examples["Text"], truncation=True)

train_tok_ds = train_ds.map(tokenize_function, batched=True).shuffle(seed=42)
eval_tok_ds = eval_ds.map(tokenize_function, batched=True).shuffle(seed=42)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [31]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

training_args = TrainingArguments(
        output_dir="test_trainer",
        overwrite_output_dir=True,
        do_train=True,
        do_eval=True,
        evaluation_strategy='no',
        per_device_train_batch_size=4,
        num_train_epochs=2,
        save_total_limit=1,
        save_steps=1000)

trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_tok_ds,
        eval_dataset=eval_tok_ds,
    )

In [32]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: __index_level_0__, Text. If __index_level_0__, Text are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 305
  Num Epochs = 2
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 154
  Number of trainable parameters = 124439808


  0%|          | 0/154 [00:00<?, ?it/s]

In [1]:
from transformers import TextGenerationPipeline

# save local version
model.save_pretrained('./content/test_trainer')
tokenizer.save_pretrained('./content/test_trainer')

checkpoint = "./content/test_trainer"

# load into model and tokenizer
model = GPT2LMHeadModel.from_pretrained(checkpoint)
tokenizer = GPT2Tokenizer.from_pretrained(checkpoint)
summary_gen = TextGenerationPipeline(model=model, tokenizer=tokenizer)

NameError: name 'model' is not defined

In [None]:
input_prompt = "Generate a book summary with genre fiction:\n"

# generate new story summary
story = summary_gen(input_prompt, max_length=50, do_sample=True, temperature=0.5)
print(story)

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.1"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Generate a book summary with genre fiction:\n The first chapters focus mainly on how the United States and New Zealand government, following World War A, were struggling with national identity issues arising from economic insecurity in the South Pacific. Through various legal issues,'}]
