In [10]:
!pip install transformers[sentencepiece]
!pip install datasets
!pip install evaluate



#Fine-Tuning a Pre-Trained Model

In [11]:
import transformers
import math
from google.colab import drive
from transformers import AutoTokenizer, AutoModelForCausalLM

We are going to fine-tune a GPT-2 model on the HuggingFace OpenWebText dataset.

Select the pre-trained model to use.

In [12]:
checkpoint = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(checkpoint)


Loading weights:   0%|          | 0/76 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: distilgpt2
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
transformer.h.{0, 1, 2, 3, 4, 5}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


# Preparing the Data

Load the OpenWebText dataset for fine-tuning. We split the dataset due to its large size.

In [13]:
from datasets import load_dataset

In [17]:
raw_datasets = load_dataset("openwebtext", split="train[:500]")
raw_datasets

Resolving data files:   0%|          | 0/80 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/80 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/80 [00:00<?, ?files/s]

Generating train split:   0%|          | 0/8013769 [00:00<?, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 500
})

In [18]:
raw_datasets[12]

{'text': 'Attention! This news was published on the old version of the website. There may be some problems with news display in specific browser versions.\n\nThunder League Division Structure\n\nDear players! You have become acquainted with the Thunder League and watched some of the matches by the pro division teams. Your participation in the league events and the purchasing of League “Dog tags” has made it possible to increase the prize pool of the pro division. We thank you for your support with the eSports development in the game!\n\nThe time has come for our supporters to become participants - gather your teams and start to make your way to the top of the Thunder League - to the pro division!\n\nThe Qualifying tournament starts on February 2016\n\nWe announce two more divisions in the league:\n\nNovice Division – a division for the novice teams.\n\nSemi Pro Division – medium division.\n\nSemi Pro Division will be established after the qualifying Novice Division tournament, where al

In [19]:
raw_datasets.column_names

['text']

Tokenize the whole dataset.

In [20]:
def tokenize_function(example):
    return tokenizer(example["text"])


In [21]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=raw_datasets.column_names)
tokenized_datasets

Map (num_proc=4):   0%|          | 0/500 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1624 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1217 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2755 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1172 > 1024). Running this sequence through the model will result in indexing errors


Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 500
})

In [22]:
tokenized_datasets[1]

{'input_ids': [14282,
  7705,
  286,
  1181,
  5073,
  2605,
  11185,
  4446,
  379,
  257,
  1923,
  7903,
  287,
  520,
  13,
  5593,
  319,
  3909,
  13,
  357,
  21102,
  1437,
  40124,
  14,
  464,
  2669,
  2947,
  8,
  198,
  198,
  33939,
  2166,
  12,
  16737,
  5073,
  2605,
  373,
  4058,
  416,
  257,
  18862,
  10330,
  287,
  11565,
  319,
  3583,
  11,
  475,
  262,
  3234,
  6150,
  287,
  46008,
  13310,
  1573,
  319,
  1771,
  8976,
  2311,
  13,
  10477,
  5831,
  286,
  16033,
  561,
  5380,
  257,
  16369,
  13,
  198,
  198,
  464,
  5711,
  33922,
  257,
  17347,
  3280,
  284,
  1771,
  2605,
  550,
  925,
  257,
  3424,
  16085,
  286,
  1936,
  1263,
  25430,
  319,
  3431,
  1755,
  13,
  3412,
  611,
  673,
  857,
  407,
  28615,
  287,
  11565,
  11,
  607,
  584,
  19017,
  4574,
  607,
  5699,
  284,
  262,
  4390,
  4787,
  11872,
  772,
  355,
  262,
  15394,
  24135,
  5831,
  19982,
  284,
  1803,
  319,
  351,
  465,
  46062,
  1923,
  13,
  198,
  

# Prepare the examples for input into the model

In [23]:
block_size = tokenizer.model_max_length

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum([ex for ex in examples[k] if isinstance(ex, list)], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

Each example groups into chunks and is now ready to be fed into the model.

In [24]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/500 [00:00<?, ? examples/s]

In [25]:
lm_datasets

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 545
})

Converts the input token IDs back into text, in order to inspect the input to make sure it was correctly preprocessed.

In [26]:
tokenizer.decode(lm_datasets[1]["input_ids"])

' of injured. The clinic, set up under several tents, was a godsend to the few who were lucky to have been brought there.\n\nRetired Army Lt. Gen. Russel Honore, who led relief efforts for Hurricane Katrina in 2005, said the evacuation of the clinic\'s medical staff was unforgivable.\n\n"Search and rescue must trump security," Honoré said. "I\'ve never seen anything like this before in my life. They need to man up and get back in there."\n\nHonoré drew parallels between the tragedy in New Orleans, Louisiana, and in Port-au-Prince. But even in the chaos of Katrina, he said, he had never seen medical staff walk away.\n\n"I find this astonishing these doctors left," he said. "People are scared of the poor."\n\nCNN\'s Justine Redman, Danielle Dellorto and John Bonifield contributed to this report.Former secretary of state Hillary Clinton meets voters at a campaign rally in St. Louis on Saturday. (Melina Mara/The Washington Post)\n\nDemocratic front-runner Hillary Clinton was ahead by a sli

# Fine-Tune the Model

Training the pre-trained model on the new dataset.

In [27]:
from transformers import TrainingArguments, Trainer

Define hyperparameters and other settings for finetuning the pre-trained model on the preprocessed dataset using the TrainingArguments class.

In [29]:
model_name = checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-openwebtext",
    eval_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False
)

Set up the trainer.

In [30]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets,
    eval_dataset=lm_datasets,
)

Train.

In [31]:
trainer.train()

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,No log,3.263235
2,No log,3.23682
3,No log,3.227087


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

TrainOutput(global_step=207, training_loss=3.387345023777174, metrics={'train_runtime': 423.5034, 'train_samples_per_second': 3.861, 'train_steps_per_second': 0.489, 'total_flos': 427220187217920.0, 'train_loss': 3.387345023777174, 'epoch': 3.0})

# Evaluation

Evaluate on the dataset.

In [32]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 25.21


Inspecting examples from the dataset.

In [33]:
print(raw_datasets[300]['text'])

Mars Observer was one of three NASA Mars missions lost in the 1990s because of technical errors, and not as part of a broader conspiracy. The dark side of space disaster theories Space disasters attract so much public attention and often involve such complex and subtle sequences of events that there’s an entire Internet literature of “crackpot causes” on par with JFK assassination myths. To the degree that innovative analysis is often critical to reconstructing—from partial and often garbled evidence—a shocking causal sequence leading from goodness to disaster, the initial investigation period demands that critical judgment be held somewhat in check so as not to discourage imagination. However, once a logical reconstruction gels, is tested, and then is ultimately verified by being implemented and hence reducing future flight hazards, that official explanation achieves a substantial level of authenticity. But not to everyone’s satisfaction, apparently, as a search of still-thriving non-

Testing our own samples from a user input for Subbreddit and Prompt.

In [41]:
user_input_text = input("Enter text prompt: ")

# Tokenize the input with attention_mask
inputs = tokenizer(user_input_text, return_tensors='pt', padding=True)
input_ids = inputs.input_ids.to('cuda')
attention_mask = inputs.attention_mask.to('cuda')

# Generate text
output = model.generate(
    input_ids,
    attention_mask=attention_mask,  # Added attention mask
    max_length=100,
    num_return_sequences=1,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

# Decode and print the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("\nGenerated text:")
print(generated_text)

Enter text prompt: the first step in solving this problem is

Generated text:
the first step in solving this problem is to apply a single, unified approach.”

The idea is simple:

“There is a way to use the same approach to solve this problem. In practice, the first step is to apply a single, unified approach.”

The second step is to apply a single, unified approach. In practice, the first step is to apply a single, unified approach. In practice, the first step is to apply a single
