Basic text generation

In [1]:
pip install transformers

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install torch

Note: you may need to restart the kernel to use updated packages.


In [3]:
from transformers import AutoTokenizer,AutoModelForCausalLM
import torch
from torch.utils.data import Dataset,DataLoader
from torch.nn.utils.rnn import pad_sequence


In [2]:
# Initialize the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

In [3]:
# Encoding the prompt to get the input ids
prompt = "Dear boss ... "
input_ids = tokenizer.encode(prompt, return_tensors="pt") # pt = pytorch

# Generate text using the model
outputs = model.generate(input_ids, max_length = 100)
tokenizer.decode(outputs[0], skip_special_tokens=True) # decode output to text

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


"Dear boss ... \xa0I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able"

In [5]:
# Simplified text generation function
def simple_text_generation(prompt, model, tokenizer, max_length = 100):
  # Encoding the prompt to get the input ids
  input_ids = tokenizer.encode(prompt, return_tensors="pt") # pt = pytorch

  # Generate text using the model
  outputs = model.generate(input_ids, max_length = 100)

  # Decode the generated output IDs back into text
  return tokenizer.decode(outputs[0], skip_special_tokens=True)
    

Notice the repetitive nature of the output => we need to fix this

In [6]:
# Test the function
prompt = "Dear boss ... "
text_generated = simple_text_generation(prompt,
                                        model,
                                        tokenizer,
                                        max_length = 100)
print(text_generated)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Dear boss ...  I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able


Fine Tuning

In [7]:
# Load dataset (scientific research abstracts related to machine learning)
data = [
    "This paper presents a new method for improving the performance of machine learning models by using data augmentation techniques.",
    "We propose a novel approach to natural language processing that leverages the power of transformers and attention mechanisms.",
    "In this study, we investigate the impact of deep learning algorithms on the accuracy of image recognition tasks.",
    "Our research demonstrates the effectiveness of transfer learning in enhancing the capabilities of neural networks.",
    "This work explores the use of reinforcement learning for optimizing decision-making processes in complex environments.",
    "We introduce a framework for unsupervised learning that significantly reduces the need for labeled data.",
    "The results of our experiments show that ensemble methods can substantially boost model performance.",
    "We analyze the scalability of various machine learning algorithms when applied to large datasets.",
    "Our findings suggest that hyperparameter tuning is crucial for achieving optimal results in machine learning applications.",
    "This research highlights the importance of feature engineering in the context of predictive modeling."
]

In [8]:
# Tokenization
# All inputs must have the same length
# Ensure all inputs have the same length by adding a dummy token at the end
# This process of adding dummy tokens is called padding.

tokenizer.pad_token = tokenizer.eos_token

In [9]:
#tokenize the data
Tokenized_data=[tokenizer.encode_plus(
    sentence,
    add_special_tokens=True,
    return_tensors="pt",
    padding="max_length",
    max_length=50) for sentence in data]
    


[{'input_ids': tensor([[ 1212,  3348, 10969,   257,   649,  2446,   329, 10068,   262,  2854,
            286,  4572,  4673,  4981,   416,  1262,  1366, 16339, 14374,  7605,
             13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0]])},
 {'input_ids': tensor([[ 1135, 18077,   257,  5337,  3164,   284,  3288,  3303,  7587,   326,
          17124,  1095,   262,  1176,   286,  6121,   364,   290,  3241, 11701,
             13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 502

In [10]:
Tokenized_data[:2]    

[{'input_ids': tensor([[ 1212,  3348, 10969,   257,   649,  2446,   329, 10068,   262,  2854,
            286,  4572,  4673,  4981,   416,  1262,  1366, 16339, 14374,  7605,
             13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0]])},
 {'input_ids': tensor([[ 1135, 18077,   257,  5337,  3164,   284,  3288,  3303,  7587,   326,
          17124,  1095,   262,  1176,   286,  6121,   364,   290,  3241, 11701,
             13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 502

In [12]:
# Isolate the input IDs and the attention masks
inputs_ids = [item['input_ids'].squeeze() for item in Tokenized_data]
attention_masks = [item['attention_mask'].squeeze() for item in Tokenized_data]
attention_masks[:2]

[tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]),
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0])]

In [13]:
#convert the input ids and attention masks to  tensors
# This step is necessary for processing the  tuned model

inputs_ids = torch.stack(inputs_ids)
attention_masks = torch.stack(attention_masks)

In [15]:
inputs_ids


tensor([[ 1212,  3348, 10969,   257,   649,  2446,   329, 10068,   262,  2854,
           286,  4572,  4673,  4981,   416,  1262,  1366, 16339, 14374,  7605,
            13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256],
        [ 1135, 18077,   257,  5337,  3164,   284,  3288,  3303,  7587,   326,
         17124,  1095,   262,  1176,   286,  6121,   364,   290,  3241, 11701,
            13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256],
        [  818,   428,  2050,    11,   356,  9161,   262,  2928,   286,  2769,
          4673, 16113,   319,   262,  9922,   286,  2939,  9465,  8861,    13,
         50256, 50256, 50256, 50256, 50256, 50256,

In [16]:
# Padding all input sequences to ensure they have the same length
padded_input_ids = pad_sequence(
    inputs_ids,
    batch_first = True,
    padding_value = tokenizer.eos_token_id) # Use the tokenizer's end-of-sequence token ID as the padding value

# Padding all attention masks to ensure they have the same length
padded_attention_masks = pad_sequence(
    attention_masks,
    batch_first = True,
    padding_value = 0) # Use 0 as the padding value for attention masks

Create the dataset include datalabels

In [20]:
class TextDataset(Dataset):
    def __init__(self,input_ids,attention_masks):
        self.input_ids=input_ids
        self.attention_masks=attention_masks
        self.labels=input_ids.clone()

        
    def __len__(self):
            return len(self.input_ids)

    def __getitem__(self,idx):
        return{
            'input_ids':self.input_ids[idx],
            'attention_masks':self.attention_masks[idx],
            'labels':self.labels[idx]
        }
dataset=TextDataset(padded_input_ids,padded_attention_masks)       

In [21]:
dataset[:2]

{'input_ids': tensor([[ 1212,  3348, 10969,   257,   649,  2446,   329, 10068,   262,  2854,
            286,  4572,  4673,  4981,   416,  1262,  1366, 16339, 14374,  7605,
             13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256],
         [ 1135, 18077,   257,  5337,  3164,   284,  3288,  3303,  7587,   326,
          17124,  1095,   262,  1176,   286,  6121,   364,   290,  3241, 11701,
             13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]]),
 'attention_masks': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,


## Fine Tuning the gpt -2

In [22]:
dataloader=DataLoader(dataset,batch_size=2,shuffle=True)

In [23]:
dataloader

<torch.utils.data.dataloader.DataLoader at 0x256841ff290>

In [27]:
optimizer=torch.optim.AdamW(model.parameters(),lr=5e-5)

model.train()

#Training loop
for epoch in range(10):
    for batch in dataloader:
        #unpacking the inputs and attention mask ids
        input_ids=batch['input_ids']
        attention_mask=batch['attention_masks']

        #reset the gradient to zero [Every time we iterate it set optimization to zero]
        optimizer.zero_grad()

        #forward pass
        #Processing the inputs and attention masks
        outputs=model(input_ids=input_ids,attention_mask=attention_mask,
                     labels=input_ids)

        loss = outputs.loss

        # Backward pass: compute the gradients of the loss
        loss.backward()
    
        # Update the model parameters
        optimizer.step()

        #print the loss of current epoch
    print(f"Epoch {epoch + 1} - Loss: {loss.item()}")
        
        

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch 1 - Loss: 1.3565013408660889
Epoch 2 - Loss: 1.15887451171875
Epoch 3 - Loss: 0.8793573379516602
Epoch 4 - Loss: 0.8603087663650513
Epoch 5 - Loss: 0.6487030982971191
Epoch 6 - Loss: 0.5872219204902649
Epoch 7 - Loss: 0.5017762780189514
Epoch 8 - Loss: 0.33895784616470337
Epoch 9 - Loss: 0.3879915475845337
Epoch 10 - Loss: 0.16615575551986694


In [30]:
#define function to generate text
def generate_text(prompt,model,tokenizer,max_length=100):
    # Encode the prompt to obtain input IDs and attention mask
    inputs = tokenizer.encode_plus(prompt, return_tensors="pt")

    # Extract input ids and attention mask
    input_ids = inputs['input_ids']
    attention_mask = inputs['attention_mask']

    outputs=model.generate(input_ids,attention_mask=attention_mask,
                            max_length=max_length)
    return tokenizer.decode(outputs[0],skip_special_tokens=True)
    

In [33]:
prompt="In this research ,we "
text_generated = generate_text(prompt, model, tokenizer, max_length = 500)
print(text_generated)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In this research ,we ert the use of machine learning algorithms for image recognition tasks.
