<a href="https://colab.research.google.com/github/YashNigam65/gitfolder/blob/master/genAI_concept_notebook/fine_tunning_and_transfer_learning/model_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
# This cell imports all the necessary libraries for fine-tuning a GPT-2 model.
# It includes torch for tensor operations, transformers for model and tokenizer handling,
# datasets for managing the training data, and json for data manipulation.

#several components from the transformers library
import torch
from transformers import (
    GPT2LMHeadModel,
    GPT2Tokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)

from datasets import Dataset
import json
# "DataCollatorForLanguageModeling" for handling the model, tokenizer, training configurations, and data preparation.

In [4]:
# This cell defines a list of sample text data that will be used for fine-tuning the GPT-2 model.
# In a real-world scenario, this would be replaced with a much larger and more diverse dataset.
# Sample training data - replace with your own dataset
sample_data = [
    "The weather today is beautiful and sunny.",
    "Machine learning is revolutionizing technology.",
    "Python is a versatile programming language.",
    "Fine-tuning models requires careful preparation.",
    "Natural language processing has many applications.",
    "Deep learning models need quality training data.",
    "Transformers have changed how we approach NLP.",
    "Text generation can be improved with fine-tuning."
]

In [2]:
# This cell defines the 'prepare_dataset' function.
# This function takes a list of texts, a tokenizer, and a maximum sequence length.
# It tokenizes the input texts, pads or truncates them to the specified max_length,
# and sets the labels to be the same as the input_ids, which is standard for language modeling.


# Function is crucial for getting your text data ready for fine-tuning,
# It takes a list of texts, a tokenizer, and an optional maximum sequence length.
#

def prepare_dataset(texts, tokenizer, max_length=128):
    """
    Prepare the dataset for training
    """

    # tokenize_function that uses the provided tokenizer to convert the raw text into
    # numerical tokens. It handles truncation and padding to ensure all sequences have
    # the same length (up to max_length) and returns PyTorch tensors.
    def tokenize_function(examples):
        # Tokenize the texts
        tokenized = tokenizer(
            examples['text'],
            truncation=True,
            padding='max_length',
            max_length=max_length,
            return_tensors='pt'
        )
        # For language modeling, labels are the same as input_ids
        tokenized['labels'] = tokenized['input_ids'].clone()
        return tokenized

    # Create dataset
    dataset = Dataset.from_dict({'text': texts})

    tokenized_dataset = dataset.map(tokenize_function, batched=True)

    return tokenized_dataset

In [7]:
# This cell defines the 'fine_tune_model' function, which orchestrates the fine-tuning process.
# It loads a pre-trained GPT-2 model and its tokenizer, prepares the training dataset,
# configures training arguments (like output directory, number of epochs, batch size),
# initializes a Hugging Face Trainer, and then starts the training.
# Finally, it saves the fine-tuned model and tokenizer to a local directory.


def fine_tune_model():
    """
    Main function to fine-tune the model
    """
    # Initialize model and tokenizer
    model_name = "gpt2"  # You can change this to other models like "distilgpt2"
    print(f"Loading model: {model_name}")

    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    model = GPT2LMHeadModel.from_pretrained(model_name)

    # Add padding token if it doesn't exist
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Prepare dataset
    print("Preparing dataset...")
    train_dataset = prepare_dataset(sample_data, tokenizer)

    # Data collator for language modeling
    # DataCollatorForLanguageModeling is used to format the data batches
    # for training. TrainingArguments defines the training configuration,
    # such as output directory, number of epochs, batch size,
    # and logging settings.
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False,  # We're not doing masked language modeling
    )

    # Training arguments
    training_args = TrainingArguments(
        output_dir="./fine_tuned_model",
        overwrite_output_dir=True,
        num_train_epochs=3,
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        warmup_steps=10,
        logging_steps=10,
        save_steps=100,
        eval_strategy="no",  # Updated argument
        save_strategy="epoch",
        load_best_model_at_end=False,
        report_to=None,  # Disable wandb logging
        logging_dir=None,
    )

    # Initialize trainer
    # The Trainer class is the core component for training; it takes the model,
    # training arguments, dataset, data collator, and tokenizer.
    # The trainer.train() method starts the fine-tuning process.
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        data_collator=data_collator,
        tokenizer=tokenizer,
    )

    # Start training

    print("Starting fine-tuning...")
    trainer.train()

    # Save the fine-tuned model
    print("Saving fine-tuned model...")
    trainer.save_model("./fine_tuned_model")
    tokenizer.save_pretrained("./fine_tuned_model")

    print("Fine-tuning completed!")
    return model, tokenizer

In [8]:
# This cell defines the 'test_model' function.
# This function takes a trained model and tokenizer, sets the model to evaluation mode,
# and generates text based on a list of predefined prompts.
# It demonstrates how the fine-tuned model can be used to generate new text.
def test_model(model, tokenizer):
    """
    Test the fine-tuned model with sample generation
    """
    print("\nTesting fine-tuned model:")

    # Set model to evaluation mode
    model.eval()

    # Get the device of the model
    device = model.device

    test_prompts = [
        "The weather today",
        "Machine learning",
        "Python programming"
    ]

    for prompt in test_prompts:
        # Encode the prompt and move to the model's device
        input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)

        # Generate text
        with torch.no_grad():
            output = model.generate(
                input_ids,
                max_length=50,
                num_return_sequences=1,
                temperature=0.8,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id
            )

        # Decode and print
        generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
        print(f"Prompt: '{prompt}'")
        print(f"Generated: '{generated_text}'")
        print("-" * 50)

In [9]:
# This cell defines the 'load_and_test_saved_model' function.
# It demonstrates how to load a fine-tuned model and its tokenizer from a saved directory.
# After loading, it calls the 'test_model' function to verify that the loaded model
# can still generate text effectively.

# Load and test saved model: The load_and_test_saved_model function
# demonstrates how to load a previously saved fine-tuned model and
# tokenizer using GPT2LMHeadModel.from_pretrained and GPT2Tokenizer.
# from_pretrained from the saved directory.
# It then calls test_model to evaluate the loaded model's performance.

def load_and_test_saved_model():
    """
    Load the saved fine-tuned model and test it
    """
    print("\nLoading saved fine-tuned model...")

    # Load the fine-tuned model and tokenizer
    model = GPT2LMHeadModel.from_pretrained("./fine_tuned_model")
    tokenizer = GPT2Tokenizer.from_pretrained("./fine_tuned_model")

    # Test the loaded model
    test_model(model, tokenizer)

In [10]:
# This cell installs the necessary Python packages: 'torch', 'transformers', and 'datasets'.
# These libraries are crucial for working with PyTorch, Hugging Face models, and data handling, respectively.
!pip install torch transformers datasets



In [11]:
# This is the main execution block of the notebook.
# It first disables WANDB logging, checks for GPU availability, and then
# calls the 'fine_tune_model', 'test_model', and 'load_and_test_saved_model' functions
# in sequence to perform the entire fine-tuning and testing workflow.
# It also includes basic error handling.
import os

if __name__ == "__main__":
    # Disable wandb logging explicitly
    os.environ["WANDB_DISABLED"] = "true"

    # Check if CUDA is available
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")

    try:
        # Fine-tune the model
        model, tokenizer = fine_tune_model()

        # Test the fine-tuned model
        test_model(model, tokenizer)

        # Demonstrate loading the saved model
        load_and_test_saved_model()

    except Exception as e:
        print(f"An error occurred: {e}")
        print("Make sure you have the required packages installed:")
        print("pip install torch transformers datasets")

Using device: cuda
Loading model: gpt2


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Preparing dataset...
Dataset({
    features: ['text'],
    num_rows: 8
})


Map:   0%|          | 0/8 [00:00<?, ? examples/s]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 50256}.


Starting fine-tuning...


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
10,3.8295


Saving fine-tuned model...
Fine-tuning completed!

Testing fine-tuned model:


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Prompt: 'The weather today'
Generated: 'The weather today will be beautiful. We will have a fantastic, sunny day."

In addition to the weather forecast, the team will be using high-definition video to explore the terrain.

"We are going to be able to use'
--------------------------------------------------
Prompt: 'Machine learning'
Generated: 'Machine learning is a rapidly evolving technology, but it has its challenges. Machine learning is still evolving. With that in mind, it's best to understand how algorithms work.

Machine Learning is an application of machine learning to different applications. The main'
--------------------------------------------------
Prompt: 'Python programming'
Generated: 'Python programming is about designing and implementing code that will work across different platforms. This helps to bring languages that require multiple programming interfaces together.

The key difference between Python programming and C/C++ programming is how best to get the language to