# Fine-Tuning GPT-2 with Data Parallelism

This notebook demonstrates how to fine-tune a pretrained GPT-2 model using PyTorch's `DataParallel` to distribute training across multiple GPUs.

## Prerequisites
- Python 3.x
- PyTorch
- Hugging Face Transformers library
- A dataset for fine-tuning (e.g., WikiText-2)

## Steps
1. Install required libraries. 
2. Load the pretrained GPT-2 model and tokenizer.
3. Tokenize the dataset.
4. Use `DataParallel` for distributed training.
5. Fine-tune the model.

# 1. Install required libraries. 

In [21]:
# %pip install -r requirements.txt

## Import required libraries

In [14]:
# Import necessary libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import torch
from torch.nn.parallel import DataParallel
import time

# 2. Load the pretrained GPT-2 model and tokenizer


In [15]:

model = AutoModelForCausalLM.from_pretrained('gpt2')
tokenizer = AutoTokenizer.from_pretrained('gpt2')

# Set an existing token (e.g., eos_token) as the padding token (The GPT-2 tokenizer already has an eos_token (End of Sequence) that can serve as a padding token)
tokenizer.pad_token = tokenizer.eos_token

# Load a small dataset (e.g., WikiText-2)
dataset = load_dataset('wikitext', 'wikitext-2-v1', split='train[:1%]')

# 3. Tokenize the dataset

In [16]:

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)

# Tokenize the dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)
# print("Tokenization completed successfully!")

# 4. Use `DataParallel` for distributed training.

In [18]:


# Check if CUDA is available
if torch.cuda.is_available():
    print("CUDA is available!")
    print(f"Number of GPUs: {torch.cuda.device_count()}")

    # Get the name of each GPU
    for i in range(torch.cuda.device_count()):
        print(f"GPU {i}: {torch.cuda.get_device_name(i)}")

        # Print memory usage
        print(f"Allocated memory on GPU {i}: {torch.cuda.memory_allocated(i) / 1024**2:.2f} MB")
        print(f"Cached memory on GPU {i}: {torch.cuda.memory_reserved(i) / 1024**2:.2f} MB")
else:
    print("CUDA is not available.")

CUDA is not available.


In [12]:
# Move model to GPU(s) and wrap with DataParallel if multiple GPUs are available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Move model to device
model = model.to(device)

# Wrap model with DataParallel if multiple GPUs are available
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs!")
    model = DataParallel(model)


# 5. Fine-tune the model.

In [17]:
# Prepare inputs and labels for a small batch of data (batch size = 8)
input_ids = torch.tensor(tokenized_dataset['input_ids'][:8]).to(device)  # Batch size: 8 examples
labels = input_ids.clone()

# Define optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# Measure the start time
start_time = time.time()

# Forward pass
outputs = model(input_ids, labels=labels)
loss = outputs.loss
print(f"Loss: {loss.item()}")

# Backward pass and optimization step
loss.backward()
optimizer.step()

# Measure the end time
end_time = time.time()

# Calculate and display elapsed time
elapsed_time = end_time - start_time
print(f"Training step completed in {elapsed_time:.2f} seconds!")

Loss: 10.270589828491211
Training step completed in 1.15 seconds!


**Next Steps:**
- Experiment with different datasets.
- Adjust hyperparameters like learning rate and batch size.