# Using Large Language Models (LLMs) with Benchmark Datasets

This notebook demonstrates how to use pre-trained Large Language Models (LLMs) for text generation and analysis on the IMDB movie review dataset, a standard benchmark for sentiment analysis.

## What are LLMs?

Large Language Models are neural networks trained on vast amounts of text data. They learn to understand and generate human-like text. Examples include GPT-2, GPT-3, BERT, and many others.

## IMDB Dataset

The IMDB dataset contains 50,000 movie reviews labeled as positive or negative, making it a benchmark for sentiment analysis tasks.

In [None]:
# Import required libraries
import torch
from transformers import (
    GPT2LMHeadModel, 
    GPT2Tokenizer,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    pipeline
)
from datasets import load_dataset
import matplotlib.pyplot as plt
import numpy as np
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Set random seed
torch.manual_seed(42)
np.random.seed(42)

# Check device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

## Part 1: Text Generation with GPT-2

Let's start by using GPT-2, a popular generative LLM, to generate movie review-style text.

In [None]:
# Load GPT-2 model and tokenizer
print('Loading GPT-2 model...')
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
gpt2_model = GPT2LMHeadModel.from_pretrained('gpt2')
gpt2_model = gpt2_model.to(device)
gpt2_model.eval()

# Set pad token
gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

print(f'Model parameters: {sum(p.numel() for p in gpt2_model.parameters()):,}')

In [None]:
# Function to generate text
def generate_text(prompt, max_length=100, num_return_sequences=1):
    """
    Generate text using GPT-2
    """
    inputs = gpt2_tokenizer.encode(prompt, return_tensors='pt').to(device)
    
    with torch.no_grad():
        outputs = gpt2_model.generate(
            inputs,
            max_length=max_length,
            num_return_sequences=num_return_sequences,
            no_repeat_ngram_size=2,
            top_k=50,
            top_p=0.95,
            temperature=0.8,
            do_sample=True,
            pad_token_id=gpt2_tokenizer.eos_token_id
        )
    
    generated_texts = []
    for output in outputs:
        text = gpt2_tokenizer.decode(output, skip_special_tokens=True)
        generated_texts.append(text)
    
    return generated_texts

# Generate movie review-style text
prompts = [
    "This movie was absolutely",
    "I really enjoyed the film because",
    "The acting in this movie was"
]

print("Generated Movie Review Texts:\n")
for prompt in prompts:
    generated = generate_text(prompt, max_length=80)
    print(f"Prompt: '{prompt}'")
    print(f"Generated: {generated[0]}")
    print("-" * 80)

## Part 2: Load IMDB Benchmark Dataset

Now let's load the IMDB dataset to analyze real movie reviews.

In [None]:
# Load IMDB dataset
print('Loading IMDB dataset...')
imdb_dataset = load_dataset('imdb', split='test[:1000]')  # Load subset for faster demo

print(f'Dataset size: {len(imdb_dataset)}')
print(f'Features: {imdb_dataset.features}\n')

# Show sample reviews
print('Sample reviews:')
for i in range(3):
    review = imdb_dataset[i]
    label = 'Positive' if review['label'] == 1 else 'Negative'
    print(f"\nReview {i+1} ({label}):")
    print(review['text'][:200] + '...')

## Part 3: Sentiment Analysis with Pre-trained LLM

Let's use a pre-trained BERT-based model fine-tuned for sentiment analysis on IMDB.

In [None]:
# Load sentiment analysis pipeline
print('Loading sentiment analysis model...')
sentiment_pipeline = pipeline(
    'sentiment-analysis',
    model='distilbert-base-uncased-finetuned-sst-2-english',
    device=0 if torch.cuda.is_available() else -1
)

print('Model loaded successfully!')

In [None]:
# Analyze sentiment of sample reviews
num_samples = 100
correct = 0
predictions = []
true_labels = []

print(f'Analyzing {num_samples} reviews...')
for i in tqdm(range(num_samples)):
    review = imdb_dataset[i]
    text = review['text'][:512]  # Truncate to max length
    true_label = review['label']
    
    # Get prediction
    result = sentiment_pipeline(text)[0]
    pred_label = 1 if result['label'] == 'POSITIVE' else 0
    
    predictions.append(pred_label)
    true_labels.append(true_label)
    
    if pred_label == true_label:
        correct += 1

accuracy = correct / num_samples
print(f'\nAccuracy on {num_samples} samples: {accuracy:.2%}')

## Part 4: Visualize Results

In [None]:
# Create confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns

# Confusion matrix
cm = confusion_matrix(true_labels, predictions)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Negative', 'Positive'],
            yticklabels=['Negative', 'Positive'])
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title(f'Confusion Matrix (Accuracy: {accuracy:.2%})')
plt.show()

# Classification report
print('\nClassification Report:')
print(classification_report(true_labels, predictions, 
                          target_names=['Negative', 'Positive']))

## Part 5: Analyze Specific Examples

In [None]:
# Show some predictions with confidence scores
print('Sample Predictions with Confidence Scores:\n')
for i in range(5):
    review = imdb_dataset[i]
    text = review['text'][:200]  # Show first 200 chars
    true_label = 'Positive' if review['label'] == 1 else 'Negative'
    
    result = sentiment_pipeline(review['text'][:512])[0]
    pred_label = result['label']
    confidence = result['score']
    
    print(f"Review {i+1}:")
    print(f"Text: {text}...")
    print(f"True: {true_label} | Predicted: {pred_label} (confidence: {confidence:.2%})")
    print("-" * 80)

## Key Takeaways

1. **LLM Capabilities**: Large Language Models can both generate text (GPT-2) and analyze text (BERT)
2. **Pre-trained Models**: Using pre-trained models saves significant time and computational resources
3. **Benchmark Datasets**: IMDB is a standard benchmark for evaluating sentiment analysis models
4. **Transfer Learning**: Models trained on general text can be fine-tuned for specific tasks
5. **Zero-shot vs Fine-tuned**: Fine-tuned models typically perform better on specific tasks

## Next Steps

- Explore other LLMs like BERT, RoBERTa, or T5
- Try different benchmark datasets (SST-2, Amazon Reviews, etc.)
- See the fine-tuning notebook to learn how to adapt models for specific tasks
- Experiment with different generation parameters for GPT-2