# Lesson 16: Test Dataset Collection and Model Evaluation

## Introduction (5 minutes)

Welcome to our lesson on Test Dataset Collection and Model Evaluation. In this 60-minute session, we'll explore the crucial processes of collecting and preparing data for our chatbot project, as well as evaluating the performance of our language model.

## Lesson Objectives

By the end of this lesson, you will be able to:
1. Collect and prepare a suitable dataset for chatbot training and evaluation
2. Implement data cleaning and preprocessing techniques
3. Understand and apply various model evaluation metrics
4. Conduct a comprehensive evaluation of a language model's performance

## 1. Data Collection for Chatbot Applications (15 minutes)

### 1.1 Sources of Data

- Public datasets (e.g., Reddit conversations, movie dialogues)
- Customer service logs
- Synthetic data generation
- Crowdsourcing

### 1.2 Data Collection Process

Let's implement a simple data collector using the Reddit API:

In [None]:
import praw
import pandas as pd

def collect_reddit_data(subreddit_name, limit=1000):
    reddit = praw.Reddit(client_id='YOUR_CLIENT_ID',
                         client_secret='YOUR_CLIENT_SECRET',
                         user_agent='YOUR_USER_AGENT')
    
    subreddit = reddit.subreddit(subreddit_name)
    data = []
    
    for submission in subreddit.top(limit=limit):
        submission.comments.replace_more(limit=0)
        for comment in submission.comments.list():
            if comment.parent_id.startswith('t1_'):  # Ensure it's a reply
                parent = reddit.comment(comment.parent_id[3:])
                data.append({
                    'context': parent.body,
                    'response': comment.body
                })
    
    return pd.DataFrame(data)

# Usage
df = collect_reddit_data('casualconversation')
print(df.head())

## 2. Data Cleaning and Preprocessing (15 minutes)

### 2.1 Low-quality Filtering

Remove irrelevant or low-quality data:

In [None]:
import re

def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove special characters
    text = re.sub(r'[^\w\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    return text.strip()

def filter_data(df):
    # Remove short or long messages
    df = df[(df['context'].str.len() > 5) & (df['context'].str.len() < 200)]
    df = df[(df['response'].str.len() > 5) & (df['response'].str.len() < 200)]
    
    # Apply text cleaning
    df['context'] = df['context'].apply(clean_text)
    df['response'] = df['response'].apply(clean_text)
    
    return df

cleaned_df = filter_data(df)
print(f"Original size: {len(df)}, Cleaned size: {len(cleaned_df)}")

### 2.2 Redundancy Handling

Remove duplicate or near-duplicate entries:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def remove_duplicates(df, threshold=0.9):
    tfidf = TfidfVectorizer().fit_transform(df['context'] + " " + df['response'])
    cosine_sim = cosine_similarity(tfidf, tfidf)
    
    to_remove = set()
    for i in range(len(df)):
        if i not in to_remove:
            similar = np.where(cosine_sim[i] > threshold)[0]
            to_remove.update(similar[similar != i])
    
    return df.drop(df.index[list(to_remove)])

deduped_df = remove_duplicates(cleaned_df)
print(f"Size after deduplication: {len(deduped_df)}")

## 3. Model Evaluation Metrics (15 minutes)

For chatbot and language model evaluation, we'll consider several metrics:

### 3.1 Perplexity

Perplexity measures how well a probability model predicts a sample. Lower perplexity indicates better performance.

In [None]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

def calculate_perplexity(model, tokenizer, text):
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
    return torch.exp(outputs.loss).item()

model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

sample_text = "The quick brown fox jumps over the lazy dog"
perplexity = calculate_perplexity(model, tokenizer, sample_text)
print(f"Perplexity: {perplexity}")

### 3.2 BLEU Score

BLEU (Bilingual Evaluation Understudy) score is used to evaluate the quality of machine-translated text.

In [None]:
from nltk.translate.bleu_score import sentence_bleu

def calculate_bleu(reference, candidate):
    return sentence_bleu([reference.split()], candidate.split())

reference = "The cat is on the mat"
candidate = "There is a cat on the mat"
bleu_score = calculate_bleu(reference, candidate)
print(f"BLEU Score: {bleu_score}")

### 3.3 Human Evaluation

While automated metrics are useful, human evaluation is crucial for assessing the quality and relevance of chatbot responses. Consider factors like:

- Relevance
- Coherence
- Engagement
- Factual accuracy

## 4. Conducting a Comprehensive Evaluation (10 minutes)

Let's put everything together to evaluate our model:

In [None]:
def evaluate_model(model, tokenizer, test_data):
    perplexities = []
    bleu_scores = []
    
    for _, row in test_data.iterrows():
        context = row['context']
        true_response = row['response']
        
        # Generate model response
        inputs = tokenizer(context, return_tensors="pt")
        output = model.generate(**inputs, max_length=50)
        predicted_response = tokenizer.decode(output[0], skip_special_tokens=True)
        
        # Calculate metrics
        perplexity = calculate_perplexity(model, tokenizer, context)
        bleu = calculate_bleu(true_response, predicted_response)
        
        perplexities.append(perplexity)
        bleu_scores.append(bleu)
    
    return {
        "avg_perplexity": sum(perplexities) / len(perplexities),
        "avg_bleu": sum(bleu_scores) / len(bleu_scores)
    }

# Assuming we have our test_data DataFrame
results = evaluate_model(model, tokenizer, test_data)
print(f"Average Perplexity: {results['avg_perplexity']}")
print(f"Average BLEU Score: {results['avg_bleu']}")

## Conclusion and Q&A (5 minutes)

In this lesson, we've covered the entire process of collecting and preparing a dataset for our chatbot project, as well as evaluating the performance of our language model. Remember that while automated metrics are useful, they should be complemented with human evaluation for a comprehensive assessment.

Are there any questions about the data collection process or evaluation metrics?

## Additional Resources

1. "Evaluation of Text Generation: A Survey" paper: https://arxiv.org/abs/2006.14799
2. NLTK documentation for BLEU score: https://www.nltk.org/_modules/nltk/translate/bleu_score.html
3. "A Survey of Evaluation Techniques for Dialogue Systems" paper: https://arxiv.org/abs/1905.04071
4. Hugging Face Datasets library: https://huggingface.co/docs/datasets/

In our next lesson, we'll focus on designing input and output formats for our chatbot with context management.