# Week 3 - Activity 1: FineWeb Educationality Classifier

In this activity, we'll explore the FineWeb educationality classifier to understand how it distinguishes between educational and non-educational content. We'll:

1. Set up and run the classifier
2. Test it on various types of content
3. Analyze cases where it succeeds and fails

Reference: [FineWeb: A Fine-grained, Lightweight Dataset and Benchmark for Web Page Quality Assessment](https://arxiv.org/abs/2310.14160)

In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "HuggingFaceFW/fineweb-edu-classifier"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

def classify_text(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    outputs = model(**inputs)
    score = outputs.logits.squeeze(-1).float().detach().numpy().item()
    return {
        "text": text,
        "score": score,
        "int_score": int(round(max(0, min(score, 5)))),
        "is_educational": score >= 3  # As recommended in the model card
    }

  from .autonotebook import tqdm as notebook_tqdm


## 2. Test Cases: Educational Content

Let's try some examples that should be classified as educational:

In [2]:
educational_examples = [
    # Tutorial-style content
    """
    Introduction to Python Programming
    
    Python is a high-level programming language known for its simplicity and readability.
    In this tutorial, we'll cover:
    1. Basic syntax
    2. Variables and data types
    3. Control structures
    
    Let's start with a simple example:
    x = 5
    y = 10
    print(x + y)
    """,
    
    # Academic content
    """
    The Theory of Relativity
    
    Einstein's theory of relativity consists of two interrelated theories: special relativity and general relativity.
    The theory fundamentally changed our understanding of space, time, gravity, and the universe itself.
    """,
    
    # Documentation
    """
    API Documentation
    
    Function: calculate_mean(numbers: List[float]) -> float
    Description: Calculates the arithmetic mean of a list of numbers
    Parameters:
        - numbers: A list of floating-point numbers
    Returns:
        - The arithmetic mean as a float
    """
]

print("Educational Content Classification Results:")
for i, text in enumerate(educational_examples, 1):
    result = classify_text(text)
    print(f"\nExample {i}:")
    print(f"Raw score: {result['score']:.3f}")
    print(f"Integer score: {result['int_score']}")
    print(f"Is educational: {result['is_educational']}")

Educational Content Classification Results:

Example 1:
Raw score: 3.596
Integer score: 4
Is educational: True

Example 2:
Raw score: 3.749
Integer score: 4
Is educational: True

Example 3:
Raw score: 2.011
Integer score: 2
Is educational: False


## 3. Test Cases: Non-Educational Content

Now let's try some examples that might be good for language model training but aren't strictly educational:

In [7]:
non_educational_examples = [
    # News article
    """
    Breaking News: Tech Company Announces New Product
    
    Silicon Valley's leading tech company unveiled their latest smartphone today,
    featuring improved camera capabilities and longer battery life.
    The announcement caused their stock price to rise by 5%.
    """,
    
    # Blog post
    """
    My Journey as a Software Developer
    
    When I started coding ten years ago, I never imagined where this path would lead me.
    Through ups and downs, I've learned that persistence is key to success in this field.
    """,
    
    # Technical discussion
    """
    Code Review Discussion
    
    I think we should refactor this module to use dependency injection.
    It would make the code more testable and reduce coupling between components.
    What do you think about this approach?
    """,

    # Custom
    """
    联合国教育、科学及文化组织（法語：Organisation des Nations unies pour l'éducation, 
    la science et la culture，罕缩写作 ONUESC ；英語：United Nations Educational, 
    Scientific and Cultural Organization，縮寫作 UNESCO），简称联合国教科文组织，是一个联合国专门机构[1]，
    成立于1945年11月16日，总部设於法国巴黎。
    """
]

print("Non-Educational Content Classification Results:")
for i, text in enumerate(non_educational_examples, 1):
    result = classify_text(text)
    print(f"\nExample {i}:")
    print(f"Raw score: {result['score']:.3f}")
    print(f"Integer score: {result['int_score']}")
    print(f"Is educational: {result['is_educational']}")

Non-Educational Content Classification Results:

Example 1:
Raw score: 0.214
Integer score: 0
Is educational: False

Example 2:
Raw score: 0.504
Integer score: 1
Is educational: False

Example 3:
Raw score: 0.451
Integer score: 0
Is educational: False

Example 4:
Raw score: 1.636
Integer score: 2
Is educational: False


## 4. Edge Cases

Let's test some interesting edge cases that might challenge the classifier:

In [4]:
edge_cases = [
    # Educational but informal
    """
    Hey everyone! 👋 Today I'm gonna show you how to make the BEST chocolate chip cookies ever!
    First, we need to understand the science behind what makes cookies chewy vs crispy...
    """,
    
    # Technical but conversational
    """
    Q: Why isn't my neural network learning?
    A: Have you checked your learning rate? Sometimes if it's too high, the model won't converge.
    Try reducing it by a factor of 10 and see what happens.
    """,
    
    # Mixed content
    """
    Product Documentation and Updates
    
    NEW FEATURES:
    - Dark mode support
    - Improved performance
    
    TUTORIAL:
    To enable dark mode, follow these steps:
    1. Open settings
    2. Navigate to Display
    3. Toggle 'Dark Mode'
    """
]

print("Edge Cases Classification Results:")
for i, text in enumerate(edge_cases, 1):
    result = classify_text(text)
    print(f"\nExample {i}:")
    print(f"Raw score: {result['score']:.3f}")
    print(f"Integer score: {result['int_score']}")
    print(f"Is educational: {result['is_educational']}")

Edge Cases Classification Results:

Example 1:
Raw score: 0.849
Integer score: 1
Is educational: False

Example 2:
Raw score: 0.799
Integer score: 1
Is educational: False

Example 3:
Raw score: 0.560
Integer score: 1
Is educational: False


## Discussion Points

1. What patterns do you notice in content that's classified as educational?
   - Formal structure (e.g., step-by-step format, clear sections)
   - Academic language and terminology
   - Clear learning objectives or instructional intent
   - Systematic presentation of information

2. What are some examples where the classifier might be too strict?
   - Informal educational content (e.g., cooking tutorials)
   - Q&A style learning materials
   - Technical discussions with educational value
   - Mixed content with both tutorial and product information

3. How might this affect the quality of training data for language models?
   - Bias towards formal academic content
   - Potential loss of valuable informal educational content
   - Limited exposure to real-world learning scenarios
   - Reduced diversity in educational styles

4. How would you improve the classifier?
   - Add support for different educational styles (formal vs. informal)
   - Consider context and intent more heavily
   - Implement multi-label classification for mixed content
   - Add domain-specific considerations
   - Consider audience level in classification