# Zero-Shot Classification for YouTube Comments Analysis

This notebook implements zero-shot classification of YouTube comments for weight stigma research using OpenAI's GPT models. The pipeline analyzes comment sentiment, detects weight-based discrimination (gordofobia), identifies language, and flags obesity-related content.

## Research Overview

This study applies advanced natural language processing techniques to:
- Classify sentiment in Portuguese YouTube comments (positive, negative, neutral)
- Detect explicit and implicit weight-based discrimination (gordofobia)
- Identify language patterns in multilingual comment datasets
- Flag obesity-related discussions for focused analysis
- Enable large-scale content analysis using batch processing

## Classification Framework

The zero-shot classification system includes:

1. **Sentiment Analysis**: Classify comments as positive, negative, or neutral
2. **Weight Discrimination Detection**: Identify explicit and implicit gordofobia
3. **Language Identification**: Detect comment language using ISO codes
4. **Obesity Content Flagging**: Mark comments discussing obesity topics
5. **Batch Processing**: Efficient processing using OpenAI's Batch API

## Input Data

- **Source**: Cleaned Portuguese comments from `02_basic_cleaning.ipynb`
- **Expected location**: `../data/intermediate/20250417_youtube_comments_pt_cleaned1.parquet`
- **Content**: Preprocessed YouTube comments ready for analysis

## Output Data

- **Destination**: `../data/intermediate/20250417_youtube_comments_yes_labels.parquet`
- **Content**: Original comments with classification labels and metadata

## Technical Requirements

- OpenAI API access with sufficient batch processing quota
- Pydantic for structured data validation
- LangChain for prompt engineering and API integration
- Robust error handling for large-scale batch processing

## Classification Schema

The system uses a structured Pydantic model to ensure consistent outputs:
- **sentimento**: Sentiment classification (positivo/negativo/neutro)
- **gordofobia_implicita**: Boolean flag for implicit weight discrimination
- **gordofobia_explicita**: Boolean flag for explicit weight discrimination
- **idioma**: Language code (ISO 639-1 format)
- **obesidade**: Boolean flag for obesity-related content

**Note**: This notebook processes large volumes of text data. Batch processing is used to optimize costs and API efficiency.

In [1]:
import hashlib
import json
import os
import sys
import time
import copy
from pathlib import Path
from typing import List, Literal, Optional, Dict, Any
from glob import glob

# Data processing libraries
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import joblib

# API and ML libraries
from dotenv import load_dotenv
from langchain.utils.openai_functions import convert_pydantic_to_openai_function
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama import ChatOllama
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field

# Custom modules
sys.path.append(str(Path("..").resolve()))
from openai_api import OpenAIBatchProcessor

# Jupyter notebook utilities
from IPython.display import clear_output
import warnings

# Load environment variables
load_dotenv()

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

# Configuration for zero-shot classification
MODEL_NAME = "gpt-4.1-mini"  # Updated to current model version

print("‚úÖ Libraries loaded successfully")
print(f"ü§ñ Using model: {MODEL_NAME}")
print(f"üìÅ Working directory: {Path.cwd()}")


# Configuration class for the classification pipeline
class ClassificationConfig:
    """Configuration for zero-shot comment classification pipeline."""

    # File paths
    DATA_DIR = Path("../data")
    INTERMEDIATE_DATA_DIR = DATA_DIR / "intermediate"
    TMP_DATA_DIR = DATA_DIR / "tmp"
    JSONL_DIR = INTERMEDIATE_DATA_DIR / "jsonl"

    # Input file (from cleaning notebook)
    INPUT_FILE = INTERMEDIATE_DATA_DIR / "20250417_youtube_comments_pt_cleaned1.parquet"

    # Output file
    OUTPUT_FILE = INTERMEDIATE_DATA_DIR / "20250417_youtube_comments_yes_labels.parquet"

    # Temporary files for batch processing
    PARSED_RESULTS_FILE = TMP_DATA_DIR / "parsed_results_comments.joblib"
    RESULTS_FILE = TMP_DATA_DIR / "results_comments.joblib"

    # Batch processing parameters
    BATCH_SIZE = 40000  # Maximum requests per batch file
    BATCH_NAME_PREFIX = "20250417_youtube_comments_batch_api"

    # Model parameters
    MODEL_NAME = MODEL_NAME
    TEMPERATURE = 0.0  # Deterministic outputs for research

    @classmethod
    def create_directories(cls):
        """Create necessary directories for processing."""
        cls.INTERMEDIATE_DATA_DIR.mkdir(parents=True, exist_ok=True)
        cls.TMP_DATA_DIR.mkdir(parents=True, exist_ok=True)
        cls.JSONL_DIR.mkdir(parents=True, exist_ok=True)


# Create directories
ClassificationConfig.create_directories()

print("‚úÖ Configuration initialized")
print(f"üìÇ Input file: {ClassificationConfig.INPUT_FILE}")
print(f"üìÇ Output file: {ClassificationConfig.OUTPUT_FILE}")
print(f"üî¢ Batch size: {ClassificationConfig.BATCH_SIZE:,}")
print(f"üå°Ô∏è Temperature: {ClassificationConfig.TEMPERATURE}")

‚úÖ Libraries loaded successfully
ü§ñ Using model: gpt-4.1-mini
üìÅ Working directory: /media/nas-elias/pesquisas/papers/paper_savio_youtube/paper_youtube_weight_stigma
‚úÖ Configuration initialized
üìÇ Input file: ../data/intermediate/20250417_youtube_comments_pt_cleaned1.parquet
üìÇ Output file: ../data/intermediate/20250417_youtube_comments_yes_labels.parquet
üî¢ Batch size: 40,000
üå°Ô∏è Temperature: 0.0


In [2]:
df = pd.read_parquet("../data/intermediate/20250417_youtube_comments_pt_cleaned1.parquet")
df

Unnamed: 0,video_id,channelId,videoId,textDisplay,textOriginal,authorDisplayName,authorProfileImageUrl,authorChannelUrl,authorChannelId,canRate,viewerRating,likeCount,publishedAt,updatedAt,author,comment,date,likes,video_title,language
0,--tK3SaYWr4,UCiV6zQocW4CvWRyXcKDZZmQ,--tK3SaYWr4,Haahahahahahahahhahh o pol√≠cia chupando a buda...,Haahahahahahahahhahh o pol√≠cia chupando a buda...,@evelynsoares4467,https://yt3.ggpht.com/ytc/AIdro_kTUhLtO25GYE29...,http://www.youtube.com/@evelynsoares4467,UCNhXx9ev5RtEiyGsVjMuTOA,True,none,0.0,2024-12-28 21:38:37+00:00,2024-12-28 21:38:37+00:00,,,,,Tony Gordo √© Incriminado #simpsons,pt
1,--tK3SaYWr4,UCiV6zQocW4CvWRyXcKDZZmQ,--tK3SaYWr4,Chefe wigol deu um beijo grego no homer skksks,Chefe wigol deu um beijo grego no homer skksks,@MrLopess00,https://yt3.ggpht.com/GbqCWSYWX0x7m12TrBOc7bBO...,http://www.youtube.com/@MrLopess00,UCtrByOsq8kIDCQfSXfq3IKw,True,none,447.0,2024-12-29 02:00:55+00:00,2024-12-29 02:00:55+00:00,,,,,Tony Gordo √© Incriminado #simpsons,pt
2,--tK3SaYWr4,UCiV6zQocW4CvWRyXcKDZZmQ,--tK3SaYWr4,Quem era Batedor de Carteiras ?,Quem era Batedor de Carteiras ?,@mateuss.santossilva5059,https://yt3.ggpht.com/lIA6NvNbtRKR4LZyVTGVdNO_...,http://www.youtube.com/@mateuss.santossilva5059,UCIY2M7NurJ728_H4Cs5zmQA,True,none,5.0,2024-12-29 12:53:19+00:00,2024-12-29 12:53:19+00:00,,,,,Tony Gordo √© Incriminado #simpsons,pt
3,--tK3SaYWr4,UCiV6zQocW4CvWRyXcKDZZmQ,--tK3SaYWr4,"""Gra√ßas a deus que essa coisa est√° do nosso la...","""Gra√ßas a deus que essa coisa est√° do nosso la...",@Ray._Ryan000,https://yt3.ggpht.com/WIrn4XlSuZAQuPHw6w53yiiX...,http://www.youtube.com/@Ray._Ryan000,UCr5gdJ-I9wpcBXWdSR6T5iw,True,none,1677.0,2024-12-29 16:16:06+00:00,2024-12-29 16:17:20+00:00,,,,,Tony Gordo √© Incriminado #simpsons,pt
4,--tK3SaYWr4,UCiV6zQocW4CvWRyXcKDZZmQ,--tK3SaYWr4,ü§® t√° estranho isso,ü§® t√° estranho isso,@darkgacha5649,https://yt3.ggpht.com/R5NIvS_yYOP4_ngqdnlXIlOH...,http://www.youtube.com/@darkgacha5649,UCZnl2qgkPiF-SYyoPmTbzBw,True,none,0.0,2024-12-29 18:03:52+00:00,2024-12-29 18:03:52+00:00,,,,,Tony Gordo √© Incriminado #simpsons,pt
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
191941,zvJJHnpUGMo,UCjOJvvYe6tyEHY21OD33h8A,zvJJHnpUGMo,Dizem a eles o que tem que fazer !!!??\nIsto e...,Dizem a eles o que tem que fazer !!!??\nIsto e...,@pauloferw,https://yt3.ggpht.com/ytc/AIdro_kQShd7zzPCX_Ta...,http://www.youtube.com/@pauloferw,UCS6h73aFHTTrQBpUJuWwNfA,True,none,1.0,2024-05-01 08:02:32+00:00,2024-05-01 08:02:32+00:00,,,,,Mat√©ria de Capa | A epidemia global da obesida...,pt
191942,zvJJHnpUGMo,UCjOJvvYe6tyEHY21OD33h8A,zvJJHnpUGMo,Desculpe mas sua obesidade n tem nada haver co...,Desculpe mas sua obesidade n tem nada haver co...,@stephaniemayer6082,https://yt3.ggpht.com/n__DHrkMmHm3ZWMyMlkl6Evy...,http://www.youtube.com/@stephaniemayer6082,UCIenqiHf4KoTSrK1cjEFwMA,True,none,0.0,2024-05-23 12:59:32+00:00,2024-05-23 12:59:32+00:00,,,,,Mat√©ria de Capa | A epidemia global da obesida...,pt
191943,zvJJHnpUGMo,UCjOJvvYe6tyEHY21OD33h8A,zvJJHnpUGMo,"Da onde foi tirado esse 2,89 pra fazer a conta...","Da onde foi tirado esse 2,89 pra fazer a conta...",@adrianagalvao9963,https://yt3.ggpht.com/ytc/AIdro_lZNyRnrOGQx6Ns...,http://www.youtube.com/@adrianagalvao9963,UCjN2L25tmQraNVLwsqOxnKQ,True,none,0.0,2024-06-26 18:15:01+00:00,2024-06-26 18:15:01+00:00,,,,,Mat√©ria de Capa | A epidemia global da obesida...,pt
191944,zvJJHnpUGMo,UCjOJvvYe6tyEHY21OD33h8A,zvJJHnpUGMo,Todos os profissionais falaram sobre exerc√≠cio...,Todos os profissionais falaram sobre exerc√≠cio...,@cinesiologiauniversal,https://yt3.ggpht.com/ytc/AIdro_kmdHnxCpeng5JD...,http://www.youtube.com/@cinesiologiauniversal,UCG0vNc5oRvHT00FnSokCFQQ,True,none,0.0,2024-07-20 20:34:55+00:00,2024-07-20 20:34:55+00:00,,,,,Mat√©ria de Capa | A epidemia global da obesida...,pt


In [3]:
def load_and_explore_comment_data(file_path: Path) -> pd.DataFrame:
    """
    Load and explore YouTube comment data for classification.

    Args:
        file_path: Path to the cleaned comments data

    Returns:
        DataFrame with comment data ready for classification
    """
    try:
        print(f"üìÇ Loading comment data from: {file_path}")

        # Verify file exists
        if not file_path.exists():
            raise FileNotFoundError(f"Input file not found: {file_path}")

        # Load the data
        df = pd.read_parquet(file_path)

        print(f"‚úÖ Successfully loaded {len(df):,} comments")
        print(f"üìä Data shape: {df.shape}")
        print(f"üíæ Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

        # Display basic statistics
        print(f"\nüìà Data Overview:")
        print(f"- Total comments: {len(df):,}")
        print(f"- Unique videos: {df['video_id'].nunique():,}")
        print(f"- Unique authors: {df.get('authorDisplayName', pd.Series()).nunique():,}")
        print(f"- Date range: {df.get('publishedAt', pd.Series()).min()} to {df.get('publishedAt', pd.Series()).max()}")

        # Check for key columns
        required_columns = ["textDisplay", "video_id"]
        missing_columns = [col for col in required_columns if col not in df.columns]
        if missing_columns:
            raise ValueError(f"Missing required columns: {missing_columns}")

        # Display sample data
        print(f"\nüìã Sample Comments:")
        sample_df = df.sample(min(3, len(df)))
        for idx, row in sample_df.iterrows():
            text_preview = row["textDisplay"][:100] + "..." if len(row["textDisplay"]) > 100 else row["textDisplay"]
            print(f"- {text_preview}")

        return df

    except Exception as e:
        print(f"‚ùå Error loading data: {e}")
        raise


# Load the comment data
df = load_and_explore_comment_data(ClassificationConfig.INPUT_FILE)

üìÇ Loading comment data from: ../data/intermediate/20250417_youtube_comments_pt_cleaned1.parquet
‚úÖ Successfully loaded 191,946 comments
üìä Data shape: (191946, 20)
‚úÖ Successfully loaded 191,946 comments
üìä Data shape: (191946, 20)
üíæ Memory usage: 247.3 MB

üìà Data Overview:
- Total comments: 191,946
- Unique videos: 1,204
- Unique authors: 163,664
- Date range: 2006-11-24 20:16:56+00:00 to 2025-04-17 11:46:21+00:00

üìã Sample Comments:
- √â isso que acontece quando da dinheiro para porcos
- Mds, como a pessoa se permite chegar a esse ponto?
- Mo√ßa üòä vc e muito bonitaüòä‚ù§
üíæ Memory usage: 247.3 MB

üìà Data Overview:
- Total comments: 191,946
- Unique videos: 1,204
- Unique authors: 163,664
- Date range: 2006-11-24 20:16:56+00:00 to 2025-04-17 11:46:21+00:00

üìã Sample Comments:
- √â isso que acontece quando da dinheiro para porcos
- Mds, como a pessoa se permite chegar a esse ponto?
- Mo√ßa üòä vc e muito bonitaüòä‚ù§


In [4]:
class RespostaAnaliseSentimento(BaseModel):
    """A resposta de uma fun√ß√£o que realiza an√°lise de sentimento em texto e detec√ß√£o do idioma do texto."""

    # O r√≥tulo de sentimento atribu√≠do ao texto
    sentimento: Literal["positivo", "negativo", "neutro"] = Field(
        default_factory=str,
        description="O r√≥tulo de sentimento atribu√≠do ao texto. Voc√™ s√≥ pode ter 'positivo', 'negativo' ou 'neutro' como valores.",
    )

    gordofobia_implicita: bool = Field(
        default_factory=bool,
        description="Se o texto cont√©m discrimina√ß√£o por peso (gordofobia) de forma impl√≠cita e/ou indireta. Se n√£o houver gordofobia, este campo deve ser False.",
    )

    gordofobia_explicita: bool = Field(
        default_factory=bool,
        description="Se o texto cont√©m discrimina√ß√£o por peso (gordofobia) de forma expl√≠cita e/ou direta. Se n√£o houver gordofobia, este campo deve ser False.",
    )

    # O idioma detectado no texto
    idioma: str = Field(
        default_factory=str,
        description="O idioma detectado no texto, representado por um c√≥digo de idioma de duas letras.",
    )

    obesidade: bool = Field(
        default_factory=bool,
        description="Se o texto toca no assunto de obesidade. Se n√£o houver men√ß√£o √† obesidade, este campo deve ser False.",
    )

    class Config:
        """Pydantic configuration for the model."""

        json_encoders = {
            # Custom encoders if needed
        }

In [5]:
# Validate the model structure
print("‚úÖ Classification schema defined successfully")
print(f"üìã Model fields: {list(RespostaAnaliseSentimento.model_fields.keys())}")

# Display model schema for validation
try:
    schema = RespostaAnaliseSentimento.schema()
    print(f"üîç Schema validation: OK")
    print(f"üìä Required fields: {schema.get('required', [])}")
except Exception as e:
    print(f"‚ùå Schema validation error: {e}")
    raise

‚úÖ Classification schema defined successfully
üìã Model fields: ['sentimento', 'gordofobia_implicita', 'gordofobia_explicita', 'idioma', 'obesidade']
üîç Schema validation: OK
üìä Required fields: []


In [6]:
# System prompt for zero-shot classification
SYSTEM_PROMPT = {
    "role": "system",
    "content": """Voc√™ √© um especialista em an√°lise de sentimento com foco em coment√°rios relacionados a peso corporal e discrimina√ß√£o.

Sua tarefa √© classificar coment√°rios do YouTube com precis√£o, identificando:
1. Sentimento geral (positivo, negativo, neutro)
2. Presen√ßa de gordofobia (discrimina√ß√£o por peso)
3. Idioma do texto
4. Men√ß√µes sobre obesidade

DIRETRIZES DE CLASSIFICA√á√ÉO:

SENTIMENTO:
- 'positivo': Coment√°rios de apoio, encorajamento, aceita√ß√£o corporal, mensagens construtivas
- 'negativo': Cr√≠ticas, julgamentos, discrimina√ß√£o, linguagem ofensiva, gordofobia
- 'neutro': Coment√°rios informativos, quest√µes, observa√ß√µes sem julgamento de valor

GORDOFOBIA:
- Expl√≠cita: Insultos diretos, linguagem claramente discriminat√≥ria, termos pejorativos sobre peso
- Impl√≠cita: Sugest√µes sutis, estere√≥tipos, press√µes indiretas relacionadas ao peso

IDIOMA:
- Use c√≥digos ISO 639-1 (pt, en, es, etc.)
- Considere o idioma predominante se houver mistura

OBESIDADE:
- Marque como True se o coment√°rio menciona ou discute obesidade, mesmo que indiretamente

CONTEXTO IMPORTANTE:
- Considere ironia, sarcasmo e emojis no contexto
- Analise o coment√°rio completo, n√£o apenas palavras isoladas
- Coment√°rios de apoio √† diversidade corporal s√£o positivos
- Seja preciso na detec√ß√£o de discrimina√ß√£o sutil

Responda APENAS com o formato estruturado solicitado.""",
}

print("‚úÖ System prompt configured")
print(f"üìù Prompt length: {len(SYSTEM_PROMPT['content'])} characters")
print("üéØ Classification targets: sentiment, gordofobia, language, obesity content")

‚úÖ System prompt configured
üìù Prompt length: 1346 characters
üéØ Classification targets: sentiment, gordofobia, language, obesity content


In [7]:
# Generate OpenAI function schema from Pydantic model
def create_function_schema() -> Dict[str, Any]:
    """
    Create OpenAI function calling schema from the Pydantic model.

    Returns:
        Dict containing the function schema for OpenAI API
    """
    try:
        # Convert Pydantic model to OpenAI function format
        function_schema = convert_pydantic_to_openai_function(RespostaAnaliseSentimento)

        # Ensure all fields are required for consistent outputs
        function_schema["parameters"]["required"] = list(function_schema["parameters"]["properties"].keys())
        function_schema["parameters"]["type"] = "object"

        print("‚úÖ Function schema created successfully")
        print(f"üìã Function name: {function_schema['name']}")
        print(f"üîß Required parameters: {function_schema['parameters']['required']}")

        return function_schema

    except Exception as e:
        print(f"‚ùå Error creating function schema: {e}")
        raise


# Create the function schema
function_schema = create_function_schema()

# Display schema structure for validation
print(f"\nüîç Function Schema Structure:")
print(f"- Name: {function_schema['name']}")
print(f"- Description: {function_schema['description']}")
print(f"- Parameters: {len(function_schema['parameters']['properties'])} fields")
print(f"- Required fields: {len(function_schema['parameters']['required'])}")

# Validate schema structure
assert "name" in function_schema, "Function schema missing name"
assert "parameters" in function_schema, "Function schema missing parameters"
assert len(function_schema["parameters"]["required"]) == 5, "Expected 5 required parameters"

print("‚úÖ Function schema validation passed")

‚úÖ Function schema created successfully
üìã Function name: RespostaAnaliseSentimento
üîß Required parameters: ['sentimento', 'gordofobia_implicita', 'gordofobia_explicita', 'idioma', 'obesidade']

üîç Function Schema Structure:
- Name: RespostaAnaliseSentimento
- Description: A resposta de uma fun√ß√£o que realiza an√°lise de sentimento em texto e detec√ß√£o do idioma do texto.
- Parameters: 5 fields
- Required fields: 5
‚úÖ Function schema validation passed


In [8]:
def analyze_text_differences(df: pd.DataFrame) -> None:
    """
    Analyze differences between textDisplay and textOriginal columns.

    Args:
        df: DataFrame containing the comment data
    """
    print("üîç Analyzing text field differences...")

    if "textOriginal" in df.columns:
        differences = (df.textDisplay != df.textOriginal).value_counts()
        print(f"üìä Text differences analysis:")
        print(f"- Identical texts: {differences.get(False, 0):,}")
        print(f"- Different texts: {differences.get(True, 0):,}")

        if differences.get(True, 0) > 0:
            print("‚ÑπÔ∏è textDisplay will be used for classification (processed version)")
        else:
            print("‚ÑπÔ∏è textDisplay and textOriginal are identical")
    else:
        print("‚ÑπÔ∏è Only textDisplay column available")


def prepare_classification_data(df: pd.DataFrame) -> List[str]:
    """
    Prepare comment texts for classification.

    Args:
        df: DataFrame containing the comment data

    Returns:
        List of comment texts ready for processing
    """
    print(f"üìù Preparing {len(df):,} comments for classification...")

    # Use textDisplay as it contains the processed version
    input_texts = df.textDisplay.values.tolist()

    # Basic validation
    empty_texts = sum(1 for text in input_texts if not text or not text.strip())
    if empty_texts > 0:
        print(f"‚ö†Ô∏è Found {empty_texts} empty or whitespace-only comments")

    print(f"‚úÖ Prepared {len(input_texts):,} texts for classification")
    return input_texts


# Analyze the data structure
analyze_text_differences(df)

# Prepare the input texts
input_texts = prepare_classification_data(df)

print(f"\nüìà Data Preparation Summary:")
print(f"- Total comments to classify: {len(input_texts):,}")
print(f"- Sample text length: {len(input_texts[0]) if input_texts else 0} characters")
print(f"- Average text length: {sum(len(text) for text in input_texts) / len(input_texts):.1f} characters")

üîç Analyzing text field differences...
üìä Text differences analysis:
- Identical texts: 191,946
- Different texts: 0
‚ÑπÔ∏è textDisplay and textOriginal are identical
üìù Preparing 191,946 comments for classification...
‚úÖ Prepared 191,946 texts for classification

üìà Data Preparation Summary:
- Total comments to classify: 191,946
- Sample text length: 55 characters
- Average text length: 91.1 characters


In [9]:
def create_batch_requests(texts: List[str], df: pd.DataFrame) -> List[Dict[str, Any]]:
    """
    Create batch API requests for comment classification.

    Args:
        texts: List of comment texts to classify
        df: Original DataFrame for generating unique IDs

    Returns:
        List of API request objects
    """
    print(f"üîß Creating batch API requests...")

    jsonl_data = []

    for idx, text in enumerate(tqdm(texts, desc="Creating requests")):
        # Create unique identifier for the request
        custom_uid = f"{text}{idx}{df.video_id.iloc[idx]}"
        request_id = hashlib.md5(custom_uid.encode()).hexdigest()

        # Create API request structure
        request_data = {
            "custom_id": request_id,
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": ClassificationConfig.MODEL_NAME,
                "temperature": ClassificationConfig.TEMPERATURE,
                "messages": [SYSTEM_PROMPT, {"role": "user", "content": text.encode().decode("utf-8")}],
                "parallel_tool_calls": False,
                "tools": [{"type": "function", "function": function_schema}],
                "tool_choice": {"type": "function", "function": {"name": function_schema["name"]}},
            },
        }
        jsonl_data.append(request_data)

    print(f"‚úÖ Created {len(jsonl_data):,} API requests")
    return jsonl_data


def split_into_batches(data: List[Dict[str, Any]], batch_size: int) -> List[List[Dict[str, Any]]]:
    """
    Split requests into batches for API processing.

    Args:
        data: List of API requests
        batch_size: Maximum requests per batch

    Returns:
        List of batches
    """
    print(f"üì¶ Splitting {len(data):,} requests into batches of {batch_size:,}...")

    chunks = [data[x : x + batch_size] for x in range(0, len(data), batch_size)]

    print(f"‚úÖ Created {len(chunks)} batch(es)")
    for i, chunk in enumerate(chunks):
        print(f"  - Batch {i}: {len(chunk):,} requests")

    return chunks


# Create the batch requests
jsonl_data = create_batch_requests(input_texts, df)

# Split into manageable batches
chunks = split_into_batches(jsonl_data, ClassificationConfig.BATCH_SIZE)

print(f"\nüìä Batch Processing Summary:")
print(f"- Total requests: {len(jsonl_data):,}")
print(f"- Number of batches: {len(chunks)}")
print(f"- Batch size limit: {ClassificationConfig.BATCH_SIZE:,}")
print(f"- Model: {ClassificationConfig.MODEL_NAME}")
print(f"- Temperature: {ClassificationConfig.TEMPERATURE}")

üîß Creating batch API requests...


Creating requests:   0%|          | 0/191946 [00:00<?, ?it/s]

‚úÖ Created 191,946 API requests
üì¶ Splitting 191,946 requests into batches of 40,000...
‚úÖ Created 5 batch(es)
  - Batch 0: 40,000 requests
  - Batch 1: 40,000 requests
  - Batch 2: 40,000 requests
  - Batch 3: 40,000 requests
  - Batch 4: 31,946 requests

üìä Batch Processing Summary:
- Total requests: 191,946
- Number of batches: 5
- Batch size limit: 40,000
- Model: gpt-4.1-mini
- Temperature: 0.0


In [10]:
def export_batch_files(chunks: List[List[Dict[str, Any]]], base_filename: str) -> List[str]:
    """
    Export batch requests to JSONL files for API processing.

    Args:
        chunks: List of batch chunks
        base_filename: Base filename for the batch files

    Returns:
        List of created file paths
    """
    print(f"üíæ Exporting batch files...")

    created_files = []

    for idx, chunk in enumerate(chunks):
        filename = f"{base_filename}_{idx}.jsonl"
        filepath = ClassificationConfig.JSONL_DIR / filename

        try:
            with open(filepath, "w", encoding="utf-8") as f:
                for item in chunk:
                    f.write(json.dumps(item, ensure_ascii=False) + "\n")

            # Verify file creation and size
            file_size_mb = filepath.stat().st_size / (1024 * 1024)
            print(f"‚úÖ Created {filename}: {file_size_mb:.2f} MB")
            created_files.append(str(filepath))

        except Exception as e:
            print(f"‚ùå Error creating {filename}: {e}")
            raise

    return created_files


# Export batch files
base_filename = ClassificationConfig.BATCH_NAME_PREFIX
created_files = export_batch_files(chunks, base_filename)

print(f"\nüìÅ Batch Files Created:")
for file_path in created_files:
    file_size = Path(file_path).stat().st_size / (1024 * 1024)
    print(f"- {Path(file_path).name}: {file_size:.2f} MB")

print(f"\nüéØ Ready for batch processing with {len(created_files)} file(s)")

üíæ Exporting batch files...
‚úÖ Created 20250417_youtube_comments_batch_api_0.jsonl: 120.11 MB
‚úÖ Created 20250417_youtube_comments_batch_api_0.jsonl: 120.11 MB
‚úÖ Created 20250417_youtube_comments_batch_api_1.jsonl: 119.87 MB
‚úÖ Created 20250417_youtube_comments_batch_api_1.jsonl: 119.87 MB
‚úÖ Created 20250417_youtube_comments_batch_api_2.jsonl: 119.04 MB
‚úÖ Created 20250417_youtube_comments_batch_api_2.jsonl: 119.04 MB
‚úÖ Created 20250417_youtube_comments_batch_api_3.jsonl: 119.84 MB
‚úÖ Created 20250417_youtube_comments_batch_api_3.jsonl: 119.84 MB
‚úÖ Created 20250417_youtube_comments_batch_api_4.jsonl: 95.90 MB

üìÅ Batch Files Created:
- 20250417_youtube_comments_batch_api_0.jsonl: 120.11 MB
- 20250417_youtube_comments_batch_api_1.jsonl: 119.87 MB
- 20250417_youtube_comments_batch_api_2.jsonl: 119.04 MB
- 20250417_youtube_comments_batch_api_3.jsonl: 119.84 MB
- 20250417_youtube_comments_batch_api_4.jsonl: 95.90 MB

üéØ Ready for batch processing with 5 file(s)
‚úÖ Creat

In [11]:
input_texts = df.textDisplay.values.tolist()

In [12]:
def get_file_hash(file_path: str) -> str:
    """
    Generate MD5 hash of a file for unique batch naming.

    Args:
        file_path: Path to the file

    Returns:
        MD5 hash string
    """
    with open(file_path, "rb") as f:
        return hashlib.md5(f.read()).hexdigest()


def initialize_batch_processors(file_paths: List[str]) -> Dict[str, OpenAIBatchProcessor]:
    """
    Initialize and submit batch jobs for processing.

    Args:
        file_paths: List of JSONL file paths to process

    Returns:
        Dictionary mapping file paths to batch processors
    """
    print(f"üöÄ Initializing batch processors for {len(file_paths)} files...")

    batch_processors = {}

    for file_path in tqdm(file_paths, desc="Submitting batches"):
        try:
            # Create processor and submit job
            processor = OpenAIBatchProcessor()
            batch_name = get_file_hash(file_path)

            processor.submit_batch_job(input_jsonl_path=file_path, batch_name=batch_name)

            batch_processors[file_path] = processor
            print(f"‚úÖ Submitted batch for {Path(file_path).name}")

        except Exception as e:
            print(f"‚ùå Error submitting batch for {Path(file_path).name}: {e}")
            raise

    print(f"‚úÖ All {len(batch_processors)} batches submitted successfully")
    return batch_processors


# Find all batch files
batch_files = glob(str(ClassificationConfig.JSONL_DIR / f"{ClassificationConfig.BATCH_NAME_PREFIX}*.jsonl"))

print(f"üìÅ Found {len(batch_files)} batch files:")
for file_path in batch_files:
    file_size = Path(file_path).stat().st_size / (1024 * 1024)
    print(f"- {Path(file_path).name}: {file_size:.2f} MB")

# Initialize batch processors
batch_processors = initialize_batch_processors(batch_files)

üìÅ Found 5 batch files:
- 20250417_youtube_comments_batch_api_0.jsonl: 120.11 MB
- 20250417_youtube_comments_batch_api_1.jsonl: 119.87 MB
- 20250417_youtube_comments_batch_api_2.jsonl: 119.04 MB
- 20250417_youtube_comments_batch_api_3.jsonl: 119.84 MB
- 20250417_youtube_comments_batch_api_4.jsonl: 95.90 MB
üöÄ Initializing batch processors for 5 files...


Submitting batches:   0%|          | 0/5 [00:00<?, ?it/s]

Successfully submitted batch 59ace21c633a198ab4d504403f7add1b with id batch_68821110b710819087888f7e03858295
Batch info saved to ../data/intermediate/jsonl/20250417_youtube_comments_batch_api_0_20250724_075513.txt
‚úÖ Submitted batch for 20250417_youtube_comments_batch_api_0.jsonl
Successfully submitted batch 1acbaa61699d58a8c14309a6cbfe7ccf with id batch_6882112382608190a895486bbe90dcca
Batch info saved to ../data/intermediate/jsonl/20250417_youtube_comments_batch_api_1_20250724_075531.txt
‚úÖ Submitted batch for 20250417_youtube_comments_batch_api_1.jsonl
Successfully submitted batch 1acbaa61699d58a8c14309a6cbfe7ccf with id batch_6882112382608190a895486bbe90dcca
Batch info saved to ../data/intermediate/jsonl/20250417_youtube_comments_batch_api_1_20250724_075531.txt
‚úÖ Submitted batch for 20250417_youtube_comments_batch_api_1.jsonl
Successfully submitted batch 2ae52d110f4f6a5ed352afa17260c60f with id batch_6882114605188190b231d141c91dbf24
Batch info saved to ../data/intermediate/json

In [13]:
def monitor_batch_status(processors: Dict[str, OpenAIBatchProcessor]) -> None:
    """
    Monitor the status of all batch jobs.

    Args:
        processors: Dictionary of batch processors to monitor
    """
    print("üìä Checking batch status...")

    for file_path, processor in processors.items():
        try:
            batch_info = processor.get_batch_info()
            filename = Path(file_path).name
            print(f"- {filename}: {batch_info.status}")

            if hasattr(batch_info, "request_counts"):
                counts = batch_info.request_counts
                if counts:
                    print(f"  üìà Progress: {counts.get('completed', 0)}/{counts.get('total', 0)} requests")

        except Exception as e:
            print(f"‚ùå Error checking status for {Path(file_path).name}: {e}")


def wait_for_completion(processors: Dict[str, OpenAIBatchProcessor], check_interval: int = 60) -> None:
    """
    Wait for all batch jobs to complete with periodic status updates.

    Args:
        processors: Dictionary of batch processors to monitor
        check_interval: Seconds between status checks
    """
    print(f"‚è≥ Waiting for batch completion (checking every {check_interval}s)...")

    while True:
        try:
            # Check if all batches are completed
            statuses = []
            for processor in processors.values():
                status = processor.get_batch_info().status
                statuses.append(status)

            completed_count = sum(1 for status in statuses if status == "completed")
            total_count = len(statuses)

            # Clear output and show current status
            clear_output(wait=True)
            print(f"üîÑ Batch Processing Status: {completed_count}/{total_count} completed")

            # Show detailed status
            for file_path, processor in processors.items():
                batch_info = processor.get_batch_info()
                filename = Path(file_path).name
                print(f"- {filename}: {batch_info.status}")

            # Check if all completed
            if all(status == "completed" for status in statuses):
                print("‚úÖ All batches completed successfully!")
                break

            # Wait before next check
            time.sleep(check_interval)

        except KeyboardInterrupt:
            print("\n‚ö†Ô∏è Monitoring interrupted by user")
            break
        except Exception as e:
            print(f"‚ùå Error during monitoring: {e}")
            break


# Monitor initial status
monitor_batch_status(batch_processors)

# Start monitoring (this will run until completion)
print("\nüéØ Starting batch monitoring...")
print("Note: This cell will run until all batches are completed.")
print("You can interrupt with Ctrl+C if needed.")

wait_for_completion(batch_processors)

üîÑ Batch Processing Status: 5/5 completed
- 20250417_youtube_comments_batch_api_0.jsonl: completed
- 20250417_youtube_comments_batch_api_1.jsonl: completed
- 20250417_youtube_comments_batch_api_0.jsonl: completed
- 20250417_youtube_comments_batch_api_1.jsonl: completed
- 20250417_youtube_comments_batch_api_2.jsonl: completed
- 20250417_youtube_comments_batch_api_3.jsonl: completed
- 20250417_youtube_comments_batch_api_2.jsonl: completed
- 20250417_youtube_comments_batch_api_3.jsonl: completed
- 20250417_youtube_comments_batch_api_4.jsonl: completed
‚úÖ All batches completed successfully!
- 20250417_youtube_comments_batch_api_4.jsonl: completed
‚úÖ All batches completed successfully!


In [14]:
# Final status check for all batches
print("üîç Final batch status check:")
for file_path, processor in batch_processors.items():
    batch_info = processor.get_batch_info()
    filename = Path(file_path).name
    print(f"- {filename}: {batch_info.status}")

    if batch_info.status == "completed":
        print(f"  ‚úÖ Ready for result processing")
    elif batch_info.status == "failed":
        print(f"  ‚ùå Batch failed - check error details")
    else:
        print(f"  ‚è≥ Still processing - current status: {batch_info.status}")

# Check if we can proceed to results processing
completed_batches = sum(1 for processor in batch_processors.values() if processor.get_batch_info().status == "completed")
total_batches = len(batch_processors)

print(f"\nüìä Completion Summary:")
print(f"- Completed batches: {completed_batches}/{total_batches}")
print(f"- Success rate: {completed_batches / total_batches * 100:.1f}%")

if completed_batches == total_batches:
    print("‚úÖ All batches completed - ready for results processing")
else:
    print("‚ö†Ô∏è Some batches are still pending - wait for completion before proceeding")

üîç Final batch status check:
- 20250417_youtube_comments_batch_api_0.jsonl: completed
  ‚úÖ Ready for result processing
- 20250417_youtube_comments_batch_api_0.jsonl: completed
  ‚úÖ Ready for result processing
- 20250417_youtube_comments_batch_api_1.jsonl: completed
  ‚úÖ Ready for result processing
- 20250417_youtube_comments_batch_api_1.jsonl: completed
  ‚úÖ Ready for result processing
- 20250417_youtube_comments_batch_api_2.jsonl: completed
  ‚úÖ Ready for result processing
- 20250417_youtube_comments_batch_api_2.jsonl: completed
  ‚úÖ Ready for result processing
- 20250417_youtube_comments_batch_api_3.jsonl: completed
  ‚úÖ Ready for result processing
- 20250417_youtube_comments_batch_api_3.jsonl: completed
  ‚úÖ Ready for result processing
- 20250417_youtube_comments_batch_api_4.jsonl: completed
  ‚úÖ Ready for result processing
- 20250417_youtube_comments_batch_api_4.jsonl: completed
  ‚úÖ Ready for result processing

üìä Completion Summary:
- Completed batches: 5/5
- Succes

In [15]:
for i in chunks:
    print(len(i))

40000
40000
40000
40000
31946


In [None]:
from typing import Tuple


def process_batch_results(processors: Dict[str, OpenAIBatchProcessor]) -> Tuple[List[Any], List[Dict]]:
    """
    Process and parse results from completed batch jobs.

    Args:
        processors: Dictionary of batch processors

    Returns:
        Tuple of (parsed_results, raw_results)
    """
    print("üîÑ Processing batch results...")

    parsed_results = []
    raw_results = []
    error_count = 0

    for file_path, processor in processors.items():
        filename = Path(file_path).name
        print(f"üìÇ Processing results from {filename}...")

        try:
            # Get batch output
            file_response = processor.get_batch_output()
            if not file_response:
                print(f"‚ö†Ô∏è No response data for {filename}")
                continue

            # Process each response
            batch_parsed = 0
            batch_errors = 0

            for output in file_response:
                try:
                    # Parse the JSON response
                    json_output = json.loads(output)
                    function_args = json_output["response"]["body"]["choices"][0]["message"]["tool_calls"][0]["function"]["arguments"]
                    parsed_json = json.loads(function_args)

                    # Validate with Pydantic model
                    validated_obj = RespostaAnaliseSentimento.model_validate(parsed_json)
                    parsed_results.append(validated_obj)
                    raw_results.append(validated_obj.model_dump())
                    batch_parsed += 1

                except Exception as e:
                    # Handle parsing errors
                    parsed_results.append(None)
                    raw_results.append(None)
                    batch_errors += 1
                    error_count += 1

                    if batch_errors <= 3:  # Show first few errors
                        print(f"‚ö†Ô∏è Parsing error: {str(e)[:100]}")

            print(f"  ‚úÖ Parsed: {batch_parsed}, Errors: {batch_errors}")

        except Exception as e:
            print(f"‚ùå Error processing {filename}: {e}")
            continue

    success_rate = (len(parsed_results) - error_count) / len(parsed_results) * 100 if parsed_results else 0

    print(f"\nüìä Results Processing Summary:")
    print(f"- Total responses: {len(parsed_results):,}")
    print(f"- Successfully parsed: {len(parsed_results) - error_count:,}")
    print(f"- Parse errors: {error_count:,}")
    print(f"- Success rate: {success_rate:.1f}%")

    return parsed_results, raw_results


# Process all batch results
parsed_results, raw_results = process_batch_results(batch_processors)

üîÑ Processing batch results...
üìÇ Processing results from 20250417_youtube_comments_batch_api_0.jsonl...
  ‚úÖ Parsed: 40000, Errors: 0
üìÇ Processing results from 20250417_youtube_comments_batch_api_1.jsonl...
  ‚úÖ Parsed: 40000, Errors: 0
üìÇ Processing results from 20250417_youtube_comments_batch_api_1.jsonl...
  ‚úÖ Parsed: 40000, Errors: 0
üìÇ Processing results from 20250417_youtube_comments_batch_api_2.jsonl...
  ‚úÖ Parsed: 40000, Errors: 0
üìÇ Processing results from 20250417_youtube_comments_batch_api_2.jsonl...
  ‚úÖ Parsed: 40000, Errors: 0
üìÇ Processing results from 20250417_youtube_comments_batch_api_3.jsonl...
  ‚úÖ Parsed: 40000, Errors: 0
üìÇ Processing results from 20250417_youtube_comments_batch_api_3.jsonl...
  ‚úÖ Parsed: 40000, Errors: 0
üìÇ Processing results from 20250417_youtube_comments_batch_api_4.jsonl...
  ‚úÖ Parsed: 40000, Errors: 0
üìÇ Processing results from 20250417_youtube_comments_batch_api_4.jsonl...
  ‚úÖ Parsed: 31946, Errors: 0

üì

In [17]:
def save_intermediate_results(parsed_results: List[Any], raw_results: List[Dict]) -> None:
    """
    Save intermediate results for backup and debugging.

    Args:
        parsed_results: List of parsed Pydantic objects
        raw_results: List of raw result dictionaries
    """
    print("üíæ Saving intermediate results...")

    try:
        # Save parsed results
        joblib.dump(parsed_results, ClassificationConfig.PARSED_RESULTS_FILE)
        print(f"‚úÖ Parsed results saved: {ClassificationConfig.PARSED_RESULTS_FILE}")

        # Save raw results
        joblib.dump(raw_results, ClassificationConfig.RESULTS_FILE)
        print(f"‚úÖ Raw results saved: {ClassificationConfig.RESULTS_FILE}")

        # File size information
        parsed_size = ClassificationConfig.PARSED_RESULTS_FILE.stat().st_size / (1024 * 1024)
        raw_size = ClassificationConfig.RESULTS_FILE.stat().st_size / (1024 * 1024)

        print(f"üìä File sizes:")
        print(f"- Parsed results: {parsed_size:.2f} MB")
        print(f"- Raw results: {raw_size:.2f} MB")

    except Exception as e:
        print(f"‚ùå Error saving intermediate results: {e}")
        raise


def validate_results_consistency(parsed_results: List[Any], original_df: pd.DataFrame) -> None:
    """
    Validate that results match the original data structure.

    Args:
        parsed_results: List of classification results
        original_df: Original DataFrame with comments
    """
    print("üîç Validating results consistency...")

    print(f"- Original comments: {len(original_df):,}")
    print(f"- Classification results: {len(parsed_results):,}")

    if len(parsed_results) == len(original_df):
        print("‚úÖ Result count matches original data")
    else:
        print("‚ö†Ô∏è Result count mismatch - check for processing errors")

    # Check for null results
    null_count = sum(1 for result in parsed_results if result is None)
    if null_count > 0:
        print(f"‚ö†Ô∏è Found {null_count} null results ({null_count / len(parsed_results) * 100:.1f}%)")
    else:
        print("‚úÖ No null results found")


# Save intermediate results for backup
save_intermediate_results(parsed_results, raw_results)

# Validate consistency
validate_results_consistency(parsed_results, df)

üíæ Saving intermediate results...
‚úÖ Parsed results saved: ../data/tmp/parsed_results_comments.joblib
‚úÖ Parsed results saved: ../data/tmp/parsed_results_comments.joblib
‚úÖ Raw results saved: ../data/tmp/results_comments.joblib
üìä File sizes:
- Parsed results: 10.62 MB
- Raw results: 4.39 MB
üîç Validating results consistency...
- Original comments: 191,946
- Classification results: 191,946
‚úÖ Result count matches original data
‚úÖ No null results found
‚úÖ Raw results saved: ../data/tmp/results_comments.joblib
üìä File sizes:
- Parsed results: 10.62 MB
- Raw results: 4.39 MB
üîç Validating results consistency...
- Original comments: 191,946
- Classification results: 191,946
‚úÖ Result count matches original data
‚úÖ No null results found


In [18]:
def create_classification_dataframe(parsed_results: List[Any]) -> pd.DataFrame:
    """
    Convert parsed classification results into a structured DataFrame.

    Args:
        parsed_results: List of parsed classification objects

    Returns:
        DataFrame with classification results
    """
    print("üìä Creating classification DataFrame...")

    outputs = []

    for i in tqdm(range(len(parsed_results)), desc="Processing results"):
        parsed_document = parsed_results[i]

        if parsed_document is not None:
            # Convert to dictionary
            result_dict = parsed_document.model_dump()
            outputs.append(result_dict)
        else:
            # Handle null results with default values
            outputs.append(
                {
                    "sentimento": None,
                    "gordofobia_implicita": None,
                    "gordofobia_explicita": None,
                    "idioma": None,
                    "obesidade": None,
                }
            )

    # Create DataFrame
    df_classifications = pd.DataFrame(outputs)
    df_classifications.columns = ["sentimento", "gordofobia_implicita", "gordofobia_explicita", "idioma", "obesidade"]

    print(f"‚úÖ Created DataFrame with {len(df_classifications):,} classification results")

    # Display classification statistics
    print(f"\nüìà Classification Statistics:")

    if "sentimento" in df_classifications.columns:
        sentiment_counts = df_classifications["sentimento"].value_counts()
        print(f"- Sentiment distribution: {dict(sentiment_counts)}")

    if "gordofobia_explicita" in df_classifications.columns:
        explicit_count = df_classifications["gordofobia_explicita"].sum()
        print(f"- Explicit gordofobia: {explicit_count:,} ({explicit_count / len(df_classifications) * 100:.1f}%)")

    if "gordofobia_implicita" in df_classifications.columns:
        implicit_count = df_classifications["gordofobia_implicita"].sum()
        print(f"- Implicit gordofobia: {implicit_count:,} ({implicit_count / len(df_classifications) * 100:.1f}%)")

    if "obesidade" in df_classifications.columns:
        obesity_count = df_classifications["obesidade"].sum()
        print(f"- Obesity-related: {obesity_count:,} ({obesity_count / len(df_classifications) * 100:.1f}%)")

    if "idioma" in df_classifications.columns:
        language_counts = df_classifications["idioma"].value_counts().head()
        print(f"- Top languages: {dict(language_counts)}")

    return df_classifications


# Create the classification DataFrame
df_classifications = create_classification_dataframe(parsed_results)

üìä Creating classification DataFrame...


Processing results:   0%|          | 0/191946 [00:00<?, ?it/s]

‚úÖ Created DataFrame with 191,946 classification results

üìà Classification Statistics:
- Sentiment distribution: {'positivo': 81684, 'neutro': 63692, 'negativo': 46569, '': 1}
- Explicit gordofobia: 12,355 (6.4%)
- Implicit gordofobia: 19,623 (10.2%)
- Obesity-related: 20,512 (10.7%)
- Top languages: {'pt': 189912, 'es': 1900, 'en': 108, 'id': 18, 'fr': 2}


In [19]:
def integrate_classification_results(original_df: pd.DataFrame, classifications_df: pd.DataFrame) -> pd.DataFrame:
    """
    Integrate classification results with original comment data.

    Args:
        original_df: Original DataFrame with comments
        classifications_df: DataFrame with classification results

    Returns:
        Combined DataFrame with original data and classifications
    """
    print("üîó Integrating classification results with original data...")

    # Verify dimensions match
    if len(original_df) != len(classifications_df):
        raise ValueError(f"Dimension mismatch: original ({len(original_df)}) vs classifications ({len(classifications_df)})")

    # Combine dataframes
    df_integrated = pd.concat([original_df, classifications_df], axis=1)

    print(f"‚úÖ Successfully integrated data")
    print(f"üìä Final dataset shape: {df_integrated.shape}")
    print(f"üìã Total columns: {len(df_integrated.columns)}")

    # Display sample of integrated data
    print(f"\nüîç Sample Integrated Data:")
    sample_cols = ["textDisplay", "sentimento", "gordofobia_explicita", "gordofobia_implicita", "idioma", "obesidade"]
    available_cols = [col for col in sample_cols if col in df_integrated.columns]

    if available_cols:
        sample_data = df_integrated[available_cols].head(3)
        for idx, row in sample_data.iterrows():
            print(f"\nComment {idx + 1}:")
            text_preview = row["textDisplay"][:80] + "..." if len(row["textDisplay"]) > 80 else row["textDisplay"]
            print(f"  Text: {text_preview}")
            for col in available_cols[1:]:  # Skip textDisplay
                print(f"  {col}: {row[col]}")

    return df_integrated


# Integrate the results
df_final = integrate_classification_results(df, df_classifications)

# Display integration summary
print(f"\nüìà Integration Summary:")
print(f"- Original comment columns: {len(df.columns)}")
print(f"- Classification columns: {len(df_classifications.columns)}")
print(f"- Final dataset columns: {len(df_final.columns)}")
print(f"- Total records: {len(df_final):,}")

# Check for any data quality issues
null_classifications = df_final[["sentimento", "gordofobia_explicita", "gordofobia_implicita", "idioma", "obesidade"]].isnull().sum()
if null_classifications.sum() > 0:
    print(f"\n‚ö†Ô∏è Null classifications found:")
    for col, count in null_classifications.items():
        if count > 0:
            print(f"  - {col}: {count} null values")
else:
    print("\n‚úÖ No null classifications found")

üîó Integrating classification results with original data...
‚úÖ Successfully integrated data
üìä Final dataset shape: (191946, 25)
üìã Total columns: 25

üîç Sample Integrated Data:

Comment 1:
  Text: Haahahahahahahahhahh o pol√≠cia chupando a buda do Romer
  sentimento: neutro
  gordofobia_explicita: False
  gordofobia_implicita: False
  idioma: pt
  obesidade: False

Comment 2:
  Text: Chefe wigol deu um beijo grego no homer skksks
  sentimento: neutro
  gordofobia_explicita: False
  gordofobia_implicita: False
  idioma: pt
  obesidade: False

Comment 3:
  Text: Quem era Batedor de Carteiras ?
  sentimento: neutro
  gordofobia_explicita: False
  gordofobia_implicita: False
  idioma: pt
  obesidade: False

üìà Integration Summary:
- Original comment columns: 20
- Classification columns: 5
- Final dataset columns: 25
- Total records: 191,946

‚úÖ No null classifications found


In [20]:
def validate_final_dataset(df: pd.DataFrame) -> Dict[str, Any]:
    """
    Perform comprehensive validation of the final classified dataset.

    Args:
        df: Final dataset with classifications

    Returns:
        Dictionary with validation results
    """
    print("üîç Performing final dataset validation...")

    validation_results = {}

    # Basic structure validation
    validation_results["total_records"] = len(df)
    validation_results["total_columns"] = len(df.columns)

    # Check required columns
    required_columns = ["textDisplay", "video_id", "sentimento", "gordofobia_explicita", "gordofobia_implicita", "idioma", "obesidade"]
    missing_columns = [col for col in required_columns if col not in df.columns]
    validation_results["missing_columns"] = missing_columns

    # Classification completeness
    classification_columns = ["sentimento", "gordofobia_explicita", "gordofobia_implicita", "idioma", "obesidade"]
    for col in classification_columns:
        if col in df.columns:
            null_count = df[col].isnull().sum()
            validation_results[f"{col}_null_count"] = null_count
            validation_results[f"{col}_completeness"] = (len(df) - null_count) / len(df) * 100

    # Data quality checks
    if "sentimento" in df.columns:
        valid_sentiments = ["positivo", "negativo", "neutro"]
        invalid_sentiments = df[~df["sentimento"].isin(valid_sentiments + [None])]["sentimento"].value_counts()
        validation_results["invalid_sentiments"] = dict(invalid_sentiments)

    # Language distribution
    if "idioma" in df.columns:
        language_dist = df["idioma"].value_counts().head(10)
        validation_results["top_languages"] = dict(language_dist)

    # Gordofobia analysis
    if "gordofobia_explicita" in df.columns and "gordofobia_implicita" in df.columns:
        explicit_count = df["gordofobia_explicita"].sum()
        implicit_count = df["gordofobia_implicita"].sum()
        any_gordofobia = (df["gordofobia_explicita"] | df["gordofobia_implicita"]).sum()

        validation_results["gordofobia_explicit"] = explicit_count
        validation_results["gordofobia_implicit"] = implicit_count
        validation_results["gordofobia_any"] = any_gordofobia
        validation_results["gordofobia_rate"] = any_gordofobia / len(df) * 100

    # Overall quality score
    completeness_scores = [validation_results[f"{col}_completeness"] for col in classification_columns if f"{col}_completeness" in validation_results]
    validation_results["overall_completeness"] = sum(completeness_scores) / len(completeness_scores) if completeness_scores else 0

    # Determine validation status
    if validation_results["overall_completeness"] >= 95:
        validation_results["status"] = "excellent"
    elif validation_results["overall_completeness"] >= 85:
        validation_results["status"] = "good"
    elif validation_results["overall_completeness"] >= 70:
        validation_results["status"] = "acceptable"
    else:
        validation_results["status"] = "poor"

    return validation_results


# Validate the final dataset
validation_results = validate_final_dataset(df_final)

# Display validation results
print(f"\nüìä Dataset Validation Results:")
print(f"- Status: {validation_results['status'].upper()}")
print(f"- Total records: {validation_results['total_records']:,}")
print(f"- Total columns: {validation_results['total_columns']}")
print(f"- Overall completeness: {validation_results['overall_completeness']:.1f}%")

if validation_results["missing_columns"]:
    print(f"‚ö†Ô∏è Missing columns: {validation_results['missing_columns']}")

print(f"\nüìà Classification Completeness:")
classification_columns = ["sentimento", "gordofobia_explicita", "gordofobia_implicita", "idioma", "obesidade"]
for col in classification_columns:
    if f"{col}_completeness" in validation_results:
        completeness = validation_results[f"{col}_completeness"]
        null_count = validation_results[f"{col}_null_count"]
        print(f"- {col}: {completeness:.1f}% ({null_count:,} null values)")

if "gordofobia_rate" in validation_results:
    print(f"\nüéØ Gordofobia Detection Results:")
    print(f"- Explicit gordofobia: {validation_results['gordofobia_explicit']:,} comments")
    print(f"- Implicit gordofobia: {validation_results['gordofobia_implicit']:,} comments")
    print(f"- Any gordofobia: {validation_results['gordofobia_any']:,} comments ({validation_results['gordofobia_rate']:.1f}%)")

if "top_languages" in validation_results:
    print(f"\nüåê Top Languages Detected:")
    for lang, count in list(validation_results["top_languages"].items())[:5]:
        print(f"- {lang}: {count:,} comments")

üîç Performing final dataset validation...

üìä Dataset Validation Results:
- Status: EXCELLENT
- Total records: 191,946
- Total columns: 25
- Overall completeness: 100.0%

üìà Classification Completeness:
- sentimento: 100.0% (0 null values)
- gordofobia_explicita: 100.0% (0 null values)
- gordofobia_implicita: 100.0% (0 null values)
- idioma: 100.0% (0 null values)
- obesidade: 100.0% (0 null values)

üéØ Gordofobia Detection Results:
- Explicit gordofobia: 12,355 comments
- Implicit gordofobia: 19,623 comments
- Any gordofobia: 31,538 comments (16.4%)

üåê Top Languages Detected:
- pt: 189,912 comments
- es: 1,900 comments
- en: 108 comments
- id: 18 comments
- fr: 2 comments


In [21]:
def export_classified_dataset(df: pd.DataFrame, output_path: Path, validation_results: Dict[str, Any]) -> bool:
    """
    Export the final classified dataset with validation checks.

    Args:
        df: Final dataset to export
        output_path: Path where to save the dataset
        validation_results: Results from validation

    Returns:
        True if export successful, False otherwise
    """
    print(f"üíæ Exporting classified dataset to: {output_path}")

    try:
        # Check if export should proceed based on validation
        if validation_results["status"] in ["poor"]:
            print("‚ö†Ô∏è Dataset quality is poor - export may contain significant issues")

        # Create backup if file already exists
        if output_path.exists():
            backup_path = output_path.with_suffix(f".backup_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.parquet")
            output_path.rename(backup_path)
            print(f"üìÅ Existing file backed up to: {backup_path.name}")

        # Ensure output directory exists
        output_path.parent.mkdir(parents=True, exist_ok=True)

        # Export to parquet format
        df.to_parquet(output_path, index=False)

        # Verify export
        exported_size = output_path.stat().st_size
        print(f"‚úÖ Export successful!")
        print(f"üìÅ File: {output_path.name}")
        print(f"üìä Size: {exported_size / (1024 * 1024):.2f} MB")
        print(f"üìà Records: {len(df):,}")
        print(f"üìã Columns: {len(df.columns)}")

        # Test read-back
        test_df = pd.read_parquet(output_path)
        if len(test_df) == len(df) and len(test_df.columns) == len(df.columns):
            print("‚úÖ Export verification passed")
        else:
            print("‚ö†Ô∏è Export verification failed - file may be corrupted")
            return False

        return True

    except Exception as e:
        print(f"‚ùå Export failed: {e}")
        return False


# Export the final classified dataset
export_success = export_classified_dataset(df_final, ClassificationConfig.OUTPUT_FILE, validation_results)

if export_success:
    print(f"\nüéâ Classification pipeline completed successfully!")
    print(f"üìÅ Output file: {ClassificationConfig.OUTPUT_FILE}")
    print(f"üìä Final dataset: {len(df_final):,} comments with classifications")

    # Display final summary statistics
    print(f"\nüìà Final Summary:")
    if "sentimento" in df_final.columns:
        sentiment_dist = df_final["sentimento"].value_counts()
        print(f"- Sentiment distribution: {dict(sentiment_dist)}")

    if "obesidade" in df_final.columns:
        obesity_count = df_final["obesidade"].sum()
        print(f"- Obesity-related comments: {obesity_count:,}")

    gordofobia_any = (df_final.get("gordofobia_explicita", False) | df_final.get("gordofobia_implicita", False)).sum()
    print(f"- Comments with gordofobia: {gordofobia_any:,}")

else:
    print("‚ùå Export failed - check error messages above")

üíæ Exporting classified dataset to: ../data/intermediate/20250417_youtube_comments_yes_labels.parquet
üìÅ Existing file backed up to: 20250417_youtube_comments_yes_labels.backup_20250724_094513.parquet
‚úÖ Export successful!
üìÅ File: 20250417_youtube_comments_yes_labels.parquet
üìä Size: 47.22 MB
üìà Records: 191,946
üìã Columns: 25
‚úÖ Export verification passed

üéâ Classification pipeline completed successfully!
üìÅ Output file: ../data/intermediate/20250417_youtube_comments_yes_labels.parquet
üìä Final dataset: 191,946 comments with classifications

üìà Final Summary:
- Sentiment distribution: {'positivo': 81684, 'neutro': 63692, 'negativo': 46569, '': 1}
- Obesity-related comments: 20,512
- Comments with gordofobia: 31,538


In [23]:
def generate_pipeline_summary() -> None:
    """
    Generate a comprehensive summary of the classification pipeline.
    """
    print("üìã Zero-Shot Classification Pipeline Summary")
    print("=" * 50)

    # Input data summary
    print(f"\nüìä Input Data:")
    print(f"- Source file: {ClassificationConfig.INPUT_FILE.name}")
    print(f"- Comments processed: {len(df):,}")
    print(f"- Unique videos: {df['video_id'].nunique():,}")

    # Model and configuration
    print(f"\nü§ñ Model Configuration:")
    print(f"- Model: {ClassificationConfig.MODEL_NAME}")
    print(f"- Temperature: {ClassificationConfig.TEMPERATURE}")
    print(f"- Batch size: {ClassificationConfig.BATCH_SIZE:,}")

    # Classification schema
    print(f"\nüè∑Ô∏è Classification Schema:")
    print(f"- Sentiment analysis (positivo/negativo/neutro)")
    print(f"- Gordofobia detection (explicit/implicit)")
    print(f"- Language identification (ISO codes)")
    print(f"- Obesity content flagging")

    # Results summary
    if "df_final" in globals() and not df_final.empty:
        print(f"\nüìà Results Summary:")
        print(f"- Final dataset: {len(df_final):,} records")
        print(f"- Classification completeness: {validation_results.get('overall_completeness', 0):.1f}%")
        print(f"- Data quality: {validation_results.get('status', 'unknown').upper()}")

        # Sentiment distribution
        if "sentimento" in df_final.columns:
            sentiment_stats = df_final["sentimento"].value_counts()
            print(f"\nüí≠ Sentiment Analysis:")
            for sentiment, count in sentiment_stats.items():
                percentage = count / len(df_final) * 100
                print(f"  - {sentiment}: {count:,} ({percentage:.1f}%)")

        # Gordofobia detection
        if "gordofobia_explicita" in df_final.columns and "gordofobia_implicita" in df_final.columns:
            explicit_count = df_final["gordofobia_explicita"].sum()
            implicit_count = df_final["gordofobia_implicita"].sum()
            any_gordofobia = (df_final["gordofobia_explicita"] | df_final["gordofobia_implicita"]).sum()

            print(f"\n‚ö†Ô∏è Gordofobia Detection:")
            print(f"  - Explicit: {explicit_count:,} ({explicit_count / len(df_final) * 100:.1f}%)")
            print(f"  - Implicit: {implicit_count:,} ({implicit_count / len(df_final) * 100:.1f}%)")
            print(f"  - Any form: {any_gordofobia:,} ({any_gordofobia / len(df_final) * 100:.1f}%)")

        # Language distribution
        if "idioma" in df_final.columns:
            lang_stats = df_final["idioma"].value_counts().head(5)
            print(f"\nüåê Language Distribution:")
            for lang, count in lang_stats.items():
                percentage = count / len(df_final) * 100
                print(f"  - {lang}: {count:,} ({percentage:.1f}%)")

        # Obesity content
        if "obesidade" in df_final.columns:
            obesity_count = df_final["obesidade"].sum()
            print(f"\nüè• Obesity-Related Content:")
            print(f"  - Comments mentioning obesity: {obesity_count:,} ({obesity_count / len(df_final) * 100:.1f}%)")

    # Output files
    print(f"\nüìÅ Output Files:")
    print(f"- Main dataset: {ClassificationConfig.OUTPUT_FILE}")
    print(f"- Backup results: {ClassificationConfig.PARSED_RESULTS_FILE}")
    print(f"- Raw results: {ClassificationConfig.RESULTS_FILE}")


# Generate comprehensive summary
generate_pipeline_summary()


print(f"\nüèÅ Zero-Shot Classification Pipeline Complete! ‚ú®")
print(f"üìä Dataset ready for research analysis and publication")

üìã Zero-Shot Classification Pipeline Summary

üìä Input Data:
- Source file: 20250417_youtube_comments_pt_cleaned1.parquet
- Comments processed: 191,946
- Unique videos: 1,204

ü§ñ Model Configuration:
- Model: gpt-4.1-mini
- Temperature: 0.0
- Batch size: 40,000

üè∑Ô∏è Classification Schema:
- Sentiment analysis (positivo/negativo/neutro)
- Gordofobia detection (explicit/implicit)
- Language identification (ISO codes)
- Obesity content flagging

üìà Results Summary:
- Final dataset: 191,946 records
- Classification completeness: 100.0%
- Data quality: EXCELLENT

üí≠ Sentiment Analysis:
  - positivo: 81,684 (42.6%)
  - neutro: 63,692 (33.2%)
  - negativo: 46,569 (24.3%)
  - : 1 (0.0%)

‚ö†Ô∏è Gordofobia Detection:
  - Explicit: 12,355 (6.4%)
  - Implicit: 19,623 (10.2%)
  - Any form: 31,538 (16.4%)

üåê Language Distribution:
  - pt: 189,912 (98.9%)
  - es: 1,900 (1.0%)
  - en: 108 (0.1%)
  - id: 18 (0.0%)
  - fr: 2 (0.0%)

üè• Obesity-Related Content:
  - Comments mentioning