# Zero-Shot Classification of YouTube Video Titles for Weight Stigma Analysis

## Research Overview

This notebook implements zero-shot classification of YouTube video titles to identify weight stigma content and sentiment patterns. As part of a comprehensive multi-modal analysis framework, this component complements comment classification (notebook 04) and transcription analysis (notebook 05) to provide complete coverage of YouTube content related to weight stigma research.

## Scientific Objectives

1. **Multi-Modal Content Analysis**: Classify video titles alongside comments and transcriptions for comprehensive YouTube content analysis
2. **Weight Stigma Detection**: Identify explicit and implicit weight discrimination patterns in video titles using advanced NLP techniques
3. **Sentiment Analysis**: Determine emotional sentiment patterns in obesity-related video titles for psychological research insights
4. **Language Processing**: Detect and filter content by language for culturally-specific analysis of Brazilian Portuguese content
5. **Research Integration**: Generate publication-ready datasets that integrate seamlessly with existing research workflows

## Methodology and Technical Approach

- **Zero-Shot Classification**: Leveraging OpenAI GPT-4o-mini-2024-07-18 for structured content analysis without training data requirements
- **Batch Processing**: Efficient large-scale classification via OpenAI Batch API for cost-effective research operations
- **Structured Outputs**: Pydantic models ensure consistent, validated data formats suitable for statistical analysis
- **Quality Control**: Comprehensive validation, error handling, and data integrity checks throughout the pipeline
- **Reproducible Research**: Standardized configuration management and comprehensive documentation for research reproducibility

## Technical Architecture

The pipeline implements established research patterns optimized for video title analysis:

1. **Data Loading and Preparation**: Integration with cleaned video datasets from previous processing stages
2. **Schema Definition**: Pydantic-based structured output models for consistent data collection
3. **System Prompt Engineering**: Specialized prompts optimized for video title content and weight stigma detection
4. **Batch API Processing**: Scalable classification using OpenAI's batch processing infrastructure
5. **Results Processing**: Comprehensive data validation, integration, and export for downstream research

## Research Context and Contributions

This notebook addresses the critical need for systematic analysis of weight stigma in digital media. Video titles represent a unique content modality that often serves as the primary gateway for audience engagement, making their analysis essential for understanding how weight-related content is framed and potentially stigmatizing messaging is propagated in digital spaces.

The methodology contributes to computational social science research by providing a scalable framework for content analysis that can be adapted to other forms of digital discrimination research.

---

## 1. Introduction and Setup

This section establishes the computational environment and imports all necessary libraries for video title classification.

In [1]:
# Core imports for video title classification pipeline
import hashlib
import json
import os
import sys
import time
from pathlib import Path
from typing import List, Literal, Optional, Dict, Any, Tuple

# Add project root to path for local modules
sys.path.append("../")

# Data processing and analysis
import pandas as pd
import numpy as np
from tqdm.auto import tqdm

# Environment and configuration
from dotenv import load_dotenv

# LangChain utilities for OpenAI integration
from langchain.utils.openai_functions import convert_pydantic_to_openai_function

# Pydantic for structured data validation
from pydantic import BaseModel, Field

# File system utilities
from glob import glob
import joblib

# Custom modules for API processing
from openai_api import OpenAIBatchProcessor

# IPython utilities for notebook display
from IPython.display import clear_output

# Load environment variables
load_dotenv()

print("🚀 Video Title Classification Pipeline - Environment Setup Complete")
print("📊 Notebook 06: Zero-shot classification of YouTube video titles")
print("🎯 Focus: Weight stigma detection in video title content")
print("🔧 Processing: OpenAI GPT-4o-mini with batch API for large-scale analysis")

# Display technical specifications
print(f"\n📋 Technical Configuration:")
print(f"- Python version: {sys.version.split()[0]}")
print(f"- Working directory: {Path.cwd()}")
print(f"- Environment variables loaded: {'✅' if os.getenv('OPENAI_API_KEY') else '❌'}")

print("✅ All imports and environment setup completed successfully")

🚀 Video Title Classification Pipeline - Environment Setup Complete
📊 Notebook 06: Zero-shot classification of YouTube video titles
🎯 Focus: Weight stigma detection in video title content
🔧 Processing: OpenAI GPT-4o-mini with batch API for large-scale analysis

📋 Technical Configuration:
- Python version: 3.12.10
- Working directory: /media/nas-elias/pesquisas/papers/paper_savio_youtube/paper_youtube_weight_stigma
- Environment variables loaded: ✅
✅ All imports and environment setup completed successfully


In [2]:
## 2. Configuration and Data Loading

"""
This section defines the configuration parameters and loads the video title dataset for classification.
The configuration class centralizes all parameters for reproducible research.
"""

class TitleClassificationConfig:
    """
    Configuration class for YouTube video title classification pipeline.
    
    Centralizes all configuration parameters for reproducible research and
    provides comprehensive validation of the research environment.
    """
    
    # Model configuration for OpenAI API
    MODEL_NAME = "gpt-4.1-mini"
    TEMPERATURE = 0.0  # Deterministic outputs for research reproducibility
    
    # Data paths and file management
    DATA_DIR = Path("../data")
    INPUT_FILE = DATA_DIR / "intermediate" / "20250417_youtube_comments_pt_cleaned1.parquet"
    
    # Batch processing configuration
    BATCH_SIZE = 40000  # Optimized for API limits and processing efficiency
    JSONL_DIR = DATA_DIR / "intermediate" / "jsonl"
    BATCH_NAME_PREFIX = "20250417_youtube_titles_batch_api"
    
    # Output files for research data
    RESULTS_FILE = DATA_DIR / "tmp" / "results_titles.joblib"
    PARSED_RESULTS_FILE = DATA_DIR / "tmp" / "parsed_results_titles.joblib"
    FINAL_OUTPUT_FILE = DATA_DIR / "intermediate" / "20250417_youtube_titles_yes_labels.parquet"
    
    # Research parameters
    LANGUAGE_FILTER = "pt"  # Focus on Portuguese content for Brazilian research context
    
    @classmethod
    def ensure_directories(cls) -> None:
        """Create necessary directories if they don't exist."""
        directories = [cls.JSONL_DIR, cls.DATA_DIR / "tmp", cls.DATA_DIR / "intermediate"]
        
        for directory in directories:
            directory.mkdir(parents=True, exist_ok=True)
            
        print("📁 Directory structure verified and created")
    
    @classmethod
    def validate_configuration(cls) -> None:
        """Validate configuration parameters and environment."""
        print("🔍 Configuration Validation:")
        
        # Check input file existence
        if cls.INPUT_FILE.exists():
            file_size = cls.INPUT_FILE.stat().st_size / (1024 * 1024)
            print(f"  ✅ Input file found: {file_size:.2f} MB")
        else:
            print(f"  ❌ Input file missing: {cls.INPUT_FILE}")
        
        # Validate model configuration
        print(f"  ✅ Model: {cls.MODEL_NAME}")
        print(f"  ✅ Temperature: {cls.TEMPERATURE} (deterministic)")
        print(f"  ✅ Batch size: {cls.BATCH_SIZE:,} requests")
        print(f"  ✅ Language filter: {cls.LANGUAGE_FILTER}")
    
    @classmethod
    def display_config(cls) -> None:
        """Display current configuration for research documentation."""
        print("⚙️ Video Title Classification Configuration:")
        print(f"- Model: {cls.MODEL_NAME}")
        print(f"- Temperature: {cls.TEMPERATURE}")
        print(f"- Batch size: {cls.BATCH_SIZE:,}")
        print(f"- Language filter: {cls.LANGUAGE_FILTER}")
        print(f"- Input file: {cls.INPUT_FILE.name}")
        print(f"- Output file: {cls.FINAL_OUTPUT_FILE.name}")

# Initialize configuration and validate environment
TitleClassificationConfig.ensure_directories()
TitleClassificationConfig.validate_configuration()
TitleClassificationConfig.display_config()

# Load and prepare the dataset
print("\n📊 Loading video title dataset...")
df = pd.read_parquet(TitleClassificationConfig.INPUT_FILE)

# Remove duplicates to get unique videos for title analysis
print(f"📈 Original dataset: {len(df):,} records")
df.drop_duplicates(subset=["video_id"], inplace=True)
df.reset_index(drop=True, inplace=True)
print(f"📈 Unique videos: {len(df):,} records")

# Comprehensive dataset analysis
print(f"\n📋 Video Title Dataset Overview:")
print(f"- Total unique videos: {len(df):,}")
print(f"- Dataset columns: {list(df.columns)}")

if 'video_title' in df.columns:
    # Title content analysis
    title_lengths = df['video_title'].str.len()
    non_null_titles = df['video_title'].notna().sum()
    
    print(f"- Video title statistics:")
    print(f"  • Non-null titles: {non_null_titles:,} ({non_null_titles/len(df)*100:.1f}%)")
    print(f"  • Length range: {title_lengths.min()}-{title_lengths.max()} characters")
    print(f"  • Mean length: {title_lengths.mean():.1f} characters")
    print(f"  • Median length: {title_lengths.median():.1f} characters")
    
    # Sample titles for verification
    print(f"\n📝 Sample Video Titles:")
    sample_titles = df['video_title'].dropna().head(3)
    for i, title in enumerate(sample_titles, 1):
        preview = title[:80] + "..." if len(title) > 80 else title
        print(f"  {i}. {preview}")
else:
    print("⚠️ video_title column not found in dataset")

print("✅ Data loading and preparation completed")
print(f"🎯 Ready to process {len(df):,} video titles for weight stigma classification")

# Display final dataset structure
df.head()

📁 Directory structure verified and created
🔍 Configuration Validation:
  ✅ Input file found: 47.12 MB
  ✅ Model: gpt-4.1-mini
  ✅ Temperature: 0.0 (deterministic)
  ✅ Batch size: 40,000 requests
  ✅ Language filter: pt
⚙️ Video Title Classification Configuration:
- Model: gpt-4.1-mini
- Temperature: 0.0
- Batch size: 40,000
- Language filter: pt
- Input file: 20250417_youtube_comments_pt_cleaned1.parquet
- Output file: 20250417_youtube_titles_yes_labels.parquet

📊 Loading video title dataset...
📈 Original dataset: 191,946 records
📈 Unique videos: 1,204 records

📋 Video Title Dataset Overview:
- Total unique videos: 1,204
- Dataset columns: ['video_id', 'channelId', 'videoId', 'textDisplay', 'textOriginal', 'authorDisplayName', 'authorProfileImageUrl', 'authorChannelUrl', 'authorChannelId', 'canRate', 'viewerRating', 'likeCount', 'publishedAt', 'updatedAt', 'author', 'comment', 'date', 'likes', 'video_title', 'language']
- Video title statistics:
  • Non-null titles: 1,204 (100.0%)
  • 

Unnamed: 0,video_id,channelId,videoId,textDisplay,textOriginal,authorDisplayName,authorProfileImageUrl,authorChannelUrl,authorChannelId,canRate,viewerRating,likeCount,publishedAt,updatedAt,author,comment,date,likes,video_title,language
0,--tK3SaYWr4,UCiV6zQocW4CvWRyXcKDZZmQ,--tK3SaYWr4,Haahahahahahahahhahh o polícia chupando a buda...,Haahahahahahahahhahh o polícia chupando a buda...,@evelynsoares4467,https://yt3.ggpht.com/ytc/AIdro_kTUhLtO25GYE29...,http://www.youtube.com/@evelynsoares4467,UCNhXx9ev5RtEiyGsVjMuTOA,True,none,0.0,2024-12-28 21:38:37+00:00,2024-12-28 21:38:37+00:00,,,,,Tony Gordo é Incriminado #simpsons,pt
1,-1DN4904BQw,UCbDy7ap3Ixk45DILe4O6Tbw,-1DN4904BQw,Aula gratuita: https://bit.ly/3RLbmWq,Aula gratuita: https://bit.ly/3RLbmWq,@sejasaudavel5167,https://yt3.ggpht.com/3Uk9AXlL4DHwwOhPTVsJIKJn...,http://www.youtube.com/@sejasaudavel5167,UCbDy7ap3Ixk45DILe4O6Tbw,True,none,1.0,2022-10-06 20:41:16+00:00,2022-10-06 20:41:16+00:00,,,,,O país mais obeso do mundo #shorts,pt
2,-4xj_teI1EQ,UCVIpR5_iHUkkpAPBkw24yDQ,-4xj_teI1EQ,Vc é linda e sua auto-estima é contagiante. Se...,Vc é linda e sua auto-estima é contagiante. Se...,@isabelitacorrea2611,https://yt3.ggpht.com/ytc/AIdro_nE2ZHEpUNJCTkX...,http://www.youtube.com/@isabelitacorrea2611,UCnxzchRu-oFH4H4SKeF2C-Q,True,none,176.0,2024-05-02 01:58:41+00:00,2024-05-02 01:58:41+00:00,,,,,Preconceitos que eu já sofri por ser uma baila...,pt
3,-6Qxw7CpQvQ,UC6cALLZLWQGilBFBB0PWAog,-6Qxw7CpQvQ,Vídeo completo: https://youtu.be/hnetjD-gje4,Vídeo completo: https://youtu.be/hnetjD-gje4,,https://yt3.ggpht.com/xXlZxbOOYCKigMGIaVKMpvi1...,http://www.youtube.com/c/MaiconK%C3%BCster,UC6cALLZLWQGilBFBB0PWAog,True,none,436.0,2023-01-26 00:25:05+00:00,2023-01-26 00:25:05+00:00,,,,,esse milionário de 18 anos não quer pegar mulh...,pt
4,-7fJRjz1BCM,UC9mdw2mmn49ZuqGOpSri7Fw,-7fJRjz1BCM,"Reunião pra saber como roubar mais, desgoverno...","Reunião pra saber como roubar mais, desgoverno...",@andersoncustodiooliveira1515,https://yt3.ggpht.com/ytc/AIdro_m5pluPevjReHKO...,http://www.youtube.com/@andersoncustodioolivei...,UCc7hcy4jmptDescrLGRHkeA,True,none,0.0,2023-06-15 21:35:26+00:00,2023-06-15 21:35:26+00:00,,,,,Lula volta a fazer piada com obesidade de Fláv...,pt


In [3]:
## 3. Data Schema and Classification Model

"""
This section defines the structured output schema for video title classification using Pydantic models.
The schema ensures consistent, validated outputs suitable for research analysis and statistical processing.
"""
class RespostaAnaliseSentimento(BaseModel):
    """A resposta de uma função que realiza análise de sentimento em texto e detecção do idioma do texto."""

    # O rótulo de sentimento atribuído ao texto
    sentimento: Literal["positivo", "negativo", "neutro"] = Field(
        default_factory=str,
        description="O rótulo de sentimento atribuído ao texto. Você só pode ter 'positivo', 'negativo' ou 'neutro' como valores.",
    )

    gordofobia_implicita: bool = Field(
        default_factory=bool,
        description="Se o texto contém discriminação por peso (gordofobia) de forma implícita e/ou indireta. Se não houver gordofobia, este campo deve ser False.",
    )

    gordofobia_explicita: bool = Field(
        default_factory=bool,
        description="Se o texto contém discriminação por peso (gordofobia) de forma explícita e/ou direta. Se não houver gordofobia, este campo deve ser False.",
    )

    # O idioma detectado no texto
    idioma: str = Field(
        default_factory=str,
        description="O idioma detectado no texto, representado por um código de idioma de duas letras.",
    )

    obesidade: bool = Field(
        default_factory=bool,
        description="Se o texto toca no assunto de obesidade. Se não houver menção à obesidade, este campo deve ser False.",
    )

    class Config:
        """Pydantic configuration for the model."""

        json_encoders = {
            # Custom encoders if needed
        }

def validate_schema_structure() -> None:
    """Validate the Pydantic schema structure for research requirements."""
    schema = RespostaAnaliseSentimento.model_json_schema()
    
    print("🔍 Schema Validation for Video Title Classification:")
    print(f"- Model name: {RespostaAnaliseSentimento.__name__}")
    print(f"- Total fields: {len(schema['properties'])}")
    
    # Validate required fields for research completeness
    required_fields = ["sentimento", "gordofobia_implicita", "gordofobia_explicita", "idioma", "obesidade"]
    for field in required_fields:
        if field in schema['properties']:
            field_type = schema['properties'][field].get('type', 'unknown')
            print(f"  ✅ {field}: {field_type}")
        else:
            print(f"  ❌ {field}: MISSING")
    
    print("\n📋 Research Classification Dimensions:")
    print("- Sentiment Analysis: Emotional tone for psychological research")
    print("- Implicit Weight Stigma: Subtle discrimination patterns")
    print("- Explicit Weight Stigma: Direct discrimination language")
    print("- Language Detection: Cultural and linguistic analysis")
    print("- Obesity Content: Health communication research")
    
    print("\n🎯 Research Applications:")
    print("- Digital discrimination analysis in Brazilian Portuguese content")
    print("- Cross-platform weight stigma comparison (titles vs. comments vs. transcriptions)")
    print("- Content framing analysis in health communication")
    print("- Sentiment trends in obesity-related video content")

def create_sample_classification() -> RespostaAnaliseSentimento:
    """Create a sample classification for testing and documentation."""
    sample_response = RespostaAnaliseSentimento(
        sentimento="neutro",
        gordofobia_implicita=False,
        gordofobia_explicita=False,
        idioma="pt",
        obesidade=True
    )
    return sample_response

# Validate the schema structure
validate_schema_structure()

# Create and display sample classification
sample_response = create_sample_classification()
print(f"\n📝 Sample Classification Structure:")
print(f"Sample output: {sample_response.model_dump()}")

# Verify schema JSON format for API integration
schema_json = RespostaAnaliseSentimento.model_json_schema()
print(f"\n📊 Schema Metadata:")
print(f"- Schema format: JSON Schema compatible")
print(f"- API integration: OpenAI function calling")
print(f"- Validation: Pydantic v2 with type checking")

print("✅ Schema validation and testing completed successfully")

🔍 Schema Validation for Video Title Classification:
- Model name: RespostaAnaliseSentimento
- Total fields: 5
  ✅ sentimento: string
  ✅ gordofobia_implicita: boolean
  ✅ gordofobia_explicita: boolean
  ✅ idioma: string
  ✅ obesidade: boolean

📋 Research Classification Dimensions:
- Sentiment Analysis: Emotional tone for psychological research
- Implicit Weight Stigma: Subtle discrimination patterns
- Explicit Weight Stigma: Direct discrimination language
- Language Detection: Cultural and linguistic analysis
- Obesity Content: Health communication research

🎯 Research Applications:
- Digital discrimination analysis in Brazilian Portuguese content
- Cross-platform weight stigma comparison (titles vs. comments vs. transcriptions)
- Content framing analysis in health communication
- Sentiment trends in obesity-related video content

📝 Sample Classification Structure:
Sample output: {'sentimento': 'neutro', 'gordofobia_implicita': False, 'gordofobia_explicita': False, 'idioma': 'pt', 'obe

In [4]:
## 4. System Prompt and Classification Configuration

"""
This section defines the AI system prompt optimized for video title analysis and classification.
The prompt is specifically engineered for Brazilian Portuguese content and weight stigma research.
"""

def create_video_title_system_prompt() -> Dict[str, str]:
    """
    Create specialized system prompt for YouTube video title classification.
    
    The prompt is designed for:
    - Video title-specific content analysis (short, clickbait-style text)
    - Brazilian Portuguese cultural context and language patterns
    - Weight stigma detection in digital media
    - Research-grade classification consistency
    
    Returns:
        Dict containing the system prompt configuration
    """
    
    # System prompt for zero-shot classification
    prompt_text = """Você é um especialista em análise de sentimento com foco em comentários relacionados a peso corporal e discriminação.

    Sua tarefa é classificar comentários do YouTube com precisão, identificando:
    1. Sentimento geral (positivo, negativo, neutro)
    2. Presença de gordofobia (discriminação por peso)
    3. Idioma do texto
    4. Menções sobre obesidade

    DIRETRIZES DE CLASSIFICAÇÃO:

    SENTIMENTO:
    - 'positivo': Comentários de apoio, encorajamento, aceitação corporal, mensagens construtivas
    - 'negativo': Críticas, julgamentos, discriminação, linguagem ofensiva, gordofobia
    - 'neutro': Comentários informativos, questões, observações sem julgamento de valor

    GORDOFOBIA:
    - Explícita: Insultos diretos, linguagem claramente discriminatória, termos pejorativos sobre peso
    - Implícita: Sugestões sutis, estereótipos, pressões indiretas relacionadas ao peso

    IDIOMA:
    - Use códigos ISO 639-1 (pt, en, es, etc.)
    - Considere o idioma predominante se houver mistura

    OBESIDADE:
    - Marque como True se o comentário menciona ou discute obesidade, mesmo que indiretamente

    CONTEXTO IMPORTANTE:
    - Considere ironia, sarcasmo e emojis no contexto
    - Analise o comentário completo, não apenas palavras isoladas
    - Comentários de apoio à diversidade corporal são positivos
    - Seja preciso na detecção de discriminação sutil

    Responda APENAS com o formato estruturado solicitado."""
    

    print("✅ System prompt configured")
    print(f"📝 Prompt length: {len(prompt_text)} characters")
    print("🎯 Classification targets: sentiment, gordofobia, language, obesity content")
    
    return {
        "role": "system",
        "content": prompt_text
    }

def validate_system_prompt(prompt: Dict[str, str]) -> None:
    """
    Validate the system prompt for research requirements and completeness.
    
    Args:
        prompt: The system prompt dictionary
    """
    print("🔍 System Prompt Validation for Video Title Classification:")
    print(f"- Prompt type: {prompt['role']}")
    print(f"- Content length: {len(prompt['content'])} characters")
    print(f"- Word count: {len(prompt['content'].split())} words")
    
    # Check for key components specific to title analysis
    key_components = [
        "análise de sentimento", 
        "gordofobia",
        "obesidade",
        "idioma",
        "YouTube",
    ]
    
    print("\n📋 Essential Components Check:")
    for component in key_components:
        if component.lower() in prompt['content'].lower():
            print(f"  ✅ {component}")
        else:
            print(f"  ❌ {component} - MISSING")
    
    print("\n🎯 Prompt Specialization for Video Titles:")
    print("- Title-specific analysis (concise, clickbait-aware)")
    print("- Brazilian/Portuguese cultural context")
    print("- Weight stigma detection (implicit/explicit patterns)")
    print("- YouTube platform-specific considerations")
    print("- Research-grade consistency requirements")
    
    print("\n📊 Research Quality Features:")
    print("- Comprehensive weight stigma detection guidelines")
    print("- Cultural sensitivity for Brazilian Portuguese content")
    print("- Clear examples and classification criteria")
    print("- Consistency guidelines for reproducible research")

# Create and validate the system prompt
SYSTEM_PROMPT = create_video_title_system_prompt()
validate_system_prompt(SYSTEM_PROMPT)

# Display prompt summary for research documentation
print(f"\n📝 System Prompt Research Summary:")
print(f"- Purpose: Video title classification for weight stigma research")
print(f"- Target content: Brazilian Portuguese YouTube video titles")
print(f"- Classification dimensions: 5 (sentiment, implicit/explicit stigma, language, obesity content)")
print(f"- Research focus: Digital discrimination in health communication")

print("✅ System prompt configuration and validation completed")

✅ System prompt configured
📝 Prompt length: 1442 characters
🎯 Classification targets: sentiment, gordofobia, language, obesity content
🔍 System Prompt Validation for Video Title Classification:
- Prompt type: system
- Content length: 1442 characters
- Word count: 179 words

📋 Essential Components Check:
  ✅ análise de sentimento
  ✅ gordofobia
  ✅ obesidade
  ✅ idioma
  ✅ YouTube

🎯 Prompt Specialization for Video Titles:
- Title-specific analysis (concise, clickbait-aware)
- Brazilian/Portuguese cultural context
- Weight stigma detection (implicit/explicit patterns)
- YouTube platform-specific considerations
- Research-grade consistency requirements

📊 Research Quality Features:
- Comprehensive weight stigma detection guidelines
- Cultural sensitivity for Brazilian Portuguese content
- Clear examples and classification criteria
- Consistency guidelines for reproducible research

📝 System Prompt Research Summary:
- Purpose: Video title classification for weight stigma research
- Target

In [5]:
## 5. Function Schema Creation and Validation

"""
This section creates the OpenAI function schema for structured video title classification.
The schema ensures consistent, validated outputs from the AI model suitable for research analysis.
"""

def create_function_schema() -> Dict[str, Any]:
    """
    Create OpenAI function schema from Pydantic model for video title classification.
    
    This function converts the RespostaAnaliseSentimento Pydantic model into the format
    required by OpenAI's function calling API, ensuring structured and validated outputs.
    
    Returns:
        Dict containing the function schema for OpenAI API integration
    """
    try:
        # Convert Pydantic model to OpenAI function format
        function_schema = convert_pydantic_to_openai_function(RespostaAnaliseSentimento)
        
        # Ensure all fields are required for consistent research outputs
        function_schema["parameters"]["required"] = list(function_schema["parameters"]["properties"].keys())
        function_schema["parameters"]["type"] = "object"
        
        print("Function schema created successfully for video title analysis")
        print(f"Function name: {function_schema['name']}")
        print(f"Required parameters: {function_schema['parameters']['required']}")
        
        return function_schema
        
    except Exception as e:
        print(f"Error creating function schema: {e}")
        raise


In [6]:

def validate_function_schema(schema: Dict[str, Any]) -> None:
    """
    Comprehensive validation of the function schema structure and research requirements.
    
    Args:
        schema: The function schema to validate
    """
    print("🔍 Comprehensive Function Schema Validation:")
    
    # Check basic structure
    required_keys = ["name", "description", "parameters"]
    for key in required_keys:
        if key in schema:
            print(f"{key}: Present")
        else:
            print(f"{key}: MISSING - Critical error")
    
    # Detailed parameters validation
    if "parameters" in schema:
        params = schema["parameters"]
        print(f"\nParameters Structure Analysis:")
        print(f"- Parameter type: {params.get('type', 'Not specified')}")
        print(f"- Properties count: {len(params.get('properties', {}))}")
        print(f"- Required fields count: {len(params.get('required', []))}")
        
        # Validate research requirements
        expected_fields = ["sentimento", "gordofobia_implicita", "gordofobia_explicita", "idioma", "obesidade"]
        properties = params.get('properties', {})
        
        print(f"\n🔬 Research Field Validation:")
        for field in expected_fields:
            if field in properties:
                field_type = properties[field].get('type', 'unknown')
                print(f"{field}: {field_type}")
            else:
                print(f"{field}: MISSING - Critical for research")
        
        # Validate all properties are required (ensures data completeness)
        props_count = len(properties)
        required_count = len(params.get('required', []))
        
        if props_count == required_count:
            print(f"\n Data Completeness: All {props_count} properties are required")
            print("   This ensures consistent data for statistical analysis")
        else:
            print(f"\n Data Completeness Warning: {props_count} properties, {required_count} required")
            print("   This may result in incomplete data for research")
    
    print(f"\n Schema Integration Analysis:")
    print("- API Integration: OpenAI function calling")
    print("- Data Validation: Pydantic model validation")
    print("- Research Application: Video title classification")
    print("- Output Format: Structured JSON for statistical analysis")


In [7]:

def test_schema_functionality(schema: Dict[str, Any]) -> None:
    """
    Test the schema functionality with sample data.
    
    Args:
        schema: The function schema to test
    """
    print("\n Schema Functionality Testing:")
    
    # Test schema structure
    assert "name" in schema, "Schema missing function name"
    assert "parameters" in schema, "Schema missing parameters"
    assert "properties" in schema["parameters"], "Schema missing properties"
    
    # Test required fields match expected research requirements
    required_fields = schema["parameters"].get("required", [])
    expected_count = 5
    
    if len(required_fields) == expected_count:
        print(f" Required fields count: {len(required_fields)}/{expected_count}")
    else:
        print(f" Required fields mismatch: {len(required_fields)}/{expected_count}")
    
    print(" Schema functionality tests passed")


In [8]:

# Create the function schema
function_schema = create_function_schema()


Function schema created successfully for video title analysis
Function name: RespostaAnaliseSentimento
Required parameters: ['sentimento', 'gordofobia_implicita', 'gordofobia_explicita', 'idioma', 'obesidade']


  function_schema = convert_pydantic_to_openai_function(RespostaAnaliseSentimento)


In [9]:

# Comprehensive validation
validate_function_schema(function_schema)


🔍 Comprehensive Function Schema Validation:
name: Present
description: Present
parameters: Present

Parameters Structure Analysis:
- Parameter type: object
- Properties count: 5
- Required fields count: 5

🔬 Research Field Validation:
sentimento: string
gordofobia_implicita: boolean
gordofobia_explicita: boolean
idioma: string
obesidade: boolean

 Data Completeness: All 5 properties are required
   This ensures consistent data for statistical analysis

 Schema Integration Analysis:
- API Integration: OpenAI function calling
- Data Validation: Pydantic model validation
- Research Application: Video title classification
- Output Format: Structured JSON for statistical analysis


In [10]:

# Test functionality
test_schema_functionality(function_schema)



 Schema Functionality Testing:
 Required fields count: 5/5
 Schema functionality tests passed


In [11]:

# Display schema summary for research documentation
print(f"\nFunction Schema Research Summary:")
print(f"- Function name: {function_schema['name']}")
print(f"- Description: {function_schema['description']}")
print(f"- Classification fields: {len(function_schema['parameters']['properties'])}")
print(f"- Required fields: {len(function_schema['parameters']['required'])}")


Function Schema Research Summary:
- Function name: RespostaAnaliseSentimento
- Description: A resposta de uma função que realiza análise de sentimento em texto e detecção do idioma do texto.
- Classification fields: 5
- Required fields: 5


In [12]:
## 6. Data Preparation and Input Processing

"""
This section prepares video titles for batch classification processing with comprehensive
data validation and quality assessment for research applications.
"""

def analyze_title_characteristics(df: pd.DataFrame) -> None:
    """
    Analyze characteristics of video title data for research documentation.
    
    Args:
        df: DataFrame containing video title data
    """
    print("🔍 Analyzing video title characteristics for research...")
    
    if 'video_title' in df.columns:
        titles = df['video_title'].dropna()
        
        # Length analysis for research insights
        lengths = titles.str.len()
        print(f"📊 Title length analysis:")
        print(f"- Total titles: {len(titles):,}")
        print(f"- Min length: {lengths.min():,} characters")
        print(f"- Max length: {lengths.max():,} characters")
        print(f"- Mean length: {lengths.mean():.1f} characters")
        print(f"- Median length: {lengths.median():.1f} characters")
        print(f"- Standard deviation: {lengths.std():.1f} characters")
        
        # Title content patterns analysis
        print(f"\n📈 Content pattern analysis:")
        # Check for common YouTube title patterns
        exclamation_count = titles.str.contains('!', na=False).sum()
        question_count = titles.str.contains('\\?', na=False).sum()
        caps_count = titles.str.contains('[A-Z]{3,}', na=False).sum()
        number_count = titles.str.contains('\\d+', na=False).sum()
        
        print(f"- Titles with exclamation marks: {exclamation_count:,} ({exclamation_count/len(titles)*100:.1f}%)")
        print(f"- Titles with questions: {question_count:,} ({question_count/len(titles)*100:.1f}%)")
        print(f"- Titles with caps (3+ chars): {caps_count:,} ({caps_count/len(titles)*100:.1f}%)")
        print(f"- Titles with numbers: {number_count:,} ({number_count/len(titles)*100:.1f}%)")
        
        # Data quality assessment
        empty_titles = df['video_title'].isna().sum()
        very_short_titles = (lengths < 10).sum()
        very_long_titles = (lengths > 100).sum()
        
        print(f"\n⚠️ Data quality assessment:")
        print(f"- Empty titles: {empty_titles:,}")
        print(f"- Very short titles (<10 chars): {very_short_titles:,}")
        print(f"- Very long titles (>100 chars): {very_long_titles:,}")
        
        if very_short_titles > 0:
            print("ℹ️ Very short titles may indicate data quality issues")
        if very_long_titles > 0:
            print("ℹ️ Very long titles may require special handling for API limits")


In [13]:

def prepare_video_titles(df: pd.DataFrame) -> List[str]:
    """
    Extract and prepare video titles for classification with comprehensive validation.
    
    Args:
        df: DataFrame containing video data
        
    Returns:
        List of video titles ready for processing
    """
    print(f"📝 Preparing {len(df):,} video titles for classification...")
    
    # Extract video titles with validation
    if 'video_title' not in df.columns:
        raise ValueError("video_title column not found in dataset")
    
    # Convert to list and handle missing values
    input_texts = df.video_title.fillna("").values.tolist()
    
    # Quality assessment and statistics
    non_empty_titles = sum(1 for title in input_texts if title and title.strip())
    empty_titles = len(input_texts) - non_empty_titles
    
    print(f"📊 Title Preparation Statistics:")
    print(f"- Total titles processed: {len(input_texts):,}")
    print(f"- Non-empty titles: {non_empty_titles:,} ({non_empty_titles/len(input_texts)*100:.1f}%)")
    print(f"- Empty/missing titles: {empty_titles:,} ({empty_titles/len(input_texts)*100:.1f}%)")
    
    if empty_titles > 0:
        print(f"⚠️ Found {empty_titles} empty titles - these will be processed as empty strings")
    
    # Title length analysis for API planning
    valid_titles = [title for title in input_texts if title and title.strip()]
    if valid_titles:
        title_lengths = [len(title) for title in valid_titles]
        avg_length = sum(title_lengths) / len(title_lengths)
        max_length = max(title_lengths)
        min_length = min(title_lengths)
        
        print(f"\n� Title Length Statistics:")
        print(f"- Average length: {avg_length:.1f} characters")
        print(f"- Length range: {min_length}-{max_length} characters")
        
        # API considerations
        long_titles = sum(1 for length in title_lengths if length > 200)
        if long_titles > 0:
            print(f"ℹ️ Found {long_titles} titles >200 chars - may need truncation for API efficiency")
    
    # Sample titles for research documentation
    sample_titles = [title for title in input_texts[:5] if title and title.strip()]
    print(f"\n📋 Sample Video Titles for Review:")
    for i, title in enumerate(sample_titles, 1):
        preview = title[:100] + "..." if len(title) > 100 else title
        print(f"  {i}. {preview}")
    
    print(f"\n✅ Successfully prepared {len(input_texts):,} video titles for classification")
    print("🎯 Ready for batch request creation and API processing")
    
    return input_texts


In [14]:

def validate_prepared_data(input_texts: List[str], original_df: pd.DataFrame) -> None:
    """
    Validate prepared data for research integrity.
    
    Args:
        input_texts: Prepared list of titles
        original_df: Original DataFrame
    """
    print("🔍 Validating prepared data for research integrity...")
    
    # Check data consistency
    if len(input_texts) != len(original_df):
        print(f"❌ Data length mismatch: {len(input_texts)} titles vs {len(original_df)} records")
        raise ValueError("Data preparation failed: length mismatch")
    else:
        print(f"✅ Data consistency: {len(input_texts):,} titles match dataset records")
    
    # Check for data type consistency
    text_types = [type(text).__name__ for text in input_texts[:100]]  # Sample check
    if all(t == 'str' for t in text_types):
        print("✅ Data types: All titles are strings")
    else:
        print("⚠️ Data types: Mixed types detected - may cause API issues")
    
    print("✅ Data validation completed - ready for API processing")


In [15]:

# Analyze title characteristics for research documentation
analyze_title_characteristics(df)


🔍 Analyzing video title characteristics for research...
📊 Title length analysis:
- Total titles: 1,204
- Min length: 10 characters
- Max length: 100 characters
- Mean length: 63.3 characters
- Median length: 64.0 characters
- Standard deviation: 23.3 characters

📈 Content pattern analysis:
- Titles with exclamation marks: 199 (16.5%)
- Titles with questions: 163 (13.5%)
- Titles with caps (3+ chars): 433 (36.0%)
- Titles with numbers: 318 (26.4%)

⚠️ Data quality assessment:
- Empty titles: 0
- Very short titles (<10 chars): 0
- Very long titles (>100 chars): 0


In [16]:

# Prepare the input data with comprehensive validation
input_texts = prepare_video_titles(df)


📝 Preparing 1,204 video titles for classification...
📊 Title Preparation Statistics:
- Total titles processed: 1,204
- Non-empty titles: 1,204 (100.0%)
- Empty/missing titles: 0 (0.0%)

� Title Length Statistics:
- Average length: 63.3 characters
- Length range: 10-100 characters

📋 Sample Video Titles for Review:
  1. Tony Gordo é Incriminado #simpsons
  2. O país mais obeso do mundo #shorts
  3. Preconceitos que eu já sofri por ser uma bailarina gorda.
  4. esse milionário de 18 anos não quer pegar mulher gorda
  5. Lula volta a fazer piada com obesidade de Flávio Dino: “Traz pouca comida para ele”

✅ Successfully prepared 1,204 video titles for classification
🎯 Ready for batch request creation and API processing


In [17]:

# Validate prepared data
validate_prepared_data(input_texts, df)

🔍 Validating prepared data for research integrity...
✅ Data consistency: 1,204 titles match dataset records
✅ Data types: All titles are strings
✅ Data validation completed - ready for API processing


In [18]:
## 7. Batch Request Creation and Processing

"""
This section creates batch API requests for efficient large-scale video title classification.
The implementation optimizes for research reproducibility and cost-effective processing.
"""

def create_batch_requests_for_titles(texts: List[str], df: pd.DataFrame) -> List[Dict[str, Any]]:
    """
    Create batch API requests for video title classification with research-grade validation.
    
    Args:
        texts: List of video titles to classify
        df: Original DataFrame for generating unique IDs and validation
        
    Returns:
        List of API request objects ready for batch processing
    """
    print(f"🔧 Creating batch API requests for {len(texts):,} video titles...")
    
    if len(texts) != len(df):
        raise ValueError(f"Data mismatch: {len(texts)} texts vs {len(df)} DataFrame records")
    
    jsonl_data = []
    skipped_count = 0
    
    for idx, text in enumerate(tqdm(texts, desc="Creating title classification requests")):
        try:
            # Handle empty or invalid titles
            if not text or not text.strip():
                text = "[EMPTY_TITLE]"  # Placeholder for empty titles
                skipped_count += 1
            
            # Create unique identifier using title content and video metadata
            custom_uid = f"title_{df.video_id.iloc[idx]}_{idx}_{text[:50]}"
            request_id = hashlib.md5(custom_uid.encode()).hexdigest()
            
            # Create API request structure optimized for title classification
            request_data = {
                "custom_id": request_id,
                "method": "POST",
                "url": "/v1/chat/completions",
                "body": {
                    "model": TitleClassificationConfig.MODEL_NAME,
                    "temperature": TitleClassificationConfig.TEMPERATURE,
                    "messages": [
                        SYSTEM_PROMPT,
                        {"role": "user", "content": text.encode().decode("utf-8")}
                    ],
                    "parallel_tool_calls": False,
                    "tools": [{"type": "function", "function": function_schema}],
                    "tool_choice": {
                        "type": "function",
                        "function": {"name": function_schema["name"]}
                    }
                }
            }
            jsonl_data.append(request_data)
            
        except Exception as e:
            print(f"⚠️ Error creating request for index {idx}: {e}")
            continue
    
    print(f"✅ Created {len(jsonl_data):,} API requests for video titles")
    if skipped_count > 0:
        print(f"ℹ️ Processed {skipped_count} empty titles with placeholder text")
    
    return jsonl_data


In [19]:

def split_into_batches(data: List[Dict[str, Any]], batch_size: int) -> List[List[Dict[str, Any]]]:
    """
    Split requests into optimally-sized batches for API processing.
    
    Args:
        data: List of API requests
        batch_size: Maximum requests per batch (API and cost optimization)
        
    Returns:
        List of batches ready for API submission
    """
    print(f"📦 Splitting {len(data):,} title requests into batches...")
    print(f"🎯 Target batch size: {batch_size:,} requests per batch")
    
    # Create batches
    chunks = [data[x:x + batch_size] for x in range(0, len(data), batch_size)]
    
    print(f"✅ Created {len(chunks)} batch(es) for video title processing")
    
    # Display batch statistics for research documentation
    total_requests = 0
    for i, chunk in enumerate(chunks):
        chunk_size = len(chunk)
        total_requests += chunk_size
        print(f"  - Batch {i+1}: {chunk_size:,} requests")
    
    print(f"\n📊 Batch Processing Summary:")
    print(f"- Total requests: {total_requests:,}")
    print(f"- Number of batches: {len(chunks)}")
    print(f"- Average batch size: {total_requests/len(chunks):.0f} requests")
    print(f"- Estimated processing time: {len(chunks) * 2:.0f}-{len(chunks) * 10:.0f} minutes")
    
    return chunks


In [20]:

def validate_batch_structure(chunks: List[List[Dict[str, Any]]]) -> None:
    """
    Validate batch structure for research quality assurance.
    
    Args:
        chunks: List of batch chunks to validate
    """
    print("🔍 Validating batch structure for research quality...")
    
    total_requests = sum(len(chunk) for chunk in chunks)
    print(f"📊 Batch Validation Results:")
    print(f"- Total batches: {len(chunks)}")
    print(f"- Total requests: {total_requests:,}")
    
    # Validate each batch structure
    for i, chunk in enumerate(chunks):
        sample_request = chunk[0] if chunk else None
        if sample_request:
            required_keys = ["custom_id", "method", "url", "body"]
            missing_keys = [key for key in required_keys if key not in sample_request]
            
            if missing_keys:
                print(f"❌ Batch {i+1}: Missing keys {missing_keys}")
            else:
                print(f"✅ Batch {i+1}: Structure valid ({len(chunk):,} requests)")
        else:
            print(f"⚠️ Batch {i+1}: Empty batch")
    
    print("✅ Batch validation completed")


In [21]:

# Create the batch requests with comprehensive validation
jsonl_data = create_batch_requests_for_titles(input_texts, df)

# Split into optimally-sized batches
chunks = split_into_batches(jsonl_data, TitleClassificationConfig.BATCH_SIZE)

# Validate batch structure
validate_batch_structure(chunks)

# Research documentation summary
print(f"\n📋 Video Title Batch Processing Configuration:")
print(f"- Content type: YouTube video titles (short text)")
print(f"- Model: {TitleClassificationConfig.MODEL_NAME}")
print(f"- Temperature: {TitleClassificationConfig.TEMPERATURE} (deterministic)")
print(f"- Total requests: {len(jsonl_data):,}")
print(f"- Batch configuration: {len(chunks)} batches")
print(f"- Research focus: Weight stigma detection in title content")

print(f"\n🎯 Ready for batch file export and API submission")
print("📝 Next step: Export batch files and submit to OpenAI API")

🔧 Creating batch API requests for 1,204 video titles...


Creating title classification requests:   0%|          | 0/1204 [00:00<?, ?it/s]

✅ Created 1,204 API requests for video titles
📦 Splitting 1,204 title requests into batches...
🎯 Target batch size: 40,000 requests per batch
✅ Created 1 batch(es) for video title processing
  - Batch 1: 1,204 requests

📊 Batch Processing Summary:
- Total requests: 1,204
- Number of batches: 1
- Average batch size: 1204 requests
- Estimated processing time: 2-10 minutes
🔍 Validating batch structure for research quality...
📊 Batch Validation Results:
- Total batches: 1
- Total requests: 1,204
✅ Batch 1: Structure valid (1,204 requests)
✅ Batch validation completed

📋 Video Title Batch Processing Configuration:
- Content type: YouTube video titles (short text)
- Model: gpt-4.1-mini
- Temperature: 0.0 (deterministic)
- Total requests: 1,204
- Batch configuration: 1 batches
- Research focus: Weight stigma detection in title content

🎯 Ready for batch file export and API submission
📝 Next step: Export batch files and submit to OpenAI API


In [22]:
## 8. Batch File Export and API Submission

"""
This section handles the export of batch files and submission to OpenAI's batch API
for efficient large-scale video title classification processing.
"""

def export_batch_files(chunks: List[List[Dict[str, Any]]], base_filename: str) -> List[str]:
    """
    Export batch requests to JSONL files for API processing with comprehensive validation.
    
    Args:
        chunks: List of batch chunks to export
        base_filename: Base filename for the batch files
        
    Returns:
        List of created file paths for API submission
    """
    print(f"💾 Exporting {len(chunks)} batch files for video title classification...")
    
    created_files = []
    total_size_mb = 0
    
    for idx, chunk in enumerate(chunks):
        filename = f"{base_filename}_{idx}.jsonl"
        filepath = TitleClassificationConfig.JSONL_DIR / filename
        
        try:
            with open(filepath, "w", encoding="utf-8") as f:
                for item in chunk:
                    # Ensure proper JSON formatting
                    json_line = json.dumps(item, ensure_ascii=False, separators=(',', ':'))
                    f.write(json_line + "\n")
            
            # Verify file creation and calculate statistics
            file_size_mb = filepath.stat().st_size / (1024 * 1024)
            total_size_mb += file_size_mb
            
            print(f"✅ Created {filename}: {file_size_mb:.2f} MB ({len(chunk):,} requests)")
            created_files.append(str(filepath))
            
        except Exception as e:
            print(f"❌ Error creating {filename}: {e}")
            raise
    
    print(f"\n📊 Batch File Export Summary:")
    print(f"- Total files created: {len(created_files)}")
    print(f"- Total file size: {total_size_mb:.2f} MB")
    print(f"- Average file size: {total_size_mb/len(created_files):.2f} MB")
    print(f"- Export directory: {TitleClassificationConfig.JSONL_DIR}")
    
    return created_files

def get_file_hash(file_path: str) -> str:
    """
    Generate MD5 hash of a file for unique batch naming and tracking.
    
    Args:
        file_path: Path to the file
        
    Returns:
        MD5 hash string for unique identification
    """
    with open(file_path, "rb") as f:
        return hashlib.md5(f.read()).hexdigest()

def initialize_batch_processors(file_paths: List[str]) -> Dict[str, OpenAIBatchProcessor]:
    """
    Initialize and submit batch jobs for video title processing with comprehensive tracking.
    
    Args:
        file_paths: List of JSONL file paths to process
        
    Returns:
        Dictionary mapping file paths to batch processors for monitoring
    """
    print(f"🚀 Initializing batch processors for {len(file_paths)} video title files...")
    
    batch_processors = {}
    successful_submissions = 0
    
    for file_path in tqdm(file_paths, desc="Submitting title classification batches"):
        try:
            # Create processor and submit job
            processor = OpenAIBatchProcessor()
            batch_name = f"titles_{get_file_hash(file_path)[:8]}"
            
            processor.submit_batch_job(
                input_jsonl_path=file_path,
                batch_name=batch_name
            )
            
            batch_processors[file_path] = processor
            successful_submissions += 1
            
            filename = Path(file_path).name
            print(f"✅ Submitted batch: {filename} (ID: {batch_name})")
            
        except Exception as e:
            print(f"❌ Error submitting batch for {Path(file_path).name}: {e}")
            # Continue with other files rather than failing completely
            continue
    
    print(f"\n📊 Batch Submission Summary:")
    print(f"- Total files processed: {len(file_paths)}")
    print(f"- Successful submissions: {successful_submissions}")
    print(f"- Failed submissions: {len(file_paths) - successful_submissions}")
    print(f"- Active batch processors: {len(batch_processors)}")
    
    if successful_submissions < len(file_paths):
        print("⚠️ Some batch submissions failed - check error messages above")
    else:
        print("✅ All video title batches submitted successfully")
    
    return batch_processors

def monitor_initial_batch_status(processors: Dict[str, OpenAIBatchProcessor]) -> None:
    """
    Monitor initial status of all submitted batch jobs.
    
    Args:
        processors: Dictionary of batch processors to monitor
    """
    print("📊 Checking initial video title batch status...")
    
    status_summary = {}
    
    for file_path, processor in processors.items():
        try:
            batch_info = processor.get_batch_info()
            filename = Path(file_path).name
            status = batch_info.status
            
            print(f"- {filename}: {status}")
            
            # Track status for summary
            status_summary[status] = status_summary.get(status, 0) + 1
            
            # Show progress if available
            if hasattr(batch_info, 'request_counts') and batch_info.request_counts:
                counts = batch_info.request_counts
                total = counts.get('total', 0)
                completed = counts.get('completed', 0)
                if total > 0:
                    progress = completed / total * 100
                    print(f"  📈 Progress: {completed}/{total} titles ({progress:.1f}%)")
            
        except Exception as e:
            print(f"❌ Error checking status for {Path(file_path).name}: {e}")
            status_summary['error'] = status_summary.get('error', 0) + 1
    
    print(f"\n📊 Batch Status Summary:")
    for status, count in status_summary.items():
        print(f"- {status}: {count} batch(es)")

# Export batch files for video title classification
base_filename = TitleClassificationConfig.BATCH_NAME_PREFIX
created_files = export_batch_files(chunks, base_filename)

print(f"\n📁 Video Title Batch Files Ready:")
for file_path in created_files:
    file_size = Path(file_path).stat().st_size / (1024 * 1024)
    request_count = sum(1 for _ in open(file_path, 'r'))
    print(f"- {Path(file_path).name}: {file_size:.2f} MB ({request_count:,} requests)")

# Initialize batch processors and submit to OpenAI
if created_files:
    print(f"\n🎯 Submitting {len(created_files)} batch files to OpenAI API...")
    batch_processors = initialize_batch_processors(created_files)
    
    # Monitor initial status
    if batch_processors:
        monitor_initial_batch_status(batch_processors)
    else:
        print("❌ No batch processors created - check API credentials and file format")
else:
    print("⚠️ No batch files found. Please check batch creation process.")
    batch_processors = {}

💾 Exporting 1 batch files for video title classification...
✅ Created 20250417_youtube_titles_batch_api_0.jsonl: 3.60 MB (1,204 requests)

📊 Batch File Export Summary:
- Total files created: 1
- Total file size: 3.60 MB
- Average file size: 3.60 MB
- Export directory: ../data/intermediate/jsonl

📁 Video Title Batch Files Ready:
- 20250417_youtube_titles_batch_api_0.jsonl: 3.60 MB (1,204 requests)

🎯 Submitting 1 batch files to OpenAI API...
🚀 Initializing batch processors for 1 video title files...


Submitting title classification batches:   0%|          | 0/1 [00:00<?, ?it/s]

Successfully submitted batch titles_e2e74fd9 with id batch_6883b30ad0e88190a2ca45f321c264f2
Batch info saved to ../data/intermediate/jsonl/20250417_youtube_titles_batch_api_0_20250725_133835.txt
✅ Submitted batch: 20250417_youtube_titles_batch_api_0.jsonl (ID: titles_e2e74fd9)

📊 Batch Submission Summary:
- Total files processed: 1
- Successful submissions: 1
- Failed submissions: 0
- Active batch processors: 1
✅ All video title batches submitted successfully
📊 Checking initial video title batch status...
- 20250417_youtube_titles_batch_api_0.jsonl: validating

📊 Batch Status Summary:
- validating: 1 batch(es)


In [23]:
## 9. Batch Monitoring and Completion

"""
This section handles batch job monitoring and completion tracking for video title classification.
"""

def wait_for_completion(processors: Dict[str, OpenAIBatchProcessor], check_interval: int = 60) -> None:
    """
    Wait for all video title batch jobs to complete with periodic status updates.
    
    Args:
        processors: Dictionary of batch processors to monitor
        check_interval: Seconds between status checks
    """
    if not processors:
        print("⚠️ No batch processors to monitor")
        return
        
    print(f"⏳ Waiting for video title batch completion (checking every {check_interval}s)...")
    print("ℹ️ Video titles typically process faster than comments or transcriptions due to shorter content")
    
    start_time = time.time()
    
    while True:
        try:
            # Check completion status
            statuses = []
            progress_info = []
            
            for file_path, processor in processors.items():
                batch_info = processor.get_batch_info()
                status = batch_info.status
                statuses.append(status)
                
                filename = Path(file_path).name
                progress_info.append(f"- {filename}: {status}")
            
            # Calculate completion metrics
            completed_count = sum(1 for status in statuses if status == "completed")
            failed_count = sum(1 for status in statuses if status == "failed")
            total_count = len(statuses)
            
            # Clear output and show current status
            clear_output(wait=True)
            elapsed_time = (time.time() - start_time) / 60
            
            print(f"🔄 Video Title Batch Processing Status ({elapsed_time:.1f} min elapsed):")
            print(f"Completed: {completed_count}/{total_count} | Failed: {failed_count} | In Progress: {total_count - completed_count - failed_count}")
            print()
            
            # Show detailed status
            for info in progress_info:
                print(info)
            
            # Check if all completed or failed
            if completed_count + failed_count == total_count:
                print(f"\n✅ All video title batches finished!")
                print(f"- Completed successfully: {completed_count}")
                print(f"- Failed: {failed_count}")
                print(f"- Total processing time: {elapsed_time:.1f} minutes")
                break
            
            # Wait before next check
            time.sleep(check_interval)
            
        except KeyboardInterrupt:
            print("\n⚠️ Monitoring interrupted by user")
            break
        except Exception as e:
            print(f"❌ Error during monitoring: {e}")
            break

def check_final_completion_status(processors: Dict[str, OpenAIBatchProcessor]) -> Dict[str, int]:
    """
    Check final completion status of all video title batches.
    
    Args:
        processors: Dictionary of batch processors
        
    Returns:
        Dictionary with completion statistics
    """
    print("🔍 Final video title batch completion check:")
    
    completed_batches = 0
    failed_batches = 0
    other_status = 0
    
    for file_path, processor in processors.items():
        try:
            batch_info = processor.get_batch_info()
            filename = Path(file_path).name
            status = batch_info.status
            
            print(f"- {filename}: {status}")
            
            if status == "completed":
                completed_batches += 1
                print(f"  ✅ Ready for result processing")
            elif status == "failed":
                failed_batches += 1
                print(f"  ❌ Batch failed - check error details")
            else:
                other_status += 1
                print(f"  ⏳ Status: {status}")
                
        except Exception as e:
            print(f"❌ Error checking {filename}: {e}")
            failed_batches += 1
    
    total_batches = len(processors)
    success_rate = completed_batches / total_batches * 100 if total_batches > 0 else 0
    
    print(f"\n📊 Video Title Processing Summary:")
    print(f"- Total batches: {total_batches}")
    print(f"- Completed successfully: {completed_batches}")
    print(f"- Failed: {failed_batches}")
    print(f"- Other status: {other_status}")
    print(f"- Success rate: {success_rate:.1f}%")
    
    if completed_batches == total_batches:
        print("✅ All video title batches completed - ready for results processing")
    elif completed_batches > 0:
        print("⚠️ Partial completion - can process available results")
    else:
        print("❌ No completed batches - check for processing errors")
    
    return {
        'total': total_batches,
        'completed': completed_batches,
        'failed': failed_batches,
        'other': other_status
    }

# Start comprehensive batch monitoring
print("\n🎯 Starting video title batch monitoring...")
print("Note: This will run until all batches are completed or failed.")
print("You can interrupt with Ctrl+C if needed.")

# Run monitoring
wait_for_completion(batch_processors)

# Check final completion status
completion_stats = check_final_completion_status(batch_processors)

🔄 Video Title Batch Processing Status (19.1 min elapsed):
Completed: 1/1 | Failed: 0 | In Progress: 0

- 20250417_youtube_titles_batch_api_0.jsonl: completed

✅ All video title batches finished!
- Completed successfully: 1
- Failed: 0
- Total processing time: 19.1 minutes
🔍 Final video title batch completion check:
- 20250417_youtube_titles_batch_api_0.jsonl: completed
  ✅ Ready for result processing

📊 Video Title Processing Summary:
- Total batches: 1
- Completed successfully: 1
- Failed: 0
- Other status: 0
- Success rate: 100.0%
✅ All video title batches completed - ready for results processing


In [24]:
## 10. Results Processing and Data Integration

"""
This section processes batch results and integrates video title classifications 
with the original dataset for research analysis.
"""

def process_batch_results(processors: Dict[str, OpenAIBatchProcessor]) -> Tuple[List[Any], List[Dict]]:
    """
    Process and parse results from completed video title batch jobs.
    
    Args:
        processors: Dictionary of batch processors
        
    Returns:
        Tuple of (parsed_results, raw_results) for research analysis
    """
    print("🔄 Processing video title batch results...")
    
    parsed_results = []
    raw_results = []
    error_count = 0
    total_processed = 0
    
    for file_path, processor in processors.items():
        filename = Path(file_path).name
        print(f"📂 Processing results from {filename}...")
        
        try:
            # Check batch status first
            batch_info = processor.get_batch_info()
            if batch_info.status != "completed":
                print(f"⚠️ Skipping {filename} - status: {batch_info.status}")
                continue
            
            # Get batch output
            file_response = processor.get_batch_output()
            if not file_response:
                print(f"⚠️ No response data for {filename}")
                continue
            
            # Process each response
            batch_parsed = 0
            batch_errors = 0
            
            for output in file_response:
                total_processed += 1
                try:
                    # Parse the JSON response
                    json_output = json.loads(output)
                    function_args = json_output["response"]["body"]["choices"][0]["message"]["tool_calls"][0]["function"]["arguments"]
                    parsed_json = json.loads(function_args)
                    
                    # Validate with Pydantic model
                    validated_obj = RespostaAnaliseSentimento.model_validate(parsed_json)
                    parsed_results.append(validated_obj)
                    raw_results.append(validated_obj.model_dump())
                    batch_parsed += 1
                    
                except Exception as e:
                    # Handle parsing errors gracefully
                    parsed_results.append(None)
                    raw_results.append(None)
                    batch_errors += 1
                    error_count += 1
                    
                    if batch_errors <= 3:  # Show first few errors per batch
                        print(f"⚠️ Parsing error: {str(e)[:100]}")
            
            print(f"  ✅ Parsed: {batch_parsed}, Errors: {batch_errors}")
            
        except Exception as e:
            print(f"❌ Error processing {filename}: {e}")
            continue
    
    success_rate = (total_processed - error_count) / total_processed * 100 if total_processed > 0 else 0
    
    print(f"\n📊 Video Title Results Processing Summary:")
    print(f"- Total responses processed: {total_processed:,}")
    print(f"- Successfully parsed: {total_processed - error_count:,}")
    print(f"- Parse errors: {error_count:,}")
    print(f"- Success rate: {success_rate:.1f}%")
    print(f"- Final dataset size: {len(parsed_results):,} records")
    
    return parsed_results, raw_results

def save_intermediate_results(parsed_results: List[Any], raw_results: List[Dict]) -> None:
    """
    Save intermediate results for backup and debugging.
    
    Args:
        parsed_results: List of parsed Pydantic objects
        raw_results: List of raw result dictionaries
    """
    print("💾 Saving intermediate video title results...")
    
    try:
        # Save parsed results
        joblib.dump(parsed_results, TitleClassificationConfig.PARSED_RESULTS_FILE)
        print(f"✅ Parsed results saved: {TitleClassificationConfig.PARSED_RESULTS_FILE}")
        
        # Save raw results
        joblib.dump(raw_results, TitleClassificationConfig.RESULTS_FILE)
        print(f"✅ Raw results saved: {TitleClassificationConfig.RESULTS_FILE}")
        
        # Calculate and display file sizes
        parsed_size = TitleClassificationConfig.PARSED_RESULTS_FILE.stat().st_size / (1024 * 1024)
        raw_size = TitleClassificationConfig.RESULTS_FILE.stat().st_size / (1024 * 1024)
        
        print(f"📊 Backup file sizes:")
        print(f"- Parsed results: {parsed_size:.2f} MB")
        print(f"- Raw results: {raw_size:.2f} MB")
        
    except Exception as e:
        print(f"❌ Error saving intermediate results: {e}")
        raise

def create_results_dataframe(parsed_results: List[Any]) -> pd.DataFrame:
    """
    Convert parsed results to a structured DataFrame for research analysis.
    
    Args:
        parsed_results: List of parsed Pydantic objects
        
    Returns:
        DataFrame with classification results
    """
    print("📊 Creating structured results DataFrame...")
    
    outputs = []
    
    for i, parsed_document in enumerate(tqdm(parsed_results, desc="Converting results to DataFrame")):
        if parsed_document is not None:
            # Extract validated classification data
            parsed_dict = parsed_document.model_dump()
            outputs.append(parsed_dict)
        else:
            # Handle null results with default values
            outputs.append({
                "sentimento": None,
                "gordofobia_implicita": None,
                "gordofobia_explicita": None,
                "idioma": None,
                "obesidade": None,
            })
    
    # Create DataFrame with proper column names
    df_results = pd.DataFrame(outputs)
    df_results.columns = ["sentimento", "gordofobia_implicita", "gordofobia_explicita", "idioma", "obesidade"]
    
    print(f"✅ Created results DataFrame with {len(df_results):,} records")
    
    # Display basic statistics
    print(f"\n📈 Classification Results Summary:")
    if 'obesidade' in df_results.columns:
        obesity_counts = df_results['obesidade'].value_counts()
        print(f"- Obesity content: {obesity_counts.to_dict()}")
    
    if 'idioma' in df_results.columns:
        language_counts = df_results['idioma'].value_counts()
        print(f"- Languages detected: {language_counts.to_dict()}")
    
    return df_results


In [25]:

# Process all video title batch results
parsed_results, raw_results = process_batch_results(batch_processors)

# Save intermediate results for backup
save_intermediate_results(parsed_results, raw_results)

# Create structured results DataFrame
df_results = create_results_dataframe(parsed_results)

🔄 Processing video title batch results...
📂 Processing results from 20250417_youtube_titles_batch_api_0.jsonl...
  ✅ Parsed: 1204, Errors: 0

📊 Video Title Results Processing Summary:
- Total responses processed: 1,204
- Successfully parsed: 1,204
- Parse errors: 0
- Success rate: 100.0%
- Final dataset size: 1,204 records
💾 Saving intermediate video title results...
✅ Parsed results saved: ../data/tmp/parsed_results_titles.joblib
✅ Raw results saved: ../data/tmp/results_titles.joblib
📊 Backup file sizes:
- Parsed results: 0.07 MB
- Raw results: 0.03 MB
📊 Creating structured results DataFrame...


Converting results to DataFrame:   0%|          | 0/1204 [00:00<?, ?it/s]

✅ Created results DataFrame with 1,204 records

📈 Classification Results Summary:
- Obesity content: {True: 801, False: 403}
- Languages detected: {'pt': 1180, 'es': 21, 'en': 3}


In [26]:
## 11. Data Integration and Final Export

"""
This section integrates the video title classification results with the original dataset
and exports the final research-ready dataset with language filtering and validation.
"""

def validate_results_consistency(df_results: pd.DataFrame, original_df: pd.DataFrame) -> None:
    """
    Validate that results match the original video title data structure.
    
    Args:
        df_results: Classification results DataFrame
        original_df: Original video DataFrame
    """
    print("🔍 Validating video title results consistency...")
    
    print(f"- Original video records: {len(original_df):,}")
    print(f"- Classification results: {len(df_results):,}")
    
    if len(df_results) == len(original_df):
        print("✅ Result count matches original video title data")
    else:
        print("⚠️ Result count mismatch - check for processing errors")
        print(f"  Difference: {abs(len(df_results) - len(original_df)):,} records")
    
    # Check for null results
    null_count = df_results.isnull().any(axis=1).sum()
    if null_count > 0:
        print(f"⚠️ Found {null_count} records with null values ({null_count/len(df_results)*100:.1f}%)")
    else:
        print("✅ No null results found")

def integrate_title_data(original_df: pd.DataFrame, results_df: pd.DataFrame) -> pd.DataFrame:
    """
    Integrate original video data with title classification results.
    
    Args:
        original_df: Original video DataFrame
        results_df: Classification results DataFrame
        
    Returns:
        Combined DataFrame with all video and classification data
    """
    print("🔗 Integrating video data with title classification results...")
    
    # Ensure DataFrames have matching indices
    original_df_reset = original_df.reset_index(drop=True)
    results_df_reset = results_df.reset_index(drop=True)
    
    # Combine original data with results
    df_combined = pd.concat([original_df_reset, results_df_reset], axis=1)
    
    print(f"✅ Integrated DataFrame with {len(df_combined):,} video title records")
    print(f"📊 Total columns: {len(df_combined.columns)}")
    
    return df_combined

def apply_language_filter(df_combined: pd.DataFrame) -> pd.DataFrame:
    """
    Filter dataset to Portuguese content for Brazilian research focus.
    
    Args:
        df_combined: Combined DataFrame with classification results
        
    Returns:
        Filtered DataFrame with Portuguese content only
    """
    print("🌍 Applying language filter for Brazilian Portuguese content...")
    
    # Display language distribution before filtering
    if 'idioma' in df_combined.columns:
        language_dist = df_combined['idioma'].value_counts()
        print(f"📊 Language distribution before filtering:")
        for lang, count in language_dist.items():
            percentage = count / len(df_combined) * 100
            print(f"- {lang}: {count:,} records ({percentage:.1f}%)")
        
        # Apply Portuguese filter
        pt_filter = df_combined['idioma'] == TitleClassificationConfig.LANGUAGE_FILTER
        df_filtered = df_combined[pt_filter].copy()
        df_filtered.reset_index(drop=True, inplace=True)
        
        filtered_count = len(df_filtered)
        original_count = len(df_combined)
        retention_rate = filtered_count / original_count * 100
        
        print(f"\n🎯 Language filtering results:")
        print(f"- Original records: {original_count:,}")
        print(f"- Portuguese records: {filtered_count:,}")
        print(f"- Retention rate: {retention_rate:.1f}%")
        print(f"- Filtered out: {original_count - filtered_count:,} non-Portuguese records")
        
        return df_filtered
    else:
        print("⚠️ No language column found - returning original dataset")
        return df_combined

def export_final_dataset(df_final: pd.DataFrame) -> None:
    """
    Export final video title classification dataset for research use.
    
    Args:
        df_final: Final filtered and integrated DataFrame
    """
    print("💾 Exporting final video title classification dataset...")
    
    # Create output directory
    output_dir = TitleClassificationConfig.FINAL_OUTPUT_FILE.parent
    output_dir.mkdir(exist_ok=True)
    
    try:
        # Export to Parquet format
        df_final.to_parquet(TitleClassificationConfig.FINAL_OUTPUT_FILE, index=False)
        
        # Calculate file size
        file_size_mb = TitleClassificationConfig.FINAL_OUTPUT_FILE.stat().st_size / (1024 * 1024)
        
        print(f"✅ Final dataset exported successfully")
        print(f"📁 File: {TitleClassificationConfig.FINAL_OUTPUT_FILE}")
        print(f"📊 Size: {file_size_mb:.2f} MB")
        print(f"📈 Records: {len(df_final):,}")
        print(f"📋 Columns: {len(df_final.columns)}")
        
        # Generate dataset summary
        print(f"\n📋 Final Dataset Summary:")
        if 'obesidade' in df_final.columns:
            obesity_dist = df_final['obesidade'].value_counts()
            print(f"- Obesity content distribution: {obesity_dist.to_dict()}")
        
        if 'sentimento' in df_final.columns:
            sentiment_dist = df_final['sentimento'].value_counts()
            print(f"- Sentiment distribution: {sentiment_dist.to_dict()}")
        
        if 'gordofobia_explicita' in df_final.columns:
            explicit_stigma = df_final['gordofobia_explicita'].sum()
            print(f"- Videos with explicit weight stigma: {explicit_stigma:,}")
        
        if 'gordofobia_implicita' in df_final.columns:
            implicit_stigma = df_final['gordofobia_implicita'].sum()
            print(f"- Videos with implicit weight stigma: {implicit_stigma:,}")
        
    except Exception as e:
        print(f"❌ Error exporting final dataset: {e}")
        raise

def generate_research_summary() -> None:
    """Generate comprehensive summary for research documentation."""
    print("\n🎯 Video Title Classification Pipeline Summary")
    print("=" * 60)
    
    print(f"\n📁 Research Outputs:")
    print(f"- Final dataset: {TitleClassificationConfig.FINAL_OUTPUT_FILE}")
    print(f"- Classification schema: 5 dimensions (sentiment, implicit/explicit stigma, language, obesity content)")



In [27]:

# Validate results consistency
validate_results_consistency(df_results, df)

# Integrate original data with classification results
df_integrated = integrate_title_data(df, df_results)

# Apply language filter for Brazilian Portuguese content
df_final = apply_language_filter(df_integrated)

# Export final research dataset
export_final_dataset(df_final)

# Generate research summary
generate_research_summary()

# Display sample of final dataset
print(f"\n📋 Sample of Final Video Title Dataset:")
df_final.head()

🔍 Validating video title results consistency...
- Original video records: 1,204
- Classification results: 1,204
✅ Result count matches original video title data
✅ No null results found
🔗 Integrating video data with title classification results...
✅ Integrated DataFrame with 1,204 video title records
📊 Total columns: 25
🌍 Applying language filter for Brazilian Portuguese content...
📊 Language distribution before filtering:
- pt: 1,180 records (98.0%)
- es: 21 records (1.7%)
- en: 3 records (0.2%)

🎯 Language filtering results:
- Original records: 1,204
- Portuguese records: 1,180
- Retention rate: 98.0%
- Filtered out: 24 non-Portuguese records
💾 Exporting final video title classification dataset...
✅ Final dataset exported successfully
📁 File: ../data/intermediate/20250417_youtube_titles_yes_labels.parquet
📊 Size: 0.42 MB
📈 Records: 1,180
📋 Columns: 25

📋 Final Dataset Summary:
- Obesity content distribution: {True: 797, False: 383}
- Sentiment distribution: {'neutro': 640, 'negativo':

Unnamed: 0,video_id,channelId,videoId,textDisplay,textOriginal,authorDisplayName,authorProfileImageUrl,authorChannelUrl,authorChannelId,canRate,...,comment,date,likes,video_title,language,sentimento,gordofobia_implicita,gordofobia_explicita,idioma,obesidade
0,--tK3SaYWr4,UCiV6zQocW4CvWRyXcKDZZmQ,--tK3SaYWr4,Haahahahahahahahhahh o polícia chupando a buda...,Haahahahahahahahhahh o polícia chupando a buda...,@evelynsoares4467,https://yt3.ggpht.com/ytc/AIdro_kTUhLtO25GYE29...,http://www.youtube.com/@evelynsoares4467,UCNhXx9ev5RtEiyGsVjMuTOA,True,...,,,,Tony Gordo é Incriminado #simpsons,pt,neutro,False,False,pt,False
1,-1DN4904BQw,UCbDy7ap3Ixk45DILe4O6Tbw,-1DN4904BQw,Aula gratuita: https://bit.ly/3RLbmWq,Aula gratuita: https://bit.ly/3RLbmWq,@sejasaudavel5167,https://yt3.ggpht.com/3Uk9AXlL4DHwwOhPTVsJIKJn...,http://www.youtube.com/@sejasaudavel5167,UCbDy7ap3Ixk45DILe4O6Tbw,True,...,,,,O país mais obeso do mundo #shorts,pt,neutro,False,False,pt,True
2,-4xj_teI1EQ,UCVIpR5_iHUkkpAPBkw24yDQ,-4xj_teI1EQ,Vc é linda e sua auto-estima é contagiante. Se...,Vc é linda e sua auto-estima é contagiante. Se...,@isabelitacorrea2611,https://yt3.ggpht.com/ytc/AIdro_nE2ZHEpUNJCTkX...,http://www.youtube.com/@isabelitacorrea2611,UCnxzchRu-oFH4H4SKeF2C-Q,True,...,,,,Preconceitos que eu já sofri por ser uma baila...,pt,negativo,True,False,pt,False
3,-6Qxw7CpQvQ,UC6cALLZLWQGilBFBB0PWAog,-6Qxw7CpQvQ,Vídeo completo: https://youtu.be/hnetjD-gje4,Vídeo completo: https://youtu.be/hnetjD-gje4,,https://yt3.ggpht.com/xXlZxbOOYCKigMGIaVKMpvi1...,http://www.youtube.com/c/MaiconK%C3%BCster,UC6cALLZLWQGilBFBB0PWAog,True,...,,,,esse milionário de 18 anos não quer pegar mulh...,pt,negativo,False,True,pt,False
4,-7fJRjz1BCM,UC9mdw2mmn49ZuqGOpSri7Fw,-7fJRjz1BCM,"Reunião pra saber como roubar mais, desgoverno...","Reunião pra saber como roubar mais, desgoverno...",@andersoncustodiooliveira1515,https://yt3.ggpht.com/ytc/AIdro_m5pluPevjReHKO...,http://www.youtube.com/@andersoncustodioolivei...,UCc7hcy4jmptDescrLGRHkeA,True,...,,,,Lula volta a fazer piada com obesidade de Fláv...,pt,negativo,False,True,pt,True
