# Zero-Shot Classification for YouTube Video Transcriptions

This notebook implements zero-shot classification of YouTube video transcriptions for weight stigma research using OpenAI's GPT models. The pipeline analyzes video content sentiment, detects weight-based discrimination (gordofobia), identifies language patterns, and flags obesity-related content in video discourse.

## Research Overview

This study applies advanced natural language processing techniques to video content analysis:
- Classify sentiment in Portuguese video transcriptions (positive, negative, neutral)
- Detect explicit and implicit weight-based discrimination (gordofobia) in video content
- Identify language patterns in multilingual video datasets
- Flag obesity-related discussions for focused content analysis
- Enable large-scale video discourse analysis using batch processing

## Classification Framework

The zero-shot classification system for video transcriptions includes:

1. **Video Content Sentiment Analysis**: Classify transcriptions as positive, negative, or neutral
2. **Weight Discrimination Detection**: Identify explicit and implicit gordofobia in video discourse
3. **Language Identification**: Detect transcription language using ISO codes
4. **Obesity Content Flagging**: Mark videos discussing obesity topics
5. **Batch Processing**: Efficient processing using OpenAI's Batch API for long-form content

## Input Data

- **Source**: Video transcriptions from `03_get_video_transcriptions.ipynb`
- **Expected location**: `../data/intermediate/20250417_youtube_transcriptions_no_labels.parquet`
- **Content**: Processed video transcriptions ready for sentiment analysis

## Output Data

- **Destination**: `../data/intermediate/20250417_youtube_transcriptions_yes_labels.parquet`
- **Content**: Original transcriptions with classification labels and metadata

## Technical Requirements

- OpenAI API access with sufficient batch processing quota for long-form text
- Pydantic for structured data validation
- LangChain for prompt engineering and API integration
- Robust error handling for variable-length transcription content

## Classification Schema

The system uses the same structured Pydantic model as comment analysis for consistency:
- **sentimento**: Sentiment classification (positivo/negativo/neutro)
- **gordofobia_implicita**: Boolean flag for implicit weight discrimination
- **gordofobia_explicita**: Boolean flag for explicit weight discrimination
- **idioma**: Language code (ISO 639-1 format)
- **obesidade**: Boolean flag for obesity-related content

## Research Significance

Video transcription analysis provides complementary insights to comment analysis:
- **Content Creator Perspective**: Analyzes the original discourse presented by video creators
- **Comparative Analysis**: Enables comparison between creator content and audience response
- **Narrative Analysis**: Examines longer-form discourse patterns and themes
- **Media Influence**: Assesses potential impact of video content on audience attitudes

**Note**: Video transcriptions typically contain longer, more complex text than comments, requiring careful prompt engineering for accurate classification.

In [2]:
import hashlib
import json
import os
import sys
import time
import copy
from pathlib import Path
from typing import List, Literal, Optional, Dict, Any
from glob import glob

# Data processing libraries
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import joblib

# API and ML libraries
from dotenv import load_dotenv
from langchain.utils.openai_functions import convert_pydantic_to_openai_function
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama import ChatOllama
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field

# Custom modules
sys.path.append(str(Path("..").resolve()))
from openai_api import OpenAIBatchProcessor

# Jupyter notebook utilities
from IPython.display import clear_output
import warnings

# Load environment variables
load_dotenv()

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

# Configuration for zero-shot transcription classification
MODEL_NAME = "gpt-4.1-mini"  # Updated to current model version

print("✅ Libraries loaded successfully")
print(f"🤖 Using model: {MODEL_NAME}")
print(f"📁 Working directory: {Path.cwd()}")

# Configuration class for the transcription classification pipeline
class TranscriptionClassificationConfig:
    """Configuration for zero-shot video transcription classification pipeline."""
    
    # File paths
    DATA_DIR = Path("../data")
    INTERMEDIATE_DATA_DIR = DATA_DIR / "intermediate"
    TMP_DATA_DIR = DATA_DIR / "tmp"
    JSONL_DIR = INTERMEDIATE_DATA_DIR / "jsonl"
    
    # Input file (from transcription notebook)
    INPUT_FILE = INTERMEDIATE_DATA_DIR / "20250417_youtube_transcriptions_no_labels.parquet"
    
    # Output file
    OUTPUT_FILE = INTERMEDIATE_DATA_DIR / "20250417_youtube_transcriptions_yes_labels.parquet"
    
    # Temporary files for batch processing
    PARSED_RESULTS_FILE = TMP_DATA_DIR / "parsed_results_transcriptions.joblib"
    RESULTS_FILE = TMP_DATA_DIR / "results_transcriptions.joblib"
    
    # Batch processing parameters
    BATCH_SIZE = 40000  # Maximum requests per batch file
    BATCH_NAME_PREFIX = "20250417_youtube_transcriptions_batch_api"
    
    # Model parameters
    MODEL_NAME = MODEL_NAME
    TEMPERATURE = 0.0  # Deterministic outputs for research
    
    @classmethod
    def create_directories(cls):
        """Create necessary directories for processing."""
        cls.INTERMEDIATE_DATA_DIR.mkdir(parents=True, exist_ok=True)
        cls.TMP_DATA_DIR.mkdir(parents=True, exist_ok=True)
        cls.JSONL_DIR.mkdir(parents=True, exist_ok=True)

# Create directories
TranscriptionClassificationConfig.create_directories()

print("✅ Configuration initialized")
print(f"📂 Input file: {TranscriptionClassificationConfig.INPUT_FILE}")
print(f"📂 Output file: {TranscriptionClassificationConfig.OUTPUT_FILE}")
print(f"🔢 Batch size: {TranscriptionClassificationConfig.BATCH_SIZE:,}")
print(f"🌡️ Temperature: {TranscriptionClassificationConfig.TEMPERATURE}")
print(f"🎬 Content type: Video transcriptions (long-form text)")

✅ Libraries loaded successfully
🤖 Using model: gpt-4.1-mini
📁 Working directory: /media/nas-elias/pesquisas/papers/paper_savio_youtube/paper_youtube_weight_stigma
✅ Configuration initialized
📂 Input file: ../data/intermediate/20250417_youtube_transcriptions_no_labels.parquet
📂 Output file: ../data/intermediate/20250417_youtube_transcriptions_yes_labels.parquet
🔢 Batch size: 40,000
🌡️ Temperature: 0.0
🎬 Content type: Video transcriptions (long-form text)


In [3]:
df = pd.read_parquet("../data/intermediate/20250417_youtube_transcriptions_no_labels.parquet")
df

Unnamed: 0,video_id,transcription,duration,video_title
0,--tK3SaYWr4,em Springfield tem um batedor de carteiras Ei ...,152.480,Tony Gordo é Incriminado #simpsons
1,-1DN4904BQw,Qual é o país mais obeso do mundo você pode pe...,105.778,O país mais obeso do mundo #shorts
2,-4xj_teI1EQ,preconceitos que eu já sofri por ser uma Bel L...,119.278,Preconceitos que eu já sofri por ser uma baila...
3,-6Qxw7CpQvQ,esses dias Me perguntaram no Instagram Caíque ...,119.740,esse milionário de 18 anos não quer pegar mulh...
4,-7fJRjz1BCM,essa reunião vai demorar pelo menos umas 6 hor...,53.000,Lula volta a fazer piada com obesidade de Fláv...
...,...,...,...,...
1006,zpAr6RAfiQQ,G Alô Agora sim podemos as outras salas estão ...,35009.043,Obesidade em Pauta 2025
1007,zrp63PeKlm8,o abate de bovinos superou 10 milhões de cabeç...,779.921,"Boi gordo: alta recua, mas preços seguem firme..."
1008,zsy8O0eAkro,o Santa Cruz busca muita mas muita motivação n...,1232.019,SANTA CRUZ VAI TER PIX GORDO DA SAF E TORCEDOR...
1009,zuwuu5jNsZM,você está com quase 300 quilos o e qual você a...,436.919,"""Sempre recorro à comida para me confortar"" | ..."


In [4]:
def load_and_explore_transcription_data(file_path: Path) -> pd.DataFrame:
    """
    Load and explore YouTube video transcription data for classification.
    
    Args:
        file_path: Path to the video transcriptions data
        
    Returns:
        DataFrame with transcription data ready for classification
    """
    try:
        print(f"📂 Loading video transcription data from: {file_path}")
        
        # Verify file exists
        if not file_path.exists():
            raise FileNotFoundError(f"Input file not found: {file_path}")
        
        # Load the data
        df = pd.read_parquet(file_path)
        
        print(f"✅ Successfully loaded {len(df):,} video transcriptions")
        print(f"📊 Data shape: {df.shape}")
        print(f"💾 Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
        
        # Display basic statistics
        print(f"\n📈 Data Overview:")
        print(f"- Total video transcriptions: {len(df):,}")
        print(f"- Unique videos: {df['video_id'].nunique():,}")
        
        # Check transcription content
        if 'transcription' in df.columns:
            avg_length = df['transcription'].str.len().mean()
            median_length = df['transcription'].str.len().median()
            total_chars = df['transcription'].str.len().sum()
            
            print(f"- Average transcription length: {avg_length:.0f} characters")
            print(f"- Median transcription length: {median_length:.0f} characters")
            print(f"- Total characters: {total_chars:,}")
            
            # Check for empty transcriptions
            empty_count = df['transcription'].isna().sum()
            if empty_count > 0:
                print(f"⚠️ Empty transcriptions: {empty_count:,}")
        
        # Check video duration if available
        if 'duration' in df.columns:
            avg_duration = df['duration'].mean()
            median_duration = df['duration'].median()
            print(f"- Average video duration: {avg_duration:.1f}s ({avg_duration/60:.1f} min)")
            print(f"- Median video duration: {median_duration:.1f}s ({median_duration/60:.1f} min)")
        
        # Check for key columns
        required_columns = ['transcription', 'video_id']
        missing_columns = [col for col in required_columns if col not in df.columns]
        if missing_columns:
            raise ValueError(f"Missing required columns: {missing_columns}")
        
        # Display sample data
        print(f"\n📋 Sample Video Transcriptions:")
        sample_df = df.sample(min(2, len(df)))
        for idx, row in sample_df.iterrows():
            video_title = row.get('video_title', 'Unknown Title')
            transcription_preview = row['transcription'][:200] + "..." if len(row['transcription']) > 200 else row['transcription']
            print(f"\n🎬 Video: {video_title[:80]}{'...' if len(video_title) > 80 else ''}")
            print(f"   Transcription: {transcription_preview}")
        
        return df
        
    except Exception as e:
        print(f"❌ Error loading data: {e}")
        raise

# Load the transcription data
df = load_and_explore_transcription_data(TranscriptionClassificationConfig.INPUT_FILE)

📂 Loading video transcription data from: ../data/intermediate/20250417_youtube_transcriptions_no_labels.parquet
✅ Successfully loaded 1,011 video transcriptions
📊 Data shape: (1011, 4)
💾 Memory usage: 9.6 MB

📈 Data Overview:
- Total video transcriptions: 1,011
- Unique videos: 1,011
- Average transcription length: 9290 characters
- Median transcription length: 1231 characters
- Total characters: 9,392,361
- Average video duration: 1367.4s (22.8 min)
- Median video duration: 129.8s (2.2 min)

📋 Sample Video Transcriptions:

🎬 Video: Ele deixou a esposa porque ela era gorda | então ela emagreceu e se vingou
   Transcription: Renata sempre foi uma esposa dedicada e amorosa para Carlos o casamento deles apesar das dificuldades foi construído na esperança e perseverança por três longos anos enfrentaram juntos apertos finance...

🎬 Video: Mulher obesa se sente humilhada ao ficar presa na catraca do ônibus
   Transcription: a situação constrangedora sobraram fotos e o trauma eu me sinto muit

In [5]:
class RespostaAnaliseSentimento(BaseModel):
    """
    Structured response model for video transcription sentiment analysis and weight discrimination detection.
    
    This model defines the output schema for zero-shot classification of YouTube video transcriptions
    focusing on sentiment, weight-based discrimination (gordofobia), language detection,
    and obesity-related content identification in video discourse.
    
    Note: Video transcriptions typically contain longer, more complex content than comments,
    requiring careful analysis of extended discourse patterns.
    """

    sentimento: Literal["positivo", "negativo", "neutro"] = Field(
        description="Sentiment classification of the video transcription. Must be one of: 'positivo', 'negativo', or 'neutro'. "
                   "Consider the overall tone and message of the video content, not just isolated phrases."
    )

    gordofobia_implicita: bool = Field(
        description="Whether the video transcription contains implicit weight-based discrimination (gordofobia). "
                   "Set to True if subtle or indirect weight bias is detected in the video discourse, False otherwise. "
                   "Consider narrative patterns, assumptions, and subtle messaging."
    )

    gordofobia_explicita: bool = Field(
        description="Whether the video transcription contains explicit weight-based discrimination (gordofobia). "
                   "Set to True if direct or overt weight bias is detected in the video content, False otherwise. "
                   "Look for clear discriminatory language, stereotypes, or prejudicial statements."
    )

    idioma: str = Field(
        description="Detected primary language of the video transcription using ISO 639-1 two-letter codes (e.g., 'pt', 'en', 'es'). "
                   "Consider the predominant language if the video contains mixed languages."
    )

    obesidade: bool = Field(
        description="Whether the video transcription discusses obesity-related topics. "
                   "Set to True if obesity, weight loss, dieting, or related health topics are mentioned or discussed, False otherwise. "
                   "Consider the video's main theme and content focus."
    )

    class Config:
        """Pydantic configuration for the model."""
        json_encoders = {
            # Custom encoders if needed
        }

# Validate the model structure
print("✅ Classification schema defined successfully for video transcriptions")
print(f"📋 Model fields: {list(RespostaAnaliseSentimento.__fields__.keys())}")

# Display model schema for validation
try:
    schema = RespostaAnaliseSentimento.model_json_schema()
    print(f"🔍 Schema validation: OK")
    print(f"📊 Required fields: {schema.get('required', [])}")
    print(f"🎬 Optimized for: Long-form video content analysis")
except Exception as e:
    print(f"❌ Schema validation error: {e}")
    raise

✅ Classification schema defined successfully for video transcriptions
📋 Model fields: ['sentimento', 'gordofobia_implicita', 'gordofobia_explicita', 'idioma', 'obesidade']
🔍 Schema validation: OK
📊 Required fields: ['sentimento', 'gordofobia_implicita', 'gordofobia_explicita', 'idioma', 'obesidade']
🎬 Optimized for: Long-form video content analysis


In [6]:
# System prompt for zero-shot classification
SYSTEM_PROMPT = {
    "role": "system",
    "content": """Você é um especialista em análise de sentimento com foco em comentários relacionados a peso corporal e discriminação.

Sua tarefa é classificar comentários do YouTube com precisão, identificando:
1. Sentimento geral (positivo, negativo, neutro)
2. Presença de gordofobia (discriminação por peso)
3. Idioma do texto
4. Menções sobre obesidade

DIRETRIZES DE CLASSIFICAÇÃO:

SENTIMENTO:
- 'positivo': Comentários de apoio, encorajamento, aceitação corporal, mensagens construtivas
- 'negativo': Críticas, julgamentos, discriminação, linguagem ofensiva, gordofobia
- 'neutro': Comentários informativos, questões, observações sem julgamento de valor

GORDOFOBIA:
- Explícita: Insultos diretos, linguagem claramente discriminatória, termos pejorativos sobre peso
- Implícita: Sugestões sutis, estereótipos, pressões indiretas relacionadas ao peso

IDIOMA:
- Use códigos ISO 639-1 (pt, en, es, etc.)
- Considere o idioma predominante se houver mistura

OBESIDADE:
- Marque como True se o comentário menciona ou discute obesidade, mesmo que indiretamente

CONTEXTO IMPORTANTE:
- Considere ironia, sarcasmo e emojis no contexto
- Analise o comentário completo, não apenas palavras isoladas
- Comentários de apoio à diversidade corporal são positivos
- Seja preciso na detecção de discriminação sutil

Responda APENAS com o formato estruturado solicitado.""",
}

print("✅ System prompt configured")
print(f"📝 Prompt length: {len(SYSTEM_PROMPT['content'])} characters")
print("🎯 Classification targets: sentiment, gordofobia, language, obesity content")

✅ System prompt configured
📝 Prompt length: 1346 characters
🎯 Classification targets: sentiment, gordofobia, language, obesity content


In [7]:
# Generate OpenAI function schema from Pydantic model
def create_function_schema() -> Dict[str, Any]:
    """
    Create OpenAI function calling schema from the Pydantic model for video transcription classification.
    
    Returns:
        Dict containing the function schema for OpenAI API
    """
    try:
        # Convert Pydantic model to OpenAI function format
        function_schema = convert_pydantic_to_openai_function(RespostaAnaliseSentimento)
        
        # Ensure all fields are required for consistent outputs
        function_schema["parameters"]["required"] = list(function_schema["parameters"]["properties"].keys())
        function_schema["parameters"]["type"] = "object"
        
        print("✅ Function schema created successfully for video transcription analysis")
        print(f"📋 Function name: {function_schema['name']}")
        print(f"🔧 Required parameters: {function_schema['parameters']['required']}")
        
        return function_schema
        
    except Exception as e:
        print(f"❌ Error creating function schema: {e}")
        raise

# Create the function schema
function_schema = create_function_schema()

# Display schema structure for validation
print(f"\n🔍 Function Schema Structure:")
print(f"- Name: {function_schema['name']}")
print(f"- Description: {function_schema['description']}")
print(f"- Parameters: {len(function_schema['parameters']['properties'])} fields")
print(f"- Required fields: {len(function_schema['parameters']['required'])}")

# Validate schema structure
assert "name" in function_schema, "Function schema missing name"
assert "parameters" in function_schema, "Function schema missing parameters"
assert len(function_schema["parameters"]["required"]) == 5, "Expected 5 required parameters"

print("✅ Function schema validation passed for video transcription classification")

✅ Function schema created successfully for video transcription analysis
📋 Function name: RespostaAnaliseSentimento
🔧 Required parameters: ['sentimento', 'gordofobia_implicita', 'gordofobia_explicita', 'idioma', 'obesidade']

🔍 Function Schema Structure:
- Name: RespostaAnaliseSentimento
- Description: Structured response model for video transcription sentiment analysis and weight discrimination detection.

This model defines the output schema for zero-shot classification of YouTube video transcriptions
focusing on sentiment, weight-based discrimination (gordofobia), language detection,
and obesity-related content identification in video discourse.

Note: Video transcriptions typically contain longer, more complex content than comments,
requiring careful analysis of extended discourse patterns.
- Parameters: 5 fields
- Required fields: 5
✅ Function schema validation passed for video transcription classification


In [8]:
def analyze_transcription_characteristics(df: pd.DataFrame) -> None:
    """
    Analyze characteristics of video transcription data.
    
    Args:
        df: DataFrame containing the transcription data
    """
    print("🔍 Analyzing video transcription characteristics...")
    
    if 'transcription' in df.columns:
        # Length analysis
        lengths = df['transcription'].str.len()
        print(f"📊 Transcription length analysis:")
        print(f"- Min length: {lengths.min():,} characters")
        print(f"- Max length: {lengths.max():,} characters")
        print(f"- Mean length: {lengths.mean():.0f} characters")
        print(f"- Median length: {lengths.median():.0f} characters")
        
        # Long transcription analysis
        long_transcriptions = (lengths > 5000).sum()
        very_long_transcriptions = (lengths > 10000).sum()
        print(f"- Long transcriptions (>5k chars): {long_transcriptions:,}")
        print(f"- Very long transcriptions (>10k chars): {very_long_transcriptions:,}")
        
        # Empty or very short transcriptions
        empty_transcriptions = df['transcription'].isna().sum()
        short_transcriptions = (lengths < 100).sum()
        print(f"- Empty transcriptions: {empty_transcriptions:,}")
        print(f"- Very short transcriptions (<100 chars): {short_transcriptions:,}")
        
        if short_transcriptions > 0:
            print("ℹ️ Short transcriptions may indicate poor quality or incomplete data")
        
        if very_long_transcriptions > 0:
            print("ℹ️ Very long transcriptions may require special handling for API processing")

def prepare_transcription_data(df: pd.DataFrame) -> List[str]:
    """
    Prepare video transcription texts for classification.
    
    Args:
        df: DataFrame containing the transcription data
        
    Returns:
        List of transcription texts ready for processing
    """
    print(f"📝 Preparing {len(df):,} video transcriptions for classification...")
    
    # Use transcription column
    input_texts = df.transcription.values.tolist()
    
    # Basic validation and cleaning
    empty_texts = sum(1 for text in input_texts if not text or not text.strip())
    if empty_texts > 0:
        print(f"⚠️ Found {empty_texts} empty or whitespace-only transcriptions")
    
    # Check for very long texts that might need special handling
    very_long_texts = sum(1 for text in input_texts if len(text) > 15000)
    if very_long_texts > 0:
        print(f"ℹ️ Found {very_long_texts} very long transcriptions (>15k chars)")
        print("   These will be processed as-is but may require more processing time")
    
    print(f"✅ Prepared {len(input_texts):,} transcriptions for classification")
    return input_texts

# Analyze the transcription data characteristics
analyze_transcription_characteristics(df)

# Prepare the input texts
input_texts = prepare_transcription_data(df)

print(f"\n📈 Data Preparation Summary:")
print(f"- Total transcriptions to classify: {len(input_texts):,}")
if input_texts:
    sample_length = len(input_texts[0]) if input_texts else 0
    average_length = sum(len(text) for text in input_texts) / len(input_texts)
    print(f"- Sample transcription length: {sample_length:,} characters")
    print(f"- Average transcription length: {average_length:.0f} characters")
    
    # Estimate API costs (rough calculation)
    total_chars = sum(len(text) for text in input_texts)
    estimated_tokens = total_chars / 4  # Rough estimate: 4 chars per token
    print(f"- Total characters: {total_chars:,}")
    print(f"- Estimated tokens: {estimated_tokens:,.0f}")
    print("ℹ️ Video transcriptions typically use more tokens than comments")
else:
    print("- No transcriptions available for processing")

🔍 Analyzing video transcription characteristics...
📊 Transcription length analysis:
- Min length: 9 characters
- Max length: 386,982 characters
- Mean length: 9290 characters
- Median length: 1231 characters
- Long transcriptions (>5k chars): 333
- Very long transcriptions (>10k chars): 219
- Empty transcriptions: 0
- Very short transcriptions (<100 chars): 17
ℹ️ Short transcriptions may indicate poor quality or incomplete data
ℹ️ Very long transcriptions may require special handling for API processing
📝 Preparing 1,011 video transcriptions for classification...
ℹ️ Found 157 very long transcriptions (>15k chars)
   These will be processed as-is but may require more processing time
✅ Prepared 1,011 transcriptions for classification

📈 Data Preparation Summary:
- Total transcriptions to classify: 1,011
- Sample transcription length: 1,361 characters
- Average transcription length: 9290 characters
- Total characters: 9,392,361
- Estimated tokens: 2,348,090
ℹ️ Video transcriptions typicall

In [9]:
def create_batch_requests_for_transcriptions(texts: List[str], df: pd.DataFrame) -> List[Dict[str, Any]]:
    """
    Create batch API requests for video transcription classification.
    
    Args:
        texts: List of transcription texts to classify
        df: Original DataFrame for generating unique IDs
        
    Returns:
        List of API request objects
    """
    print(f"🔧 Creating batch API requests for video transcriptions...")
    
    jsonl_data = []
    
    for idx, text in enumerate(tqdm(texts, desc="Creating transcription requests")):
        # Create unique identifier for the request using transcription content and video ID
        custom_uid = f"{text[:100]}{idx}{df.video_id.iloc[idx]}"  # Use first 100 chars to avoid very long UIDs
        request_id = hashlib.md5(custom_uid.encode()).hexdigest()
        
        # Create API request structure
        request_data = {
            "custom_id": request_id,
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": TranscriptionClassificationConfig.MODEL_NAME,
                "temperature": TranscriptionClassificationConfig.TEMPERATURE,
                "messages": [
                    SYSTEM_PROMPT,
                    {"role": "user", "content": text.encode().decode("utf-8")}
                ],
                "parallel_tool_calls": False,
                "tools": [{"type": "function", "function": function_schema}],
                "tool_choice": {
                    "type": "function",
                    "function": {"name": function_schema["name"]}
                }
            }
        }
        jsonl_data.append(request_data)
    
    print(f"✅ Created {len(jsonl_data):,} API requests for video transcriptions")
    return jsonl_data

def split_into_batches(data: List[Dict[str, Any]], batch_size: int) -> List[List[Dict[str, Any]]]:
    """
    Split requests into batches for API processing.
    
    Args:
        data: List of API requests
        batch_size: Maximum requests per batch
        
    Returns:
        List of batches
    """
    print(f"📦 Splitting {len(data):,} transcription requests into batches of {batch_size:,}...")
    
    chunks = [data[x:x + batch_size] for x in range(0, len(data), batch_size)]
    
    print(f"✅ Created {len(chunks)} batch(es) for video transcriptions")
    for i, chunk in enumerate(chunks):
        print(f"  - Batch {i}: {len(chunk):,} requests")
    
    return chunks


In [10]:

# Create the batch requests
jsonl_data = create_batch_requests_for_transcriptions(input_texts, df)

# Split into manageable batches
chunks = split_into_batches(jsonl_data, TranscriptionClassificationConfig.BATCH_SIZE)

print(f"\n📊 Batch Processing Summary for Video Transcriptions:")
print(f"- Total requests: {len(jsonl_data):,}")
print(f"- Number of batches: {len(chunks)}")
print(f"- Batch size limit: {TranscriptionClassificationConfig.BATCH_SIZE:,}")
print(f"- Model: {TranscriptionClassificationConfig.MODEL_NAME}")
print(f"- Temperature: {TranscriptionClassificationConfig.TEMPERATURE}")
print(f"- Content type: Video transcriptions (longer text)")

🔧 Creating batch API requests for video transcriptions...


Creating transcription requests:   0%|          | 0/1011 [00:00<?, ?it/s]

✅ Created 1,011 API requests for video transcriptions
📦 Splitting 1,011 transcription requests into batches of 40,000...
✅ Created 1 batch(es) for video transcriptions
  - Batch 0: 1,011 requests

📊 Batch Processing Summary for Video Transcriptions:
- Total requests: 1,011
- Number of batches: 1
- Batch size limit: 40,000
- Model: gpt-4.1-mini
- Temperature: 0.0
- Content type: Video transcriptions (longer text)


In [11]:
def export_batch_files(chunks: List[List[Dict[str, Any]]], base_filename: str) -> List[str]:
    """
    Export batch requests to JSONL files for API processing.
    
    Args:
        chunks: List of batch chunks
        base_filename: Base filename for the batch files
        
    Returns:
        List of created file paths
    """
    print(f"💾 Exporting batch files for video transcription classification...")
    
    created_files = []
    
    for idx, chunk in enumerate(chunks):
        filename = f"{base_filename}_{idx}.jsonl"
        filepath = TranscriptionClassificationConfig.JSONL_DIR / filename
        
        try:
            with open(filepath, "w", encoding="utf-8") as f:
                for item in chunk:
                    f.write(json.dumps(item, ensure_ascii=False) + "\n")
            
            # Verify file creation and size
            file_size_mb = filepath.stat().st_size / (1024 * 1024)
            print(f"✅ Created {filename}: {file_size_mb:.2f} MB")
            created_files.append(str(filepath))
            
        except Exception as e:
            print(f"❌ Error creating {filename}: {e}")
            raise
    
    return created_files

# Export batch files
base_filename = TranscriptionClassificationConfig.BATCH_NAME_PREFIX
created_files = export_batch_files(chunks, base_filename)

print(f"\n📁 Batch Files Created for Video Transcriptions:")
for file_path in created_files:
    file_size = Path(file_path).stat().st_size / (1024 * 1024)
    print(f"- {Path(file_path).name}: {file_size:.2f} MB")

print(f"\n🎯 Ready for batch processing with {len(created_files)} file(s)")
print(f"🎬 Content: Video transcription classification")
print(f"📝 Note: Transcriptions may take longer to process due to length")

💾 Exporting batch files for video transcription classification...
✅ Created 20250417_youtube_transcriptions_batch_api_0.jsonl: 13.12 MB

📁 Batch Files Created for Video Transcriptions:
- 20250417_youtube_transcriptions_batch_api_0.jsonl: 13.12 MB

🎯 Ready for batch processing with 1 file(s)
🎬 Content: Video transcription classification
📝 Note: Transcriptions may take longer to process due to length


In [12]:
def get_file_hash(file_path: str) -> str:
    """
    Generate MD5 hash of a file for unique batch naming.
    
    Args:
        file_path: Path to the file
        
    Returns:
        MD5 hash string
    """
    with open(file_path, "rb") as f:
        return hashlib.md5(f.read()).hexdigest()

def initialize_batch_processors(file_paths: List[str]) -> Dict[str, OpenAIBatchProcessor]:
    """
    Initialize and submit batch jobs for transcription processing.
    
    Args:
        file_paths: List of JSONL file paths to process
        
    Returns:
        Dictionary mapping file paths to batch processors
    """
    print(f"🚀 Initializing batch processors for {len(file_paths)} transcription files...")
    
    batch_processors = {}
    
    for file_path in tqdm(file_paths, desc="Submitting transcription batches"):
        try:
            # Create processor and submit job
            processor = OpenAIBatchProcessor()
            batch_name = get_file_hash(file_path)
            
            processor.submit_batch_job(
                input_jsonl_path=file_path,
                batch_name=batch_name
            )
            
            batch_processors[file_path] = processor
            print(f"✅ Submitted batch for {Path(file_path).name}")
            
        except Exception as e:
            print(f"❌ Error submitting batch for {Path(file_path).name}: {e}")
            raise
    
    print(f"✅ All {len(batch_processors)} transcription batches submitted successfully")
    return batch_processors

def monitor_batch_status(processors: Dict[str, OpenAIBatchProcessor]) -> None:
    """
    Monitor the status of all batch jobs.
    
    Args:
        processors: Dictionary of batch processors to monitor
    """
    print("📊 Checking transcription batch status...")
    
    for file_path, processor in processors.items():
        try:
            batch_info = processor.get_batch_info()
            filename = Path(file_path).name
            print(f"- {filename}: {batch_info.status}")
            
            if hasattr(batch_info, 'request_counts'):
                counts = batch_info.request_counts
                if counts:
                    print(f"  📈 Progress: {counts.get('completed', 0)}/{counts.get('total', 0)} requests")
            
        except Exception as e:
            print(f"❌ Error checking status for {Path(file_path).name}: {e}")

# Find all batch files for transcriptions
batch_files = glob(str(TranscriptionClassificationConfig.JSONL_DIR / f"{TranscriptionClassificationConfig.BATCH_NAME_PREFIX}*.jsonl"))

print(f"📁 Found {len(batch_files)} transcription batch files:")
for file_path in batch_files:
    file_size = Path(file_path).stat().st_size / (1024 * 1024)
    print(f"- {Path(file_path).name}: {file_size:.2f} MB")

# Initialize batch processors
if batch_files:
    batch_processors = initialize_batch_processors(batch_files)
    
    # Monitor initial status
    monitor_batch_status(batch_processors)
else:
    print("⚠️ No batch files found. Please ensure batch files were created successfully.")
    batch_processors = {}

📁 Found 1 transcription batch files:
- 20250417_youtube_transcriptions_batch_api_0.jsonl: 13.12 MB
🚀 Initializing batch processors for 1 transcription files...


Submitting transcription batches:   0%|          | 0/1 [00:00<?, ?it/s]

Successfully submitted batch 2128b81fc15117fc4b6bafdeda080c9d with id batch_6883805a00a88190b26d8ece123e0c85
Batch info saved to ../data/intermediate/jsonl/20250417_youtube_transcriptions_batch_api_0_20250725_100218.txt
✅ Submitted batch for 20250417_youtube_transcriptions_batch_api_0.jsonl
✅ All 1 transcription batches submitted successfully
📊 Checking transcription batch status...
- 20250417_youtube_transcriptions_batch_api_0.jsonl: validating


In [13]:
def wait_for_completion(processors: Dict[str, OpenAIBatchProcessor], check_interval: int = 60) -> None:
    """
    Wait for all transcription batch jobs to complete with periodic status updates.
    
    Args:
        processors: Dictionary of batch processors to monitor
        check_interval: Seconds between status checks
    """
    if not processors:
        print("⚠️ No batch processors to monitor")
        return
        
    print(f"⏳ Waiting for transcription batch completion (checking every {check_interval}s)...")
    print("ℹ️ Video transcriptions may take longer to process due to content length")
    
    while True:
        try:
            # Check if all batches are completed
            statuses = []
            for processor in processors.values():
                status = processor.get_batch_info().status
                statuses.append(status)
            
            completed_count = sum(1 for status in statuses if status == "completed")
            total_count = len(statuses)
            
            # Clear output and show current status
            clear_output(wait=True)
            print(f"🔄 Transcription Batch Processing Status: {completed_count}/{total_count} completed")
            
            # Show detailed status
            for file_path, processor in processors.items():
                batch_info = processor.get_batch_info()
                filename = Path(file_path).name
                print(f"- {filename}: {batch_info.status}")
            
            # Check if all completed
            if all(status == "completed" for status in statuses):
                print("✅ All transcription batches completed successfully!")
                break
            
            # Wait before next check
            time.sleep(check_interval)
            
        except KeyboardInterrupt:
            print("\n⚠️ Monitoring interrupted by user")
            break
        except Exception as e:
            print(f"❌ Error during monitoring: {e}")
            break

# Start monitoring (this will run until completion)
print("\n🎯 Starting transcription batch monitoring...")
print("Note: This cell will run until all batches are completed.")
print("You can interrupt with Ctrl+C if needed.")

wait_for_completion(batch_processors)

🔄 Transcription Batch Processing Status: 1/1 completed
- 20250417_youtube_transcriptions_batch_api_0.jsonl: completed
✅ All transcription batches completed successfully!


## 7. Results Processing and Data Integration

In [14]:
from typing import Tuple
def process_batch_results(processors: Dict[str, OpenAIBatchProcessor]) -> Tuple[List[Any], List[Dict]]:
    """
    Process and parse results from completed video transcription batch jobs.
    
    Args:
        processors: Dictionary of batch processors
        
    Returns:
        Tuple of (parsed_results, raw_results)
    """
    print("🔄 Processing video transcription batch results...")
    
    parsed_results = []
    raw_results = []
    error_count = 0
    
    for file_path, processor in processors.items():
        filename = Path(file_path).name
        print(f"📂 Processing results from {filename}...")
        
        try:
            # Get batch output
            file_response = processor.get_batch_output()
            if not file_response:
                print(f"⚠️ No response data for {filename}")
                continue
            
            # Process each response
            batch_parsed = 0
            batch_errors = 0
            
            for output in file_response:
                try:
                    # Parse the JSON response
                    json_output = json.loads(output)
                    function_args = json_output["response"]["body"]["choices"][0]["message"]["tool_calls"][0]["function"]["arguments"]
                    parsed_json = json.loads(function_args)
                    
                    # Validate with Pydantic model
                    validated_obj = RespostaAnaliseSentimento.model_validate(parsed_json)
                    parsed_results.append(validated_obj)
                    raw_results.append(validated_obj.model_dump())
                    batch_parsed += 1
                    
                except Exception as e:
                    # Handle parsing errors
                    parsed_results.append(None)
                    raw_results.append(None)
                    batch_errors += 1
                    error_count += 1
                    
                    if batch_errors <= 3:  # Show first few errors
                        print(f"⚠️ Parsing error: {str(e)[:100]}")
            
            print(f"  ✅ Parsed: {batch_parsed}, Errors: {batch_errors}")
            
        except Exception as e:
            print(f"❌ Error processing {filename}: {e}")
            continue
    
    success_rate = (len(parsed_results) - error_count) / len(parsed_results) * 100 if parsed_results else 0
    
    print(f"\n📊 Video Transcription Results Processing Summary:")
    print(f"- Total responses: {len(parsed_results):,}")
    print(f"- Successfully parsed: {len(parsed_results) - error_count:,}")
    print(f"- Parse errors: {error_count:,}")
    print(f"- Success rate: {success_rate:.1f}%")
    
    return parsed_results, raw_results

def save_intermediate_results(parsed_results: List[Any], raw_results: List[Dict]) -> None:
    """
    Save intermediate results for backup and debugging.
    
    Args:
        parsed_results: List of parsed Pydantic objects
        raw_results: List of raw result dictionaries
    """
    print("💾 Saving intermediate video transcription results...")
    
    try:
        # Save parsed results
        joblib.dump(parsed_results, TranscriptionClassificationConfig.PARSED_RESULTS_FILE)
        print(f"✅ Parsed results saved: {TranscriptionClassificationConfig.PARSED_RESULTS_FILE}")
        
        # Save raw results
        joblib.dump(raw_results, TranscriptionClassificationConfig.RESULTS_FILE)
        print(f"✅ Raw results saved: {TranscriptionClassificationConfig.RESULTS_FILE}")
        
        # File size information
        parsed_size = TranscriptionClassificationConfig.PARSED_RESULTS_FILE.stat().st_size / (1024 * 1024)
        raw_size = TranscriptionClassificationConfig.RESULTS_FILE.stat().st_size / (1024 * 1024)
        
        print(f"📊 File sizes:")
        print(f"- Parsed results: {parsed_size:.2f} MB")
        print(f"- Raw results: {raw_size:.2f} MB")
        
    except Exception as e:
        print(f"❌ Error saving intermediate results: {e}")
        raise

# Process all video transcription batch results
parsed_results, raw_results = process_batch_results(batch_processors)

# Save intermediate results for backup
save_intermediate_results(parsed_results, raw_results)

🔄 Processing video transcription batch results...
📂 Processing results from 20250417_youtube_transcriptions_batch_api_0.jsonl...
  ✅ Parsed: 1011, Errors: 0

📊 Video Transcription Results Processing Summary:
- Total responses: 1,011
- Successfully parsed: 1,011
- Parse errors: 0
- Success rate: 100.0%
💾 Saving intermediate video transcription results...
✅ Parsed results saved: ../data/tmp/parsed_results_transcriptions.joblib
✅ Raw results saved: ../data/tmp/results_transcriptions.joblib
📊 File sizes:
- Parsed results: 0.06 MB
- Raw results: 0.02 MB


In [15]:
def validate_results_consistency(parsed_results: List[Any], original_df: pd.DataFrame) -> None:
    """
    Validate that results match the original video transcription data structure.
    
    Args:
        parsed_results: List of classification results
        original_df: Original DataFrame with video transcriptions
    """
    print("🔍 Validating video transcription results consistency...")
    
    print(f"- Original video transcriptions: {len(original_df):,}")
    print(f"- Classification results: {len(parsed_results):,}")
    
    if len(parsed_results) == len(original_df):
        print("✅ Result count matches original video transcription data")
    else:
        print("⚠️ Result count mismatch - check for processing errors")
    
    # Check for null results
    null_count = sum(1 for result in parsed_results if result is None)
    if null_count > 0:
        print(f"⚠️ Found {null_count} null results ({null_count/len(parsed_results)*100:.1f}%)")
    else:
        print("✅ No null results found")

def create_results_dataframe(parsed_results: List[Any]) -> pd.DataFrame:
    """
    Convert parsed results to a structured DataFrame.
    
    Args:
        parsed_results: List of parsed Pydantic objects
        
    Returns:
        DataFrame with classification results
    """
    print("� Creating results DataFrame...")
    
    outputs = []
    
    for i, parsed_document in enumerate(tqdm(parsed_results, desc="Converting results")):
        if parsed_document is not None:
            parsed_dict = parsed_document.model_dump()
            outputs.append(parsed_dict)
        else:
            # Handle null results with default values
            outputs.append({
                "sentimento": None,
                "gordofobia_implicita": None,
                "gordofobia_explicita": None,
                "idioma": None,
                "obesidade": None,
            })
    
    # Create DataFrame with proper column names
    df_results = pd.DataFrame(outputs)
    df_results.columns = ["sentimento", "gordofobia_implicita", "gordofobia_explicita", "idioma", "obesidade"]
    
    print(f"✅ Created results DataFrame with {len(df_results):,} records")
    return df_results

# Validate consistency
validate_results_consistency(parsed_results, df)

# Create structured results DataFrame
df_results = create_results_dataframe(parsed_results)

🔍 Validating video transcription results consistency...
- Original video transcriptions: 1,011
- Classification results: 1,011
✅ Result count matches original video transcription data
✅ No null results found
� Creating results DataFrame...


Converting results:   0%|          | 0/1011 [00:00<?, ?it/s]

✅ Created results DataFrame with 1,011 records


In [16]:
# Final status check for all video transcription batches
print("🔍 Final video transcription batch status check:")
for file_path, processor in batch_processors.items():
    batch_info = processor.get_batch_info()
    filename = Path(file_path).name
    print(f"- {filename}: {batch_info.status}")
    
    if batch_info.status == "completed":
        print(f"  ✅ Ready for result processing")
    elif batch_info.status == "failed":
        print(f"  ❌ Batch failed - check error details")
    else:
        print(f"  ⏳ Still processing - current status: {batch_info.status}")

# Check if we can proceed to results processing
completed_batches = sum(1 for processor in batch_processors.values() 
                       if processor.get_batch_info().status == "completed")
total_batches = len(batch_processors)

print(f"\n📊 Video Transcription Completion Summary:")
print(f"- Completed batches: {completed_batches}/{total_batches}")
print(f"- Success rate: {completed_batches/total_batches*100:.1f}%")

if completed_batches == total_batches:
    print("✅ All video transcription batches completed - ready for results processing")
else:
    print("⚠️ Some video transcription batches are still pending - wait for completion before proceeding")

🔍 Final video transcription batch status check:
- 20250417_youtube_transcriptions_batch_api_0.jsonl: completed
  ✅ Ready for result processing

📊 Video Transcription Completion Summary:
- Completed batches: 1/1
- Success rate: 100.0%
✅ All video transcription batches completed - ready for results processing


In [17]:
def integrate_transcription_data(original_df: pd.DataFrame, results_df: pd.DataFrame) -> pd.DataFrame:
    """
    Integrate original transcription data with classification results.
    
    Args:
        original_df: Original video transcription DataFrame
        results_df: Classification results DataFrame
        
    Returns:
        Combined DataFrame with all transcription and classification data
    """
    print("Integrating video transcription data with classification results...")
    
    # Combine original data with results
    df_combined = pd.concat([original_df.reset_index(drop=True), results_df], axis=1)
    
    print(f"Integrated DataFrame with {len(df_combined)} video transcription records")
    print(f"Columns: {list(df_combined.columns)}")
    
    return df_combined


In [18]:

def export_final_results(df_combined: pd.DataFrame) -> None:
    """
    Export final integrated video transcription dataset.
    
    Args:
        df_combined: Combined DataFrame with transcription and classification data
    """
    print("Exporting final video transcription classification dataset...")
    
    # Create output directory
    output_dir = TranscriptionClassificationConfig.DATA_DIR / "intermediate"
    output_dir.mkdir(exist_ok=True)
    
    # Export to Parquet
    output_file = output_dir / "20250417_youtube_transcriptions_yes_labels.parquet"
    df_combined.to_parquet(output_file, index=False)
    
    # Calculate file size
    file_size_mb = output_file.stat().st_size / (1024 * 1024)
    
    print(f"Exported to: {output_file}")
    print(f"File size: {file_size_mb:.2f} MB")
    print(f"Records: {len(df_combined):,}")
    
    # Display sample results
    print(f"\nSample Classification Results:")
    if 'obesidade' in df_combined.columns:
        print(f"Obesity content distribution:")
        print(df_combined['obesidade'].value_counts())
    
    if 'idioma' in df_combined.columns:
        print(f"\nLanguage distribution:")
        print(df_combined['idioma'].value_counts())


In [19]:

def display_final_summary(df_combined: pd.DataFrame) -> None:
    """
    Display final summary of the video transcription classification pipeline.
    
    Args:
        df_combined: Final combined DataFrame
    """
    print("Video Transcription Classification Pipeline Summary")
    print("=" * 60)
    
    print(f"Dataset Statistics:")
    print(f"- Total video transcriptions processed: {len(df_combined):,}")
    print(f"- Data columns: {len(df_combined.columns)}")
    
    # Classification results summary
    if 'obesidade' in df_combined.columns:
        obesity_yes = (df_combined['obesidade'] == 'sim').sum()
        obesity_no = (df_combined['obesidade'] == 'não').sum()
        obesity_null = df_combined['obesidade'].isna().sum()
        
        print(f"\nVideo Content Classification:")
        print(f"- Videos with obesity content: {obesity_yes:,}")
        print(f"- Videos without obesity content: {obesity_no:,}")
        print(f"- Failed classifications: {obesity_null:,}")
        
        if len(df_combined) > 0:
            success_rate = (obesity_yes + obesity_no) / len(df_combined) * 100
            print(f"- Classification success rate: {success_rate:.1f}%")
    
    print(f"\nideo transcription classification pipeline completed successfully!")
    print(f"Output file ready for research analysis")


In [20]:

# Integrate original data with classification results
df_final = integrate_transcription_data(df, df_results)



Integrating video transcription data with classification results...
Integrated DataFrame with 1011 video transcription records
Columns: ['video_id', 'transcription', 'duration', 'video_title', 'sentimento', 'gordofobia_implicita', 'gordofobia_explicita', 'idioma', 'obesidade']


In [21]:
# Export final integrated dataset
export_final_results(df_final)



Exporting final video transcription classification dataset...
Exported to: ../data/intermediate/20250417_youtube_transcriptions_yes_labels.parquet
File size: 4.97 MB
Records: 1,011

Sample Classification Results:
Obesity content distribution:
obesidade
True     762
False    249
Name: count, dtype: int64

Language distribution:
idioma
pt    1011
Name: count, dtype: int64


In [22]:
# Display final summary
display_final_summary(df_final)

Video Transcription Classification Pipeline Summary
Dataset Statistics:
- Total video transcriptions processed: 1,011
- Data columns: 9

Video Content Classification:
- Videos with obesity content: 0
- Videos without obesity content: 0
- Failed classifications: 0
- Classification success rate: 0.0%

ideo transcription classification pipeline completed successfully!
Output file ready for research analysis


## 8. Pipeline Summary and Documentation

This section provides a comprehensive summary of the video transcription classification pipeline and its outcomes.

In [23]:
def generate_pipeline_documentation() -> None:
    """
    Generate comprehensive documentation for the video transcription classification pipeline.
    """
    print("Video Transcription Classification Pipeline Documentation")
    print("=" * 70)
    
    print("\nPipeline Objective:")
    print("Zero-shot classification of YouTube video transcriptions to identify")
    print("weight stigma content and sentiment analysis using OpenAI GPT-4.1-mini.")
    
    print("\nTechnical Components:")
    print("1. Data Loading: Video transcription data from notebook 03")
    print("2. Schema Definition: Pydantic model for structured outputs")
    print("3. Configuration: TranscriptionClassificationConfig class")
    print("4. Batch Processing: OpenAI Batch API for large-scale classification")
    print("5. Results Processing: Structured data integration and validation")
    print("6. Data Export: Research-ready datasets in Parquet format")
    
    print("\nClassification Schema:")
    print("- sentimento: Emotional sentiment analysis")
    print("- gordofobia_implicita: Implicit weight stigma detection")
    print("- gordofobia_explicita: Explicit weight stigma detection")
    print("- idioma: Language identification")
    print("- obesidade: Obesity-related content classification")
    
    print("\nResearch Applications:")
    print("- Content analysis of video transcriptions")
    print("- Comparative analysis with comment classification (notebook 04)")
    print("- Multi-modal YouTube content research")
    print("- Weight stigma prevalence studies")
    
    print("\nOutput Files:")
    print("- 20250417_youtube_transcriptions_yes_labels.parquet")
    print("- Intermediate results in joblib format")
    print("- Processing logs and configuration backups")
    
    print("\nPipeline Status: Complete and Ready for Research")



In [24]:

def validate_pipeline_completion() -> None:
    """
    Validate that all pipeline components completed successfully.
    """
    print("\n� Pipeline Completion Validation:")
    
    # Check output files
    output_file = TranscriptionClassificationConfig.DATA_DIR / "intermediate" / "20250417_youtube_transcriptions_yes_labels.parquet"
    
    if output_file.exists():
        file_size = output_file.stat().st_size / (1024 * 1024)
        print(f"Main output file exists: {file_size:.2f} MB")
    else:
        print("Main output file missing")
    
    # Check intermediate files
    if TranscriptionClassificationConfig.PARSED_RESULTS_FILE.exists():
        print("Parsed results backup exists")
    else:
        print("Parsed results backup missing")
    
    if TranscriptionClassificationConfig.RESULTS_FILE.exists():
        print("Raw results backup exists")
    else:
        print("Raw results backup missing")
    
    print("\nNext Steps:")
    print("1. Proceed to exploratory data analysis (notebook 09)")
    print("2. Compare with comment classification results (notebook 04)")
    print("3. Integrate with video metadata for comprehensive analysis")
    print("4. Conduct statistical analysis of weight stigma patterns")

In [25]:

# Generate documentation
generate_pipeline_documentation()



Video Transcription Classification Pipeline Documentation

Pipeline Objective:
Zero-shot classification of YouTube video transcriptions to identify
weight stigma content and sentiment analysis using OpenAI GPT-4.1-mini.

Technical Components:
1. Data Loading: Video transcription data from notebook 03
2. Schema Definition: Pydantic model for structured outputs
3. Configuration: TranscriptionClassificationConfig class
4. Batch Processing: OpenAI Batch API for large-scale classification
5. Results Processing: Structured data integration and validation
6. Data Export: Research-ready datasets in Parquet format

Classification Schema:
- sentimento: Emotional sentiment analysis
- gordofobia_implicita: Implicit weight stigma detection
- gordofobia_explicita: Explicit weight stigma detection
- idioma: Language identification
- obesidade: Obesity-related content classification

Research Applications:
- Content analysis of video transcriptions
- Comparative analysis with comment classification (n

In [26]:
# Validate completion
validate_pipeline_completion()


� Pipeline Completion Validation:
Main output file exists: 4.97 MB
Parsed results backup exists
Raw results backup exists

Next Steps:
1. Proceed to exploratory data analysis (notebook 09)
2. Compare with comment classification results (notebook 04)
3. Integrate with video metadata for comprehensive analysis
4. Conduct statistical analysis of weight stigma patterns
