# Video Transcription Collection for YouTube Weight Stigma Research

This notebook implements the video transcription collection pipeline for analyzing YouTube video content related to weight stigma research. The transcription data provides insights into the actual content being discussed in the videos, complementing the comment analysis.

## Research Overview

This study collects and processes video transcriptions to:
- Analyze video content themes and language patterns
- Compare video content with user comments
- Identify weight stigma language in video discourse
- Enable comprehensive content analysis

## Pipeline Overview

The transcription collection process includes:

1. **Data Loading**: Load cleaned video metadata from previous steps
2. **Transcript Collection**: Use YouTube Transcript API to collect Portuguese transcripts
3. **Fallback Translation**: Translate non-Portuguese transcripts when needed
4. **Data Processing**: Aggregate transcript segments into full video transcriptions
5. **Quality Control**: Validate and clean transcription data
6. **Export**: Save processed transcriptions for analysis

## Input Data

- **Source**: Cleaned video data from `02_basic_cleaning.ipynb`
- **Expected location**: `../data/intermediate/20250417_youtube_comments_pt_cleaned1.parquet`
- **Content**: Video IDs and metadata for transcription collection

## Output Data

- **Destination**: `../data/intermediate/20250417_youtube_transcriptions_no_labels.parquet`
- **Content**: Processed video transcriptions with metadata

## Requirements

- YouTube Transcript API (youtube-transcript-api)
- Robust error handling for API limitations
- Progress tracking with checkpointing for large datasets

**Note**: Some videos may not have transcripts available. The pipeline handles these cases gracefully.

## 1. Import Libraries and Configuration

In [1]:
import pandas as pd
import numpy as np
import logging
from pathlib import Path
from typing import List, Dict, Optional, Any, Tuple
from tqdm.auto import tqdm
import warnings
import sys
import os

# Add parent directory to path to import custom modules
sys.path.append(str(Path("..").resolve()))

# Import custom transcription utilities
from transcription_utils import VideoTranscriptDownloader, process_transcripts_to_final_format

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

# Configure logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)


# Configuration class for transcription collection
class TranscriptionConfig:
    """Configuration for video transcription collection pipeline."""

    # File paths
    DATA_DIR = Path("../data")
    INTERMEDIATE_DATA_DIR = DATA_DIR / "intermediate"
    TMP_DATA_DIR = DATA_DIR / "tmp"

    # Input file (from cleaning notebook)
    INPUT_FILE = INTERMEDIATE_DATA_DIR / "20250417_youtube_comments_pt_cleaned1.parquet"

    # Output file
    OUTPUT_FILE = INTERMEDIATE_DATA_DIR / "20250417_youtube_transcriptions_no_labels.parquet"

    # Checkpoint file for progress tracking
    CHECKPOINT_FILE = TMP_DATA_DIR / "transcription_video_processing_checkpoint.joblib"

    # Processing parameters
    SLEEP_INTERVAL = 2  # Seconds between API requests to avoid rate limiting
    TRANSLATE_FALLBACK = True  # Whether to translate non-Portuguese transcripts
    CHECKPOINT_FREQUENCY = 5  # Save progress every N videos

    # Create directories if they don't exist
    @classmethod
    def create_directories(cls):
        cls.INTERMEDIATE_DATA_DIR.mkdir(parents=True, exist_ok=True)
        cls.TMP_DATA_DIR.mkdir(parents=True, exist_ok=True)


# Create directories
TranscriptionConfig.create_directories()

print("✅ Libraries loaded and configuration set")
print(f"📁 Input file: {TranscriptionConfig.INPUT_FILE}")
print(f"📁 Output file: {TranscriptionConfig.OUTPUT_FILE}")
print(f"💾 Checkpoint file: {TranscriptionConfig.CHECKPOINT_FILE}")
print(f"⏱️ Sleep interval: {TranscriptionConfig.SLEEP_INTERVAL}s")
print(f"🌐 Translation fallback: {TranscriptionConfig.TRANSLATE_FALLBACK}")

# Verify input file exists
if TranscriptionConfig.INPUT_FILE.exists():
    print(f"✅ Input file found: {TranscriptionConfig.INPUT_FILE.name}")
    file_size = TranscriptionConfig.INPUT_FILE.stat().st_size / (1024 * 1024)  # MB
    print(f"📊 File size: {file_size:.1f} MB")
else:
    print(f"❌ Input file not found: {TranscriptionConfig.INPUT_FILE}")
    print("Please run the data cleaning notebook (02_basic_cleaning.ipynb) first")

✅ Libraries loaded and configuration set
📁 Input file: ../data/intermediate/20250417_youtube_comments_pt_cleaned1.parquet
📁 Output file: ../data/intermediate/20250417_youtube_transcriptions_no_labels.parquet
💾 Checkpoint file: ../data/tmp/transcription_video_processing_checkpoint.joblib
⏱️ Sleep interval: 2s
🌐 Translation fallback: True
✅ Input file found: 20250417_youtube_comments_pt_cleaned1.parquet
📊 File size: 47.1 MB


## 2. Load and Explore Video Data

In [2]:
def load_and_explore_video_data(file_path: Path) -> pd.DataFrame:
    """
    Load and explore video data for transcription collection.

    Args:
        file_path: Path to the cleaned video comments data

    Returns:
        DataFrame with video metadata for transcription collection
    """
    logger.info(f"Loading video data from {file_path}")

    try:
        # Load the cleaned comments data
        df_comments = pd.read_parquet(file_path)
        logger.info(f"✅ Loaded {len(df_comments):,} comments")

        # Extract unique video information
        df_videos = df_comments[["video_id", "video_title"]].drop_duplicates()
        df_videos.reset_index(drop=True, inplace=True)

        logger.info(f"📹 Found {len(df_videos):,} unique videos for transcription")

        # Display sample data
        print("\n📊 Video Data Sample:")
        print(df_videos.head())

        print(f"\n📈 Data Overview:")
        print(f"- Total unique videos: {len(df_videos):,}")
        print(f"- Total comments across videos: {len(df_comments):,}")
        print(f"- Average comments per video: {len(df_comments) / len(df_videos):.1f}")

        return df_videos

    except Exception as e:
        logger.error(f"❌ Error loading video data: {e}")
        raise


# Load the video data
df_videos = load_and_explore_video_data(TranscriptionConfig.INPUT_FILE)

2025-07-24 07:07:24,696 - INFO - Loading video data from ../data/intermediate/20250417_youtube_comments_pt_cleaned1.parquet
2025-07-24 07:07:25,745 - INFO - ✅ Loaded 191,946 comments
2025-07-24 07:07:25,786 - INFO - 📹 Found 1,204 unique videos for transcription



📊 Video Data Sample:
      video_id                                        video_title
0  --tK3SaYWr4                 Tony Gordo é Incriminado #simpsons
1  -1DN4904BQw                 O país mais obeso do mundo #shorts
2  -4xj_teI1EQ  Preconceitos que eu já sofri por ser uma baila...
3  -6Qxw7CpQvQ  esse milionário de 18 anos não quer pegar mulh...
4  -7fJRjz1BCM  Lula volta a fazer piada com obesidade de Fláv...

📈 Data Overview:
- Total unique videos: 1,204
- Total comments across videos: 191,946
- Average comments per video: 159.4


In [3]:
# Analyze video distribution
print(f"🎯 Video Collection Summary:")
print(f"- Unique video IDs: {df_videos['video_id'].nunique():,}")
print(f"- Videos with titles: {df_videos['video_title'].notna().sum():,}")
print(f"- Videos without titles: {df_videos['video_title'].isna().sum():,}")

# Check for any data quality issues
if df_videos["video_id"].duplicated().any():
    logger.warning("⚠️ Duplicate video IDs found - this should not happen")
    duplicate_count = df_videos["video_id"].duplicated().sum()
    print(f"❌ Found {duplicate_count} duplicate video IDs")
else:
    print("✅ No duplicate video IDs found")

# Display video ID examples for validation
print(f"\n🔍 Sample Video IDs:")
sample_videos = df_videos.sample(min(5, len(df_videos)))
for idx, row in sample_videos.iterrows():
    print(f"- {row['video_id']}: {row['video_title'][:60]}{'...' if len(str(row['video_title'])) > 60 else ''}")

🎯 Video Collection Summary:
- Unique video IDs: 1,204
- Videos with titles: 1,204
- Videos without titles: 0
✅ No duplicate video IDs found

🔍 Sample Video IDs:
- FWm_FWOyQL8: Perdendo peso e ganhando saúde❤️‍🩹 #emagrecimento #vencendoa...
- 2jYX2DHpQTA: Quais as doenças associadas à obesidade?
- tp8aPwI3Mhg: DIÁRIO DA DIETA | ESTOU OBESA | PRECISO EMAGRECER 10KG
- Ejq6pKngyXQ: O QUE UM OBESO MAIS PRECISA PARA EMAGRECER? – IRONBERG PODCA...
- Pv5iJPhhgp4: Con Ánimo de Ofender : Cap #38 - Gordo, Gordo


## 3. Video Transcription Collection

This section uses the custom `VideoTranscriptDownloader` class to systematically collect video transcriptions with proper error handling and progress tracking.

In [4]:
# Initialize the transcript downloader
video_ids = df_videos["video_id"].tolist()

print(f"🚀 Initializing VideoTranscriptDownloader")
print(f"- Videos to process: {len(video_ids):,}")
print(f"- Checkpoint file: {TranscriptionConfig.CHECKPOINT_FILE}")
print(f"- Sleep interval: {TranscriptionConfig.SLEEP_INTERVAL}s")

downloader = VideoTranscriptDownloader(video_ids=video_ids, checkpoint_file=TranscriptionConfig.CHECKPOINT_FILE, sleep_interval=TranscriptionConfig.SLEEP_INTERVAL)

print("✅ VideoTranscriptDownloader initialized successfully")

🚀 Initializing VideoTranscriptDownloader
- Videos to process: 1,204
- Checkpoint file: ../data/tmp/transcription_video_processing_checkpoint.joblib
- Sleep interval: 2s
✅ VideoTranscriptDownloader initialized successfully


In [5]:
# Execute transcript collection with progress tracking
print("🎬 Starting transcript collection...")
print("This process may take some time depending on the number of videos.")
print("Progress is automatically saved - you can safely interrupt and resume.")

try:
    # Download transcripts with translation fallback
    df_raw_transcripts = downloader.download_transcripts(translate_to_pt=TranscriptionConfig.TRANSLATE_FALLBACK)

    print(f"\n✅ Transcript collection completed!")
    print(f"- Raw transcript segments collected: {len(df_raw_transcripts):,}")

    if not df_raw_transcripts.empty:
        print(f"- Videos with transcripts: {df_raw_transcripts['video_id'].nunique():,}")
        print(f"- Average segments per video: {len(df_raw_transcripts) / df_raw_transcripts['video_id'].nunique():.1f}")

        # Show sample transcript data
        print(f"\n📋 Sample Transcript Data:")
        print(df_raw_transcripts.head())
    else:
        print("⚠️ No transcripts were collected. Check API availability and video accessibility.")

except Exception as e:
    logger.error(f"❌ Error during transcript collection: {e}")
    print(f"❌ Transcript collection failed: {e}")
    print("Check the logs for detailed error information.")
    raise

2025-07-24 07:07:29,549 - INFO - Resuming from checkpoint: ../data/tmp/transcription_video_processing_checkpoint.joblib


🎬 Starting transcript collection...
This process may take some time depending on the number of videos.
Progress is automatically saved - you can safely interrupt and resume.


2025-07-24 07:07:30,191 - INFO - All videos have already been processed.
2025-07-24 07:07:30,192 - INFO - Processing complete. Total successes: 1011, Total failures: 193.



✅ Transcript collection completed!
- Raw transcript segments collected: 263,913
- Videos with transcripts: 1,011
- Average segments per video: 261.0

📋 Sample Transcript Data:
                                       text  start  duration     video_id
0          em Springfield tem um batedor de   0.00      3.48  --tK3SaYWr4
1  carteiras Ei cadê minha carteira é minha   1.28      3.72  --tK3SaYWr4
2     carteira subiu também para pegar quem   3.48      2.96  --tK3SaYWr4
3       está batendo carteiras os policiais   5.00      3.16  --tK3SaYWr4
4        precisam de alguém como isca Vamos   6.44      3.04  --tK3SaYWr4


## 4. Process and Aggregate Transcripts

Convert raw transcript segments into consolidated video transcriptions with metadata.

In [6]:
# Process raw transcripts into final format
if not df_raw_transcripts.empty:
    print("🔄 Processing raw transcripts into final format...")

    try:
        # Use the utility function to process transcripts
        df_final_transcripts = process_transcripts_to_final_format(df_raw_transcripts=df_raw_transcripts, df_video_info=df_videos)

        print(f"✅ Transcript processing completed!")
        print(f"- Final video transcripts: {len(df_final_transcripts):,}")
        print(f"- Success rate: {len(df_final_transcripts) / len(df_videos) * 100:.1f}%")

        # Display processing statistics
        if not df_final_transcripts.empty:
            avg_duration = df_final_transcripts["duration"].mean()
            median_duration = df_final_transcripts["duration"].median()
            avg_transcript_length = df_final_transcripts["transcription"].str.len().mean()

            print(f"\n📊 Transcript Statistics:")
            print(f"- Average video duration: {avg_duration:.1f} seconds ({avg_duration / 60:.1f} minutes)")
            print(f"- Median video duration: {median_duration:.1f} seconds ({median_duration / 60:.1f} minutes)")
            print(f"- Average transcript length: {avg_transcript_length:.0f} characters")

            # Show sample processed data
            print(f"\n📋 Sample Processed Transcripts:")
            sample_df = df_final_transcripts.sample(min(3, len(df_final_transcripts)))
            for idx, row in sample_df.iterrows():
                print(f"\n🎬 Video: {row['video_title'][:60]}{'...' if len(row['video_title']) > 60 else ''}")
                print(f"   Duration: {row['duration']:.0f}s")
                print(f"   Transcript: {row['transcription'][:100]}{'...' if len(row['transcription']) > 100 else ''}")

    except Exception as e:
        logger.error(f"❌ Error processing transcripts: {e}")
        print(f"❌ Transcript processing failed: {e}")
        raise

else:
    print("⚠️ No raw transcripts available for processing.")
    df_final_transcripts = pd.DataFrame(columns=["video_id", "transcription", "duration", "video_title"])

🔄 Processing raw transcripts into final format...
✅ Transcript processing completed!
- Final video transcripts: 1,011
- Success rate: 84.0%

📊 Transcript Statistics:
- Average video duration: 1367.4 seconds (22.8 minutes)
- Median video duration: 129.8 seconds (2.2 minutes)
- Average transcript length: 9290 characters

📋 Sample Processed Transcripts:

🎬 Video: O CAOS DO XIXI QUE ME TROUXE A RESPOSTA / MEU GATO OBESO   #...
   Duration: 1025s
   Transcript: começando mais um vlog para vocês e esse gatinho aí que vocês estão vendo é o querido é um gato que ...

🎬 Video: AULÃO #001 - VOCÊ FOI PROGRAMADO PARA SER OBESO
   Duration: 6307s
   Transcript: G1 Oi tudo bem Boa noite E aí G1 Oi eurice boa noite é hoje só temos um teste né um período de teste...

🎬 Video: Ex-obesa vira modelo #motivação #disciplina #inspiração
   Duration: 92s
   Transcript: [Música] Mas essa é uma daquelas baita histórias de superação o namorado que tanto ajudou Flávia mor...


## 5. Data Quality Validation

In [7]:
# Perform data quality validation
def validate_transcript_data(df: pd.DataFrame) -> Dict[str, Any]:
    """
    Validate the quality of transcript data.

    Args:
        df: DataFrame with transcript data

    Returns:
        Dictionary with validation results
    """
    validation_results = {}

    if df.empty:
        validation_results["status"] = "empty"
        validation_results["message"] = "No transcript data to validate"
        return validation_results

    # Check for required columns
    required_columns = ["video_id", "transcription", "duration", "video_title"]
    missing_columns = [col for col in required_columns if col not in df.columns]

    if missing_columns:
        validation_results["status"] = "error"
        validation_results["message"] = f"Missing required columns: {missing_columns}"
        return validation_results

    # Data quality checks
    validation_results["total_transcripts"] = len(df)
    validation_results["empty_transcripts"] = df["transcription"].isna().sum()
    validation_results["empty_titles"] = df["video_title"].isna().sum()
    validation_results["zero_duration"] = (df["duration"] <= 0).sum()
    validation_results["duplicate_videos"] = df["video_id"].duplicated().sum()

    # Calculate quality metrics
    valid_transcripts = len(df) - validation_results["empty_transcripts"]
    validation_results["quality_score"] = valid_transcripts / len(df) if len(df) > 0 else 0

    # Determine overall status
    if validation_results["quality_score"] >= 0.8:
        validation_results["status"] = "good"
    elif validation_results["quality_score"] >= 0.5:
        validation_results["status"] = "acceptable"
    else:
        validation_results["status"] = "poor"

    return validation_results


# Validate the transcript data
print("🔍 Validating transcript data quality...")
validation = validate_transcript_data(df_final_transcripts)

print(f"\n📋 Data Quality Report:")
print(f"- Status: {validation.get('status', 'unknown').upper()}")

if "total_transcripts" in validation:
    print(f"- Total transcripts: {validation['total_transcripts']:,}")
    print(f"- Empty transcripts: {validation['empty_transcripts']:,}")
    print(f"- Empty titles: {validation['empty_titles']:,}")
    print(f"- Zero duration videos: {validation['zero_duration']:,}")
    print(f"- Duplicate videos: {validation['duplicate_videos']:,}")
    print(f"- Quality score: {validation['quality_score']:.2%}")

if validation["status"] == "good":
    print("✅ Data quality is good - ready for analysis")
elif validation["status"] == "acceptable":
    print("⚠️ Data quality is acceptable - some issues detected")
elif validation["status"] == "poor":
    print("❌ Data quality is poor - significant issues detected")
else:
    print(f"ℹ️ {validation.get('message', 'Validation completed')}")

🔍 Validating transcript data quality...

📋 Data Quality Report:
- Status: GOOD
- Total transcripts: 1,011
- Empty transcripts: 0
- Empty titles: 0
- Zero duration videos: 0
- Duplicate videos: 0
- Quality score: 100.00%
✅ Data quality is good - ready for analysis


## 6. Export Processed Transcripts

In [8]:
def export_transcript_data(df: pd.DataFrame, output_path: Path) -> bool:
    """
    Export transcript data with proper validation and backup.

    Args:
        df: DataFrame to export
        output_path: Path where to save the data

    Returns:
        True if export successful, False otherwise
    """
    try:
        if df.empty:
            logger.warning("No transcript data to export")
            return False

        # Ensure output directory exists
        output_path.parent.mkdir(parents=True, exist_ok=True)

        # Create backup if file already exists
        if output_path.exists():
            backup_path = output_path.with_suffix(f".backup_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.parquet")
            output_path.rename(backup_path)
            logger.info(f"Existing file backed up to: {backup_path}")

        # Export to parquet
        df.to_parquet(output_path, index=False)

        # Verify export
        file_size = output_path.stat().st_size / (1024 * 1024)  # MB
        logger.info(f"✅ Export successful: {output_path}")
        logger.info(f"📊 File size: {file_size:.2f} MB")

        return True

    except Exception as e:
        logger.error(f"❌ Export failed: {e}")
        return False


# Export the processed transcript data
if not df_final_transcripts.empty:
    print(f"💾 Exporting transcript data to: {TranscriptionConfig.OUTPUT_FILE}")

    export_success = export_transcript_data(df_final_transcripts, TranscriptionConfig.OUTPUT_FILE)

    if export_success:
        print("✅ Transcript data exported successfully!")
        print(f"📁 Output file: {TranscriptionConfig.OUTPUT_FILE}")
        print(f"📊 Records exported: {len(df_final_transcripts):,}")

        # Display final summary
        print(f"\n🏁 Final Summary:")
        print(f"- Videos processed: {len(df_videos):,}")
        print(f"- Transcripts collected: {len(df_final_transcripts):,}")
        print(f"- Success rate: {len(df_final_transcripts) / len(df_videos) * 100:.1f}%")
        print(f"- Output file: {TranscriptionConfig.OUTPUT_FILE.name}")
    else:
        print("❌ Export failed - check logs for details")
else:
    print("⚠️ No transcript data to export")
    print("This could be due to:")
    print("- API rate limiting")
    print("- Videos without available transcripts")
    print("- Network connectivity issues")
    print("- API key restrictions")

2025-07-24 07:07:38,902 - INFO - Existing file backed up to: ../data/intermediate/20250417_youtube_transcriptions_no_labels.backup_20250724_070738.parquet
2025-07-24 07:07:39,007 - INFO - ✅ Export successful: ../data/intermediate/20250417_youtube_transcriptions_no_labels.parquet
2025-07-24 07:07:39,008 - INFO - 📊 File size: 4.97 MB


💾 Exporting transcript data to: ../data/intermediate/20250417_youtube_transcriptions_no_labels.parquet
✅ Transcript data exported successfully!
📁 Output file: ../data/intermediate/20250417_youtube_transcriptions_no_labels.parquet
📊 Records exported: 1,011

🏁 Final Summary:
- Videos processed: 1,204
- Transcripts collected: 1,011
- Success rate: 84.0%
- Output file: 20250417_youtube_transcriptions_no_labels.parquet


## 7. Cleanup and Next Steps

In [9]:
# Optional: Clean up checkpoint files after successful completion
def cleanup_checkpoint_files():
    """Clean up temporary checkpoint files."""
    try:
        if TranscriptionConfig.CHECKPOINT_FILE.exists():
            # Archive instead of delete for debugging purposes
            archive_name = f"checkpoint_completed_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.joblib"
            archive_path = TranscriptionConfig.CHECKPOINT_FILE.parent / archive_name
            TranscriptionConfig.CHECKPOINT_FILE.rename(archive_path)
            print(f"✅ Checkpoint file archived: {archive_name}")
        else:
            print("ℹ️ No checkpoint file to clean up")
    except Exception as e:
        logger.warning(f"⚠️ Could not clean up checkpoint file: {e}")


# Clean up if export was successful
if not df_final_transcripts.empty and TranscriptionConfig.OUTPUT_FILE.exists():
    print("🧹 Cleaning up temporary files...")
    cleanup_checkpoint_files()

print("\n🎯 Next Steps:")
print("1. ✅ Video transcriptions collected and processed")
print("2. 📊 Ready for content analysis and topic modeling")
print("3. 🔍 Can proceed with zero-shot classification")
print("4. 📈 Combine with comment data for comprehensive analysis")

if not df_final_transcripts.empty:
    print(f"\n📁 Output Files Created:")
    print(f"- Main output: {TranscriptionConfig.OUTPUT_FILE}")
    if TranscriptionConfig.CHECKPOINT_FILE.exists():
        print(f"- Checkpoint: {TranscriptionConfig.CHECKPOINT_FILE}")

print("\n💡 Research Applications:")
print("- Content analysis of weight stigma themes")
print("- Comparison between video content and user comments")
print("- Language pattern analysis in Portuguese content")
print("- Sentiment analysis across video discourse")

print("\n🔧 Pipeline Complete! ✨")

🧹 Cleaning up temporary files...
✅ Checkpoint file archived: checkpoint_completed_20250724_070744.joblib

🎯 Next Steps:
1. ✅ Video transcriptions collected and processed
2. 📊 Ready for content analysis and topic modeling
3. 🔍 Can proceed with zero-shot classification
4. 📈 Combine with comment data for comprehensive analysis

📁 Output Files Created:
- Main output: ../data/intermediate/20250417_youtube_transcriptions_no_labels.parquet

💡 Research Applications:
- Content analysis of weight stigma themes
- Comparison between video content and user comments
- Language pattern analysis in Portuguese content
- Sentiment analysis across video discourse

🔧 Pipeline Complete! ✨
