# Data Cleaning and Preprocessing for YouTube Weight Stigma Research

This notebook implements the data cleaning and preprocessing pipeline for YouTube comments collected in the weight stigma research study. The cleaning process ensures data quality and prepares the dataset for subsequent analysis.

## Cleaning Steps Overview

The data cleaning pipeline includes:

1. **Text Length Filtering**: Remove extremely short and long comments
2. **Emoji-Only Comment Removal**: Filter out comments containing only emojis
3. **Data Structure Normalization**: Fix nested data structures
4. **Duplicate Detection**: Remove dupli|cate comments using content-based hashing
5. **Language Detection**: Identify and filter Portuguese content
6. **Video Language Validation**: Ensure video titles are also in Portuguese
7. **Data Export**: Save cleaned dataset for further analysis

## Input Data

- **Source**: Raw YouTube comments from `01_get_data_api.ipynb`
- **Expected location**: `../data/raw/20250417_youtube_comments.parquet`
- **Format**: Parquet file with comment metadata

## Output Data

- **Destination**: `../data/intermediate/20250417_youtube_comments_pt_cleaned1.parquet`
- **Content**: Cleaned Portuguese comments ready for analysis

## 1. Import Libraries and Configuration

In [1]:
import pandas as pd
import numpy as np
import hashlib
import logging
from pathlib import Path
from typing import List, Dict, Optional, Any, Tuple
from tqdm.auto import tqdm
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

# Configure logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)


# Configuration class for data cleaning
class CleaningConfig:
    """Configuration for data cleaning pipeline."""

    # File paths
    DATA_DIR = Path("../data")
    RAW_DATA_DIR = DATA_DIR / "raw"
    INTERMEDIATE_DATA_DIR = DATA_DIR / "intermediate"

    # Expected input file (from data collection notebook)
    INPUT_FILE = RAW_DATA_DIR / "20250417_youtube_comments.parquet"

    # Output file
    OUTPUT_FILE = INTERMEDIATE_DATA_DIR / "20250417_youtube_comments_pt_cleaned1.parquet"

    # Cleaning parameters
    MIN_TEXT_PERCENTILE = 0.1  # Remove bottom 10% shortest comments
    MAX_TEXT_PERCENTILE = 0.999  # Remove top 0.1% longest comments

    # Language detection
    TARGET_LANGUAGE = "pt"  # Portuguese
    BATCH_SIZE = 784  # For language detection processing

    # Create directories if they don't exist
    @classmethod
    def create_directories(cls):
        cls.INTERMEDIATE_DATA_DIR.mkdir(parents=True, exist_ok=True)


# Create directories
CleaningConfig.create_directories()

print("✅ Libraries loaded and configuration set")
print(f"📁 Input file: {CleaningConfig.INPUT_FILE}")
print(f"📁 Output file: {CleaningConfig.OUTPUT_FILE}")
print(f"🎯 Target language: {CleaningConfig.TARGET_LANGUAGE}")

# Verify input file exists
if CleaningConfig.INPUT_FILE.exists():
    print(f"✅ Input file found: {CleaningConfig.INPUT_FILE.name}")
    file_size = CleaningConfig.INPUT_FILE.stat().st_size / (1024 * 1024)  # MB
    print(f"📊 File size: {file_size:.1f} MB")
else:
    print(f"❌ Input file not found: {CleaningConfig.INPUT_FILE}")
    print("Please run the data collection notebook (01_get_data_api.ipynb) first")

✅ Libraries loaded and configuration set
📁 Input file: ../data/raw/20250417_youtube_comments.parquet
📁 Output file: ../data/intermediate/20250417_youtube_comments_pt_cleaned1.parquet
🎯 Target language: pt
✅ Input file found: 20250417_youtube_comments.parquet
📊 File size: 127.5 MB


## 2. Load and Explore Raw Data

In [2]:
def load_and_explore_data(file_path: Path) -> pd.DataFrame:
    """
    Load raw data and provide initial exploration.

    Args:
        file_path: Path to the raw data file

    Returns:
        DataFrame with raw data
    """
    logger.info(f"Loading data from: {file_path}")

    try:
        df = pd.read_parquet(file_path)
        logger.info(f"✅ Successfully loaded {len(df):,} records")

        # Initial data exploration
        print("🔍 RAW DATA OVERVIEW")
        print("=" * 50)
        print(f"📊 Dataset shape: {df.shape}")
        print(f"📝 Total comments: {len(df):,}")
        print(f"🎥 Unique videos: {df['video_id'].nunique():,}")
        print(f"👥 Unique authors: {df['authorDisplayName'].nunique():,}")

        # Check data types and missing values
        print(f"\n📋 Column Information:")
        for col in df.columns:
            missing_count = df[col].isna().sum()
            missing_pct = (missing_count / len(df)) * 100
            print(f"   {col}: {df[col].dtype}, {missing_count:,} missing ({missing_pct:.4f})")

        # Memory usage
        memory_mb = df.memory_usage(deep=True).sum() / (1024 * 1024)
        print(f"\n💾 Memory usage: {memory_mb:.1f} MB")

        return df

    except Exception as e:
        logger.error(f"Error loading data: {str(e)}")
        raise


# Load the raw data
df = load_and_explore_data(CleaningConfig.INPUT_FILE)

# Display sample data
print(f"\n📄 Sample data (first 3 rows):")
display(df.head(3))

2025-07-23 18:02:57,750 - INFO - Loading data from: ../data/raw/20250417_youtube_comments.parquet
2025-07-23 18:03:00,651 - INFO - ✅ Successfully loaded 593,509 records


🔍 RAW DATA OVERVIEW
📊 Dataset shape: (593509, 19)
📝 Total comments: 593,509
🎥 Unique videos: 1,850
👥 Unique authors: 512,630

📋 Column Information:
   video_id: object, 0 missing (0.0000)
   channelId: object, 0 missing (0.0000)
   videoId: object, 0 missing (0.0000)
   textDisplay: object, 0 missing (0.0000)
   textOriginal: object, 0 missing (0.0000)
   authorDisplayName: object, 0 missing (0.0000)
   authorProfileImageUrl: object, 0 missing (0.0000)
   authorChannelUrl: object, 0 missing (0.0000)
   authorChannelId: object, 0 missing (0.0000)
   canRate: bool, 0 missing (0.0000)
   viewerRating: object, 0 missing (0.0000)
   likeCount: float64, 0 missing (0.0000)
   publishedAt: datetime64[ns, UTC], 0 missing (0.0000)
   updatedAt: datetime64[ns, UTC], 0 missing (0.0000)
   author: object, 593,509 missing (100.0000)
   comment: object, 593,509 missing (100.0000)
   date: object, 593,509 missing (100.0000)
   likes: object, 593,509 missing (100.0000)
   video_title: object, 0 missing

Unnamed: 0,video_id,channelId,videoId,textDisplay,textOriginal,authorDisplayName,authorProfileImageUrl,authorChannelUrl,authorChannelId,canRate,viewerRating,likeCount,publishedAt,updatedAt,author,comment,date,likes,video_title
0,--tK3SaYWr4,UCiV6zQocW4CvWRyXcKDZZmQ,--tK3SaYWr4,Haahahahahahahahhahh o polícia chupando a buda...,Haahahahahahahahhahh o polícia chupando a buda...,@evelynsoares4467,https://yt3.ggpht.com/ytc/AIdro_kTUhLtO25GYE29...,http://www.youtube.com/@evelynsoares4467,{'value': 'UCNhXx9ev5RtEiyGsVjMuTOA'},True,none,0.0,2024-12-28 21:38:37+00:00,2024-12-28 21:38:37+00:00,,,,,Tony Gordo é Incriminado #simpsons
1,--tK3SaYWr4,UCiV6zQocW4CvWRyXcKDZZmQ,--tK3SaYWr4,😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢,😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢,@SophialouiseSouza,https://yt3.ggpht.com/rvxbmQyDslI2p4RzecqWzruS...,http://www.youtube.com/@SophialouiseSouza,{'value': 'UCdQkFElArWumsJTS7FxUWcQ'},True,none,2.0,2024-12-28 21:47:13+00:00,2024-12-28 21:47:13+00:00,,,,,Tony Gordo é Incriminado #simpsons
2,--tK3SaYWr4,UCiV6zQocW4CvWRyXcKDZZmQ,--tK3SaYWr4,Chupada tridimensional 😎,Chupada tridimensional 😎,@capivagiota,https://yt3.ggpht.com/Rk5mblie0y248pftSyVfoqWV...,http://www.youtube.com/@capivagiota,{'value': 'UCaA27fdWlD7VZn6keN5Gb2w'},True,none,123.0,2024-12-28 21:53:52+00:00,2024-12-28 21:53:52+00:00,,,,,Tony Gordo é Incriminado #simpsons


## 3. Text Length Analysis and Filtering

Remove extremely short and long comments to focus on meaningful content.

In [3]:
def analyze_text_lengths(df: pd.DataFrame) -> pd.DataFrame:
    """
    Analyze comment text lengths and provide detailed statistics.

    Args:
        df: DataFrame with textDisplay column

    Returns:
        DataFrame with text length statistics
    """
    logger.info("Analyzing comment text lengths")

    # Calculate text lengths
    df_analysis = df.copy()
    df_analysis["len_text"] = df_analysis["textDisplay"].str.len()

    print("📏 TEXT LENGTH ANALYSIS")
    print("=" * 50)

    # Comprehensive statistics
    length_stats = df_analysis["len_text"].describe(percentiles=[0.01, 0.02, 0.03, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, 0.999])

    print("📊 Text Length Statistics:")
    for stat, value in length_stats.items():
        print(f"   {stat}: {value:.0f} characters")

    # Show examples of different lengths
    print(f"\n📝 Sample Comments by Length:")

    # Very short comments
    short_comments = df_analysis[df_analysis["len_text"] <= length_stats["10%"]]
    if len(short_comments) > 0:
        print(f"\n🔹 Very short (≤{length_stats['10%']:.0f} chars) - {len(short_comments):,} comments:")
        for i, comment in enumerate(short_comments["textDisplay"].head(3)):
            print(f"     {i + 1}. '{comment}' ({len(comment)} chars)")

    # Very long comments
    long_comments = df_analysis[df_analysis["len_text"] >= length_stats["99.9%"]]
    if len(long_comments) > 0:
        print(f"\n🔹 Very long (≥{length_stats['99.9%']:.0f} chars) - {len(long_comments):,} comments:")
        for i, comment in enumerate(long_comments["textDisplay"].head(2)):
            preview = comment[:100] + "..." if len(comment) > 100 else comment
            print(f"     {i + 1}. '{preview}' ({len(comment)} chars)")

    return df_analysis


# Analyze text lengths
df = analyze_text_lengths(df)

2025-07-23 18:03:04,601 - INFO - Analyzing comment text lengths


📏 TEXT LENGTH ANALYSIS
📊 Text Length Statistics:
   count: 593509 characters
   mean: 68 characters
   std: 174 characters
   min: 0 characters
   1%: 2 characters
   2%: 3 characters
   3%: 3 characters
   5%: 5 characters
   10%: 9 characters
   20%: 16 characters
   30%: 22 characters
   40%: 30 characters
   50%: 38 characters
   60%: 49 characters
   70%: 64 characters
   80%: 88 characters
   90%: 141 characters
   95%: 213 characters
   99%: 500 characters
   99.9%: 1488 characters
   max: 62137 characters

📝 Sample Comments by Length:

🔹 Very short (≤9 chars) - 66,181 comments:
     1. 'O jogo' (6 chars)
     2. '🫵🤨' (2 chars)
     3. '💀' (1 chars)

🔹 Very long (≥1488 chars) - 594 comments:
     1. 'Ele tá certo! Mulher diz : 
não quero cara desempregado, não gosto de homem sem barba, não gosto de ...' (2617 chars)
     2. 'Olha, SE a Maya aprendeu isso, ela propositadamente esqueceu todo o resto da aula, onde qualquer fac...' (1742 chars)


In [4]:
def filter_by_text_length(df: pd.DataFrame, min_percentile: float = 0.1, max_percentile: float = 0.999) -> pd.DataFrame:
    """
    Filter comments by text length using percentile thresholds.

    Args:
        df: DataFrame with len_text column
        min_percentile: Minimum percentile threshold (remove bottom X%)
        max_percentile: Maximum percentile threshold (remove top X%)

    Returns:
        Filtered DataFrame
    """
    logger.info(f"Filtering comments by text length (keeping {min_percentile:.1%} to {max_percentile:.1%})")

    # Calculate thresholds
    min_length = df["len_text"].quantile(min_percentile)
    max_length = df["len_text"].quantile(max_percentile)

    print(f"📏 TEXT LENGTH FILTERING")
    print("=" * 50)
    print(f"📊 Original dataset: {len(df):,} comments")
    print(f"📐 Length thresholds:")
    print(f"   Minimum ({min_percentile:.1%} percentile): {min_length:.0f} characters")
    print(f"   Maximum ({max_percentile:.1%} percentile): {max_length:.0f} characters")

    # Apply filters
    original_count = len(df)

    # Remove too short
    too_short = len(df[df["len_text"] <= min_length])
    df_filtered = df[df["len_text"] > min_length].copy()

    # Remove too long
    too_long = len(df_filtered[df_filtered["len_text"] >= max_length])
    df_filtered = df_filtered[df_filtered["len_text"] < max_length].copy()

    # Clean up and reset index
    df_filtered = df_filtered.drop(columns=["len_text"])
    df_filtered.reset_index(drop=True, inplace=True)

    # Report results
    final_count = len(df_filtered)
    removed_count = original_count - final_count

    print(f"\n📊 Filtering Results:")
    print(f"   🗑️ Removed {too_short:,} too short comments ({too_short / original_count:.1%})")
    print(f"   🗑️ Removed {too_long:,} too long comments ({too_long / original_count:.1%})")
    print(f"   ✅ Kept {final_count:,} comments ({final_count / original_count:.1%})")
    print(f"   📉 Total removed: {removed_count:,} comments ({removed_count / original_count:.1%})")

    return df_filtered


# Apply text length filtering
df = filter_by_text_length(df, CleaningConfig.MIN_TEXT_PERCENTILE, CleaningConfig.MAX_TEXT_PERCENTILE)

2025-07-23 18:03:05,003 - INFO - Filtering comments by text length (keeping 10.0% to 99.9%)


📏 TEXT LENGTH FILTERING
📊 Original dataset: 593,509 comments
📐 Length thresholds:
   Minimum (10.0% percentile): 9 characters
   Maximum (99.9% percentile): 1488 characters

📊 Filtering Results:
   🗑️ Removed 66,181 too short comments (11.2%)
   🗑️ Removed 594 too long comments (0.1%)
   ✅ Kept 526,734 comments (88.7%)
   📉 Total removed: 66,775 comments (11.3%)


## 4. Emoji-Only Comment Removal

Filter out comments that contain only emojis or emoji-like characters to focus on textual content.

In [5]:
def remove_emoji_only_comments(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Remove comments that consist only of emojis or emoji-like characters.

    Args:
        df: DataFrame with textDisplay column

    Returns:
        Tuple of (filtered_dataframe, emoji_only_dataframe)
    """
    logger.info("Removing emoji-only comments")

    def is_emoji_only(text: str) -> bool:
        """
        Check if text contains only emojis and whitespace.

        Covers multiple Unicode ranges for emojis and symbols.
        """
        if pd.isna(text) or text.strip() == "":
            return False

        # Remove whitespace for analysis
        text_clean = text.strip()
        if not text_clean:
            return False

        # Check if all characters are emojis/symbols
        emoji_ranges = [
            (0x1F600, 0x1F64F),  # Emoticons
            (0x1F300, 0x1F5FF),  # Misc Symbols and Pictographs
            (0x1F680, 0x1F6FF),  # Transport and Map Symbols
            (0x1F1E0, 0x1F1FF),  # Regional Indicator Symbols
            (0x2600, 0x26FF),  # Misc Symbols
            (0x2700, 0x27BF),  # Dingbats
            (0xFE00, 0xFE0F),  # Variation Selectors
            (0x1F900, 0x1F9FF),  # Supplemental Symbols and Pictographs
        ]

        for char in text_clean:
            char_code = ord(char)
            is_emoji = any(start <= char_code <= end for start, end in emoji_ranges)
            is_whitespace = char.isspace()

            if not (is_emoji or is_whitespace):
                return False

        return True

    print("😀 EMOJI-ONLY COMMENT REMOVAL")
    print("=" * 50)
    print(f"📊 Dataset before filtering: {len(df):,} comments")

    # Apply emoji detection
    emoji_mask = df["textDisplay"].apply(is_emoji_only)

    # Separate emoji-only and text comments
    df_emoji_only = df[emoji_mask].copy()
    df_filtered = df[~emoji_mask].copy()

    print(f"😀 Emoji-only comments found: {len(df_emoji_only):,} ({len(df_emoji_only) / len(df) * 100:.4f})")
    print(f"📝 Text comments kept: {len(df_filtered):,} ({len(df_filtered) / len(df) * 100:.4f})")

    # Show examples of emoji-only comments
    if len(df_emoji_only) > 0:
        print(f"\n📋 Sample emoji-only comments:")
        samples = df_emoji_only["textDisplay"].head(5)
        for i, comment in enumerate(samples, 1):
            print(f"   {i}. '{comment}'")

    return df_filtered.reset_index(drop=True), df_emoji_only.reset_index(drop=True)


# Remove emoji-only comments
df, df_emoji = remove_emoji_only_comments(df)

2025-07-23 18:03:05,740 - INFO - Removing emoji-only comments


😀 EMOJI-ONLY COMMENT REMOVAL
📊 Dataset before filtering: 526,734 comments
😀 Emoji-only comments found: 6,076 (1.1535)
📝 Text comments kept: 520,658 (98.8465)

📋 Sample emoji-only comments:
   1. '😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢😢'
   2. '💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩'
   3. '😂😂😂😂😂😂😂😂😂😂'
   4. '🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣'
   5. '😢😢😢😢😢😢😢😢😢😢'


## 5. Data Structure Normalization

Fix nested data structures and normalize the dataset schema.

In [6]:
def normalize_data_structures(df: pd.DataFrame) -> pd.DataFrame:
    """
    Normalize nested data structures in the dataset.

    Args:
        df: DataFrame with nested structures

    Returns:
        DataFrame with normalized structures
    """
    logger.info("Normalizing data structures")

    print("🔧 DATA STRUCTURE NORMALIZATION")
    print("=" * 50)

    df_normalized = df.copy()

    # Fix authorChannelId structure (if it's nested)
    if "authorChannelId" in df_normalized.columns:
        print(f"📋 Normalizing authorChannelId structure...")

        # Check if it's a nested structure
        sample_value = df_normalized["authorChannelId"].dropna().iloc[0] if len(df_normalized["authorChannelId"].dropna()) > 0 else None

        if isinstance(sample_value, dict):
            print(f"   Found nested structure, extracting 'value' field")
            df_normalized["authorChannelId"] = df_normalized["authorChannelId"].apply(lambda x: x.get("value") if isinstance(x, dict) and "value" in x else x)
        else:
            print(f"   Structure already normalized")

    # Check for other nested structures
    print(f"\n📊 Data types after normalization:")
    for col in df_normalized.columns:
        dtype = df_normalized[col].dtype
        print(f"   {col}: {dtype}")

        # Check for remaining nested structures
        if dtype == "object":
            sample_vals = df_normalized[col].dropna().head(3)
            has_dict = any(isinstance(val, dict) for val in sample_vals)
            has_list = any(isinstance(val, list) for val in sample_vals)

            if has_dict or has_list:
                print(f"      ⚠️ Contains nested structures (dict: {has_dict}, list: {has_list})")

    # Display sample of normalized data
    print(f"\n📄 Sample normalized data:")
    if len(df_normalized) > 0:
        display(df_normalized.head(2))

    return df_normalized


# Normalize data structures
df = normalize_data_structures(df)

2025-07-23 18:03:07,580 - INFO - Normalizing data structures


🔧 DATA STRUCTURE NORMALIZATION
📋 Normalizing authorChannelId structure...
   Found nested structure, extracting 'value' field

📊 Data types after normalization:
   video_id: object
   channelId: object
   videoId: object
   textDisplay: object
   textOriginal: object
   authorDisplayName: object
   authorProfileImageUrl: object
   authorChannelUrl: object
   authorChannelId: object
   canRate: bool
   viewerRating: object
   likeCount: float64
   publishedAt: datetime64[ns, UTC]
   updatedAt: datetime64[ns, UTC]
   author: object
   comment: object
   date: object
   likes: object
   video_title: object

📄 Sample normalized data:


Unnamed: 0,video_id,channelId,videoId,textDisplay,textOriginal,authorDisplayName,authorProfileImageUrl,authorChannelUrl,authorChannelId,canRate,viewerRating,likeCount,publishedAt,updatedAt,author,comment,date,likes,video_title
0,--tK3SaYWr4,UCiV6zQocW4CvWRyXcKDZZmQ,--tK3SaYWr4,Haahahahahahahahhahh o polícia chupando a buda...,Haahahahahahahahhahh o polícia chupando a buda...,@evelynsoares4467,https://yt3.ggpht.com/ytc/AIdro_kTUhLtO25GYE29...,http://www.youtube.com/@evelynsoares4467,UCNhXx9ev5RtEiyGsVjMuTOA,True,none,0.0,2024-12-28 21:38:37+00:00,2024-12-28 21:38:37+00:00,,,,,Tony Gordo é Incriminado #simpsons
1,--tK3SaYWr4,UCiV6zQocW4CvWRyXcKDZZmQ,--tK3SaYWr4,Chupada tridimensional 😎,Chupada tridimensional 😎,@capivagiota,https://yt3.ggpht.com/Rk5mblie0y248pftSyVfoqWV...,http://www.youtube.com/@capivagiota,UCaA27fdWlD7VZn6keN5Gb2w,True,none,123.0,2024-12-28 21:53:52+00:00,2024-12-28 21:53:52+00:00,,,,,Tony Gordo é Incriminado #simpsons


## 6. Duplicate Detection and Removal

Identify and remove duplicate comments using content-based hashing.

In [7]:
def remove_duplicates(df: pd.DataFrame) -> pd.DataFrame:
    """
    Remove duplicate comments using content-based hashing.

    Creates a unique identifier based on video_id, comment text, and author URL
    to identify and remove duplicate comments.

    Args:
        df: DataFrame with comment data

    Returns:
        DataFrame with duplicates removed
    """
    logger.info("Detecting and removing duplicate comments")

    print("🔍 DUPLICATE DETECTION AND REMOVAL")
    print("=" * 50)

    original_count = len(df)
    print(f"📊 Dataset before deduplication: {original_count:,} comments")

    df_dedup = df.copy()

    # Create unique identifier for each comment
    # Combine video_id + text + author to detect duplicates
    def create_comment_hash(row):
        """Create a unique hash for a comment."""
        try:
            # Handle potential missing values
            video_id = str(row.get("video_id", ""))
            text = str(row.get("textDisplay", ""))
            author_url = str(row.get("authorChannelUrl", ""))

            # Create composite string
            composite = video_id + text + author_url

            # Generate MD5 hash
            return hashlib.md5(composite.encode("utf-8")).hexdigest()

        except Exception as e:
            logger.warning(f"Error creating hash: {str(e)}")
            return str(hash(str(row)))  # Fallback hash

    # Apply hashing
    print("🔒 Creating content-based hashes...")
    df_dedup["comment_uuid"] = df_dedup.apply(create_comment_hash, axis=1)

    # Check for duplicates
    duplicate_count = df_dedup["comment_uuid"].duplicated().sum()
    print(f"🔍 Found {duplicate_count:,} duplicate comments ({duplicate_count / original_count:.4f})")

    if duplicate_count > 0:
        # Show examples of duplicates
        print(f"\n📋 Sample duplicate comments:")
        duplicate_hashes = df_dedup[df_dedup["comment_uuid"].duplicated(keep=False)]["comment_uuid"].unique()[:3]

        for i, hash_val in enumerate(duplicate_hashes, 1):
            duplicates = df_dedup[df_dedup["comment_uuid"] == hash_val]
            print(f"\n   Duplicate group {i} ({len(duplicates)} instances):")
            print(f"      Text: '{duplicates.iloc[0]['textDisplay'][:100]}{'...' if len(duplicates.iloc[0]['textDisplay']) > 100 else ''}'")
            print(f"      Videos: {duplicates['video_id'].tolist()}")

    # Remove duplicates
    df_dedup = df_dedup.drop_duplicates(subset=["comment_uuid"])
    df_dedup = df_dedup.drop(columns=["comment_uuid"])  # Remove helper column
    df_dedup.reset_index(drop=True, inplace=True)

    final_count = len(df_dedup)
    removed_count = original_count - final_count

    print(f"\n📊 Deduplication Results:")
    print(f"   🗑️ Removed: {removed_count:,} duplicate comments ({removed_count / original_count:.4f})")
    print(f"   ✅ Kept: {final_count:,} unique comments ({final_count / original_count:.4f})")

    return df_dedup


# Remove duplicates
df = remove_duplicates(df)

2025-07-23 18:03:12,544 - INFO - Detecting and removing duplicate comments


🔍 DUPLICATE DETECTION AND REMOVAL
📊 Dataset before deduplication: 520,658 comments
🔒 Creating content-based hashes...
🔍 Found 5,413 duplicate comments (0.0104)

📋 Sample duplicate comments:

   Duplicate group 1 (2 instances):
      Text: 'Vídeo completo: https://youtu.be/hnetjD-gje4'
      Videos: ['-6Qxw7CpQvQ', '-6Qxw7CpQvQ']

   Duplicate group 2 (2 instances):
      Text: 'Eu tb n pegaria gorda Caraí, mas eu n ia ficar meia hora dando explicação sobre isso né bixo kkkkkkk...'
      Videos: ['-6Qxw7CpQvQ', '-6Qxw7CpQvQ']

   Duplicate group 3 (2 instances):
      Text: '"tenho 18 anos e me cuido" falou o cara que você olha de lado parece um quadrado do Minecraft, tem c...'
      Videos: ['-6Qxw7CpQvQ', '-6Qxw7CpQvQ']

📊 Deduplication Results:
   🗑️ Removed: 5,413 duplicate comments (0.0104)
   ✅ Kept: 515,245 unique comments (0.9896)


## 7. Language Detection and Filtering

Identify Portuguese comments and filter out content in other languages.

In [8]:
def setup_language_detection():
    """
    Set up the language detection pipeline.

    Returns:
        Language detection pipeline
    """
    logger.info("Setting up language detection pipeline")

    try:
        from transformers import pipeline

        print("🌐 LANGUAGE DETECTION SETUP")
        print("=" * 50)
        print("📦 Loading language detection model...")

        # Use XLM-RoBERTa for multilingual language detection
        pipe = pipeline(
            "text-classification",
            model="papluca/xlm-roberta-base-language-detection",
            device=1,  # Use CPU (set to 0 for GPU if available)
        )

        print("✅ Language detection model loaded successfully")

        # Test the pipeline
        test_result = pipe("Olá, como você está?", top_k=1, truncation=True)
        detected_lang = test_result[0]["label"]
        confidence = test_result[0]["score"]

        print(f"🧪 Test detection:")
        print(f"   Text: 'Olá, como você está?'")
        print(f"   Detected: {detected_lang} (confidence: {confidence:.3f})")

        if detected_lang == "pt":
            print("✅ Portuguese detection working correctly")
        else:
            print("⚠️ Unexpected result - please verify model")

        return pipe

    except ImportError:
        print("❌ Error: transformers library not installed")
        print("Please install with: pip install transformers torch")
        raise
    except Exception as e:
        logger.error(f"Error setting up language detection: {str(e)}")
        raise


# Set up language detection
language_detector = setup_language_detection()

2025-07-23 18:03:22,369 - INFO - Setting up language detection pipeline


🌐 LANGUAGE DETECTION SETUP
📦 Loading language detection model...


Device set to use cuda:1


✅ Language detection model loaded successfully
🧪 Test detection:
   Text: 'Olá, como você está?'
   Detected: pt (confidence: 0.996)
✅ Portuguese detection working correctly


In [9]:
def detect_comment_languages(df: pd.DataFrame, pipe, batch_size: int = 784) -> pd.DataFrame:
    """
    Detect languages for all comments in the dataset.

    Args:
        df: DataFrame with textDisplay column
        pipe: Language detection pipeline
        batch_size: Number of texts to process in each batch

    Returns:
        DataFrame with language column added
    """
    logger.info(f"Detecting languages for {len(df):,} comments")

    print("🌐 COMMENT LANGUAGE DETECTION")
    print("=" * 50)
    print(f"📊 Processing {len(df):,} comments in batches of {batch_size}")

    df_lang = df.copy()
    language_results = []

    try:
        # Process in batches with progress bar
        for batch_start in tqdm(range(0, len(df_lang), batch_size), desc="Detecting languages"):
            batch_end = min(batch_start + batch_size, len(df_lang))
            batch_texts = df_lang.iloc[batch_start:batch_end]["textDisplay"].tolist()

            # Clean texts (handle potential None values)
            batch_texts = [str(text) if text is not None else "" for text in batch_texts]

            # Detect languages for batch
            batch_results = pipe(batch_texts, top_k=1, truncation=True, batch_size=min(batch_size, len(batch_texts)))

            # Extract language labels
            batch_languages = [result[0]["label"] for result in batch_results]
            language_results.extend(batch_languages)

        # Add language column
        df_lang["language"] = language_results

        # Language distribution
        lang_counts = df_lang["language"].value_counts()
        print(f"\n📊 Language Distribution:")
        for lang, count in lang_counts.head(10).items():
            percentage = (count / len(df_lang)) * 100
            print(f"   {lang}: {count:,} ({percentage:.4f})")

        # Show examples of non-Portuguese content
        non_pt = df_lang[df_lang["language"] != "pt"]
        if len(non_pt) > 0:
            print(f"\n🌍 Sample non-Portuguese comments:")
            sample_languages = non_pt["language"].value_counts().head(3)

            for lang in sample_languages.index:
                sample_comments = non_pt[non_pt["language"] == lang]["textDisplay"].head(2)
                print(f"\n   {lang.upper()} examples:")
                for i, comment in enumerate(sample_comments, 1):
                    preview = comment[:100] + "..." if len(comment) > 100 else comment
                    print(f"      {i}. '{preview}'")

        return df_lang

    except Exception as e:
        logger.error(f"Error in language detection: {str(e)}")
        raise


# Detect languages for all comments
df = detect_comment_languages(df, language_detector, CleaningConfig.BATCH_SIZE)

2025-07-23 18:03:37,941 - INFO - Detecting languages for 515,245 comments


🌐 COMMENT LANGUAGE DETECTION
📊 Processing 515,245 comments in batches of 784


Detecting languages:   0%|          | 0/658 [00:00<?, ?it/s]

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset



📊 Language Distribution:
   pt: 240,028 (46.5852)
   es: 154,493 (29.9844)
   en: 47,658 (9.2496)
   it: 22,683 (4.4024)
   sw: 18,754 (3.6398)
   hi: 11,844 (2.2987)
   ur: 7,902 (1.5336)
   tr: 3,022 (0.5865)
   bg: 2,600 (0.5046)
   ru: 1,428 (0.2771)

🌍 Sample non-Portuguese comments:

   ES examples:
      1. 'Ele esta correto'
      2. 'Cara escroto sem comentarios'

   EN examples:
      1. 'Famoso sugar baby'
      2. 'New money is so classless.'

   IT examples:
      1. 'Como se ele fosse magro KKKKKKKKKKKKKKKK'
      2. 'Esse cara n tem 18'


In [10]:
def filter_portuguese_comments(df: pd.DataFrame, target_language: str = "pt") -> pd.DataFrame:
    """
    Filter comments to keep only the target language.

    Args:
        df: DataFrame with language column
        target_language: Language code to keep (default: "pt" for Portuguese)

    Returns:
        DataFrame with only target language comments
    """
    logger.info(f"Filtering comments for language: {target_language}")

    print(f"🇧🇷 PORTUGUESE COMMENT FILTERING")
    print("=" * 50)

    original_count = len(df)
    print(f"📊 Dataset before language filtering: {original_count:,} comments")

    # Show distribution before filtering
    lang_dist = df["language"].value_counts()
    pt_count = lang_dist.get(target_language, 0)
    print(f"📈 Portuguese comments: {pt_count:,} ({pt_count / original_count:.4f})")
    print(f"📈 Other languages: {original_count - pt_count:,} ({(original_count - pt_count) / original_count:.4f})")

    # Filter for Portuguese comments
    df_pt = df[df["language"] == target_language].copy()
    df_pt.reset_index(drop=True, inplace=True)

    final_count = len(df_pt)
    removed_count = original_count - final_count

    print(f"\n📊 Filtering Results:")
    print(f"   ✅ Kept: {final_count:,} Portuguese comments ({final_count / original_count:.4f})")
    print(f"   🗑️ Removed: {removed_count:,} non-Portuguese comments ({removed_count / original_count:.4f})")

    return df_pt


# Filter for Portuguese comments only
df = filter_portuguese_comments(df, CleaningConfig.TARGET_LANGUAGE)

2025-07-23 18:37:38,025 - INFO - Filtering comments for language: pt


🇧🇷 PORTUGUESE COMMENT FILTERING
📊 Dataset before language filtering: 515,245 comments
📈 Portuguese comments: 240,028 (0.4659)
📈 Other languages: 275,217 (0.5341)

📊 Filtering Results:
   ✅ Kept: 240,028 Portuguese comments (0.4659)
   🗑️ Removed: 275,217 non-Portuguese comments (0.5341)


## 8. Video Language Va|lidation

Verify that the video titles are also in Portuguese to ensure consistency.

In [11]:
def validate_video_languages(df: pd.DataFrame, pipe, target_language: str = "pt") -> Tuple[pd.DataFrame, List[str]]:
    """
    Validate that video titles are in the target language.

    Args:
        df: DataFrame with video_id and video_title columns
        pipe: Language detection pipeline
        target_language: Expected language for videos

    Returns:
        Tuple of (filtered_dataframe, list_of_target_language_video_ids)
    """
    logger.info("Validating video title languages")

    print("🎥 VIDEO LANGUAGE VALIDATION")
    print("=" * 50)

    # Get unique videos
    videos = df[["video_id", "video_title"]].drop_duplicates()
    videos.reset_index(drop=True, inplace=True)

    print(f"🎬 Unique videos to check: {len(videos):,}")

    # Detect languages for video titles
    print(f"🌐 Detecting languages for video titles...")

    video_titles = videos["video_title"].tolist()
    # Clean titles (handle potential None values)
    video_titles = [str(title) if title is not None else "" for title in video_titles]

    video_language_results = pipe(video_titles, top_k=1, truncation=True, batch_size=512)

    videos["language"] = [result[0]["label"] for result in video_language_results]

    # Language distribution for videos
    video_lang_dist = videos["language"].value_counts()
    print(f"\n📊 Video Title Language Distribution:")
    for lang, count in video_lang_dist.items():
        percentage = (count / len(videos)) * 100
        print(f"   {lang}: {count:,} videos ({percentage:.4f})")

    # Show examples of non-target language videos
    non_target_videos = videos[videos["language"] != target_language]
    if len(non_target_videos) > 0:
        print(f"\n🌍 Sample non-{target_language.upper()} video titles:")
        for lang in non_target_videos["language"].value_counts().head(3).index:
            sample_videos = non_target_videos[non_target_videos["language"] == lang]
            print(f"\n   {lang.upper()} examples:")
            for i, (_, row) in enumerate(sample_videos.head(2).iterrows(), 1):
                title_preview = row["video_title"][:80] + "..." if len(str(row["video_title"])) > 80 else row["video_title"]
                print(f"      {i}. {row['video_id']}: '{title_preview}'")

    # Filter for target language videos
    target_videos = videos[videos["language"] == target_language]
    target_video_ids = target_videos["video_id"].tolist()

    print(f"\n📊 Video Language Filtering:")
    print(f"   ✅ {target_language.upper()} videos: {len(target_video_ids):,} ({len(target_video_ids) / len(videos):.4f})")
    print(f"   🗑️ Other languages: {len(videos) - len(target_video_ids):,}")

    # Filter comments to only include target language videos
    original_comment_count = len(df)
    df_filtered = df[df["video_id"].isin(target_video_ids)].copy()
    df_filtered.reset_index(drop=True, inplace=True)

    final_comment_count = len(df_filtered)
    removed_comments = original_comment_count - final_comment_count

    print(f"\n📊 Comment Filtering by Video Language:")
    print(f"   ✅ Comments kept: {final_comment_count:,} ({final_comment_count / original_comment_count:.4f})")
    print(f"   🗑️ Comments removed: {removed_comments:,} ({removed_comments / original_comment_count:.4f})")

    return df_filtered, target_video_ids


# Validate video languages and filter accordingly
df, portuguese_video_ids = validate_video_languages(df, language_detector, CleaningConfig.TARGET_LANGUAGE)

2025-07-23 18:38:38,014 - INFO - Validating video title languages


🎥 VIDEO LANGUAGE VALIDATION
🎬 Unique videos to check: 1,619
🌐 Detecting languages for video titles...

📊 Video Title Language Distribution:
   pt: 1,204 videos (74.3669)
   es: 196 videos (12.1062)
   it: 117 videos (7.2267)
   hi: 28 videos (1.7295)
   en: 27 videos (1.6677)
   sw: 18 videos (1.1118)
   ur: 8 videos (0.4941)
   de: 6 videos (0.3706)
   tr: 5 videos (0.3088)
   bg: 4 videos (0.2471)
   nl: 4 videos (0.2471)
   ru: 2 videos (0.1235)

🌍 Sample non-PT video titles:

   ES examples:
      1. -7bTGqiS34w: 'Raúl de Molina se entera que Lili Estefan perdió a una conocida en el Jet Set | ...'
      2. -v6VfrcNzB0: '💥Era ACOSADO por GORDO pero un ISEKAI lo CONVIRTIÓ en TODO UN PAPUCHO | TEMPORAD...'

   IT examples:
      1. 1e9OzpsQg1k: 'Mi Gorda Bella capitulo 112'
      2. 1u1gR2RJIzE: 'Mi papá es un fan!'

   HI examples:
      1. 2SSDY47Ecrw: 'TU MADRE ESTÁ TAN GORDA'
      2. 6jmg_XexH8U: 'SOU A BARBIE GORDA 💪 #roblox #barbie #fy #viraliza #robloxshorts #memeblox'

📊 Vide

## 9. Final Data Summary and Export

Review the cleaned dataset and export for further analysis.

In [12]:
def generate_final_summary(df: pd.DataFrame) -> Dict[str, Any]:
    """
    Generate comprehensive summary of the cleaned dataset.

    Args:
        df: Final cleaned DataFrame

    Returns:
        Dictionary with summary statistics
    """
    logger.info("Generating final dataset summary")

    print("📊 FINAL CLEANED DATASET SUMMARY")
    print("=" * 50)

    # Basic statistics
    summary = {
        "total_comments": len(df),
        "unique_videos": df["video_id"].nunique(),
        "unique_authors": df["authorDisplayName"].nunique(),
        "date_range": {"earliest": df["publishedAt"].min() if "publishedAt" in df.columns else None, "latest": df["publishedAt"].max() if "publishedAt" in df.columns else None},
        "dataset_shape": df.shape,
        "memory_usage_mb": df.memory_usage(deep=True).sum() / (1024 * 1024),
    }

    print(f"📈 Dataset Overview:")
    print(f"   Total comments: {summary['total_comments']:,}")
    print(f"   Unique videos: {summary['unique_videos']:,}")
    print(f"   Unique authors: {summary['unique_authors']:,}")
    print(f"   Dataset shape: {summary['dataset_shape']}")
    print(f"   Memory usage: {summary['memory_usage_mb']:.2f} MB")

    # Temporal coverage
    if summary["date_range"]["earliest"] and summary["date_range"]["latest"]:
        time_span = summary["date_range"]["latest"] - summary["date_range"]["earliest"]
        print(f"\n📅 Temporal Coverage:")
        print(f"   Earliest comment: {summary['date_range']['earliest']}")
        print(f"   Latest comment: {summary['date_range']['latest']}")
        print(f"   Time span: {time_span.days} days")
        summary["time_span_days"] = time_span.days

    # Comment length statistics
    if "textDisplay" in df.columns:
        df_temp = df.copy()
        df_temp["comment_length"] = df_temp["textDisplay"].str.len()
        length_stats = df_temp["comment_length"].describe()

        print(f"\n📝 Comment Length Statistics:")
        print(f"   Mean: {length_stats['mean']:.2f} characters")
        print(f"   Median: {length_stats['50%']:.2f} characters")
        print(f"   Std: {length_stats['std']:.2f} characters")
        print(f"   Range: {length_stats['min']:.0f} - {length_stats['max']:.0f} characters")

        summary["comment_length"] = {"mean": length_stats["mean"], "median": length_stats["50%"], "std": length_stats["std"], "min": length_stats["min"], "max": length_stats["max"]}

    # Top videos by comment count
    print(f"\n🎥 Most Commented Videos (Top 5):")
    top_videos = df["video_id"].value_counts().head()
    for i, (video_id, count) in enumerate(top_videos.items(), 1):
        title = df[df["video_id"] == video_id]["video_title"].iloc[0] if "video_title" in df.columns else "Unknown"
        title_preview = title[:60] + "..." if len(str(title)) > 60 else title
        print(f"   {i}. {video_id}: {count:,} comments")
        print(f"      '{title_preview}'")

    summary["top_videos"] = top_videos.to_dict()

    # Data quality assessment
    print(f"\n🔍 Data Quality Assessment:")
    missing_stats = {}
    for col in df.columns:
        missing_count = df[col].isna().sum()
        missing_pct = (missing_count / len(df)) * 100
        missing_stats[col] = {"count": missing_count, "percentage": missing_pct}
        if missing_count > 0:
            print(f"   {col}: {missing_count:,} missing ({missing_pct:.4f})")

    if not any(stats["count"] > 0 for stats in missing_stats.values()):
        print("   ✅ No missing values detected")

    summary["missing_data"] = missing_stats

    # Language confirmation
    if "language" in df.columns:
        lang_dist = df["language"].value_counts()
        print(f"\n🌐 Language Distribution:")
        for lang, count in lang_dist.items():
            pct = (count / len(df)) * 100
            print(f"   {lang}: {count:,} ({pct:.4f})")
        summary["language_distribution"] = lang_dist.to_dict()

    return summary


# Generate final summary
final_summary = generate_final_summary(df)

# Display final cleaned dataset
print(f"\n📄 Final Cleaned Dataset (sample):")
display(df.head())

2025-07-23 18:39:38,016 - INFO - Generating final dataset summary


📊 FINAL CLEANED DATASET SUMMARY
📈 Dataset Overview:
   Total comments: 191,946
   Unique videos: 1,204
   Unique authors: 163,664
   Dataset shape: (191946, 20)
   Memory usage: 255.06 MB

📅 Temporal Coverage:
   Earliest comment: 2006-11-24 20:16:56+00:00
   Latest comment: 2025-04-17 11:46:21+00:00
   Time span: 6718 days

📝 Comment Length Statistics:
   Mean: 91.09 characters
   Median: 55.00 characters
   Std: 117.04 characters
   Range: 10 - 1487 characters

🎥 Most Commented Videos (Top 5):
   1. S4pDpA-g7hE: 9,418 comments
      'Pirado - João Gordo X Dado Dolabella'
   2. 3JK3MbRhjUg: 5,927 comments
      'ROMANTIZANDO A OBESIDADE.'
   3. Fn54EKtAbTc: 5,873 comments
      'Super Oração Contra Inveja, Olho Gordo e Mau Olhado'
   4. YeGperfZ7QY: 5,742 comments
      'COMECEI A TREINAR E…. #viral #emagrecimento #youtube #obesid...'
   5. Y0NWZKORge0: 3,752 comments
      'Mulher mais gorda do Brasil pesa 360 quilos'

🔍 Data Quality Assessment:
   author: 191,946 missing (100.0000)


Unnamed: 0,video_id,channelId,videoId,textDisplay,textOriginal,authorDisplayName,authorProfileImageUrl,authorChannelUrl,authorChannelId,canRate,viewerRating,likeCount,publishedAt,updatedAt,author,comment,date,likes,video_title,language
0,--tK3SaYWr4,UCiV6zQocW4CvWRyXcKDZZmQ,--tK3SaYWr4,Haahahahahahahahhahh o polícia chupando a buda...,Haahahahahahahahhahh o polícia chupando a buda...,@evelynsoares4467,https://yt3.ggpht.com/ytc/AIdro_kTUhLtO25GYE29...,http://www.youtube.com/@evelynsoares4467,UCNhXx9ev5RtEiyGsVjMuTOA,True,none,0.0,2024-12-28 21:38:37+00:00,2024-12-28 21:38:37+00:00,,,,,Tony Gordo é Incriminado #simpsons,pt
1,--tK3SaYWr4,UCiV6zQocW4CvWRyXcKDZZmQ,--tK3SaYWr4,Chefe wigol deu um beijo grego no homer skksks,Chefe wigol deu um beijo grego no homer skksks,@MrLopess00,https://yt3.ggpht.com/GbqCWSYWX0x7m12TrBOc7bBO...,http://www.youtube.com/@MrLopess00,UCtrByOsq8kIDCQfSXfq3IKw,True,none,447.0,2024-12-29 02:00:55+00:00,2024-12-29 02:00:55+00:00,,,,,Tony Gordo é Incriminado #simpsons,pt
2,--tK3SaYWr4,UCiV6zQocW4CvWRyXcKDZZmQ,--tK3SaYWr4,Quem era Batedor de Carteiras ?,Quem era Batedor de Carteiras ?,@mateuss.santossilva5059,https://yt3.ggpht.com/lIA6NvNbtRKR4LZyVTGVdNO_...,http://www.youtube.com/@mateuss.santossilva5059,UCIY2M7NurJ728_H4Cs5zmQA,True,none,5.0,2024-12-29 12:53:19+00:00,2024-12-29 12:53:19+00:00,,,,,Tony Gordo é Incriminado #simpsons,pt
3,--tK3SaYWr4,UCiV6zQocW4CvWRyXcKDZZmQ,--tK3SaYWr4,"""Graças a deus que essa coisa está do nosso la...","""Graças a deus que essa coisa está do nosso la...",@Ray._Ryan000,https://yt3.ggpht.com/WIrn4XlSuZAQuPHw6w53yiiX...,http://www.youtube.com/@Ray._Ryan000,UCr5gdJ-I9wpcBXWdSR6T5iw,True,none,1677.0,2024-12-29 16:16:06+00:00,2024-12-29 16:17:20+00:00,,,,,Tony Gordo é Incriminado #simpsons,pt
4,--tK3SaYWr4,UCiV6zQocW4CvWRyXcKDZZmQ,--tK3SaYWr4,🤨 tá estranho isso,🤨 tá estranho isso,@darkgacha5649,https://yt3.ggpht.com/R5NIvS_yYOP4_ngqdnlXIlOH...,http://www.youtube.com/@darkgacha5649,UCZnl2qgkPiF-SYyoPmTbzBw,True,none,0.0,2024-12-29 18:03:52+00:00,2024-12-29 18:03:52+00:00,,,,,Tony Gordo é Incriminado #simpsons,pt


In [13]:
def export_cleaned_data(df: pd.DataFrame, output_path: Path, summary: Dict[str, Any]) -> None:
    """
    Export the cleaned dataset with comprehensive metadata.

    Args:
        df: Cleaned DataFrame to export
        output_path: Path for the output file
        summary: Summary statistics dictionary
    """
    logger.info(f"Exporting cleaned data to: {output_path}")

    print("💾 EXPORTING CLEANED DATA")
    print("=" * 50)

    try:
        # Export main dataset
        df.to_parquet(output_path, index=False)
        file_size_mb = output_path.stat().st_size / (1024 * 1024)

        print(f"✅ Dataset exported successfully")
        print(f"📁 File: {output_path.name}")
        print(f"📊 Size: {file_size_mb:.1f} MB")
        print(f"📈 Records: {len(df):,}")

        # Export summary metadata
        metadata_path = output_path.with_suffix(".json")

        # Prepare metadata for JSON serialization
        export_metadata = {
            "export_info": {"timestamp": pd.Timestamp.now().isoformat(), "file_name": output_path.name, "file_size_mb": file_size_mb, "format": "parquet"},
            "dataset_summary": summary,
            "cleaning_pipeline": {
                "steps": ["Text length filtering (10th to 99.9th percentile)", "Emoji-only comment removal", "Data structure normalization", "Duplicate detection and removal (content-based hashing)", "Language detection and Portuguese filtering", "Video language validation"],
                "target_language": CleaningConfig.TARGET_LANGUAGE,
                "min_text_percentile": CleaningConfig.MIN_TEXT_PERCENTILE,
                "max_text_percentile": CleaningConfig.MAX_TEXT_PERCENTILE,
            },
            "column_descriptions": {
                "video_id": "YouTube video identifier",
                "textDisplay": "Comment text content (Portuguese only)",
                "authorDisplayName": "Comment author display name",
                "authorChannelUrl": "Author channel URL",
                "authorChannelId": "Author channel identifier",
                "publishedAt": "Comment publication timestamp",
                "updatedAt": "Comment last update timestamp",
                "likeCount": "Number of likes on the comment",
                "video_title": "Title of the YouTube video (Portuguese)",
                "language": "Detected language code (pt for Portuguese)",
            },
        }

        # Handle datetime serialization
        import json

        def json_serializer(obj):
            if isinstance(obj, pd.Timestamp):
                return obj.isoformat()
            elif hasattr(obj, "isoformat"):
                return obj.isoformat()
            return str(obj)

        with open(metadata_path, "w", encoding="utf-8") as f:
            json.dump(export_metadata, f, indent=2, ensure_ascii=False, default=json_serializer)

        print(f"📋 Metadata exported: {metadata_path.name}")

        # Export CSV backup
        csv_path = output_path.with_suffix(".csv")
        df.to_csv(csv_path, index=False, encoding="utf-8")
        csv_size_mb = csv_path.stat().st_size / (1024 * 1024)

        print(f"💾 CSV backup created: {csv_path.name} ({csv_size_mb:.1f} MB)")

        print(f"\n✅ Export completed successfully!")
        print(f"📁 Files created:")
        print(f"   • {output_path.name} (main dataset)")
        print(f"   • {metadata_path.name} (metadata)")
        print(f"   • {csv_path.name} (CSV backup)")

    except Exception as e:
        logger.error(f"Error exporting data: {str(e)}")
        raise


# Export the cleaned dataset
export_cleaned_data(df, CleaningConfig.OUTPUT_FILE, final_summary)

print(f"\n🎉 DATA CLEANING PIPELINE COMPLETED!")
print(f"📊 Final dataset: {len(df):,} Portuguese comments from {df['video_id'].nunique():,} videos")
print(f"📁 Output file: {CleaningConfig.OUTPUT_FILE}")
print(f"▶️ Ready for analysis in subsequent notebooks!")

2025-07-23 18:40:38,017 - INFO - Exporting cleaned data to: ../data/intermediate/20250417_youtube_comments_pt_cleaned1.parquet


💾 EXPORTING CLEANED DATA
✅ Dataset exported successfully
📁 File: 20250417_youtube_comments_pt_cleaned1.parquet
📊 Size: 47.1 MB
📈 Records: 191,946
📋 Metadata exported: 20250417_youtube_comments_pt_cleaned1.json
💾 CSV backup created: 20250417_youtube_comments_pt_cleaned1.csv (104.9 MB)

✅ Export completed successfully!
📁 Files created:
   • 20250417_youtube_comments_pt_cleaned1.parquet (main dataset)
   • 20250417_youtube_comments_pt_cleaned1.json (metadata)
   • 20250417_youtube_comments_pt_cleaned1.csv (CSV backup)

🎉 DATA CLEANING PIPELINE COMPLETED!
📊 Final dataset: 191,946 Portuguese comments from 1,204 videos
📁 Output file: ../data/intermediate/20250417_youtube_comments_pt_cleaned1.parquet
▶️ Ready for analysis in subsequent notebooks!
