# Language Translation for Media Cloud dataset

## Overview

This notebook processes the MediaCloud dataset (2020-2025) by translating non-English article titles to English while preserving original English titles. The pipeline implements a robust data preprocessing and translation workflow designed for large-scale multilingual text processing.

- **Chunked Processing**: Handles large datasets efficiently in configurable chunks (default: 1000 rows)

In [1]:
# Import necessary libraries
import pandas as pd
from deep_translator import GoogleTranslator
import time

In [2]:
df = pd.read_csv("data/mediacloud-2020-2025-dataset.csv", encoding='utf-8')

df.head(5)

Unnamed: 0,id,indexed_date,language,media_name,media_url,publish_date,title,url
0,1dc171e7750c319dc4a7b4ca87c8a6f5587c0e9481f129...,2025-11-03 00:22:19.404264+00:00,en,techcrunch.com,techcrunch.com,2025-11-02,Google pulls Gemma from AI Studio after Senato...,https://techcrunch.com/2025/11/02/google-pulls...
1,e6a47f0c7b6de768d799e603b5797ca432f9af6903de7b...,2025-11-02 23:17:22.553191+00:00,en,livemint.com,livemint.com,2025-11-02,Here's why India’s AI content draft rules miss...,https://www.livemint.com/opinion/online-views/...
2,8061d40a29f0fcf33e6906573a72389a4c0a0cc136522c...,2025-11-02 21:51:34.827292+00:00,en,apnews.com,apnews.com,2025-11-02,Who is Zico Kolter? A professor leads OpenAI s...,https://apnews.com/article/openai-safety-chatg...
3,d411e1a49c87e2e710c054d514f58beebcfa2f275107d8...,2025-11-02 20:26:57.088674+00:00,es,infolibre.es,infolibre.es,2025-11-02,"ChatGPT no es un psicólogo, pero cambiará su l...",https://www.infolibre.es/politica/chatgpt-rect...
4,c8ce06691fdf797a875be76c82efb7b0fab0d3d62a795c...,2025-11-02 20:17:51.246694+00:00,en,thestar.com,thestar.com,2025-11-02,"Like maple syrup and hockey, AI must become a ...",https://www.thestar.com/business/opinion/like-...


In [3]:
print("Number of unique languages:", df['language'].nunique())

Number of unique languages: 48


In [4]:
# Get unique language codes from the DataFrame (numpy array)
languages = df["language"].unique()

# Create a list of language codes excluding English ('en')
non_english_languages = [lang for lang in languages if lang != "en"]
print(non_english_languages)

['es', 'ja', 'de', 'pl', 'fr', 'ro', 'it', 'hr', 'fa', 'id', 'hi', 'zh', 'no', 'nl', 'hu', 'uk', 'ru', 'pt', 'sv', 'sq', 'bg', 'la', 'sr', 'se', 'ca', 'tr', 'he', 'cs', 'sw', 'el', 'fi', 'gl', 'th', 'mt', 'ko', 'ml', 'nb', 'is', 'ta', 'sk', 'nn', 'tl', 'ar', 'ka', 'ur', 'mk', 'az']


In [5]:
# Convert to set for faster look-up
non_english_Langs = set(non_english_languages)

# Get supported languages by Google Translate - we need the VALUES (codes) from the dictionary
supported_languages_dict = GoogleTranslator().get_supported_languages(as_dict=True)
print("Supported languages dictionary sample:", dict(list(supported_languages_dict.items())[:10]))

# Extract just the language CODES (the values)
supported_language_codes = set(supported_languages_dict.values())
print(f"Supported language codes: {len(supported_language_codes)} languages")

# Also handle Chinese variants - map zh to zh-CN if needed
if 'zh' in non_english_Langs and 'zh' not in supported_language_codes:
    print("Note: 'zh' detected in dataset, will map to 'zh-CN' for translation")
    non_english_Langs.discard('zh')
    non_english_Langs.add('zh-CN')

# Find intersection between our dataset languages and supported language codes
SUPPORTED_LANGS = non_english_Langs.intersection(supported_language_codes)

print(f"Non-English languages from dataset: {len(non_english_Langs)}")
print(f"Supported languages for translation: {len(SUPPORTED_LANGS)}")
print(f"Supported languages: {SUPPORTED_LANGS}")

Supported languages dictionary sample: {'afrikaans': 'af', 'albanian': 'sq', 'amharic': 'am', 'arabic': 'ar', 'armenian': 'hy', 'assamese': 'as', 'aymara': 'ay', 'azerbaijani': 'az', 'bambara': 'bm', 'basque': 'eu'}
Supported language codes: 133 languages
Note: 'zh' detected in dataset, will map to 'zh-CN' for translation
Non-English languages from dataset: 47
Supported languages for translation: 43
Supported languages: {'ko', 'ro', 'id', 'ar', 'fa', 'sk', 'zh-CN', 'ca', 'la', 'ka', 'sv', 'cs', 'hu', 'el', 'hi', 'sq', 'az', 'es', 'th', 'nl', 'gl', 'uk', 'fi', 'fr', 'bg', 'tr', 'sw', 'no', 'ru', 'ta', 'ml', 'it', 'de', 'sr', 'mt', 'mk', 'pt', 'tl', 'hr', 'pl', 'is', 'ja', 'ur'}


 43 out 47 non-English languages are supported by Google Translate. The 4 unsupported languages will be drooped during processing to maintain data quality.

In [None]:
# Define translation function with chunk processing
def translate_chunk(chunk):
    """Translate a chunk of data and return processed DataFrame"""
    translated_data = []
    
    # Language mapping for Google Translate (for any remaining mismatches)
    language_mapping = {
        'zh': 'zh-CN',  # Chinese simplified
        'he': 'iw',     # Hebrew  
        'nb': 'no',     # Norwegian Bokmål
    }
    
    for idx, row in chunk.iterrows():
        try:
            lang = row['language']
            title = row['title']
            
            # Skip if title is NaN or empty
            if pd.isna(title) or str(title).strip() == '':
                continue
                
            # If language is English, use original as translated
            if lang == 'en':
                translated_title = str(title)
            # If language is non-English BUT not supported, SKIP/DROP this row
            elif lang not in SUPPORTED_LANGS:
                continue  # This drops unsupported languages
            else:
                # Map language code if needed
                source_lang = language_mapping.get(lang, lang)
                
                # Translate supported non-English languages
                translated_title = GoogleTranslator(source=source_lang, target='en').translate(str(title))
            
            # Add to results (only English and successfully translated rows)
            translated_data.append({
                'language': lang,
                'title': title,
                'translated_title': translated_title
            })
            
        except Exception as e:
            # Drop failed rows, continue without adding
            continue
    
    return pd.DataFrame(translated_data)

In [7]:
# Process the entire dataset in chunks
def process_dataset_in_chunks(input_df, chunk_size=1000):
    """Process the entire dataset in chunks and return combined DataFrame"""
    
    # Initialize counters and storage
    total_processed = 0
    total_successful = 0
    chunks_processed = 0
    all_translated_data = []
    
    # Calculate number of chunks
    total_chunks = (len(input_df) // chunk_size) + 1
    print(f"Processing {len(input_df)} rows in {total_chunks} chunks of {chunk_size}...")
    
    # Process in chunks
    for chunk_num in range(total_chunks):
        start_idx = chunk_num * chunk_size
        end_idx = min((chunk_num + 1) * chunk_size, len(input_df))
        chunk = input_df.iloc[start_idx:end_idx]
        
        print(f"Processing chunk {chunk_num + 1}/{total_chunks} (rows {start_idx}-{end_idx})...")
        
        # Translate chunk
        translated_chunk = translate_chunk(chunk)
        
        # Update counters
        chunk_successful = len(translated_chunk)
        total_processed += len(chunk)
        total_successful += chunk_successful
        chunks_processed += 1
        
        # Store results
        if not translated_chunk.empty:
            all_translated_data.append(translated_chunk)
            print(f"  Chunk {chunk_num + 1}: {chunk_successful} successful, {len(chunk) - chunk_successful} failed")
        else:
            print(f"  Chunk {chunk_num + 1}: No successful translations")
        
        # Rate limiting - be nice to the API
        time.sleep(1)  # 1 second between chunks
        if chunks_processed % 10 == 0:  # Longer break every 10 chunks
            print("  Taking a longer break...")
            time.sleep(3)
    
    # Combine all chunks
    if all_translated_data:
        final_df = pd.concat(all_translated_data, ignore_index=True)
    else:
        final_df = pd.DataFrame(columns=['language', 'title', 'translated_title'])
    
    # Final summary
    print(f"\nProcessing complete!")
    print(f"Total rows processed: {total_processed}")
    print(f"Total successful translations: {total_successful}")
    print(f"Total failed/dropped: {total_processed - total_successful}")
    
    return final_df

In [8]:
# Process the dataset in chunks
translated_df = process_dataset_in_chunks(df, chunk_size=1000)


Processing 35399 rows in 36 chunks of 1000...
Processing chunk 1/36 (rows 0-1000)...
  Chunk 1: 969 successful, 31 failed
Processing chunk 2/36 (rows 1000-2000)...
  Chunk 2: 993 successful, 7 failed
Processing chunk 3/36 (rows 2000-3000)...
  Chunk 3: 995 successful, 5 failed
Processing chunk 4/36 (rows 3000-4000)...
  Chunk 4: 989 successful, 11 failed
Processing chunk 5/36 (rows 4000-5000)...
  Chunk 5: 991 successful, 9 failed
Processing chunk 6/36 (rows 5000-6000)...
  Chunk 6: 996 successful, 4 failed
Processing chunk 7/36 (rows 6000-7000)...
  Chunk 7: 996 successful, 4 failed
Processing chunk 8/36 (rows 7000-8000)...
  Chunk 8: 999 successful, 1 failed
Processing chunk 9/36 (rows 8000-9000)...
  Chunk 9: 991 successful, 9 failed
Processing chunk 10/36 (rows 9000-10000)...
  Chunk 10: 994 successful, 6 failed
  Taking a longer break...
Processing chunk 11/36 (rows 10000-11000)...
  Chunk 11: 999 successful, 1 failed
Processing chunk 12/36 (rows 11000-12000)...
  Chunk 12: 997 su

In [None]:
# Save to CSV
output_file = "data/mediacloud-translated-complete.csv"
translated_df.to_csv(output_file, index=False)


In [12]:
translated_df

Unnamed: 0,language,title,translated_title
0,en,Google pulls Gemma from AI Studio after Senato...,Google pulls Gemma from AI Studio after Senato...
1,en,Here's why India’s AI content draft rules miss...,Here's why India’s AI content draft rules miss...
2,en,Who is Zico Kolter? A professor leads OpenAI s...,Who is Zico Kolter? A professor leads OpenAI s...
3,es,"ChatGPT no es un psicólogo, pero cambiará su l...","ChatGPT is not a psychologist, but it will cha..."
4,en,"Like maple syrup and hockey, AI must become a ...","Like maple syrup and hockey, AI must become a ..."
...,...,...,...
35070,en,New Delhi Slush’D witnessed unprecedented succ...,New Delhi Slush’D witnessed unprecedented succ...
35071,en,AI on human rights watch,AI on human rights watch
35072,es,"Con su propia criptomoneda, el creador del Cha...","With his own cryptocurrency, the creator of Ch..."
35073,en,Musk ‘sensationalist’ comments on AI taking jo...,Musk ‘sensationalist’ comments on AI taking jo...
