# Language Translation for Media Cloud dataset

## Overview

This notebook processes the MediaCloud dataset (2020-2025) by translating non-English article titles to English while preserving original English titles. The pipeline implements a robust data preprocessing and translation workflow designed for large-scale multilingual text processing.

- **Chunked Processing**: Handles large datasets efficiently in configurable chunks (default: 1000 rows)

In [3]:
# Import necessary libraries
import pandas as pd
from deep_translator import GoogleTranslator
import time

In [None]:
df = pd.read_csv("../translate_part1.csv", encoding='utf-8')
df.head(5)

Unnamed: 0,id,indexed_date,language,media_name,media_url,publish_date,title,url,region
0,8ec6ee3fbd7df8acf011807509135594e1059a975e4e33...,2025-12-07 21:29:31.704001+00:00,es,bbc.com,bbc.com,07/12/2025,Fujian: cómo es el portaaviones de última gene...,https://www.bbc.com/mundo/articles/cdd526n2egr...,west
1,c4b7fc0a14d38d4bb082aac6ab793cb70ffbdce43fcab0...,2025-11-23 21:29:39.700224+00:00,es,bbc.com,bbc.com,23/11/2025,Cómo quiere Australia prohibir el acceso a las...,https://www.bbc.com/mundo/articles/cx202xn46jl...,west
2,cf5195fa34510410a88564813fa464cc235dcf8cd7d0a5...,2025-09-14 21:27:48.971495+00:00,es,bbc.com,bbc.com,14/09/2025,Los millonarios de Silicon Valley tienen propu...,https://www.bbc.com/mundo/articles/c701420e172...,west
3,0b1ce86ab0a7bf3f1705d8a076c0021119daeeebb6f718...,2025-09-02 21:31:52.015017+00:00,es,bbc.com,bbc.com,02/09/2025,"Tatiana Bilbao: ""La arquitectura puede ser una...",https://www.bbc.com/mundo/articles/c5yp28n4n65...,west
4,4e519f07912c0f767852c853bb9e4c972030722f5f4a4c...,2025-09-01 21:30:42.707620+00:00,es,bbc.com,bbc.com,01/09/2025,Cómo China usó a empresas como Apple para supe...,https://www.bbc.com/mundo/articles/cx29ljx57pg...,west


In [8]:
print("Number of unique languages:", df['language'].nunique())

Number of unique languages: 5


In [9]:
# Get unique language codes from the DataFrame (numpy array)
languages = df["language"].unique()

# Create a list of language codes excluding English ('en')
non_english_languages = [lang for lang in languages if lang != "en"]
print(non_english_languages)

['es', 'ru', 'fr', 'de', 'it']


In [None]:
# All language counts
lang_counts = df["language"].value_counts()
print(lang_counts)


es    22225
it      216
fr       12
de        2
ru        1
Name: language, dtype: int64
es    22225
it      216
fr       12
de        2
ru        1
Name: language, dtype: int64


In [12]:
# Convert to set for faster look-up
non_english_Langs = set(non_english_languages)

# Get supported languages by Google Translate - we need the VALUES (codes) from the dictionary
supported_languages_dict = GoogleTranslator().get_supported_languages(as_dict=True)
print("Supported languages dictionary sample:", dict(list(supported_languages_dict.items())[:10]))

# Extract just the language CODES (the values)
supported_language_codes = set(supported_languages_dict.values())
print(f"Supported language codes: {len(supported_language_codes)} languages")

# Also handle Chinese variants - map zh to zh-CN if needed
if 'zh' in non_english_Langs and 'zh' not in supported_language_codes:
    print("Note: 'zh' detected in dataset, will map to 'zh-CN' for translation")
    non_english_Langs.discard('zh')
    non_english_Langs.add('zh-CN')

# Find intersection between our dataset languages and supported language codes
SUPPORTED_LANGS = non_english_Langs.intersection(supported_language_codes)

print(f"Non-English languages from dataset: {len(non_english_Langs)}")
print(f"Supported languages for translation: {len(SUPPORTED_LANGS)}")
print(f"Supported languages: {SUPPORTED_LANGS}")

Supported languages dictionary sample: {'afrikaans': 'af', 'albanian': 'sq', 'amharic': 'am', 'arabic': 'ar', 'armenian': 'hy', 'assamese': 'as', 'aymara': 'ay', 'azerbaijani': 'az', 'bambara': 'bm', 'basque': 'eu'}
Supported language codes: 133 languages
Non-English languages from dataset: 5
Supported languages for translation: 5
Supported languages: {'es', 'fr', 'ru', 'it', 'de'}


 43 out 47 non-English languages are supported by Google Translate. The 4 unsupported languages will be drooped during processing to maintain data quality.

In [13]:
def translate_chunk(chunk):
    """Translate a chunk of data and return processed DataFrame with all original columns"""
    translated_data = []
    
    # Language mapping for Google Translate
    language_mapping = {
        'zh': 'zh-CN',  # Chinese simplified
        'he': 'iw',     # Hebrew  
        'nb': 'no',     # Norwegian Bokmål
    }
    
    for idx, row in chunk.iterrows():
        try:
            lang = row['language']
            title = row['title']
            
            # Skip if title is NaN or empty
            if pd.isna(title) or str(title).strip() == '':
                continue
                
            # If language is English, use original as translated
            if lang == 'en':
                translated_title = str(title)
            # If language is non-English BUT not supported, SKIP/DROP this row
            elif lang not in SUPPORTED_LANGS:
                continue  # This drops unsupported languages
            else:
                # Map language code if needed
                source_lang = language_mapping.get(lang, lang)
                
                # Translate supported non-English languages
                translated_title = GoogleTranslator(source=source_lang, target='en').translate(str(title))
            
            # Create a new row with ALL original data + translated_title
            new_row = row.to_dict()  # This preserves all original columns
            new_row['translated_title'] = translated_title  # Add the translation
            
            # Add to results
            translated_data.append(new_row)
            
        except Exception as e:
            # Drop failed rows, continue without adding
            continue
    
    return pd.DataFrame(translated_data)

In [14]:
# Process the entire dataset in chunks
def process_dataset_in_chunks(input_df, chunk_size=1000):
    """Process the entire dataset in chunks and return combined DataFrame"""
    
    # Initialize counters and storage
    total_processed = 0
    total_successful = 0
    chunks_processed = 0
    all_translated_data = []
    
    # Calculate number of chunks
    total_chunks = (len(input_df) // chunk_size) + 1
    print(f"Processing {len(input_df)} rows in {total_chunks} chunks of {chunk_size}...")
    
    # Process in chunks
    for chunk_num in range(total_chunks):
        start_idx = chunk_num * chunk_size
        end_idx = min((chunk_num + 1) * chunk_size, len(input_df))
        chunk = input_df.iloc[start_idx:end_idx]
        
        print(f"Processing chunk {chunk_num + 1}/{total_chunks} (rows {start_idx}-{end_idx})...")
        
        # Translate chunk
        translated_chunk = translate_chunk(chunk)
        
        # Update counters
        chunk_successful = len(translated_chunk)
        total_processed += len(chunk)
        total_successful += chunk_successful
        chunks_processed += 1
        
        # Store results
        if not translated_chunk.empty:
            all_translated_data.append(translated_chunk)
            print(f"  Chunk {chunk_num + 1}: {chunk_successful} successful, {len(chunk) - chunk_successful} failed")
        else:
            print(f"  Chunk {chunk_num + 1}: No successful translations")
        
        # Rate limiting - be nice to the API
        time.sleep(1)  # 1 second between chunks
        if chunks_processed % 10 == 0:  # Longer break every 10 chunks
            print("  Taking a longer break...")
            time.sleep(3)
    
    # Combine all chunks
    if all_translated_data:
        final_df = pd.concat(all_translated_data, ignore_index=True)
    else:
        final_df = pd.DataFrame(columns=['language', 'title', 'translated_title'])
    
    # Final summary
    print(f"\nProcessing complete!")
    print(f"Total rows processed: {total_processed}")
    print(f"Total successful translations: {total_successful}")
    print(f"Total failed/dropped: {total_processed - total_successful}")
    
    return final_df

In [None]:
# Process the dataset in chunks
translated_df = process_dataset_in_chunks(df, chunk_size=1000)


Processing 22456 rows in 23 chunks of 1000...
Processing chunk 1/23 (rows 0-1000)...
  Chunk 1: 1000 successful, 0 failed
Processing chunk 2/23 (rows 1000-2000)...
  Chunk 2: 1000 successful, 0 failed
Processing chunk 3/23 (rows 2000-3000)...
  Chunk 3: 1000 successful, 0 failed
Processing chunk 4/23 (rows 3000-4000)...


In [None]:
# Save to CSV
output_file = "../data/mediacloud-translated-complete_part1.csv"
translated_df.to_csv(output_file, index=False)


In [10]:
translated_df

Unnamed: 0,id,indexed_date,language,media_name,media_url,publish_date,title,url,translated_title
0,1dc171e7750c319dc4a7b4ca87c8a6f5587c0e9481f129...,2025-11-03 00:22:19.404264+00:00,en,techcrunch.com,techcrunch.com,2025-11-02,Google pulls Gemma from AI Studio after Senato...,https://techcrunch.com/2025/11/02/google-pulls...,Google pulls Gemma from AI Studio after Senato...
1,e6a47f0c7b6de768d799e603b5797ca432f9af6903de7b...,2025-11-02 23:17:22.553191+00:00,en,livemint.com,livemint.com,2025-11-02,Here's why India’s AI content draft rules miss...,https://www.livemint.com/opinion/online-views/...,Here's why India’s AI content draft rules miss...
2,8061d40a29f0fcf33e6906573a72389a4c0a0cc136522c...,2025-11-02 21:51:34.827292+00:00,en,apnews.com,apnews.com,2025-11-02,Who is Zico Kolter? A professor leads OpenAI s...,https://apnews.com/article/openai-safety-chatg...,Who is Zico Kolter? A professor leads OpenAI s...
3,d411e1a49c87e2e710c054d514f58beebcfa2f275107d8...,2025-11-02 20:26:57.088674+00:00,es,infolibre.es,infolibre.es,2025-11-02,"ChatGPT no es un psicólogo, pero cambiará su l...",https://www.infolibre.es/politica/chatgpt-rect...,"ChatGPT is not a psychologist, but it will cha..."
4,c8ce06691fdf797a875be76c82efb7b0fab0d3d62a795c...,2025-11-02 20:17:51.246694+00:00,en,thestar.com,thestar.com,2025-11-02,"Like maple syrup and hockey, AI must become a ...",https://www.thestar.com/business/opinion/like-...,"Like maple syrup and hockey, AI must become a ..."
...,...,...,...,...,...,...,...,...,...
35070,9c99f124e4570fa47a52347b36be92c29d6a36bf841bd5...,2023-11-12 00:13:58+00:00,en,hindustantimes.com,hindustantimes.com,2023-11-10,New Delhi Slush’D witnessed unprecedented succ...,https://www.hindustantimes.com/brand-stories/n...,New Delhi Slush’D witnessed unprecedented succ...
35071,2a1e8f766030208b2eb23b755954af02adf47a59983d88...,2023-11-12 00:12:34+00:00,en,indiatimes.com,indiatimes.com,2023-05-12,AI on human rights watch,https://economictimes.indiatimes.com/opinion/e...,AI on human rights watch
35072,08fb7e5577bba492f3fd2461af622a6c54a40156df1fa4...,2023-11-05 03:58:17+00:00,es,lapoliticaonline.com,lapoliticaonline.com,2023-10-05,"Con su propia criptomoneda, el creador del Cha...",https://www.lapoliticaonline.com/usa/business-...,"With his own cryptocurrency, the creator of Ch..."
35073,4a95dfe5a35bd5f075df96cf3e398fdf2618035d1e4e0d...,2023-11-05 01:28:22+00:00,en,jerseyeveningpost.com,jerseyeveningpost.com,2023-11-03,Musk ‘sensationalist’ comments on AI taking jo...,https://jerseyeveningpost.com/uncategorised/20...,Musk ‘sensationalist’ comments on AI taking jo...


In [8]:
# check if the lengths match the failed rows dropped
print(f"Length of translated_df: {len(translated_df)}")
print(f"\nLength of original df: {len(df)}")
print(f"\nNumber of rows dropped (failed translations): {len(df) - len(translated_df)}")

Length of translated_df: 35075

Length of original df: 35399

Number of rows dropped (failed translations): 324


The translated dataset contains 35,075 rows, reduced by 324 from the original 35,399, confirming that failed translations were successfully dropped.

In [None]:
#merged the two part into one part again
part_1 = pd.read_csv("./data/mediacloud-translated-complete_part1.csv", encoding='utf-8')
part_2 = pd.read_csv("./data/mediacloud-translated-complete_part2.csv", encoding='utf-8')
combined_df = pd.concat([part_1, part_2], ignore_index=True)
combined_df.to_csv("mediacloud-translated-completed_taba.csv", index=False)