# Data Preprocessing: English Translation for Jeju Data Analysis

## 1. Project Overview
This notebook focuses on the initial data preprocessing stageâ€”specifically, translating the Korean dataset into English. This process ensures that the analysis is accessible to a wider audience and compatible with standard data visualization tools and libraries that often prefer English character sets.

### Target Datasets:
1. `jeju_card_region_2017.csv`: Credit card usage data in Jeju (2017)
2. `jeju_card_region_2018.csv`: Credit card usage data in Jeju (2018)
3. `jeju_population.csv`: Resident and floating population data in Jeju

---

## 2. Methodology
To ensure efficiency and accuracy, the following strategy was applied:

* **Library:** Used `deep-translator` (Google Translator API) for reliable automated translation.
* **Optimization:** Instead of translating every row individually, the script extracts **unique values** from each column. This drastically reduces API calls and shortens execution time.
* **Data Formatting:** * Converted column headers to `snake_case` (all lowercase with underscores).
    * Preserved numeric data integrity while translating categorical text data (Districts, Business Types, Gender, etc.).
* **Encoding:** Handled both `utf-8-sig` and `cp949` to prevent Korean character corruption during the loading process.

---

## 3. Environment Setup

In [1]:
import pandas as pd
from deep_translator import GoogleTranslator

def translate_korean_csv(input_file, output_file):
    # 1. Load the dataset (using utf-8-sig to handle Korean characters properly)
    try:
        df = pd.read_csv(input_file, encoding='utf-8-sig')
    except UnicodeDecodeError:
        # Fallback to cp949 if utf-8 fails (common for Korean Windows CSVs)
        df = pd.read_csv(input_file, encoding='cp949')

    translator = GoogleTranslator(source='ko', target='en')

    # 2. Translate Column Names
    print("Translating column headers...")
    new_columns = {}
    for col in df.columns:
        translated_col = translator.translate(col)
        # Replacing spaces with underscores for better data handling
        new_columns[col] = translated_col.replace(" ", "_")
    df.rename(columns=new_columns, inplace=True)

    # 3. Translate Row Values (target only object/string columns)
    for col in df.columns:
        if df[col].dtype == 'object':
            print(f"Translating unique values in column: '{col}'...")
            
            # Efficient translation: only translate unique values to save API calls
            unique_elements = df[col].unique()
            translation_dict = {}
            
            for element in unique_elements:
                if isinstance(element, str) and element.strip():
                    # Translate the text value
                    translated_text = translator.translate(element)
                    translation_dict[element] = translated_text
                else:
                    # Keep non-string or empty values as they are
                    translation_dict[element] = element
            
            # Apply the mapping to the entire column
            df[col] = df[col].map(translation_dict)

    # 4. Export to a new CSV file
    df.to_csv(output_file, index=False, encoding='utf-8-sig')
    print(f"Translation completed! Saved to: {output_file}")

# Execution Block
if __name__ == "__main__":
    # List of files to translate (Make sure these are in the same folder as this notebook)
    files_to_translate = [
        'jeju_card_region_2017.csv',
        'jeju_card_region_2018.csv',
        'jeju_population.csv'
    ]

    for file in files_to_translate:
        # Create output filename (e.g., 'data.csv' -> 'data_english.csv')
        output_name = file.replace('.csv', '_english.csv')
        translate_korean_csv(file, output_name)
        
    print("\n" + "="*30)
    print("All translation tasks completed!")
    print("="*30)

Translating column headers...
Translating unique values in column: 'Year_Month'...
Translating unique values in column: 'City/County_Life_Name'...
Translating unique values in column: 'Eup/myeon/dong_name'...
Translating unique values in column: 'Industry_name'...
Translating unique values in column: 'gender'...
Translation completed! Saved to: jeju_card_region_2017_english.csv
Translating column headers...
Translating unique values in column: 'Year_Month'...
Translating unique values in column: 'City/County_Life_Name'...
Translating unique values in column: 'Eup/myeon/dong_name'...
Translating unique values in column: 'Industry_name'...
Translating unique values in column: 'gender'...
Translation completed! Saved to: jeju_card_region_2018_english.csv
Translating column headers...
Translating unique values in column: 'City/County_Life_Name'...
Translating unique values in column: 'Eup/myeon/dong_name'...
Translating unique values in column: 'gender'...
Translating unique values in colu