# Detecting Special/Non-ASCII Characters in Text Data

In [1]:
# What is Character Encoding?
# Character encoding is a way of representing characters as numbers. Every character (letter, symbol, etc.) is mapped to a unique numerical value (byte sequence) in a character set. These mappings are crucial because different systems and applications use different standards for encoding characters.

# For example:

# ASCII maps characters like A, B, C, etc., to the numbers 65, 66, 67, and so on, using 7 bits.
# UTF-8 is a more comprehensive encoding standard, capable of representing every character in the Unicode standard, including characters from non-Latin alphabets (e.g., Chinese, Arabic) and special characters (e.g., emojis, accented letters).
# ISO-8859-1 and Windows-1252 are examples of legacy encodings that cover a specific set of characters, commonly used in older systems or applications.

In [2]:
# Why is Encoding Important in pd.read_csv()?
# When reading data from a CSV file, the data is typically stored in bytes on disk. If the file contains characters outside of the standard ASCII range (e.g., accented characters, non-Latin alphabets), Python needs to know which encoding was used to store those characters, so it can correctly decode the bytes back into strings.

# If you don't specify the correct encoding, you might encounter issues like:

# Incorrectly displayed characters (e.g., Ã© instead of é).
# Errors during the reading process (e.g., UnicodeDecodeError) if Python tries to decode bytes that don't match the expected encoding.

---

In [3]:
import os
import pandas as pd

In [4]:
# Dictionary mapping short keys to file paths.
file_paths = {
    "dev": "/kaggle/input/meld-emotion-recognition/JSON files/JSON files/Updated CSV/dev_sent_emo_cleaned.csv",
    "test": "/kaggle/input/meld-emotion-recognition/JSON files/JSON files/Updated CSV/test_sent_emo_cleaned.csv",
    "train": "/kaggle/input/meld-emotion-recognition/JSON files/JSON files/Updated CSV/train_sent_emo_cleaned.csv"
}

In [5]:
# Process each file:
for split, path in file_paths.items():
    # Read the CSV file into a DataFrame.
    df = pd.read_csv(path, encoding='cp1252')

    # Replace unwanted characters (like Â) and fix the encoding
    df['Utterance'] = df['Utterance'].str.replace('Â', '', regex=False)
    
    # Construct an output file name in the /kaggle/working/ directory.
    # For example, "dev_sent_emo_cleaned.csv" becomes "dev_sent_emo_cleaned_ascii.csv"
    file_name = os.path.basename(path)
    output_file_name = file_name.replace('.csv', '_processed.csv')
    output_path = os.path.join('/kaggle/working', output_file_name)
    
    # Save the processed DataFrame to the new CSV file.
    df.to_csv(output_path, index=False)
    
    print(f"Processed '{split}' data has been saved to: {output_path}")

Processed 'dev' data has been saved to: /kaggle/working/dev_sent_emo_cleaned_processed.csv
Processed 'test' data has been saved to: /kaggle/working/test_sent_emo_cleaned_processed.csv
Processed 'train' data has been saved to: /kaggle/working/train_sent_emo_cleaned_processed.csv
