# Effects of preprocessing

## Preprocessing for LID
In general, multiple different preprocessing steps are viable for the LID task. The survey [1] e.g. mentions the following preprocessing steps:

- **Case folding**: Convert all characters to lowercase.
- **Range compression**: Groups a range of characters into a single logical set to reduce sparcity which is especially useful for languages with large character sets like Chinese.
- **Noise removal:** Remove digits, punctuation, special characters, language-independent characters (like URLs, emails, etc.). This is mostly done using heuristics.

Other common NLP preprocessing steps, however, might not be suited for the task. They mostly include normalization techniques:

- **Removing stop words and diacritics**: As [2] points out, stop words and diacritics are language-specific and useful for the LID task.
- **Lemmatization**: Relies on understanding a word's base form, which depends on grammar, morphology, and irregular forms.
- **Stemming**: Applies heuristic rules to chop off word endings, but these rules are language-dependent.

Language-agnostic approaches to these normalization techniques often rely on rule-based heuristics and are often impractical for a large number of languages. Apart from these methods, one might use statistical, embedding-based or neural methods to learn word structures across languages. However, this would leave the realm of preprocessing for classical ML methods and enter the domain of deep learning.

## Khan's WiLi-2018 subset
As data exploration already showed, the Khan's WiLi-2018 dataset is already preprocessed. The text is already lowercased and some noise removal has been applied. As the dataset's name suggests, it is already optimized for the LID task. Therefore, we expect no significant performance improvements from further preprocessing. Nevertheless, we at least validate that the lowercase assumption holds true for the entire dataset. Noise removal is not easily validated because it is not obvious what kind of noise removal was applied.

In [1]:
import pandas as pd

from langlens.data import _clean_data

# Load the dataset
df = pd.read_csv("../data/wili_subset_khan.csv")
df = _clean_data(df)

In [2]:
def is_lowercased(text):
    return text == text.lower()


is_all_lowercased = df['text'].apply(is_lowercased).all()
is_all_lowercased

np.True_

The dataset has been properly lowercased. Let us evaluate its performance using a simple Naive Bayes classifier on character uni-grams.

In [3]:
! python ../langlens/main.py baseline --dataset-path ../data/wili_subset_khan.csv --n-gram-type char --vocab-size 512

2025-02-20 08:56:46,291 - langlens.__main__ - INFO - Loading dataset from ../data/wili_subset_khan.csv
2025-02-20 08:56:46,445 - langlens.__main__ - INFO - Training dataset size: 17599, Val/Test dataset size: 2200
2025-02-20 08:56:46,445 - langlens.__main__ - INFO - Vectorizing and normalizing data...
2025-02-20 08:56:49,933 - langlens.langlens.baseline.vectorize - INFO - The vocabulary covers 96.04% of the training data.
2025-02-20 08:56:49,984 - langlens.__main__ - INFO - Training classifier...
2025-02-20 08:56:50,038 - langlens.__main__ - INFO - Validation set performance:
2025-02-20 08:56:50,052 - langlens.langlens.evaluation - INFO -               precision    recall  f1-score   support

      Arabic       1.00      0.99      0.99        97
     Chinese       1.00      0.97      0.99       102
       Dutch       0.98      0.89      0.93        93
     English       0.63      0.99      0.77        96
    Estonian       0.98      0.95      0.96       101
      French       0.95     

We observe a validation set accuracy of 0.96 and a F1 score of 0.97. This is a good result for a simple baseline model.

## Introducing more preprocessing
Let us try to improve by removing more "noise" from the dataset.

In [5]:
import re


def remove_noise(text):
    # Remove URLs
    text = re.sub(r'http\S+|www\.\S+', '', text)
    # Remove emails
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b', '', text)
    # Remove digits
    text = re.sub(r'\d+', '', text)
    # Remove punctuation and special characters (excluding spaces)
    text = re.sub(r'[^\w\s]', '', text)
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text


# Apply preprocessing to the 'text' column
df_noise_free = df.copy()
df_noise_free['text'] = df_noise_free['text'].apply(remove_noise)

# Save the cleaned dataset
df_noise_free.to_csv("../data/wili_subset_khan_cleaned.csv", index=False)

The dataset has been further cleaned. Let us evaluate its performance using a simple Naive Bayes classifier on character uni-grams.

In [6]:
! python ../langlens/main.py baseline --dataset-path ../data/wili_subset_khan_cleaned.csv --n-gram-type char --vocab-size 512

2025-02-20 08:59:22,152 - langlens.__main__ - INFO - Loading dataset from ../data/wili_subset_khan_cleaned.csv
2025-02-20 08:59:22,305 - langlens.__main__ - INFO - Training dataset size: 17599, Val/Test dataset size: 2200
2025-02-20 08:59:22,305 - langlens.__main__ - INFO - Vectorizing and normalizing data...
2025-02-20 08:59:25,549 - langlens.langlens.baseline.vectorize - INFO - The vocabulary covers 96.36% of the training data.
2025-02-20 08:59:25,604 - langlens.__main__ - INFO - Training classifier...
2025-02-20 08:59:25,649 - langlens.__main__ - INFO - Validation set performance:
2025-02-20 08:59:25,662 - langlens.langlens.evaluation - INFO -               precision    recall  f1-score   support

      Arabic       0.88      0.99      0.93        97
     Chinese       1.00      0.96      0.98       102
       Dutch       0.98      0.88      0.93        93
     English       0.63      0.99      0.77        96
    Estonian       0.97      0.95      0.96       101
      French       0

We observe a validation set accuracy and a F1 score of 0.96, this is not a significantly different result (if at all, worse) compared to the previous one. We refrain from performing range compression as it requires a lot of hand-crafted rules, described in [3].

In [2]:
import os

# Clean up
os.remove("../data/wili_subset_khan_cleaned.csv")

# References
1. T. Jauhiainen, M. Lui, M. Zampieri, T. Baldwin, and K. Lindén, “Automatic Language Identification in Texts: A Survey,” 2019.
2. C.-O. Truică, J. Velcin, and A. Boicea, “Automatic Language Identification for Romance Languages using Stop Words and Diacritics,” Jun. 2018, doi: 10.1109/SYNASC.2015.45.
3. A. Simões, J. J. Almeida, and S. D. Byers, “Language identification: A neural network approach,” in OpenAccess Series in Informatics, Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing, 2014, pp. 251–265. doi: 10.4230/OASIcs.SLATE.2014.251.


