# Initialization

In [2]:
import kagglehub
from kagglehub import KaggleDatasetAdapter
import pandas as pd

import re

In [3]:
language_detection_dataset_save_path = "data/raw/language_detection_dataset.csv"

# Download datasets

## Language detection dataset

In [4]:
temp_language_detection_dataset_path = kagglehub.dataset_download(
    "chazzer/big-language-detection-dataset",
    "sentences.csv"
)



### Overview

In [5]:
language_detection_df = pd.read_csv(temp_language_detection_dataset_path, compression="zip", encoding="utf-8")
language_detection_df.head()

Unnamed: 0,id,lan_code,sentence
0,1,cmn,我們試試看！
1,2,cmn,我该去睡觉了。
2,3,cmn,你在干什麼啊？
3,4,cmn,這是什麼啊？
4,5,cmn,今天是６月１８号，也是Muiriel的生日！


In [6]:
len(language_detection_df)

10341812

In [7]:
print(language_detection_df[language_detection_df["sentence"].str.len() > 2000].head(n=1)["sentence"].iloc[0])

 Olabilir miydi...? Dima merak etti. Sonunda doğru Al-Sayib aldım mı?
1041315	oci	Bon aniversari, Muirièl !
1041316	tok	tenpo kulupu musi li musi pi mute ala.
1041317	jpn	投票日は雨の降る寒い日だった。
1041318	oci	Vas véser la diferéncia.
1041319	tur	Neredeyse bitti.
1041320	oci	Qu'èi hèit petassar la mea bicicleta peu men hrair.
1041321	oci	Dinc a quan demoraratz au Japon ?
1041322	spa	Su consejo fue muy útil.
1041323	tur	O, çok az değerlidir.
1041324	oci	Que's hè vièlh, mès qu'ei hens ua fòrma com n'èra pas jamei abans.
1041325	por	O seu conselho foi muito útil.
1041326	oci	Qu'a compausat l'istòria.
1041327	tur	O Pochi'nin yiyeceğidir..
1041328	oci	Òm pòt odiar l'anglés, aver un chafre en anglés e pretengue's san d'esperit.
1041329	oci	Que m'escapèi fin finau.
1041330	oci	Quin hès entad estar tan longanha !
1041331	tur	O, otelden çok uzakta değildir.
1041332	spa	No tendría que haberte dicho nada.
1041333	oci	Ne sabèva pas quin devèva exprimí's.
1041334	tur	O bir ayçiçeği.
1041335	oci	Los tapís anti

In [8]:
language_detection_df[language_detection_df["lan_code"] == "cycl"].head()

Unnamed: 0,id,lan_code,sentence
409868,427922,cycl,(behaviorCapable-PerformedBy Madonna Singing-H...
409876,427932,cycl,(numberOfInhabitants CityOfAucklandNZ 1000000)
409883,427940,cycl,(isa CityOfAucklandNewZealand (CityInCountryFn...
409905,427962,cycl,(mostNotableIsa Batman Superhero)
409914,427971,cycl,(likesAsFriend Batman Robin-BatmanSidekick)


As we can see, dataset appears to be broken. As we can see from https://www.kaggle.com/datasets/chazzer/big-language-detection-dataset?select=lan_to_language.json, each row has a format of `number`, `3 lowercase char language code`, `sentence`. We can use this information to write our own dataset parser to (probably) fix some of its issues.

However, as we can see from code block above (and after careful inspection of https://www.kaggle.com/datasets/chazzer/big-language-detection-dataset?select=lan_to_language.json), there is actually one non-ISO 639-3 language code in this dataset, and it is `cycl`, which we need to handle separately.

### Custom parser

In [9]:
def try_load_broken_language_detection_dataset(file_path, compression='zip'):
    data = []
    
    # Regex:
    # Comma or tab as divider
    # Number with at least 1 digit, divider, 3-letter language code or cycl, divider
    new_row_pattern = re.compile(r'^\d+[,|\t](?:[a-z]{3}|cycl)[,|\t]')
    
    with pd.io.common.get_handle(file_path, 'r', compression=compression, encoding='utf-8') as f:
        header = next(f.handle) # Our dataset has header, so we need to skip it
        
        current_row = None
        
        for line in f.handle:
            line = line.rstrip('\n')
            
            if new_row_pattern.match(line):
                if current_row:
                    data.append(current_row)
                
                parts = re.split(r'[,|\t]', line, maxsplit=2)
                
                if len(parts) == 3:
                    current_row = parts
                else: # Case where there is no sentence
                    current_row = parts + [''] if len(parts) < 3 else parts
            else: # We've gotten to the broken row
                if current_row:
                    current_row[2] += " " + line
        
        if current_row:
            data.append(current_row)

    return pd.DataFrame(data, columns=['id', 'lan_code', 'sentence'])

In [10]:
temp_df = try_load_broken_language_detection_dataset(temp_language_detection_dataset_path, "zip")

In [11]:
temp_df.head()

Unnamed: 0,id,lan_code,sentence
0,1,cmn,我們試試看！\r
1,2,cmn,我该去睡觉了。\r
2,3,cmn,你在干什麼啊？\r
3,4,cmn,這是什麼啊？\r
4,5,cmn,今天是６月１８号，也是Muiriel的生日！\r


As we can see, every row how ends with \r. We can fix it as well.

In [12]:
temp_df["sentence"] = temp_df["sentence"].str.rstrip('\r')
temp_df.head()

Unnamed: 0,id,lan_code,sentence
0,1,cmn,我們試試看！
1,2,cmn,我该去睡觉了。
2,3,cmn,你在干什麼啊？
3,4,cmn,這是什麼啊？
4,5,cmn,今天是６月１８号，也是Muiriel的生日！


In [13]:
len(temp_df)

10358183

Now we have 10358183 rows of (hopefully) non-corrupted data instead of 10341812 rows with corrupted data!

In [14]:
language_detection_df = temp_df

### Saving dataset locally

We'll save only rows with lan_code value among eng, rus and urk (the exact mapping is described at https://www.kaggle.com/datasets/chazzer/big-language-detection-dataset?select=lan_to_language.json), as these are the only languages we're interested in.

In [15]:
language_detection_df = language_detection_df[
    language_detection_df["lan_code"].isin(["eng", "rus", "ukr"])
]
language_detection_df.groupby("lan_code").count()

Unnamed: 0_level_0,id,sentence
lan_code,Unnamed: 1_level_1,Unnamed: 2_level_1
eng,1588752,1588752
rus,911848,911848
ukr,178588,178588


In [16]:
with open(language_detection_dataset_save_path, "+w", encoding="utf-8") as f: 
    language_detection_df.to_csv(f, index=False)

# Aftermath

As an aftermath, I advice to manually clear the cache afterwards at the ~/.cache/kagglehub/ location (https://github.com/Kaggle/kagglehub/blob/main/README.md) if you're using the default one.