# Initialization

In [1]:
import pandas as pd
import numpy as np

In [2]:
RANDOM_STATE = 0
TEST_SIZE = 0.15 # 15%

# Language detection dataset

In [3]:
train_language_detection_dataset_path = "../../data/processed/train_language_detection_dataset.csv"
test_language_detection_dataset_path = "../../data/processed/test_language_detection_dataset.csv"

## Reading

In [4]:
language_detection_df = pd.read_csv("../../data/raw/language_detection_dataset.csv")
language_detection_df.head()

Unnamed: 0.1,Unnamed: 0,id,lan_code,sentence
0,235,243,rus,Один раз в жизни я делаю хорошее дело... И оно...
1,1232,1276,eng,Let's try something.
2,1233,1277,eng,I have to go to sleep.
3,1235,1280,eng,Today is June 18th and it is Muiriel's birthday!
4,1236,1282,eng,Muiriel is 20 now.


In [5]:
language_detection_df.groupby("lan_code").count()

Unnamed: 0_level_0,Unnamed: 0,id,sentence
lan_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
eng,1586621,1586621,1586621
rus,909951,909951,909951
ukr,178269,178269,178269


As we can see, we have a pretty strong class imbalance which we need to avoid. We'll split our dataset into train and test datasets now, but we'll address class imbalance issue later when we'll be comparing different models and approaches.

## Splitting

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X = language_detection_df[["sentence"]]
y = language_detection_df[["lan_code"]]

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, shuffle=True, random_state=RANDOM_STATE, stratify=y
)

## Saving

In [9]:
train_language_detection_df = pd.concat([X_train, y_train], axis=1)
train_language_detection_df

Unnamed: 0,sentence,lan_code
1600726,"Merry Christmas, Tatoeba!",eng
710564,"Крайне важно, чтобы мы поговорили с Томом.",rus
2348832,Она была весела.,rus
273136,Urban sprawl is said to be a major contributor...,eng
1059358,Только не надо делать большие глаза.,rus
...,...,...
10721,Imagine that you have a wife.,eng
488826,Do you do that often?,eng
1173856,Том говорив з Мері?,ukr
998837,"Вбросы, ""карусели"", подкуп избирателей - лишь ...",rus


In [10]:
test_language_detection_df = pd.concat([X_test, y_test], axis=1)
test_language_detection_df

Unnamed: 0,sentence,lan_code
306214,She suspected that it was too late.,eng
2090931,Я порой бываю рассеян.,rus
2225400,Он мог бы победить.,rus
470542,"Том мог быть не таким счастливым, каким прикид...",rus
2272624,"Do not forsake me, oh my darling.",eng
...,...,...
1688367,Что нам следует съесть в первую очередь?,rus
2576250,"А Вы не так просты, как кажется.",rus
1799153,Sami's toilet is really dirty.,eng
599369,Он играет в Монополию.,rus


In [11]:
with open(train_language_detection_dataset_path, "+w", encoding="utf-8") as f:
    train_language_detection_df.to_csv(f, index=False)

In [12]:
with open(test_language_detection_dataset_path, "+w", encoding="utf-8") as f:
    test_language_detection_df.to_csv(f, index=False)