## Language Detection

Download the Dataset from https://www.kaggle.com/datasets/basilb2s/language-detection, then extract it and upload the csv file to this Notebook.

Credits: This Notebook is slightly adapted from https://github.com/basil-b2s/Language-Detector


In [None]:
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
import pickle

import warnings
warnings.simplefilter("ignore")

In [None]:
import sklearn
sklearn.__version__

'1.0.2'

In [None]:
from google.colab import files 
uploaded = files.upload()

In [None]:
# Loading the dataset
data = pd.read_csv("Language Detection.csv")

In [None]:
data.head()

Unnamed: 0,Text,Language
0,"Nature, in the broadest sense, is the natural...",English
1,"""Nature"" can refer to the phenomena of the phy...",English
2,"The study of nature is a large, if not the onl...",English
3,"Although humans are part of nature, human acti...",English
4,[1] The word nature is borrowed from the Old F...,English


In [None]:
data.shape

(10337, 2)

In [None]:
data.Language.value_counts()

English       1385
French        1014
Spanish        819
Portugeese     739
Italian        698
Russian        692
Sweedish       676
Malayalam      594
Dutch          546
Arabic         536
Turkish        474
German         470
Tamil          469
Danish         428
Kannada        369
Greek          365
Hindi           63
Name: Language, dtype: int64

In [None]:
data.Language.value_counts(normalize=True)

English       0.133985
French        0.098094
Spanish       0.079230
Portugeese    0.071491
Italian       0.067524
Russian       0.066944
Sweedish      0.065396
Malayalam     0.057463
Dutch         0.052820
Arabic        0.051853
Turkish       0.045855
German        0.045468
Tamil         0.045371
Danish        0.041405
Kannada       0.035697
Greek         0.035310
Hindi         0.006095
Name: Language, dtype: float64

In [None]:
X = data["Text"]
y = data["Language"]

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [None]:
y

array([3, 3, 3, ..., 9, 9, 9])

In [None]:
from collections import Counter

In [None]:
Counter(y)

Counter({3: 1385,
         10: 594,
         7: 63,
         15: 469,
         11: 739,
         4: 1014,
         2: 546,
         13: 819,
         6: 365,
         12: 692,
         1: 428,
         8: 698,
         16: 474,
         14: 676,
         0: 536,
         5: 470,
         9: 369})

In [None]:
items = Counter(y)

In [None]:
items.most_common()

[(3, 1385),
 (4, 1014),
 (13, 819),
 (11, 739),
 (8, 698),
 (12, 692),
 (14, 676),
 (10, 594),
 (2, 546),
 (0, 536),
 (16, 474),
 (5, 470),
 (15, 469),
 (1, 428),
 (9, 369),
 (6, 365),
 (7, 63)]

In [None]:
le.classes_

array(['Arabic', 'Danish', 'Dutch', 'English', 'French', 'German',
       'Greek', 'Hindi', 'Italian', 'Kannada', 'Malayalam', 'Portugeese',
       'Russian', 'Spanish', 'Sweedish', 'Tamil', 'Turkish'], dtype=object)

In [None]:
data_list = []
for text in X:
    text = re.sub(r'[!@#$(),\n"%^*?\:;~`0-9]', ' ', text)
    text = re.sub(r'[[]]', ' ', text)
    text = text.lower()
    data_list.append(text)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

In [None]:
# creating bag of words using countvectorizer

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
cv.fit(X_train)

x_train = cv.transform(X_train).toarray()
x_test  = cv.transform(X_test).toarray()

In [None]:
X_train.values[0]

'Este tipo de aprendizaje usa un árbol de decisiones como modelo predictivo.'

In [None]:
X_train.values[1]

'als je nog niet op de abonneerknop en het belpictogram ernaast hebt gedrukt en blij bent met kijken.'

In [None]:
x_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [None]:
x_train[0]

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
x_train[0].shape

(34899,)

In [None]:
np.unique(x_train[0])

array([0, 1, 2])

In [None]:
np.unique(x_train[0], return_counts=True)

(array([0, 1, 2]), array([34888,    10,     1]))

In [None]:
X_train

5178    Este tipo de aprendizaje usa un árbol de decis...
4497    als je nog niet op de abonneerknop en het belp...
4969    Esta característica constituye una de sus mayo...
7622                                         Mi dispiace.
5491                           ¿Te importa si salto aquí?
                              ...                        
2210    ஏய் தோழர்களே இந்த வீடியோவில் எனது சேனலுக்கு வருக.
1051        well-mannered and cultured and sophisticated.
5486                     me temo que no estoy de acuerdo.
9250    عبارات إذا كنت تتعلم اللغة الإنجليزية لفترة من...
160     Species that were unable to adapt to the chang...
Name: Text, Length: 8269, dtype: object

In [None]:
cv.get_feature_names()[:5], cv.get_feature_names()[-5:]

(['00', '000', '000க', '001', '001097666virtual'],
 ['ൽപ', 'ൽവർ', 'ൾക', 'ൾപ', 'ൾമ'])

In [None]:
len(cv.get_feature_names())

34899

In [None]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(x_train, y_train)

MultinomialNB()

In [None]:
# the transformed features, x_test
x_test

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [None]:
x_test[0]

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
Counter(x_test[0])

Counter({0: 34895, 1: 4})

In [None]:
model.classes_

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16])

In [None]:
y_pred = model.predict(x_test)

In [None]:
y_pred

array([ 6, 12,  4, ..., 10,  4, 16])

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

ac = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)

In [None]:
print("Accuracy is :",ac)

Accuracy is : 0.9743713733075435


In [None]:
cm

array([[104,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0],
       [  0,  89,   0,   3,   1,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   4,   0,   0],
       [  0,   0, 100,   2,   1,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0],
       [  0,   0,   0, 280,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0],
       [  0,   0,   0,   2, 215,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0],
       [  0,   0,   0,   2,   0,  78,   0,   0,   0,   0,   0,   0,   0,
          0,   1,   0,   0],
       [  0,   0,   0,   2,   0,   0,  88,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,  10,   0,   0,   0,   0,   0,
          0,   0,   0,   0],
       [  0,   0,   0,   2,   0,   0,   0,   0, 138,   0,   0,   0,   0,
          0,   0,   0,   0],
       [  0,   0,   0,   6,   0,   0,   0,   0,   0,  68,   0,   0,   0,
         

In [None]:
print(cr)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       104
           1       1.00      0.92      0.96        97
           2       1.00      0.97      0.99       103
           3       0.88      1.00      0.93       280
           4       0.99      0.99      0.99       217
           5       1.00      0.96      0.98        81
           6       1.00      0.98      0.99        90
           7       1.00      1.00      1.00        10
           8       0.99      0.99      0.99       140
           9       1.00      0.92      0.96        74
          10       1.00      0.98      0.99       124
          11       0.99      0.97      0.98       145
          12       1.00      0.96      0.98       134
          13       0.98      0.97      0.98       154
          14       0.97      0.98      0.97       143
          15       1.00      1.00      1.00        77
          16       1.00      0.93      0.96        95

    accuracy              

In [None]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([('vectorizer', cv), ('multinomialNB', model)])
pipe.fit(X_train, y_train)

Pipeline(steps=[('vectorizer', CountVectorizer()),
                ('multinomialNB', MultinomialNB())])

In [None]:
y_pred2 = pipe.predict(X_test)
ac2 = accuracy_score(y_test, y_pred2)
print("Accuracy is :",ac2)

Accuracy is : 0.9743713733075435


In [None]:
with open('trained_pipeline-0.1.0.pkl','wb') as f:
    pickle.dump(pipe, f)

In [None]:
# If you need to zip and download a folder, you can do it with this command:

# !zip -r ./trained_pipeline-0.1.0.pkl.zip ./trained_pipeline-0.1.0.pkl

In [None]:
# text = "Hello, how are you?"
# text = "Ciao, come stai?"
text = "Hallo, guten tag!"
#text = "Hi, how are you?"
y = pipe.predict([text])
le.classes_[y[0]], y

('German', array([5]))

In [None]:
y

array([5])

In [None]:
pipe.predict([text])

array([5])

In [None]:
cv.transform([text])

<1x34899 sparse matrix of type '<class 'numpy.int64'>'
	with 2 stored elements in Compressed Sparse Row format>

In [None]:
cv.transform([text]).toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

In [None]:
cv.transform([text]).toarray()[0]

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
# transformed text
text_transformed = cv.transform([text]).toarray()
Counter(text_transformed[0]) # the number of "1"s will be equal to the number of unique words in the text we're predicting on.

Counter({0: 34897, 1: 2})

In [None]:
text_transformed.shape

(1, 34899)

In [None]:
model.predict(text_transformed)

array([5])

In [None]:
predicted_label_index = model.predict(text_transformed)
predicted_label_index

array([5])

In [None]:
le.classes_[predicted_label_index][0]

'German'