<a href="https://colab.research.google.com/github/gabrielnichio/language-detector/blob/main/language_detector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language Detector model

NLP project to create a language detector model.

Dataset used: https://www.kaggle.com/datasets/basilb2s/language-detection

In [38]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import pickle

In [19]:
data = pd.read_csv('Language Detection.csv')
data.head()

Unnamed: 0,Text,Language
0,"Nature, in the broadest sense, is the natural...",English
1,"""Nature"" can refer to the phenomena of the phy...",English
2,"The study of nature is a large, if not the onl...",English
3,"Although humans are part of nature, human acti...",English
4,[1] The word nature is borrowed from the Old F...,English


In [20]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10337 entries, 0 to 10336
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Text      10337 non-null  object
 1   Language  10337 non-null  object
dtypes: object(2)
memory usage: 161.6+ KB


In [21]:
le = LabelEncoder()
le.fit(data['Language'])
data['Language'] = le.transform(data['Language'])

In [22]:
data.head()

Unnamed: 0,Text,Language
0,"Nature, in the broadest sense, is the natural...",3
1,"""Nature"" can refer to the phenomena of the phy...",3
2,"The study of nature is a large, if not the onl...",3
3,"Although humans are part of nature, human acti...",3
4,[1] The word nature is borrowed from the Old F...,3


## TF-IDF to vectorize the data

In [23]:
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(data['Text'])
y = data['Language']

In [24]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [25]:
model = LogisticRegression()
model.fit(x_train, y_train)

In [26]:
accuracy = model.score(x_test, y_test)
print("Accuracy:", accuracy)

Accuracy: 0.9598646034816247


In [33]:
def predict_language(text):
  phrase = tfidf.transform([text])
  prediction = model.predict(phrase)

  print("Language classified as: ", le.inverse_transform(prediction)[0])

In [34]:
predict_language("Hej, mit navn er Gabriel")

Language classified as:  Danish


In [35]:
predict_language("Hello, my name is Gabriel")

Language classified as:  English


In [36]:
predict_language("Hola, mi nombre es Gabriel")

Language classified as:  Spanish


## It doesn't predict Chinese

In [37]:
predict_language("你好，我叫加布里埃尔")

Language classified as:  Russian


In [39]:
with open("language_detector_model.pkl", "wb") as f:
  pickle.dump(model, f)