<a href="https://colab.research.google.com/github/ZainabSyed88/Classification-Models/blob/main/Language_Detection_with_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Language Detection**

Language detection is a natural language processing task where we need to identify the language of a text or document.
As a human, you can easily detect the languages you know. For example, I can easily identify Hindi and English, but being an Indian, it is also not possible for me to identify all Indian languages. This is where the language identification task can be used. Google Translate is one of the most popular language translators in the world which is used by so many people around the world. It also includes a machine learning model to detect languages that you can use if you don’t know which language you want to translate.

The more data you have about every language, the more accurate your model will perform in real-time. The dataset that I am using is collected from Kaggle, which contains data about 22 popular languages and contains 1000 sentences in each of the languages, so it will be an appropriate dataset for training a language detection model with machine learning. So in the section below, I will take you through how you can train a language detection model with machine learning using Python.

In [19]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

In [20]:
data = pd.read_csv('/content/Language_Detection.csv')
data.head()

Unnamed: 0,Text,Language
0,"Nature, in the broadest sense, is the natural...",English
1,"""Nature"" can refer to the phenomena of the phy...",English
2,"The study of nature is a large, if not the onl...",English
3,"Although humans are part of nature, human acti...",English
4,[1] The word nature is borrowed from the Old F...,English


Lets have a look at whether this dataset contains any null values or not

In [21]:
data.isnull().sum()

Text        0
Language    0
dtype: int64

Now, let's have a look at all the languages present in this dataset

In [22]:
data['Language'].value_counts()

Language
English       1385
French        1014
Spanish        819
Portugeese     739
Italian        698
Russian        692
Sweedish       676
Malayalam      594
Dutch          546
Arabic         536
Turkish        474
German         470
Tamil          469
Danish         428
Kannada        369
Greek          365
Hindi           63
Name: count, dtype: int64

This dataset contains 17 languages. This is a very balanced dataset with no missing values, so we can say this dataset is completely ready to be used to train a machine learning model.

# **Language Detection Model**

Now,let's split the data into training and testing sets:

In [23]:
x = np.array(data["Text"])
y = np.array(data["Language"])

cv = CountVectorizer()
X = cv.fit_transform(x) #Fit the data
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                    test_size=0.33,
                                                    random_state=42)

As this is a problem of multiclass classification, so I will be using the Multinomial Naïve Bayes algorithm to train the language detection model as this algorithm always performs very well on the problems based on multiclass classification

In [24]:
model = MultinomialNB()
model.fit(X_train,y_train)
model.score(X_test,y_test)


0.977725674091442

Now let’s use this model to detect the language of a text by taking a user input:

In [25]:
user = input("Enter a text:")
data = cv.transform([user]).toarray()
output = model.predict(data)
print(output)

Enter a text:hallo
['Dutch']


So as you can see that the model performs well. One thing to note here is that this model can only detect the languages mentioned in the dataset.