This project is a detection of English, Arabic, French, Hindi, Urdu, Portuguese, Farsi, Pushto, Spanish, Korean, Tamil, Turkish, Estonian, Russian, Romanian, Chinese, Swedish, Latin, Indonesian, Dutch, Japanese, and Thai using Multinomial Naive Bayes method.

## Import Library

In [27]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB



This code is importing several Python libraries for text analysis and machine learning, and uses them to build a Naive Bayes classifier for text classification.

- `pandas` is a library for data manipulation and analysis. It provides data structures and functions to work with tabular data, such as data frames, and is often used for data cleaning and preparation.
- `numpy` is a library for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, as well as a large collection of mathematical functions to operate on them.
- `sklearn.feature_extraction.text` is a module from the scikit-learn library that provides tools for converting text into numerical representations that can be used as input to machine learning algorithms. Specifically, it provides the `CountVectorizer` class, which converts a collection of text documents into a matrix of word counts.
- `sklearn.model_selection` is a module from the scikit-learn library that provides tools for model selection and evaluation. Specifically, it provides the `train_test_split` function, which splits a dataset into training and testing subsets.
- `sklearn.naive_bayes` is a module from the scikit-learn library that provides implementation of the Naive Bayes algorithm, which is a probabilistic classifier that assumes independence between features.

The code likely reads in a dataset of text documents and their corresponding labels, cleans and preprocesses the data, vectorizes the text using the `CountVectorizer` class, and then splits the data into training and testing subsets using the `train_test_split` function. Finally, it trains a Naive Bayes classifier on the training data using the `MultinomialNB` class, and evaluates its performance on the testing data.

## Load Dataset

In [29]:
data = pd.read_csv("Language.csv")
print(data.head())

                                                Text  language
0  klement gottwaldi surnukeha palsameeriti ning ...  Estonian
1  sebes joseph pereira thomas  på eng the jesuit...   Swedish
2  ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...      Thai
3  விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...     Tamil
4  de spons behoort tot het geslacht haliclona en...     Dutch




This code reads a CSV file named "Language.csv" into a pandas data frame, and then prints the first few rows of the data frame using the `head()` function.

- `pd.read_csv()` is a function from the pandas library that reads a CSV file into a data frame. The function takes the filename as its argument and returns a data frame object. 

- `"Language.csv"` is the name of the file being read.

- `data` is the name of the variable that is used to store the data frame object returned by the `read_csv()` function.

- `data.head()` is a function that returns the first few rows of the data frame. By default, it returns the first five rows, but this can be changed by passing an argument to the function.

- `print()` is a built-in Python function that prints the output of the expression that is passed to it. In this case, the expression is `data.head()`, which is a data frame object containing the first few rows of the CSV file.

Overall, this code is used to inspect the contents of the "Language.csv" file by printing the first few rows of the data frame.

In [31]:
data.shape

(22000, 2)



This code returns the shape of the pandas data frame `data`, which is a tuple containing the number of rows and columns in the data frame.

- `data` is the name of the pandas data frame.

- `.shape` is an attribute of a pandas data frame that returns the shape of the data frame as a tuple `(nrows, ncols)`, where `nrows` is the number of rows and `ncols` is the number of columns.

For example, if `data.shape` returns `(100, 5)`, it means that the data frame `data` has 100 rows and 5 columns.

By inspecting the shape of the data frame, we can quickly get an idea of the size of the dataset and the number of features it contains. This information can be useful for data cleaning and preprocessing, as well as for selecting appropriate machine learning algorithms and evaluating their performance.

In [32]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22000 entries, 0 to 21999
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Text      22000 non-null  object
 1   language  22000 non-null  object
dtypes: object(2)
memory usage: 343.9+ KB




This code prints information about the pandas data frame `data`, including the number of non-null values and data types of each column, as well as the total memory usage of the data frame.

- `data` is the name of the pandas data frame.

- `.info()` is a method of a pandas data frame that prints a concise summary of the data frame's contents. It includes information such as the data types of each column, the number of non-null values in each column, and the memory usage of the data frame.

The first line indicates that the object is a pandas data frame, and that it has a range index with 100 entries. The next few lines list the columns in the data frame, along with their data types and the number of non-null values. Finally, the last line shows the total memory usage of the data frame.

This information can be useful for identifying missing or incorrect values, understanding the distribution of data types, and optimizing memory usage.

In [33]:
data.describe()

Unnamed: 0,Text,language
count,22000,22000
unique,21859,22
top,haec commentatio automatice praeparata res ast...,Estonian
freq,48,1000




This code generates descriptive statistics for the numerical columns of the pandas data frame `data`.

- `data` is the name of the pandas data frame.

- `.describe()` is a method of a pandas data frame that generates summary statistics for the numerical columns of the data frame. By default, it includes the count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum values for each column.

## Check Missing Value

In [4]:
data.isnull().sum()

Text        0
language    0
dtype: int64



This code calculates the number of missing (null) values in each column of the pandas data frame `data`.

- `data` is the name of the pandas data frame.

- `.isnull()` is a method of a pandas data frame that returns a Boolean mask indicating which values in the data frame are missing (null). Each element of the mask is `True` if the corresponding value in the data frame is null, and `False` otherwise.

- `.sum()` is a method that calculates the sum of the values in each column of the Boolean mask. Since `True` is interpreted as 1 and `False` as 0, the sum of the Boolean mask gives the total number of missing values in each column.

In [5]:
data["language"].value_counts()

Estonian      1000
Swedish       1000
English       1000
Russian       1000
Romanian      1000
Persian       1000
Pushto        1000
Spanish       1000
Hindi         1000
Korean        1000
Chinese       1000
French        1000
Portugese     1000
Indonesian    1000
Urdu          1000
Latin         1000
Turkish       1000
Japanese      1000
Dutch         1000
Tamil         1000
Thai          1000
Arabic        1000
Name: language, dtype: int64



This code counts the number of occurrences of each unique value in the "language" column of the pandas data frame `data`.

- `data` is the name of the pandas data frame.

- `["language"]` selects the "language" column of the data frame.

- `.value_counts()` is a method of a pandas series (i.e., a single column of a data frame) that counts the number of occurrences of each unique value in the series.

## Modelling

In [7]:
x = np.array(data["Text"])
y = np.array(data["language"])



This code creates numpy arrays `x` and `y` from two columns of the pandas data frame `data`.

- `np.array()` is a numpy function that converts a pandas series or data frame to a numpy array.

- `data["Text"]` selects the "Text" column of the data frame `data`.

- `data["language"]` selects the "language" column of the data frame `data`.

- `x` and `y` are the names of the numpy arrays that are created from the "Text" and "language" columns, respectively.

The resulting numpy arrays `x` and `y` can be used as inputs for machine learning algorithms that require numerical data (e.g., text classification algorithms). In this case, `x` contains the text data and `y` contains the corresponding language labels.

In [16]:
cv = CountVectorizer()
X = cv.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=25)



This code performs the following tasks:

- It creates a `CountVectorizer` object `cv`, which is a text feature extraction method provided by scikit-learn that converts a collection of text documents to a matrix of token counts.

- It applies `cv.fit_transform()` to the `x` numpy array to transform the text data into a sparse matrix of token counts. The resulting matrix is stored in the `X` variable.

- It uses `train_test_split()` from scikit-learn to split the `X` and `y` data into training and testing sets. The training set consists of 70% of the data (`test_size=0.30`), and the remaining 30% is used for testing. The `random_state` parameter is set to 25 to ensure reproducibility of the split.

- It stores the training and testing sets in four variables: `X_train`, `X_test`, `y_train`, and `y_test`.

The resulting `X_train` and `X_test` matrices are sparse matrices with rows representing individual text documents and columns representing the frequency of each word in the corpus. The `y_train` and `y_test` numpy arrays contain the corresponding language labels for the text documents in the training and testing sets.

These data splits are commonly used for training and evaluating machine learning models, where the model is trained on the `X_train` and `y_train` data, and evaluated on the `X_test` and `y_test` data. The goal is to build a model that can accurately predict the language of a text document based on its word frequencies.

In [17]:
model = MultinomialNB()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.9562121212121212

This code trains a Naive Bayes text classification model using the training data (`X_train` and `y_train`), and then evaluates the accuracy of the model on the testing data (`X_test` and `y_test`).

- It creates a `MultinomialNB` object `model`, which is a Naive Bayes classifier that works well with count-based data (such as the count matrix produced by `CountVectorizer`).

- It trains the model using the `fit()` method, with the training data `X_train` and `y_train` as input.

- It evaluates the accuracy of the model using the `score()` method, with the testing data `X_test` and `y_test` as input. The `score()` method returns the mean accuracy on the given test data and labels.

The resulting accuracy of 0.9562121212121212 indicates that the model was able to correctly classify 95.6% of the test instances, i.e., it was able to predict the language of the texts with a high degree of accuracy. A higher accuracy score indicates that the model is better at predicting the language of a text document. Note that the accuracy of the model may vary depending on the specific problem and data at hand.

## Prediction

In [19]:
 user = input("Enter text: ")
 data = cv.transform([user]).toarray()
 output = model.predict(data)
 print(output)

Enter texts: amigo
['Spanish']


In [21]:
 user = input("Enter texts: ")
 data = cv.transform([user]).toarray()
 output = model.predict(data)
 print(output)

Enter texts: I want to eat ice cream
['English']


In [25]:
 user = input("Enter texts: ")
 data = cv.transform([user]).toarray()
 output = model.predict(data)
 print(output)

Enter texts: मुझे तुमसे प्यार है
['Hindi']




This code is used to predict the language of a new text input by the user, using the trained Naive Bayes classifier model.

- It first prompts the user to enter a text input using the `input()` function, and stores the input in the `user` variable.

- It then applies the `transform()` method of the `CountVectorizer` object `cv` to the new text input, in order to transform it into a count-based feature vector similar to the ones used in the training and testing data. The result is stored in the `data` variable as a numpy array.

- The `toarray()` method is used to convert the sparse matrix representation of `data` to a dense numpy array.

- It uses the `predict()` method of the trained `MultinomialNB` model to predict the language label of the new text input. The input to `predict()` is the transformed `data` variable. The predicted label is stored in the `output` variable as a numpy array.

- Finally, it prints the predicted label using the `print()` function.

This code can be used to test the trained model on new text inputs, and to see how well it generalizes to unseen data. Note that the accuracy of the model on new inputs may vary depending on the quality and nature of the input data.