Making a ML model using Naive_Bayes for Language Detector

In [114]:
import numpy as np
import pandas as pd

In [115]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [116]:
data = pd.read_csv('/content/drive/MyDrive/Data_Sets/language.csv')
data

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch
...,...,...
21995,hors du terrain les années et sont des année...,French
21996,ใน พศ หลักจากที่เสด็จประพาสแหลมมลายู ชวา อินเ...,Thai
21997,con motivo de la celebración del septuagésimoq...,Spanish
21998,年月，當時還只有歲的她在美國出道，以mai-k名義推出首張英文《baby i like》，由...,Chinese


In [117]:
from sklearn.feature_extraction.text import CountVectorizer #CountVectorizer converts a collection of text documents into a matrix of token (word) counts, which is a numerical representation that machine learning models can understand. It works by creating a vocabulary of all unique words across the documents, and then for each document, it creates a vector where each position represents a word from the vocabulary, and the value is the number of times that word appears in that document.

In [118]:
from sklearn.model_selection import train_test_split

In [119]:
from sklearn.naive_bayes import MultinomialNB

In [120]:
data.isnull().sum()

Unnamed: 0,0
Text,0
language,0


In [121]:
data['language'].value_counts()

Unnamed: 0_level_0,count
language,Unnamed: 1_level_1
Estonian,1000
Swedish,1000
Thai,1000
Tamil,1000
Dutch,1000
Japanese,1000
Turkish,1000
Latin,1000
Urdu,1000
Indonesian,1000


In [122]:
data.dtypes

Unnamed: 0,0
Text,object
language,object


In [123]:
data.columns

Index(['Text', 'language'], dtype='object')

In [124]:
data['language'] # Accessing the 'language' column data

Unnamed: 0,language
0,Estonian
1,Swedish
2,Thai
3,Tamil
4,Dutch
...,...
21995,French
21996,Thai
21997,Spanish
21998,Chinese


converting 'object' dataypes  into 'array'

In [125]:
x = np.array(data['Text'])
y = np.array(data['language'])

In [126]:
x                                #our 'object' data types are converted into 'array" data types

array(['klement gottwaldi surnukeha palsameeriti ning paigutati mausoleumi surnukeha oli aga liiga hilja ja oskamatult palsameeritud ning hakkas ilmutama lagunemise tundemärke  aastal viidi ta surnukeha mausoleumist ära ja kremeeriti zlíni linn kandis aastatel – nime gottwaldov ukrainas harkivi oblastis kandis zmiivi linn aastatel – nime gotvald',
       'sebes joseph pereira thomas  på eng the jesuits and the sino-russian treaty of nerchinsk  the diary of thomas pereira bibliotheca instituti historici s i --   rome libris ',
       'ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เริ่มตั้งแต่ถนนสนามไชยถึงแม่น้ำเจ้าพระยาที่ถนนตก กรุงเทพมหานคร เป็นถนนรุ่นแรกที่ใช้เทคนิคการสร้างแบบตะวันตก ปัจจุบันผ่านพื้นที่เขตพระนคร เขตป้อมปราบศัตรูพ่าย เขตสัมพันธวงศ์ เขตบางรัก เขตสาทร และเขตบางคอแหลม',
       ...,
       'con motivo de la celebración del septuagésimoquinto ° aniversario de la fundación del departamento en  guillermo ceballos espinosa presentó a la gobernación de caldas por encargo de su t

In [127]:
y                     #'object' datatypes is converted into 'array' datatypes

array(['Estonian', 'Swedish', 'Thai', ..., 'Spanish', 'Chinese',
       'Romanian'], dtype=object)

converting array into numbers for "x"

In [128]:
cv = CountVectorizer()

In [129]:
X = cv.fit_transform(x)  # cv is used to convert a collection of text documents into a matrix of token counts. The fit_transform(x) method does two things: first, it 'fits' the CountVectorizer to your text data x, meaning it learns all the unique words (vocabulary) present in your documents. Second, it 'transforms' your text data x into a numerical matrix X based on this learned vocabulary. Each row in X will correspond to a document, and each column will represent a word from your vocabulary, with the value being the count of that word in the respective document.

In [130]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size= 0.33, random_state= 42)

In [131]:
print(X_train)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 613324 stored elements and shape (14740, 277635)>
  Coords	Values
  (0, 197210)	2
  (0, 197623)	1
  (0, 197716)	1
  (0, 198303)	1
  (0, 197382)	1
  (0, 197780)	2
  (0, 197519)	1
  (0, 198343)	1
  (0, 198416)	1
  (0, 198471)	1
  (0, 197247)	1
  (0, 197400)	2
  (0, 198038)	1
  (0, 197807)	1
  (0, 197905)	1
  (0, 197968)	1
  (0, 198332)	1
  (0, 197538)	1
  (1, 197556)	2
  (1, 197229)	1
  (1, 197846)	1
  (1, 197719)	3
  (1, 198312)	1
  (1, 197064)	1
  (1, 197696)	1
  :	:
  (14738, 188732)	1
  (14738, 191919)	1
  (14738, 157086)	1
  (14738, 190261)	1
  (14738, 190640)	1
  (14738, 189600)	1
  (14738, 159184)	2
  (14738, 145346)	1
  (14738, 173207)	1
  (14738, 175977)	1
  (14738, 159874)	1
  (14738, 190113)	1
  (14738, 167039)	1
  (14738, 168073)	1
  (14738, 180175)	2
  (14738, 153177)	1
  (14738, 162065)	1
  (14738, 153270)	1
  (14738, 178019)	1
  (14738, 163685)	1
  (14739, 222917)	1
  (14739, 235085)	1
  (14739, 222361)	1
  (14739

Training Model

In [132]:
model_NB = MultinomialNB()

In [133]:
model_NB.fit(X_train,y_train)

In [134]:
model_NB.score(X_test,y_test)

0.952892561983471

Checking for output result

In [138]:
user = input("Enter a Text: ")
data = cv.transform([user]).toarray()
output = model_NB.predict(data)
print(output)

Enter a Text: my name is dhiraj
['English']


In [136]:
import joblib
joblib.dump(model_NB,'NB_Language_Detector.pkl')
joblib.dump(cv,'vectorizer.pkl') # Assuming 'cv' (CountVectorizer) was intended instead of 'scalar'
joblib.dump(cv.get_feature_names_out().tolist(),'feature_names.pkl') # If you need to save feature names

['feature_names.pkl']