Honour Code
I Chibuikem, Mbonu, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the EDSA honour code. Non-compliance with the honour code constitutes a material breach of contract.

Exam Overview: Language Identification Analysis 2022
South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society.

The country is multilingual with 11 official languages, each of which is guaranteed equal status. Most South Africans are multilingual and able to speak at least two or more of the official languages. With such a multilingual population, it is only obvious that our systems and devices also communicate in multi-languages.

In this challenge, you will take text which is in any of South Africa's 11 Official languages and identify which language the text is in. This is an example of NLP's Language Identification, the task of determining the natural language that a piece of text is written in.

In [25]:
#Import necessary packages for the modeling
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import f1_score, log_loss

In [26]:
#loading of the train data set
train = pd.read_csv('train_set.csv')

In [27]:
#this shows the first five row of the data set
train.head()

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


In [28]:
#transform the lang_id column into numerical value
le = LabelEncoder()
le.fit(train.lang_id)
train.lang_id = le.transform(train.lang_id)

In [29]:
train.head()

Unnamed: 0,lang_id,text
0,9,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,9,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,1,the province of kwazulu-natal department of tr...
3,3,o netefatša gore o ba file dilo ka moka tše le...
4,8,khomishini ya ndinganyiso ya mbeu yo ewa maana...


In [30]:
#create a bag of words for the data set that will be used for training
vect = CountVectorizer()
X_count = vect.fit_transform(train['text'])



In [31]:
#splitting the data into features and label
X = X_count
y = train['lang_id']

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 47)

In [33]:
#instantiate the naive_bayes object
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train, y_train)
pred = naive_bayes.predict(X_test)


In [34]:
#calculate the log loss
y_hat = naive_bayes.predict_proba(X_test)

In [35]:

f1 = f1_score(y_test, pred, average ='weighted')
loss = log_loss(y_test, y_hat)
print('the test f1_score is: {}'.format(f1))
print('the test log_loss is: {}'.format(loss))

the test f1_score is: 0.999545454660823
the test log_loss is: 0.001273889969532281


In [36]:
#load the test data set
test = pd.read_csv('test_set.csv')
test.head()

Unnamed: 0,index,text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele..."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.
3,4,Kube inja nelikati betingevakala kutsi titsini...
4,5,Winste op buitelandse valuta.


In [37]:
#transform the test data 
word = vect.transform(test['text'])

In [38]:
y_pred = naive_bayes.predict(word)

In [39]:
result = list(le.inverse_transform(y_pred))


In [40]:
index= test['index']

In [41]:
sub = pd.DataFrame(list(zip(index,  result)), columns = ['index', 'lang_id'])
sub.head()

Unnamed: 0,index,lang_id
0,1,tsn
1,2,nbl
2,3,ven
3,4,ssw
4,5,afr


In [42]:
sub.to_csv('first_submission.csv', index = False)

In [43]:
sub

Unnamed: 0,index,lang_id
0,1,tsn
1,2,nbl
2,3,ven
3,4,ssw
4,5,afr
...,...,...
5677,5678,eng
5678,5679,nso
5679,5680,sot
5680,5681,sot
