Here we will see how to use bayesian on multi-class classification/discrimination.

Import class sklearn.naive_bayes.MultinomialNB for Multinomial logistic regression (logistic regression of multi-class).

If you want to classify binary classes, it is better to use BernoulliNB.

I will also compare accuracy for using BOW and TF-IDF vectorizing techniques.

In [10]:
from sklearn import metrics
import numpy as np
import sklearn.datasets
import re
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

Define some function to help us for preprocessing

In [11]:
# clear string
def clearstring(string):
    string = re.sub('[^A-Za-z0-9 ]+', '', string)
    string = string.split(' ')
    string = filter(None, string)
    string = [y.strip() for y in string]
    string = ' '.join(string)
    return string

# because of sklean.datasets read a document as a single element
# so we want to split based on new line
def separate_dataset(trainset):
    datastring = []
    datatarget = []
    for i in range(len(trainset.data)):
        data_ = trainset.data[i].split('\n')
        # python3, if python2, just remove list()
        data_ = list(filter(None, data_))
        for n in range(len(data_)):
            data_[n] = clearstring(data_[n])
        datastring += data_
        for n in range(len(data_)):
            datatarget.append(trainset.target[i])
    return datastring, datatarget

I included 6 classes in local/
1. adidas (wear)
2. apple (electronic)
3. hungry (status)
4. kerajaan (government related)
5. nike (wear)
6. pembangkang (opposition related)

In [12]:
# you can change any encoding type
trainset = sklearn.datasets.load_files(container_path = 'local', encoding = 'UTF-8')
trainset.data, trainset.target = separate_dataset(trainset)
print ("List of Classes: %s" %trainset.target_names)
print ("# of Samples: %s" %len(trainset.data))
print ("# of Samples: %s" %len(trainset.target))

List of Classes: ['adidas', 'apple', 'hungry', 'kerajaan', 'nike', 'pembangkang']
# of Samples: 25292
# of Samples: 25292


Change n to see different samples from the dataset

In [13]:
n=111
print("Sentence: %s" %trainset.data[n])
print("Class: %s" %trainset.target_names[trainset.target[n]])

Sentence: Report 1MDB negotiations over missing funds breaks down Mkini
Class: pembangkang


Let's split the data into train (80%) and test (20%) sets.

In [14]:
train_data, test_data, train_Y, test_Y = train_test_split(trainset.data, trainset.target, test_size = 0.2)

It is time to change data into BOW vector representation

In [15]:
bow = CountVectorizer().fit(train_data) # create and train a bow verctorizer using training data

Train and test Naive Bayes using BOW

In [16]:
bow_train_X = bow.transform(train_data)
bow_test_X = bow.transform(test_data)

bayes_multinomial = MultinomialNB().fit(bow_train_X, train_Y)
predicted = bayes_multinomial.predict(bow_test_X)
print('accuracy validation set: %s' %np.mean(predicted == test_Y))

# print scores
print(metrics.classification_report(test_Y, predicted, target_names = trainset.target_names))

accuracy validation set: 0.851749357581
             precision    recall  f1-score   support

     adidas       0.96      0.76      0.85       311
      apple       0.82      0.90      0.86       450
     hungry       0.87      0.94      0.90      1034
   kerajaan       0.86      0.81      0.83      1417
       nike       0.89      0.80      0.84       298
pembangkang       0.82      0.85      0.83      1549

avg / total       0.85      0.85      0.85      5059



It is time to change data into TF-IDF vector representation

In [17]:
# must get data from BOW first
tfidf = TfidfTransformer().fit(bow_train_X) # create and train a tfidf verctorizer using training data

Train Naive Bayes using TF-IDF

In [18]:
# must get data from BOW first
tfidf_train_X = tfidf.transform(bow_train_X)
tfidf_test_X = tfidf.transform(bow_test_X)

bayes_multinomial = MultinomialNB().fit(tfidf_train_X, train_Y)
predicted = bayes_multinomial.predict(tfidf_test_X)
print('accuracy validation set: %s' %np.mean(predicted == test_Y))

# print scores
print(metrics.classification_report(test_Y, predicted, target_names = trainset.target_names))

accuracy validation set: 0.807076497331
             precision    recall  f1-score   support

     adidas       0.98      0.59      0.74       311
      apple       0.96      0.61      0.75       450
     hungry       0.83      0.91      0.87      1034
   kerajaan       0.86      0.81      0.83      1417
       nike       0.92      0.61      0.73       298
pembangkang       0.71      0.88      0.78      1549

avg / total       0.83      0.81      0.80      5059

