Here I will to show how to use bayes on multi-class classification/discrimination

import class sklearn.naive_bayes.MultinomialNB for Multinomial logistic regression (logistic regression of multi-class)

But if you want to classify binary/boolean class, it is better to use BernoulliNB 

I will use also compare accuracy for using BOW, TF-IDF, and HASHING for vectorizing technique

In [1]:
# to get f1 score
from sklearn import metrics
import numpy as np
import sklearn.datasets
import re
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split



Define some function to help us for preprocessing

In [2]:
# clear string
def clearstring(string):
    string = re.sub('[^A-Za-z0-9 ]+', '', string)
    string = string.split(' ')
    string = filter(None, string)
    string = [y.strip() for y in string]
    string = ' '.join(string)
    return string

# because of sklean.datasets read a document as a single element
# so we want to split based on new line
def separate_dataset(trainset):
    datastring = []
    datatarget = []
    for i in range(len(trainset.data)):
        data_ = trainset.data[i].split('\n')
        # python3, if python2, just remove list()
        data_ = list(filter(None, data_))
        for n in range(len(data_)):
            data_[n] = clearstring(data_[n])
        datastring += data_
        for n in range(len(data_)):
            datatarget.append(trainset.target[i])
    return datastring, datatarget

I included 6 classes in local/
1. adidas (wear)
2. apple (electronic)
3. hungry (status)
4. kerajaan (government related)
5. nike (wear)
6. pembangkang (opposition related)

In [50]:
# you can change any encoding type
trainset = sklearn.datasets.load_files(container_path = 'local', encoding = 'UTF-8')
trainset.data, trainset.target = separate_dataset(trainset)
print ("List of Classes: %s" %trainset.target_names)
print ("# of Samples: %s" %len(trainset.data))
print ("# of Samples: %s" %len(trainset.target))

List of Classes: ['adidas', 'apple', 'hungry', 'kerajaan', 'nike', 'pembangkang']
# of Samples: 25292
# of Samples: 25292


Change n to see different samples from the dataset

In [47]:
n=10002
print("Sentence: %s" %trainset.data[n])
print("Class: %s" %trainset.target_names[trainset.target[n]])

Sentence: Why did the sandvich cross the road
Class: hungry


So we got 25292 of strings, and 6 classes

It is time to change it into vector representation

In [52]:
# bag-of-word
bow = CountVectorizer().fit_transform(trainset.data)

#tf-idf, must get from BOW first
tfidf = TfidfTransformer().fit_transform(bow)

Feed Naive Bayes using BOW

but split it first into train-set (80% of our data-set), and validation-set (20% of our data-set)

In [53]:
train_X, test_X, train_Y, test_Y = train_test_split(bow, trainset.target, test_size = 0.2)

bayes_multinomial = MultinomialNB().fit(train_X, train_Y)
predicted = bayes_multinomial.predict(test_X)
print('accuracy validation set: %s' %np.mean(predicted == test_Y))

# print scores
print(metrics.classification_report(test_Y, predicted, target_names = trainset.target_names))

accuracy validation set: 0.851749357581
             precision    recall  f1-score   support

     adidas       0.89      0.76      0.82       323
      apple       0.78      0.91      0.84       460
     hungry       0.86      0.95      0.90      1055
   kerajaan       0.86      0.83      0.84      1382
       nike       0.90      0.82      0.86       337
pembangkang       0.85      0.82      0.83      1502

avg / total       0.85      0.85      0.85      5059



Feed Naive Bayes using TF-IDF

but split it first into train-set (80% of our data-set), and validation-set (20% of our data-set)

In [54]:
train_X, test_X, train_Y, test_Y = train_test_split(tfidf, trainset.target, test_size = 0.2)

bayes_multinomial = MultinomialNB().fit(train_X, train_Y)
predicted = bayes_multinomial.predict(test_X)
print('accuracy validation set: %s' %np.mean(predicted == test_Y))

# print scores
print(metrics.classification_report(test_Y, predicted, target_names = trainset.target_names))

accuracy validation set: 0.801937141728
             precision    recall  f1-score   support

     adidas       0.96      0.57      0.71       311
      apple       0.98      0.58      0.73       477
     hungry       0.79      0.91      0.85      1010
   kerajaan       0.86      0.83      0.85      1408
       nike       0.94      0.56      0.70       314
pembangkang       0.71      0.87      0.78      1539

avg / total       0.82      0.80      0.80      5059



Feed Naive Bayes using hashing

but split it first into train-set (80% of our data-set), and validation-set (20% of our data-set)

In [55]:
train_X, test_X, train_Y, test_Y = train_test_split(hashing, trainset.target, test_size = 0.2)

bayes_multinomial = MultinomialNB().fit(train_X, train_Y)
predicted = bayes_multinomial.predict(test_X)
print('accuracy validation set: %s' %np.mean(predicted == test_Y))

# print scores
print(metrics.classification_report(test_Y, predicted, target_names = trainset.target_names))

accuracy validation set: 0.776438031231
             precision    recall  f1-score   support

     adidas       0.98      0.53      0.69       348
      apple       1.00      0.47      0.64       483
     hungry       0.91      0.89      0.90      1076
   kerajaan       0.85      0.79      0.82      1360
       nike       0.96      0.54      0.69       321
pembangkang       0.61      0.89      0.72      1471

avg / total       0.82      0.78      0.77      5059

