## Text Classification using Naive Bayes

Below are 26 lines of  a T.S. Eliot poem. In each, a dummy string "ZZZ" or "XXX" has been inserted. The lines comprise the corpus (each string constitutes a document) which will be used to create a Naive Bayes Classifier. 

Separate from the corpus is a list of 0s and 1s. These indicate the class of each document (the ZZZs and XXXs match the target class). 

In [41]:
corpus = [
'And indeed there will be time ZZZ',
'For the yellow smoke that slides along the street XXX',
'Rubbing its back upon the window-panes ZZZ',
'There will be time, there will be time ZZZ',
'To prepare a face to meet the faces that you meet XXX',
'There will be time to murder and create ZZZ',
'And time for all the works and days of hands ZZZ',
'That lift and drop a question on your plate ZZZ',
'Time for you and time for me ZZZ',
'And time yet for a hundred indecisions XXX',
'And for a hundred visions and revisions XXX',
'Before the taking of a toast and tea ZZZ.',
'In the room the women come and go XXX',
'Talking of Michelangelo. XXX',
'And indeed there will be time XXX',
'To wonder, "Do I dare?" and, "Do I dare?" ZZZ',
'Time to turn back and descend the stair, ZZZ',
'With a bald spot in the middle of my hair — XXX',
'(They will say: "How his hair is growing thin!") XXX',
'My morning coat, my collar mounting firmly to the chin, ZZZ',
'My necktie rich and modest, but asserted by a simple pin — XXX',
'(They will say: "But how his arms and legs are thin!") ZZZ',
'Do I dare XXX',
'Disturb the universe? XXX',
'In a minute there is time ZZZ',
'For decisions and revisions which a minute will reverse. XXX'
]

targets = [0,1,0,0,1,0,0,0,0,1,1,0,1,1,1,0,0,1,1,0,1,0,1,1,0,1]

We import `CountVectorizer` and use it to tokenize and transform the data set. What is produced is a matrix. Each row corresponds to a document, and each column represents a word from the set of all words appearing in the corpus. The matrix cell value represents the number of occurrences of the word in the given document. 

In [42]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(corpus)
print('type:',type(X_train_counts))

print('rows (docs), columns (words)')
print(X_train_counts.shape)

type: <class 'scipy.sparse.csr.csr_matrix'>
rows (docs), columns (words)
(26, 100)


We can get the list of words exptracted from the corpus using `get_feature_names`. 

In [43]:
names = count_vect.get_feature_names()
print(names)
print('Number of words:', len(names))

['all', 'along', 'and', 'are', 'arms', 'asserted', 'back', 'bald', 'be', 'before', 'but', 'by', 'chin', 'coat', 'collar', 'come', 'create', 'dare', 'days', 'decisions', 'descend', 'disturb', 'do', 'drop', 'face', 'faces', 'firmly', 'for', 'go', 'growing', 'hair', 'hands', 'his', 'how', 'hundred', 'in', 'indecisions', 'indeed', 'is', 'its', 'legs', 'lift', 'me', 'meet', 'michelangelo', 'middle', 'minute', 'modest', 'morning', 'mounting', 'murder', 'my', 'necktie', 'of', 'on', 'panes', 'pin', 'plate', 'prepare', 'question', 'reverse', 'revisions', 'rich', 'room', 'rubbing', 'say', 'simple', 'slides', 'smoke', 'spot', 'stair', 'street', 'taking', 'talking', 'tea', 'that', 'the', 'there', 'they', 'thin', 'time', 'to', 'toast', 'turn', 'universe', 'upon', 'visions', 'which', 'will', 'window', 'with', 'women', 'wonder', 'works', 'xxx', 'yellow', 'yet', 'you', 'your', 'zzz']
Number of words: 100


Below is the word count for the second line of the poem. 

In [44]:
arr = X_train_counts.toarray()[1]
print(arr)
print(count_vect.get_feature_names()[1])

[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0
 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0]
along


Once we have the vectorized input (using the count of each word, each document can be viewed as a vector), we can create a classifier for it. 

We'll use `MultinomialNB`, one of scikit-learn's defined Naive Bayes classifiers. 

In [45]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_counts, targets)

After the data has been fitted, we'll run some new documents through the classifier 

In [46]:
new_data = [
    'For I have known them all already, known them all: ZZZ',
    'Have known the evenings, mornings, afternoons, ZZZ',
    'I have measured out my life with coffee spoons; XXX',
    'I know the voices dying with a dying fall XXX',
    'Beneath the music from a farther room. ZZZ',
    'So how should I presume? ZZZ']
actual = [0,0,1,1,0,0]

X_new_counts = count_vect.transform(new_data)

predicted = clf.predict(X_new_counts)

# zip iterates through new_data and predicted in parallel, forming 2-tuples.
for doc, category in zip(new_data, predicted):
    print(f'{doc} => {category}')


For I have known them all already, known them all: ZZZ => 0
Have known the evenings, mornings, afternoons, ZZZ => 0
I have measured out my life with coffee spoons; XXX => 1
I know the voices dying with a dying fall XXX => 1
Beneath the music from a farther room. ZZZ => 0
So how should I presume? ZZZ => 0


In [47]:
import numpy as np
print(predicted == actual)
print("accuracy", np.mean(predicted == actual))

[ True  True  True  True  True  True]
accuracy 1.0


## A larger example

Taken from: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In [48]:
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train',
    shuffle=True, random_state=42)

In [49]:
twenty_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [50]:
print(len(twenty_train.data))
print(len(twenty_train.filenames))

11314
11314


In [51]:
print(twenty_train.data[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







In [52]:
twenty_train.target[:100]

array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4,  8, 19,  4, 14,  6,  0,  1,
        7, 12,  5,  0, 10,  6,  2,  4,  1, 12,  9, 15,  7,  6, 13, 12, 17,
       18, 10,  8, 11,  8, 16,  9,  4,  3,  9,  9,  4,  4,  8, 12, 14,  5,
       15,  2, 13, 17, 11,  7, 10,  2, 14, 12,  5,  4,  6,  7,  0, 11, 16,
        0,  6, 17,  7, 12,  7,  3, 12, 11,  7,  2,  2,  0, 16,  1,  2,  7,
        3,  2,  1, 10, 12, 12, 17, 12,  2,  8,  8, 18,  5,  0,  1])

In [53]:
for t in twenty_train.target[:50]:
    print(twenty_train.target_names[t])

rec.autos
comp.sys.mac.hardware
comp.sys.mac.hardware
comp.graphics
sci.space
talk.politics.guns
sci.med
comp.sys.ibm.pc.hardware
comp.os.ms-windows.misc
comp.sys.mac.hardware
rec.motorcycles
talk.religion.misc
comp.sys.mac.hardware
sci.space
misc.forsale
alt.atheism
comp.graphics
rec.autos
sci.electronics
comp.windows.x
alt.atheism
rec.sport.hockey
misc.forsale
comp.os.ms-windows.misc
comp.sys.mac.hardware
comp.graphics
sci.electronics
rec.sport.baseball
soc.religion.christian
rec.autos
misc.forsale
sci.med
sci.electronics
talk.politics.mideast
talk.politics.misc
rec.sport.hockey
rec.motorcycles
sci.crypt
rec.motorcycles
talk.politics.guns
rec.sport.baseball
comp.sys.mac.hardware
comp.sys.ibm.pc.hardware
rec.sport.baseball
rec.sport.baseball
comp.sys.mac.hardware
comp.sys.mac.hardware
rec.motorcycles
sci.electronics
sci.space


In [54]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
#X_train_counts = count_vect.fit_transform(twenty_train.data)
print(type(X_train_counts))
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

<class 'scipy.sparse.csr.csr_matrix'>


(11314, 130107)

In [55]:
names = count_vect.get_feature_names()
n = len(names)
n = int(n/3)
print(names[n:n+50])

['crescendo', 'crescent', 'crescenta', 'crescentville', 'crest', 'cretaceous', 'cretainly', 'crete', 'cretinous', 'cretins', 'crew', 'crews', 'crg1', 'crg8', 'crgruen', 'crh', 'crhai', 'crhc', 'cri', 'crib', 'cribs', 'crica', 'crichar', 'crichmon', 'crichton', 'crick', 'cricket', 'cried', 'cries', 'criiterion', 'crim', 'crime', 'crimea', 'crimean', 'crimes', 'crimestrike', 'crimial', 'criminal', 'criminality', 'criminalization', 'criminalize', 'criminalized', 'criminally', 'criminals', 'criminological', 'criminologist', 'criminologists', 'criminology', 'crimp', 'crimper']


In [56]:
count_vect.vocabulary_.get('bald')

31636

In [57]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_counts, twenty_train.target)

In [58]:
docs_new = [
    'God is love', 
    'The gods are against us', 
    'blood red blood plasma',
    'OpenGL on the GPU is fast',
    "agnosticism",
    "doubt",
    "RAM",
    "Radio Shack",
    "Atari",
    "Brady",
    "Saturn",
    "Jupiter",
]
X_new_counts = count_vect.transform(docs_new)

predicted = clf.predict(X_new_counts)

for doc, category in zip(docs_new, predicted):
    print(f'{doc} => {twenty_train.target_names[category]}')

God is love => soc.religion.christian
The gods are against us => alt.atheism
blood red blood plasma => sci.med
OpenGL on the GPU is fast => comp.graphics
agnosticism => alt.atheism
doubt => soc.religion.christian
RAM => comp.sys.mac.hardware
Radio Shack => sci.electronics
Atari => sci.electronics
Brady => talk.politics.guns
Saturn => rec.autos
Jupiter => sci.space
