LIN 373 UT Austin :: Jessy Li

## Naive Bayes

Let's encode the Naive Bayes example we used in class into the count table shown on the slides.

In [1]:
## Here's the data. Let's pretend these are grammatical sentences.
docs_train = ["Chinese Beijing Chinese",
              "Chinese Chinese Shanghai",
              "Chinese Macao",
             "Tokyo Japan Chinese"]
# two classes 1 and 0
Y_train = [1, 1, 1, 0]

docs_test = ["Chinese Chinese Chinese Tokyo Japan"]

In [2]:
## first need to tokenize each document
docs_train_tokenized = [doc.split() for doc in docs_train]
docs_test_tokenized = [doc.split() for doc in docs_test]

print(docs_train_tokenized, docs_test_tokenized)


[['Chinese', 'Beijing', 'Chinese'], ['Chinese', 'Chinese', 'Shanghai'], ['Chinese', 'Macao'], ['Tokyo', 'Japan', 'Chinese']] [['Chinese', 'Chinese', 'Chinese', 'Tokyo', 'Japan']]


So how do we put words into a table?
First, we need to create that table. The rows are just the examples. But we need to come up with the columns.
We need to assign each word to a column number!
We do that by creating a dictionary to map from word to a unique column id:

In [4]:
word_to_col_id = {}
for doc in docs_train_tokenized:
    for word in doc:
        if word not in word_to_col_id:
            word_to_col_id[word] = len(word_to_col_id)
print(word_to_col_id)

{'a': 0, 'b': 1, 'c': 2}


Now, we can make the table, and fill it up!

In [5]:
import numpy as np

X_train = np.zeros((len(docs_train),len(word_to_col_id)))

# column represents the word
# row represents the doc
for i, doc in enumerate(docs_train_tokenized):
    for word in doc:
        rowid = i
        colid = word_to_col_id[word]
        X_train[rowid][colid] += 1

print(X_train)

[[1. 4. 1.]
 [1. 1. 4.]
 [0. 1. 3.]
 [1. 4. 0.]]


We do the same for testing docs.

In [6]:
X_test = np.zeros((len(docs_test),len(word_to_col_id)))

for i, doc in enumerate(docs_test_tokenized):
    for word in doc:
        rowid = i
        if word in word_to_col_id:
            colid = word_to_col_id[word]
            X_test[rowid][colid] += 1

print(X_test)

[[1. 3. 1.]]


In [7]:
## Implementing NB from scratch!
from collections import defaultdict

# two dictionaries - each class what is the probability of the prior
# single dict - variables as keys, probabilities as values

# second dictionary - what is it's conditional probability given the class
# nested dict

prior = [0,0]
condprob = defaultdict(dict)
testprob = [0,0]

for c in [0, 1]:
    print('class', c)
    prior[c] = Y_train.count(c)/len(Y_train)
    # P(W1|C) = # of times W1 appeared in class c / number of words in class c
    # denominator
    countallc = sum([sum(X_train[i]) for i, y in enumerate(Y_train) if y == c])
    # numerator
    for token, tid in word_to_col_id.items():
            count_tc =  sum([X_train[i][tid] for i, y in enumerate(Y_train) if y == c])
            condprob[tid][c] = (count_tc + 1)/(countallc + len(word_to_col_id))
            print(token, ';', tid, ';', count_tc, ';', condprob[tid][c])
    
    testprob[c] = prior[c]
    for i, x in enumerate(X_test[0]):
        # i is word, x is the number of times it occured
        testprob[c] = testprob[c] * (condprob[i][c] ** x)
    
    print('class', c, testprob[c])
    print()

for c in [0,1]:
    print('probabilities', testprob[c]/sum(testprob))

class 0
a ; 0 ; 1.0 ; 0.15384615384615385
b ; 1 ; 2.0 ; 0.23076923076923078
c ; 2 ; 7.0 ; 0.6153846153846154
class 0 0.0005817508005806736

class 1
a ; 0 ; 2.0 ; 0.21428571428571427
b ; 1 ; 8.0 ; 0.6428571428571429
c ; 2 ; 1.0 ; 0.14285714285714285
class 1 0.0040663860296305115

probabilities 0.1251578475055869
probabilities 0.8748421524944131


What if there is a new word in the testing docs?


### Using a tool

All of this is implemented by sklearn's [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer), which does tokenization AND the corpus-to-table transformation

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

X_train_v = vectorizer.fit_transform(docs_train)

ValueError: empty vocabulary; perhaps the documents only contain stop words

In [None]:
print(vectorizer.get_feature_names_out())

['beijing' 'chinese' 'japan' 'macao' 'shanghai' 'tokyo']


In [None]:
print(X_train_v.toarray())

[[1 2 0 0 0 0]
 [0 2 0 0 1 0]
 [0 1 0 1 0 0]
 [0 1 1 0 0 1]]


In [None]:
## vectorizer uses sparse encoding
print(X_train_v)

  (0, 1)	2
  (0, 0)	1
  (1, 1)	2
  (1, 4)	1
  (2, 1)	1
  (2, 3)	1
  (3, 1)	1
  (3, 5)	1
  (3, 2)	1


In [None]:
## get the test vector
X_test_v = vectorizer.transform(docs_test)

print(X_test_v.toarray())

[[0 3 1 0 0 1]]


## Running the Naive Base model

Check https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html!

In [None]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train_v, Y_train)

In [None]:
print(model.predict_proba(X_test_v))

[[0.31024139 0.68975861]]


In [None]:
print(model.predict(X_test_v))

[1]
