In [216]:
import numpy as np

If possible, update your sklearn version to 1.3.2 to reduce variance in the versions.

In [217]:
#!pip3 install scikit-learn==1.3.2

In [218]:
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The scikit-learn version is 1.3.2.


## Naive Bayes
From the 20Newsgroups dataset we fetch the documents belonging to three categories, which we use as classes.

In [219]:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'talk.politics.guns',
              'sci.space']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

For example, the first document in the training data is the following one:

In [220]:
print(train.data[0])

From: fcrary@ucsu.Colorado.EDU (Frank Crary)
Subject: Re: Riddle me this...
Nntp-Posting-Host: ucsu.colorado.edu
Organization: University of Colorado, Boulder
Distribution: usa
Lines: 16

In article <1r1lp1INN752@mojo.eng.umd.edu> chuck@eng.umd.edu (Chuck Harris - WA3UQV) writes:
>>If so, why was CS often employed against tunnels in Vietnam?

>CS "tear-gas" was used in Vietnam because it makes you wretch so hard that
>your stomach comes out thru your throat.  Well, not quite that bad, but
>you can't really do much to defend yourself while you are blowing cookies.

I think the is BZ gas, not CS or CN. BZ gas exposure results in projectile
vomiting, loss of essentially all muscle control, inability to concentrate
or think rationally and fatal reactions in a significant fraction of
the population. For that reason its use is limited to military
applications.

                                                          Frank Crary
                                                          CU B

The classes are indicated categorically with indices from zero to two by the target vector. The target names tell us which index belongs to which class.

In [221]:
y_train = train.target
y_train

array([2, 2, 1, ..., 1, 2, 2], dtype=int64)

In [222]:
train.target_names

['alt.atheism', 'sci.space', 'talk.politics.guns']

We represent the documents in a bag of word format. That is, we create a data matrix ``D`` such that ``D[j,i]=1`` if the j-th document contains the i-th feature (word), and ``D[j,i]=0`` otherwise. 

In [223]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words="english", min_df=5,token_pattern="[^\W\d_]+", binary=True)
D = vectorizer.fit_transform(train.data)
D_test = vectorizer.transform(test.data)

We get the allocation of feature indices to words by the following array, containing the vocabulary.

In [224]:
vectorizer.get_feature_names_out()

array(['aa', 'aario', 'aaron', ..., 'zoology', 'zv', 'ÿ'], dtype=object)

For example, the word `naive` has the index 4044.

In [225]:
np.where(vectorizer.get_feature_names_out() == 'naive')[0]

array([4044], dtype=int64)

In [226]:
# 5a
y_train_0 = np.array([x for x in y_train if x == 0])
y_train_1 = np.array([x for x in y_train if x == 1])
y_train_2 = np.array([x for x in y_train if x == 2])

p_train_0 = y_train_0.size / y_train.size
p_train_1 = y_train_1.size / y_train.size
p_train_2 = y_train_2.size / y_train.size

p_train_0, p_train_1, p_train_2

(0.2964793082149475, 0.3662754786905497, 0.3372452130945028)

In [227]:
# 5b
alpha = 1e-5
I_0 = np.where(y_train == 0)[0]
I_1 = np.where(y_train == 1)[0]
I_2 = np.where(y_train == 2)[0]

class_counts = {0: 0, 1: 0, 2: 0}

for i in range(y_train.size):
    if D[i, 4044] == 1:
        class_counts[y_train[i]] += 1

K = vectorizer.get_feature_names_out().size

p_train_0 = (class_counts[0] + alpha) / (I_0.size + alpha * K)
p_train_1 = (class_counts[1] + alpha) / (I_1.size + alpha * K)
p_train_2 = (class_counts[2] + alpha) / (I_2.size + alpha * K)

np.log(np.array([p_train_0, p_train_1, p_train_2]))

array([-4.56448951, -6.38530041, -4.91644811])

In [228]:
# 5c
alpha = 1e-5
I_0 = np.where(y_train == 0)[0]
I_1 = np.where(y_train == 1)[0]
I_2 = np.where(y_train == 2)[0]

# log p(y=c)
y_train_0 = np.array([x for x in y_train if x == 0])
y_train_1 = np.array([x for x in y_train if x == 1])
y_train_2 = np.array([x for x in y_train if x == 2])

log_p_train_0 = np.log(y_train_0.size / y_train.size)
log_p_train_1 = np.log(y_train_1.size / y_train.size)
log_p_train_2 = np.log(y_train_2.size / y_train.size)

# p(xd = xdtest | y = c)
counts = {}
num_words = vectorizer.get_feature_names_out().size
train_size = y_train.size
for w in range(num_words):
    counts[w, 0] = 0
    counts[w, 1] = 0
    counts[w, 2] = 0
for w in range(num_words):
    for i in range(train_size):
        if D[i,w] == 1:
            counts[w, y_train[i]] += 1

# log p(y=c)+sum(log(p(xd = xdtest | y = c)))
for w in range(num_words):
    if D[0, w] == 1:
        log_p_train_0 += np.log((counts[w, 0] + alpha) / (I_0.size + alpha * num_words))
        log_p_train_1 += np.log((counts[w, 1] + alpha) / (I_1.size + alpha * num_words))
        log_p_train_2 += np.log((counts[w, 2] + alpha) / (I_2.size + alpha * num_words))

log_p_train_0, log_p_train_1, log_p_train_2

(-362.54772451864125, -381.4478315382202, -193.47950987244982)