## Email Similarity - implementing a Naives Bayes classifier on different datasets.

In [42]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
import plotly.graph_objects as go
import numpy as np

Download the dataset first

In [18]:
data = fetch_20newsgroups()

In [19]:
print(data.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [31]:
len(data.target_names)

20

We want only the baseball and hockey categories

In [9]:
emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'])

Verifying that the categories are the correct ones

In [10]:
print(emails.target_names)

['rec.sport.baseball', 'rec.sport.hockey']


Our interest in this dataset is to see how effective the NB Classifier is at telling the difference between a baseball email and a hockey email. Here's a random email out of our dataset.

In [12]:
print(emails.data[5])

From: mmb@lamar.ColoState.EDU (Michael Burger)
Subject: More TV Info
Distribution: na
Nntp-Posting-Host: lamar.acns.colostate.edu
Organization: Colorado State University, Fort Collins, CO  80523
Lines: 36

United States Coverage:
Sunday April 18
  N.J./N.Y.I. at Pittsburgh - 1:00 EDT to Eastern Time Zone
  ABC - Gary Thorne and Bill Clement

  St. Louis at Chicago - 12:00 CDT and 11:00 MDT - to Central/Mountain Zones
  ABC - Mike Emerick and Jim Schoenfeld

  Los Angeles at Calgary - 12:00 PDT and 11:00 ADT - to Pacific/Alaskan Zones
  ABC - Al Michaels and John Davidson

Tuesday, April 20
  N.J./N.Y.I. at Pittsburgh - 7:30 EDT Nationwide
  ESPN - Gary Thorne and Bill Clement

Thursday, April 22 and Saturday April 24
  To Be Announced - 7:30 EDT Nationwide
  ESPN - To Be Announced


Canadian Coverage:

Sunday, April 18
  Buffalo at Boston - 7:30 EDT Nationwide
  TSN - ???

Tuesday, April 20
  N.J.D./N.Y. at Pittsburgh - 7:30 EDT Nationwide
  TSN - ???

Wednesday, April 21
  St. Louis a

In [13]:
print(emails.target[5])

1


In [16]:
print(emails.target_names[1])

rec.sport.hockey


From the random email we extracted out of the dataset, we can conclude that it's a hockey related email

## Creating the model

Splitting our data

In [20]:
train_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'], subset='train', shuffle=True, random_state=108)

In [21]:
test_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'], subset='test', shuffle=True, random_state=108)

We want to transform these emails into lists of word counts. We can use the CountVectorizer for that.

In [22]:
counter = CountVectorizer()

We need to tell counter what possible words can exist in our email.

In [23]:
counter.fit(test_emails.data + train_emails.data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

From there, we can now make a list of the counts of the words in the training set.

In [24]:
train_counts = counter.transform(train_emails.data)

And do the same with the test set

In [26]:
test_counts = counter.transform(test_emails.data)

Time to create the NB Classifier!

In [27]:
classifier = MultinomialNB()

Fitting the classifier with our data and the labels associated with it

In [28]:
classifier.fit(train_counts, train_emails.target)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Testing the classifier's score

In [29]:
classifier.score(test_counts, test_emails.target)

0.9723618090452262

## Testing different categories

In [32]:
def classifierTester(cat1, cat2):
    train_emails = fetch_20newsgroups(categories = [cat1, cat2], subset='train', shuffle=True, random_state=108)
    test_emails = fetch_20newsgroups(categories = [cat1, cat2], subset='test', shuffle=True, random_state=108)
    
    counter = CountVectorizer()
    counter.fit(test_emails.data + train_emails.data)
    
    train_counts = counter.transform(train_emails.data)
    test_counts = counter.transform(test_emails.data)
    
    classifier = MultinomialNB()
    classifier.fit(train_counts, train_emails.target)
    
    return classifier.score(test_counts, test_emails.target)

In [34]:
score_list = []

for first_cat in data.target_names:
    for second_cat in data.target_names:
        score_list.append(classifierTester(first_cat, second_cat))


In [35]:
print(score_list)

[1.0, 0.9788135593220338, 0.6886395511921458, 0.9915611814345991, 0.9914772727272727, 0.9943977591036415, 0.9887165021156559, 0.9916083916083916, 0.9902370990237099, 0.9874301675977654, 0.9874651810584958, 0.9874125874125874, 0.9901685393258427, 0.9734265734265735, 0.9775596072931276, 0.9651324965132496, 0.9853587115666179, 0.9482014388489208, 0.9570747217806042, 0.8491228070175438, 0.9788135593220338, 1.0, 0.5006385696040868, 0.939820742637644, 0.9431524547803618, 0.860969387755102, 0.975609756097561, 0.9796178343949045, 0.9860228716645489, 0.9847328244274809, 0.9961928934010152, 0.9643312101910828, 0.9232736572890026, 0.9668789808917198, 0.9719029374201787, 0.9796696315120712, 0.9827357237715804, 0.9816993464052287, 0.9814020028612304, 0.975, 0.6886395511921458, 0.5006385696040868, 1.0, 0.5, 0.4980744544287548, 0.5044359949302915, 0.5790816326530612, 0.589873417721519, 0.6136363636363636, 0.7155499367888748, 0.7919293820933165, 0.5822784810126582, 0.5349428208386277, 0.59620253164556

In [37]:
print(len(score_list))

400


In [44]:
score_list = np.reshape(score_list, (20, 20))

In [47]:
fig = go.Figure(go.Heatmap(z=score_list))
fig.update_layout(height=800, width=800)
fig.show()

### Lowest scores categories pairs

In [68]:
low_scores = score_list[score_list<0.8]
low_scores_indices = np.nonzero(score_list<0.8)

In [88]:
def printScorePairs(indices):
    count = 0

    while count < len(indices[0]):
        for i in range(2):
            print(data.target_names[indices[i][count]])
            count += 1
        print('\n')

In [90]:
printScorePairs(low_scores_indices)

alt.atheism
comp.os.ms-windows.misc


comp.os.ms-windows.misc
comp.graphics


comp.os.ms-windows.misc
comp.sys.mac.hardware


comp.os.ms-windows.misc
misc.forsale


comp.os.ms-windows.misc
rec.motorcycles


comp.os.ms-windows.misc
rec.sport.hockey


comp.os.ms-windows.misc
sci.electronics


comp.os.ms-windows.misc
sci.space


comp.os.ms-windows.misc
talk.politics.guns


comp.os.ms-windows.misc
talk.politics.misc


comp.os.ms-windows.misc
comp.os.ms-windows.misc


comp.sys.mac.hardware
comp.os.ms-windows.misc


misc.forsale
comp.os.ms-windows.misc


rec.motorcycles
comp.os.ms-windows.misc


rec.sport.hockey
comp.os.ms-windows.misc


sci.electronics
comp.os.ms-windows.misc


sci.space
comp.os.ms-windows.misc


talk.politics.guns
comp.os.ms-windows.misc


talk.politics.misc
comp.os.ms-windows.misc




### Highest scores categories pairs

In [95]:
high_scores = np.where(np.logical_and(score_list>=0.8, score_list<1.0))
high_scores_indices = np.nonzero(np.logical_and(score_list>=0.8, score_list<1.0))

In [80]:
printScorePairs(high_scores_indices)

alt.atheism
comp.sys.ibm.pc.hardware


alt.atheism
comp.windows.x


alt.atheism
rec.autos


alt.atheism
rec.sport.baseball


alt.atheism
sci.crypt


alt.atheism
sci.med


alt.atheism
soc.religion.christian


alt.atheism
talk.politics.mideast


alt.atheism
talk.religion.misc


comp.graphics
comp.sys.ibm.pc.hardware


comp.graphics
comp.windows.x


comp.graphics
rec.autos


comp.graphics
rec.sport.baseball


comp.graphics
sci.crypt


comp.graphics
sci.med


comp.graphics
soc.religion.christian


comp.graphics
talk.politics.mideast


comp.graphics
talk.religion.misc


comp.sys.ibm.pc.hardware
comp.graphics


comp.sys.ibm.pc.hardware
comp.windows.x


comp.sys.ibm.pc.hardware
rec.autos


comp.sys.ibm.pc.hardware
rec.sport.baseball


comp.sys.ibm.pc.hardware
sci.crypt


comp.sys.ibm.pc.hardware
sci.med


comp.sys.ibm.pc.hardware
soc.religion.christian


comp.sys.ibm.pc.hardware
talk.politics.mideast


comp.sys.ibm.pc.hardware
talk.religion.misc


comp.sys.mac.hardware
comp.graphics


comp.sy