## Email Similarity
In this project, you will use scikit-learn’s Naive Bayes implementation on several different datasets. By reporting the accuracy of the classifier, we can find which datasets are harder to distinguish. For example, how difficult do you think it is to distinguish the difference between emails about hockey and emails about soccer? How hard is it to tell the difference between emails about hockey and emails about tech? In this project, we’ll find out exactly how difficult those two tasks are.

In [76]:
# from codeacademy
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

emails = fetch_20newsgroups()

### Exploring the Data

In [77]:
# Print emails.target_names to see the different categories.

emails.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [78]:
# We’re interested in seeing how effective our Naive Bayes classifier is at telling the difference between a baseball email and a hockey email. 
# We can select the categories of articles we want from fetch_20newsgroups by adding the parameter categories.
categories = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'])

In [79]:
# print email at index 5 of emails.data
categories.data[5]

'From: mmb@lamar.ColoState.EDU (Michael Burger)\nSubject: More TV Info\nDistribution: na\nNntp-Posting-Host: lamar.acns.colostate.edu\nOrganization: Colorado State University, Fort Collins, CO  80523\nLines: 36\n\nUnited States Coverage:\nSunday April 18\n  N.J./N.Y.I. at Pittsburgh - 1:00 EDT to Eastern Time Zone\n  ABC - Gary Thorne and Bill Clement\n\n  St. Louis at Chicago - 12:00 CDT and 11:00 MDT - to Central/Mountain Zones\n  ABC - Mike Emerick and Jim Schoenfeld\n\n  Los Angeles at Calgary - 12:00 PDT and 11:00 ADT - to Pacific/Alaskan Zones\n  ABC - Al Michaels and John Davidson\n\nTuesday, April 20\n  N.J./N.Y.I. at Pittsburgh - 7:30 EDT Nationwide\n  ESPN - Gary Thorne and Bill Clement\n\nThursday, April 22 and Saturday April 24\n  To Be Announced - 7:30 EDT Nationwide\n  ESPN - To Be Announced\n\n\nCanadian Coverage:\n\nSunday, April 18\n  Buffalo at Boston - 7:30 EDT Nationwide\n  TSN - ???\n\nTuesday, April 20\n  N.J.D./N.Y. at Pittsburgh - 7:30 EDT Nationwide\n  TSN - ??

In [80]:
# print label of email at index 5
categories.target[5]

1

In [81]:
# lookup what sport that refers to
categories.target_names[1]

'rec.sport.hockey'

## Making the Training and Test Sets

In [82]:
# create train set
train_emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'], subset='train', shuffle=True, random_state=108)

In [83]:
# create test set
test_emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'], subset='test', shuffle=True, random_state=108)

### Counting words

In [84]:
# create CountVectorizer object and train
counter = CountVectorizer()
counter.fit(test_emails.data, train_emails.data)

CountVectorizer()

In [85]:
# make list of counts of words in training and sets
train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

### Making a Naive Bayes Classifier

In [86]:
# create object and train
classifier = MultinomialNB()
classifier.fit(train_counts, train_emails.target)

MultinomialNB()

In [87]:
# test accuracy
acc = classifier.score(test_counts, test_emails.target)
print("Accuracy distinguishing soccer from hockey emails = {:.2f}%".format(acc*100))

Accuracy distinguishing soccer from hockey emails = 96.98%


### Testing Other Datasets

In [90]:
# create train set
train_emails2 = fetch_20newsgroups(categories=['comp.sys.ibm.pc.hardware', 'rec.sport.hockey'], subset='train', shuffle=True, random_state=108)
# create test set
test_emails2 = fetch_20newsgroups(categories=['comp.sys.ibm.pc.hardware', 'rec.sport.hockey'], subset='test', shuffle=True, random_state=108)
# create CountVectorizer object and train
counter2 = CountVectorizer()
counter2.fit(test_emails2.data, train_emails2.data)
# make list of counts of words in training and sets
train_counts2 = counter2.transform(train_emails2.data)
test_counts2 = counter2.transform(test_emails2.data)
# create object and train
classifier2 = MultinomialNB()
classifier2.fit(train_counts2, train_emails2.target)
# test accuracy
acc2 = classifier2.score(test_counts2, test_emails2.target)
print("Accuracy distinguishing IBM hardware from hockey emails = {:.2f}%".format(acc2*100))

Accuracy distinguishing IBM hardware from hockey emails = 99.62%


### Define function

In [93]:
def compare_pairs(dataset1, dataset2):
    # create train set
    train_emails_func = fetch_20newsgroups(categories=[dataset1, dataset2], subset='train', shuffle=True, random_state=108)
    # create test set
    test_emails_func = fetch_20newsgroups(categories=[dataset1, dataset2], subset='test', shuffle=True, random_state=108)
    # create CountVectorizer object and train
    counter_func = CountVectorizer()
    counter_func.fit(test_emails_func.data, train_emails_func.data)
    # make list of counts of words in training and sets
    train_counts_func = counter_func.transform(train_emails_func.data)
    test_counts_func = counter_func.transform(test_emails_func.data)
    # create object and train
    classifier_func = MultinomialNB()
    classifier_func.fit(train_counts_func, train_emails_func.target)
    #
    acc_func = classifier_func.score(test_counts_func, test_emails_func.target)
    print("Accuracy distinguishing {} from {} emails = {:.2f}%".format(dataset1, dataset2, acc_func*100))

In [94]:
# test
compare_pairs('comp.sys.ibm.pc.hardware', 'rec.sport.hockey')

Accuracy distinguishing comp.sys.ibm.pc.hardware from rec.sport.hockey emails = 99.62%


There are 20 possible targets, so I'll run a few random comparisons to see how they could differ:

In [95]:
import random

In [96]:
for i in range(20):
    compare_pairs(emails.target_names[random.choice(range(20))], emails.target_names[random.choice(range(20))])

Accuracy distinguishing talk.religion.misc from sci.med emails = 97.37%
Accuracy distinguishing talk.politics.guns from comp.sys.mac.hardware emails = 98.53%
Accuracy distinguishing misc.forsale from comp.graphics emails = 96.66%
Accuracy distinguishing comp.sys.ibm.pc.hardware from rec.motorcycles emails = 99.75%
Accuracy distinguishing comp.sys.ibm.pc.hardware from talk.religion.misc emails = 99.07%
Accuracy distinguishing soc.religion.christian from sci.med emails = 98.49%
Accuracy distinguishing rec.motorcycles from sci.crypt emails = 98.36%
Accuracy distinguishing sci.space from rec.sport.baseball emails = 99.37%
Accuracy distinguishing comp.windows.x from talk.politics.guns emails = 99.08%
Accuracy distinguishing rec.autos from rec.sport.baseball emails = 98.87%
Accuracy distinguishing talk.religion.misc from comp.sys.ibm.pc.hardware emails = 99.07%
Accuracy distinguishing comp.graphics from talk.religion.misc emails = 97.50%
Accuracy distinguishing rec.sport.baseball from misc.f

Mostly these have really high (>97%) accuracy. Notably lower were distinguishing alt.atheism from talk.religion.misc emails (84.74%) and especially
distinguishing rec.autos from comp.os.ms-windows.misc (65.32%).