# Email Similarity

In this project, you will use scikit-learn's Naive Bayes implementation on several different datasets. By reporting the accuracy of the classifier, we can find which datasets are harder to distinguish.

For example, how difficult do you think it is to distinguish the difference between emails about hockey and emails about soccer? 

How hard is it to tell the difference between emails about hockey and emails about tech? 

In this project, we’ll find out exactly how difficult those two tasks are.

### Exploring the Data
We've imported a dataset of emails from scikit-learn's datasets. All of these emails are tagged based on their content.

Print emails.target_names to see the different categories.

In [1]:
# Imports
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

emails = fetch_20newsgroups()
emails.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

We're interested in seeing how effective our Naive Bayes classifier is at telling the difference between a baseball email and a hockey email. We can select the categories of articles we want from __fetch_20newsgroups__ by adding the parameter categories.

In the function call, set categories equal to the list ['rec.sport.baseball', 'rec.sport.hockey']

In [2]:
emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'])

Let's take a look at one of these emails.

All of the emails are stored in a list called emails.data. Print the email at index 5 in the list.

In [3]:
emails.data[5]

'From: mmb@lamar.ColoState.EDU (Michael Burger)\nSubject: More TV Info\nDistribution: na\nNntp-Posting-Host: lamar.acns.colostate.edu\nOrganization: Colorado State University, Fort Collins, CO  80523\nLines: 36\n\nUnited States Coverage:\nSunday April 18\n  N.J./N.Y.I. at Pittsburgh - 1:00 EDT to Eastern Time Zone\n  ABC - Gary Thorne and Bill Clement\n\n  St. Louis at Chicago - 12:00 CDT and 11:00 MDT - to Central/Mountain Zones\n  ABC - Mike Emerick and Jim Schoenfeld\n\n  Los Angeles at Calgary - 12:00 PDT and 11:00 ADT - to Pacific/Alaskan Zones\n  ABC - Al Michaels and John Davidson\n\nTuesday, April 20\n  N.J./N.Y.I. at Pittsburgh - 7:30 EDT Nationwide\n  ESPN - Gary Thorne and Bill Clement\n\nThursday, April 22 and Saturday April 24\n  To Be Announced - 7:30 EDT Nationwide\n  ESPN - To Be Announced\n\n\nCanadian Coverage:\n\nSunday, April 18\n  Buffalo at Boston - 7:30 EDT Nationwide\n  TSN - ???\n\nTuesday, April 20\n  N.J.D./N.Y. at Pittsburgh - 7:30 EDT Nationwide\n  TSN - ??

All of the labels can be found in the list emails.target. Print the label of the email at index 5.

The labels themselves are numbers, but those numbers correspond to the label names found at emails.target_names.

Is this a baseball email or a hockey email?

In [4]:
emails.target[5]

1

In [5]:
emails.target_names[1]

'rec.sport.hockey'

### Making the Training and Test Sets
We now want to split our data into training and test sets. Change the name of your variable from __emails__ to __train_emails.__ Add these three parameters to the function call:

    . subset='train'
    . shuffle = True
    . random_state = 108
Adding the __random_state__ parameter will make sure that every time you run the code, your dataset is split in the same way.

In [6]:
train_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'], 
                                    subset = 'train',
                                    shuffle = True,
                                    random_state = 108)

In [7]:
train_emails.data[5]

'From: smorris@venus.lerc.nasa.gov (Ron Morris )\nSubject: Murray as GM  (was: Wings will win\nOrganization: NASA Lewis Research Center\nLines: 37\nDistribution: world\nNNTP-Posting-Host: venus.lerc.nasa.gov\nNews-Software: VAX/VMS VNEWS 1.41    \n\nIn article <1993Apr19.204348.8254@sol.UVic.CA>, gballent@hudson.UVic.CA writes...\n> \n>In article 735249453@vela.acs.oakland.edu, ragraca@vela.acs.oakland.edu (Randy A. Graca) writes:\n> \n>>are predicting).  Although I think Bryan Murray is probably the best GM\n>>I have ever seen in hockey\n> \n>How do you figure that??  When Bryan Murray took over the Wings they were\n>a pretty good team that was contending for the Stanley Cup but looked\n>unlikely to win it.  Now they are a pretty good team that is contending for\n>the Stanley Cup but looks unlikely to win it.  A truly great GM would\n>have been able to make the moves to push the team to the upper echelon\n>of the NHL and maybe win the Stanley Cup.  A good GM (like Murray) can\n\nI thi


Create another variable named __test_emails__ and set it equal to fetch_20newsgroups. The parameters of the function should be the same as before except subset should now be __'test'.__

In [8]:
test_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'], 
                                    subset = 'test',
                                    shuffle = True,
                                    random_state = 108)

In [9]:
test_emails.data[5]

'From: lws@eembox.ncku.edu.tw (WenHsiang Lin)\nSubject: Stats question\nOrganization: National Cheng Kung University\nLines: 5\n\n\n\tI am just wondering whether the official MLB stats includes \nIntentional Walks in the BB category or not?\n\nWenHsiang Lin\n'

### Counting Words

We want to transform these emails into lists of word counts. The CountVectorizer class makes this easy for us.

Create a CountVectorizer object and name it counter.

In [10]:
counter = CountVectorizer()

We need to tell counter what possible words can exist in our emails. counter has a .fit() a function that takes a list of all your data.

Call .fit() with __test_emails.data + train_emails.data__ as a parameter.

In [11]:
counter.fit(test_emails.data + train_emails.data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

We can now make a list of the counts of our words in our __training set.__

Create a variable named __train_counts.__ Set it equal to counter's __transform function__ using __train_emails.data__ as a parameter.

In [12]:
train_counts = counter.transform(train_emails.data)

In [13]:
train_counts

<1197x23714 sparse matrix of type '<class 'numpy.int64'>'
	with 174038 stored elements in Compressed Sparse Row format>

Let's also make a variable named test_counts. This should be the same function call as before, but use test_emails.data as the parameter of transform.

In [14]:
test_counts = counter.transform(test_emails.data)

In [15]:
test_counts

<796x23714 sparse matrix of type '<class 'numpy.int64'>'
	with 115794 stored elements in Compressed Sparse Row format>

### Making a Naive Bayes Classifier

Let's now make a Naive Bayes classifier that we can train and test on. Create a MultinomialNB object named classifier.

In [16]:
classifier = MultinomialNB()

Call classifier's .fit() function. __.fit()__ takes two parameters. The first should be our __training set,__ which for us is __train_counts.__ The second should be the __labels__ associated with the __training emails.__ Those are found in __train_emails.target.__

In [17]:
classifier.fit(train_counts, train_emails.target)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Test the Naive Bayes Classifier by printing classifier's __.score() function.__ _.score()_ takes the __test set and the test labels__ as parameters.

.score() returns the accuracy of the classifier on the test data. Accuracy measures the percentage of classifications a classifier correctly made.

In [18]:
classifier.score(test_counts, test_emails.target)

0.9723618090452262

### Testing Other Datasets
Our classifier does a pretty good job distinguishing between soccer emails and hockey emails. But let's see how it does with emails about really different topics.

Find where you create train_emails and test_emails. Change the categories to be ['comp.sys.ibm.pc.hardware','rec.sport.hockey'].

Did your classifier do a better or worse job on these two datasets?

In [19]:
train_emails = fetch_20newsgroups(categories = ['comp.sys.ibm.pc.hardware', 'rec.sport.hockey'], 
                                    subset = 'train',
                                    shuffle = True,
                                    random_state = 108)

test_emails = fetch_20newsgroups(categories = ['comp.sys.ibm.pc.hardware', 'rec.sport.hockey'], 
                                    subset = 'test',
                                    shuffle = True,
                                    random_state = 108)
counter = CountVectorizer()
counter.fit(test_emails.data + train_emails.data)
train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)
classifier = MultinomialNB()
classifier.fit(train_counts, train_emails.target)
classifier.score(test_counts, test_emails.target)

0.9974715549936789

### Conclusion:
The classifier was 99% accurate when trying to classify hockey and tech emails.

This is better than when it was trying to classify hockey and soccer emails. This makes sense — emails about sports probably share more words in common.