# Bayes theorem

Two events are independent if the occurrence of one event does not affect the probability of the second event

Bayes' Theorem deals with probability of dependent events. It takes a test result and relates it to the conditional probability of that test result given other related events. 

we determined two probabilities:

1. The patient had the disease, and the test correctly diagnosed the disease ≈ 0.00001
2. The patient didn’t have the disease and the test incorrectly diagnosed that they had the disease ≈ 0.01

Both events are rare, but we can see that it was about 1,000 times more likely that the test was incorrect than that the patient had this rare disease.

We’re able to come to this conclusion because we had more information than just the accuracy of the test; we also knew the prevalence of this disease.

In statistics, if we have two events (A and B), we write the probability that event A will happen, given that event B already happened as P(A|B). In our example, we want to find P(rare disease | positive result). In other words, we want to find the probability that the patient has the disease given the test came back positive.



In [25]:
import numpy as np

p_positive_given_disease = (0.99 * (.00001))/ (1./100000.)
print(p_positive_given_disease)

p_disease = 1./100000.
print(p_disease)

p_positive = (0.00001) + (0.01) 
print(p_positive)

p_disease_given_positive = (p_positive_given_disease) * (p_disease) / (p_positive)

print(p_disease_given_positive)

0.9899999999999999
1e-05
0.01001
0.000989010989010989


## Spam Filters

Let’s explore a different example. Email spam filters use Bayes’ Theorem to determine if certain words indicate that an email is spam.

Let’s take a word that often appears in spam: “enhancement”.

With just 3 facts, we can make some preliminary steps towards a good spam filter:

1. “enhancement” appears in just 0.1% of non-spam emails
2. “enhancement” appears in 5% of spam emails
3. Spam emails make up about 20% of total emails

Given that an email contains “enhancement”, what is the probability that the email is spam?

In [26]:
import numpy as np

a = 'spam'
b = 'enhancement'

p_spam = 0.2
p_enhancement_given_spam = 0.05
p_enhancement = 0.05 * 0.2 + 0.001 * (1 - 0.2)
p_spam_enhancement = p_enhancement_given_spam * p_spam / p_enhancement

print(p_spam_enhancement)

0.9259259259259259


## The Naive Bayes Classifier

A Naive Bayes classifier is a supervised machine learning algorithm that leverages Bayes’ Theorem to make predictions and classifications.

In [27]:
"""from reviews import neg_list, pos_list
from sklearn.feature_extraction.text import CountVectorizer

review = "This crib was amazing"

# Create a count vectorizer
counter = CountVectorizer()

# Learn vocabilary from both sets of strings
counter.fit(neg_list + pos_list)

# This is the vocabulary that your counter just learned. 
# The numbers associated with each word are the indices of each word when you transform a review.
print(counter.vocabulary_)

# Transform our review
review_counts = counter.transform([review])
print(review_counts.toarray())

# Transform training set
training_counts = counter.transform(neg_list + pos_list)

review = "This crib was great amazing and wonderful"
review_counts = counter.transform([review])

# Create classifier
classifier = MultinomialNB()

# Create labels
# We made the training points by combining neg_list and pos_list. 
# So the first half of the labels should be 0 (for negative) and the second half should be 1 (for positive).
# Create a list named training_labels that has 1000 0s followed by 1000 1s.
training_labels = [0] * 1000 + [1] * 1000

# Fit using training set and labels
classifier.fit(training_counts, training_labels)
print(classifier.predict(review_counts))
print(classifier.predict_proba(review_counts))"""

'from reviews import neg_list, pos_list\nfrom sklearn.feature_extraction.text import CountVectorizer\n\nreview = "This crib was amazing"\n\n# Create a count vectorizer\ncounter = CountVectorizer()\n\n# Learn vocabilary from both sets of strings\ncounter.fit(neg_list + pos_list)\n\n# This is the vocabulary that your counter just learned. \n# The numbers associated with each word are the indices of each word when you transform a review.\nprint(counter.vocabulary_)\n\n# Transform our review\nreview_counts = counter.transform([review])\nprint(review_counts.toarray())\n\n# Transform training set\ntraining_counts = counter.transform(neg_list + pos_list)\n\nreview = "This crib was great amazing and wonderful"\nreview_counts = counter.transform([review])\n\n# Create classifier\nclassifier = MultinomialNB()\n\n# Create labels\n# We made the training points by combining neg_list and pos_list. \n# So the first half of the labels should be 0 (for negative) and the second half should be 1 (for posi

In this lesson, you’ve learned how to leverage Bayes’ Theorem to create a supervised machine learning algorithm. Here are some of the major takeaways from the lesson:

- A tagged dataset is necessary to calculate the probabilities used in Bayes’ Theorem.
- In this example, the features of our dataset are the words used in a product review. In order to apply Bayes’ Theorem, we assume that these features are independent.
- Using Bayes’ Theorem, we can find P(class|data point) for every possible class. In this example, there were two classes — positive and negative. The class with the highest probability will be the algorithm’s prediction.

Even though our algorithm is running smoothly, there’s always more that we can add to try to improve performance. The following techniques are focused on ways in which we process data before feeding it into the Naive Bayes classifier:

- Remove punctuation from the training set. Right now in our dataset, there are 702 instances of "great!" and 2322 instances of "great.". We should probably combine those into 3024 instances of "great".
- Lowercase every word in the training set. We do this for the same reason why we remove punctuation. We want "Great" and "great" to be the same.
- Use a bigram or trigram model. Right now, the features of a review are individual words. For example, the features of the point “This crib is great” are “This”, “crib”, “is”, and “great”. If we used a bigram model, the features would be “This crib”, “crib is”, and “is great”. Using a bigram model makes the assumption of independence more reasonable.

Smoothing in a Naive Bayes Classifier is done to prevent a feature with a porbability of 0 from ruining the total probability

# Bayes project using scikit

In [28]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'])

print(emails.target_names)

['rec.sport.baseball', 'rec.sport.hockey']


In [29]:
print(emails.data[5])

From: mmb@lamar.ColoState.EDU (Michael Burger)
Subject: More TV Info
Distribution: na
Nntp-Posting-Host: lamar.acns.colostate.edu
Organization: Colorado State University, Fort Collins, CO  80523
Lines: 36

United States Coverage:
Sunday April 18
  N.J./N.Y.I. at Pittsburgh - 1:00 EDT to Eastern Time Zone
  ABC - Gary Thorne and Bill Clement

  St. Louis at Chicago - 12:00 CDT and 11:00 MDT - to Central/Mountain Zones
  ABC - Mike Emerick and Jim Schoenfeld

  Los Angeles at Calgary - 12:00 PDT and 11:00 ADT - to Pacific/Alaskan Zones
  ABC - Al Michaels and John Davidson

Tuesday, April 20
  N.J./N.Y.I. at Pittsburgh - 7:30 EDT Nationwide
  ESPN - Gary Thorne and Bill Clement

Thursday, April 22 and Saturday April 24
  To Be Announced - 7:30 EDT Nationwide
  ESPN - To Be Announced


Canadian Coverage:

Sunday, April 18
  Buffalo at Boston - 7:30 EDT Nationwide
  TSN - ???

Tuesday, April 20
  N.J.D./N.Y. at Pittsburgh - 7:30 EDT Nationwide
  TSN - ???

Wednesday, April 21
  St. Louis a

In [30]:
print(emails.target[5])

# Label value is 1 so this email is about Hockey

1


Create training and test datasets

In [31]:
train_emails = fetch_20newsgroups(categories = ['comp.sys.ibm.pc.hardware','rec.sport.hockey'], subset = 'train', shuffle = True, random_state = 108)

test_emails = fetch_20newsgroups(categories = ['comp.sys.ibm.pc.hardware','rec.sport.hockey'], subset = 'test', shuffle = True, random_state = 108)

In [32]:
# Transform these emails into a list of word counts
counter = CountVectorizer()

#  tell counter what possible words can exist in our emails
counter.fit(test_emails.data + train_emails.data)

# make a list of the counts of our words in our training set.
train_counts = counter.transform(train_emails.data)

# make a list of the counts of our words in our test set.
test_counts = counter.transform(test_emails.data)

# Create an instance of a Naive Bayes classifier that we can train and test on
classifier = MultinomialNB()

# Call classifier‘s .fit() function. .fit() takes two parameters. 
# The first should be our training set, which for us is train_counts. 
# The second should be the labels associated with the training emails.
classifier.fit(train_counts, train_emails.target)

# Test the Naive Bayes Classifier by printing classifier‘s .score() function. 
# .score() takes the test set and the test labels as parameters.
print(classifier.score(test_counts, test_emails.target))

0.9974715549936789
