**Copyright: © NexStream Technical Education, LLC**.
All rights reserved

#Bayes Theorem - Part 1

Create a Colab script which calculates P(A|B) using Bayes theorem for the example provided in the lecture; that is, if you test positive how probable is it that you have cancer?
- Create a Python script to implement the bayes_theorem function
- Create a runner to call your function and output the result, P(A|B) for the following
Cases:
-  Case 1: Cancer occurs in 1% of the people your age and the test is 99% reliable
-  Case 2: Cancer occurs in 3% of the people your age and the test is 90% reliable
-  Repeat the tests 3 times each (using your previous P(A|B) result in the
subsequent run) and record P(A|B) for each repeated test. How did your results
improve with each additional test run?

Summary of the lecture example:
- P(A):  probability you have cancer prior to getting tested (called the 'belief')
- P(B):  probability you test positive  (called the 'evidence')
- P(A|B):  probability you have cancer given the test is positive

- Bayes Theorem:
$$P(A|B)=\frac{P(B|A)\cdot P(A)}{P(B)}$$

<br>

**def bayes_theorem(p_a, p_b_given_a):**
- Inputs:
  - *p_a* = probability of event 'a', e.g., you have cancer prior to getting tested
  - *p_b_given_a* = probability of event 'b' given event 'a' has occurred, e.g. the probability that you test positive given you have cancer.
- return *p_a_given_b* = probability of event 'a' given event 'b', e.g. the probability that you have cancer given you test positive, *p_b* (for test purposes).

<br>

Follow the steps as outlined in the following code cell to implement the functions.  Make sure your code passes the embedded doctests.


In [None]:
#In-class assignment in Bayes lecture
#If you test positive how probable is it that you have cancer?
import numpy as np


#Calculate P(A|B) given P(A), P(B|A)
#    Hint:  calculate false_positives, true_positives, and P(B)
#           then can use Bayes theorem to calculate P(A|B)

def bayes_theorem(p_a, p_b_given_a):
  #YOUR CODE HERE
  p_b = p_b_given_a * p_a + (1 - p_a) * (1 - p_b_given_a)

  p_a_given_b = (p_b_given_a * p_a) / p_b

  return p_b, p_a_given_b




#-------------------------------------------------------------------------------------------------
#Test with the following doctest test vectors.
#DO NOT EDIT THE TEST CODE!!!!

#Scenario and definitions
#   P(A):  probability that event (e.g. cancer) occurs in % of population (belief)
#   P(B):  probability test is positive (evidence)
#   P(B|A):  probability test is positive if event occurs, i.e. reliability of test
#            when multiply by P(A) then get the rate of true positives
#   P(B|not A): probability test is positive if event does not occur
#            when multiply by P(not A) then get the rate of false positives

#Case 1.  Cancer occurs in 1% of the people your age and the test is 99% reliable
#Case 2.  Cancer occurs in 3% of the people your age and the test is 90% reliable

def bayes_runner(p_a, p_b_given_a, iterations):
  pagb_list = []
  pb_list = []
  for i in range(iterations):
    p_b, p_a_given_b = bayes_theorem(p_a, p_b_given_a)
    p_a = p_a_given_b
    pagb_list.append(p_a_given_b)
    pb_list.append(p_b)
  return pb_list, pagb_list

#Case 1.  Cancer occurs in 1% of the people your age and the test is 99% reliable
pb_1, pagb_1 = bayes_runner(0.01, 0.99, 3)

#Case 2.  Cancer occurs in 3% of the people your age and the test is 90% reliable
pb_2, pagb_2 = bayes_runner(0.03, 0.90, 3)

import doctest
"""
   >>> print(np.round(pagb_1, 4))
   [0.5    0.99   0.9999]
   >>> print(np.round(pb_1, 4))
   [0.0198 0.5    0.9802]
   >>> print(np.round(pagb_2, 4))
   [0.2177 0.7147 0.9575]
   >>> print(np.round(pb_2, 4))
   [0.124  0.2742 0.6718]
"""
doctest.testmod()



sys.settrace() should not be used when the debugger is being used.
This may cause the debugger to stop working correctly.
If this is needed, please check: 
http://pydev.blogspot.com/2007/06/why-cant-pydev-debugger-work-with.html
to see how to restore the debug tracing back correctly.
Call Location:
  File "/usr/lib/python3.10/doctest.py", line 1501, in run
    sys.settrace(save_trace)



TestResults(failed=0, attempted=4)

# Naive Bayes Monomial Classifier - Part 2

Create a Colab script which uses a Naive Bayes classifier to detect email spam messages. Note, you may NOT use any machine learning libraries (e.g. sklearn) for this section.
- Create a training set of Wanted and Unwanted messages
- Create your 'model'
 - Parse the words in the emails and create conditional probabilities for the Wanted
and Unwanted messages
- Test your model with other messages not used in the training set
- Record and reflect on your results, i.e. evaluate the performance, and compare against
typical performance of the algorithm (do some searches to find the average performance
of a Naive Bayes based spam filter).
- A training dataset has been provided as materials for this assignment. You may find another, or scrape your own emails if you prefer - just describe what dataset you used or created in your reflections.


<br>

- Follow the steps as outlined in the following code cells to implement the functions.
- Record your comments and reflections for this part in a text cell at the end of this section.



In [None]:
#Mount your google drive and copy the dataset to the current working directory (!cp),
#or change the working directory to the folder (%cd).
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/content/



Mounted at /content/drive
/content/drive/My Drive/content


In [None]:
#Read the dataset as a Pandas dataframe.
#and examine the head of the file.
#An example is shown for the provided dataset.
#This code assumes the dataset csv file is in the working directory.

import pandas as pd

df = pd.read_csv('/content/spam_training_dataset_2 (1).csv')
df.head(10)

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [None]:
#Setup dataframes with 80% train and 20% test
#An example is shown for the provided dataset.
cutoff = int(len(df)*0.8)
train = df[:cutoff]
test = df[cutoff:]

#Reset the indices of the test rows
test.reset_index(drop=True, inplace=True)


In [None]:
#Train the model

#Create empty dictionaries of ham (wanted), and spam (unwanted) messages.
#This will create the histogram bins needed for ham and spam messages.
#The dicts will set keys = email word, value = word frequency (count)
ham = {}
spam = {}

#Loop over length of the training set
for i in range(len(train)):

  #Read an example (X) == row from the training set
  #Hint:  https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
  X = df.loc[i]

  #If the exammple is ham (can get this from the 'label' or 'label_num' column)
  #then set a reference to the ham dictionary, else set reference to spam dict.
  if X.label == 'ham':
    toAppend = ham
  else:
    toAppend = spam

  #Loop over the words in the example.
  #Hint: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html
  #Hint: In this case str is the X.text field in the df
  for word in X.text.split():

    #Check for 'stopwords' by ignoring any words with length < X (try different lengths)
    if len(word) > 2:
      #Check if word is in the dictionary and increment count if so
      #otherwise add the word to the dictionary and set the count to 1
      if word in toAppend:
        toAppend[word] += 1
      else:
        toAppend[word] = 1

#Set the lengths for each of the ham and spam entries
#so can calculate the probability of each word in the ham and spam histograms.
lenHam = sum(ham.values())
lenSpam = sum(spam.values())

#Loop over each word and frequency in the dicts and calculate their weightings.
#Hint:  https://docs.python.org/3/tutorial/datastructures.html#dictionaries (Looping Techniques)
#Ham histogram probabilities
for word, count in ham.items():
  ham[word] = (count/lenHam)
#Spam histogram probabilities
for word, count in spam.items():
  spam[word] = (count/lenSpam)

#Finally sort the dictionaries by values in decending order (highest weighting first)
#and keep the items with the top 25 weightings.
#Note, you should experiment with keeping different amounts of items.
ham = dict(sorted(ham.items(), key=lambda item: item[1], reverse=True)[:25])
print(ham)
spam = dict(sorted(spam.items(), key=lambda item: item[1], reverse=True)[:25])
print(spam)

{'you': 0.029119836520216026, 'the': 0.020118717462170973, 'and': 0.014742373376149467, 'for': 0.009511993382961124, 'that': 0.007979370408212913, 'have': 0.0071279132000194615, 'your': 0.007079258502408407, 'not': 0.0063980927358536464, 'are': 0.006252128643020483, '&lt;#&gt;': 0.005619617574076777, 'will': 0.005497980830049141, 'but': 0.005497980830049141, 'get': 0.005473653481243614, "I'm": 0.005206052644382815, 'can': 0.005035761202744125, 'with': 0.004792487714688853, 'when': 0.004524886877828055, 'like': 0.00430594073857831, 'got': 0.003941030506495402, 'know': 0.003941030506495402, 'come': 0.003916703157689875, 'all': 0.0038437211112732936, 'just': 0.0038193937624677664, 'You': 0.003795066413662239, 'call': 0.0037220843672456576}
{'call': 0.012540276931115563, 'your': 0.012453191674649481, 'the': 0.01201776539231908, 'you': 0.01166942436645476, 'for': 0.011146912827658277, 'and': 0.009492292954802752, 'Call': 0.009405207698336672, 'have': 0.00862144039014195, 'from': 0.007663502

In [None]:
#Make predictions
#def predict(text):
#Input: test message text
#Return: 'ham' or 'spam' label

# Loop over words in test message
#   if not stopword (<4 char) then accumulate wanted or unwanted probabilities
# Note, for this implementation, if the word is not in the ham or spam dictionary,
# just ignore it, i.e. don't multiply in a 0 probability to avoid the zero frequency problem.
# If the wanted score > unwanted score then return 'ham', else return 'spam'

#YOUR CODE HERE

def predict(text):
  w_score = 0
  u_score = 0
  for word in text.split():
    if len(word) > 2:
      if word in ham:
        w_score += ham[word]
      if word in spam:
        u_score += spam[word]

  #Compare w_score against u_score and return 'ham' or 'spam' label
  if w_score > u_score:
    return 'ham'
  else:
    return 'spam'


In [None]:
#Test the model
#Loop over the test dataset
#  extract an example (row) and pass the text field to the predict function
#  if the returned label matches the example label then increment the correct
#Calculate a percent correct

#YOUR CODE HERE

correct = 0
for i in range(len(test)):
  email = test.iloc[i]['text']
  if predict(email) == test.iloc[i]['label']:
    correct += 1


print((correct/len(test))*100)

61.0062893081761


**Enter your model performance and reflections here.**
.
.
.




length of stopword | % correct


2         |  61

3         |  32.9

4         |  16.4

# Naive Bayes Gaussian Classifier (from scratch) - Part 3

Create a Colab script which uses a Naive Bayes classifier to predict one of the 3 species of flowers based on input attributes from the Iris dataset.
Note, you may NOT use any machine learning libraries (e.g. sklearn) for this section.
- Create an 80/20 split of the dataset (provided)
- Generate the Gaussian distributions for each of your attributes in each of the classes (species) for your training data
- Use the test data to predict the species
- Calculate and record the accuracy of your classifier
- The Iris dataset has been provided as materials for this assignment, however for testing purposes, please run the provided cell which accesses the dataset directly and performs a train/test split on the data (this is the only use of a sklearn function in this section).

Summary of lecture notes:
- Generate an initial guess (prior probabilities) for each of the classes based on the number of examples for each class, e.g.
 - P(Setosa) = 50/150
 - P(Versicolor) = 50/150
 - P(Virginica) = 50/150
Note, depending on your train/test split, the initial guess probabilities may be different
- Calculate the score for each of the classes, e.g.

*SetosaScore =  P(Setosa) * g(SepalLength | Setosa) * g(SepalWidth | Setosa) * g(PetalLength | Setosa) * g(PetalWidth | Setosa)*
etc.

where:
$$g(x)= \frac{1}{\sigma \sqrt(2\pi)}exp(-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2}) $$



In [None]:
#Load the dataset into a Pandas dataframe and split the data.
#The cell has been provided to set up a random seed for testing purposes.
#Please do NOT alter the code.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
df = pd.read_csv(csv_url, names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'])
#print(df.head())


feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
for col in feature_cols:
  df[col] = pd.to_numeric(df[col])

X_train, X_test, y_train, y_test = train_test_split(df[feature_cols], df['species'], test_size=0.2, random_state=1)
#print(X_train)
#print(y_train)

In [None]:
print(df.shape)

(150, 5)


In [None]:
#Calculate the mean and standard deviation of the training set
#Hint: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html
#Hint: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html


grouped = X_train.groupby(y_train)

train_mean = grouped.mean()
train_std = grouped.std()


#-------------------------------------------------------------------------------------------------
#Test with the following doctest test vectors.
#DO NOT EDIT THE TEST CODE!!!!


import doctest
"""
   >>> print(np.round(train_mean['petal_length']['Iris-setosa'], 4))
   1.4692
   >>> print(np.round(train_std['petal_length']['Iris-setosa'], 4))
   0.1608
   >>> print(np.round(train_mean['sepal_width']['Iris-virginica'], 4))
   2.9523
   >>> print(np.round(train_std['sepal_width']['Iris-virginica'], 4))
   0.3069
"""
doctest.testmod()


TestResults(failed=0, attempted=4)

In [None]:
#Prediction
#  def predict(example):
#    init best_score and best_species
#    loop over each of the species
#       set initial species_score (prior probability)
#       loop over the feature_cols (defined in the provided cell)
#           extract the mean and std for the species feature
#           calculate the species_score (accumulate probability products)
#       if species_score > best_score
#           update best_score and best_species
#     return best_species

import math

def predict(example):
  best_score = -1
  best_species = None
  for species in train_mean.index:
    species_score = 1
    for col in feature_cols:
      mean = train_mean.loc[species, col]
      std = train_std.loc[species, col]

      # Calculate the probability using Gaussian distribution
      prob = (1 / (np.sqrt(2 * np.pi) * std)) * np.exp(-0.5 * ((example[col] - mean) / std) ** 2)

      # Accumulate probability products
      species_score *= prob
    if species_score > best_score:
      best_score = species_score
      best_species = species
  return best_species

In [None]:
#Test the classifier

#-------------------------------------------------------------------------------------------------
#Test with the following doctest test vectors.
#DO NOT EDIT THE TEST CODE!!!!

t = 0
for i in range(len(X_test)):
  test_example = X_test.iloc[i]
  #print(test_example)
  pred = predict(test_example)
  actual = y_test.iloc[i]
  if pred == actual:
    t+=1


import doctest
"""
   >>> print(np.round(t/len(X_test), 4))
   0.9667
"""

doctest.testmod()


TestResults(failed=0, attempted=1)

# Naive Bayes Gaussian Classifier (sklearn) - Part 4

Repeat Part 3 using Scikit-learn functions:

Hint: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

#Note, use the X_train, X_test, y_train, y_test are from the provided
#cell in Part 3

#Instantiate the Naive Bayes classifier
#Fit the model
#Run the predictions
#Calculate the score

clf = GaussianNB()
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)




#-------------------------------------------------------------------------------------------------
#Test with the following doctest test vectors.
#DO NOT EDIT THE TEST CODE!!!!

import doctest
"""
   >>> print(np.round(clf.score(X_test, y_pred), 4))
   1.0
"""

doctest.testmod()


TestResults(failed=0, attempted=1)