# Lab 2 - Probability in Machine Learning

Welcome to the Probability in Machine Learning Lab! In this lab, we will explore how probability theory plays a crucial role in machine learning. We will start with a simple coin flip example to grasp the basics and then move on to build a Bayesian email classifier. Let's dive in!

## Setting Up the Environment

First, let's import the necessary libraries.


In [35]:
import pandas as pd
import numpy as np

## Part 1: Coin Flip Probability Example

### Objective:
To understand basic probability and Python coding through a coin flip example.

### Simulating Coin Flips
We will simulate flipping a coin 1000 times.


In [36]:
# Simulating 1000 coin flips, 0 for 'tails' and 1 for 'heads'
coin_flips = np.random.choice(['heads', 'tails'], size=1000)
df_coin = pd.DataFrame({'flip_result': coin_flips})

### Analyzing Flip Results
Now, let's count how many heads and tails we got.

In [21]:
flip_counts = df_coin['flip_result'].value_counts()
print(flip_counts)

flip_result
tails    525
heads    475
Name: count, dtype: int64


### Calculating Probabilities
Next, we will calculate the probability of getting heads or tails.

In [37]:
p_heads = flip_counts['heads'] / len(df_coin)
p_tails = flip_counts['tails'] / len(df_coin)
print(f"Probability of Heads: {p_heads}")
print(f"Probability of Tails: {p_tails}")

Probability of Heads: 0.475
Probability of Tails: 0.525


## Part 2: Bayesian Email Classifier

### Objective:
Now, you will build a Bayesian email classifier to differentiate between 'spam' and 'ham' (not spam) emails.

### Task 1: Exploring the Dataset
First, load and explore the dataset. You can either find and use a dataset or use the following code to simulate a sample dataset.

In [301]:
# The following code snippet creates a simulated email classification (spam and not spam) dataset with 1000 data points.

import pandas as pd
import numpy as np

# Sample size
n_samples = 1000

# Simulating data
np.random.seed(42)
data = {
    'email_length': np.random.normal(100, 20, n_samples).astype(int),
    'contains_free': np.random.choice([0, 1], size=n_samples, p=[0.7, 0.3]),
    'contains_winner': np.random.choice([0, 1], size=n_samples, p=[0.8, 0.2]),
    'time_of_day': np.random.choice(['morning', 'afternoon', 'evening', 'night'], n_samples),
    'label': np.random.choice(['spam', 'ham'], n_samples, p=[0.4, 0.6])
}

df = pd.DataFrame(data)

# Saving the dataset
df.to_csv('simulated_email_dataset.csv', index=False)


In [273]:
# Load the dataset (Replace 'path_to_dataset' with the actual file path). You can uncomment the codes below. Notice what `df_emails.head()` is representing
# df_emails = pd.read_csv('path_to_dataset.csv')
# df_emails.head()
# Read simulated_email_dataset csv file into data frame
df_mail=pd.read_csv('simulated_email_dataset.csv')

# Display first few rows of the Dataframe
print("First few rows of the Dataframe:")
df_mail.head()

First few rows of the Dataframe:


Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label
0,109,0,0,morning,ham
1,97,0,0,morning,spam
2,112,0,0,morning,spam
3,130,1,0,afternoon,ham
4,95,0,1,afternoon,spam


In [316]:
# Load the dataset (Replace 'path_to_dataset' with the actual file path). You can uncomment the codes below. Notice what `df_emails.head()` is representing.
# df_emails = pd.read_csv('path_to_dataset.csv')
# df_emails.head()

# Load the dataset (replace 'path_to_dataset' with the actual file name 'spam.csv') 
dataset_path = 'spam.csv' 
data = pd.read_csv(dataset_path, encoding='latin-1') 
data.head()


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


### Task 2: Data Preprocessing
You need to preprocess the data for analysis. This involves normalizing and encoding the features.

In [317]:
# Your code for Data Preprocessing goes here
# Import Libraries
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.preprocessing import LabelEncoder 

# Assuming the dataset has 'text' column for messages and 'label' column for spam/ham labels 

df = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
df = df.rename(columns={"v1":"Class", "v2":"Message"})
df.head(10)


Unnamed: 0,Class,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [318]:
#Number of observations in each label spam and ham
df.Class.value_counts()

df.describe()


Unnamed: 0,Class,Message
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


In [None]:
#df.loc[:,'label'] = df.label.map({'ham':0, 'spam':1})
#print(df.shape)
#df.head()

In [319]:
df['Length'] = df['Message'].apply(len)
df.head(10)

Unnamed: 0,Class,Message,Length
0,ham,"Go until jurong point, crazy.. Available only ...",111
1,ham,Ok lar... Joking wif u oni...,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,ham,U dun say so early hor... U c already then say...,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",61
5,spam,FreeMsg Hey there darling it's been 3 week's n...,148
6,ham,Even my brother is not like to speak with me. ...,77
7,ham,As per your request 'Melle Melle (Oru Minnamin...,160
8,spam,WINNER!! As a valued network customer you have...,158
9,spam,Had your mobile 11 months or more? U R entitle...,154


In [306]:
#Let's Label the data as 0 & 1 i.e. Spam as 1 & Ham as 0
df.loc[:,'Class'] = df.Class.map({'ham':0, 'spam':1})
print(df.shape)
df.head()

(5572, 3)


Unnamed: 0,Class,Message,Length
0,0,"Go until jurong point, crazy.. Available only ...",111
1,0,Ok lar... Joking wif u oni...,29
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,0,U dun say so early hor... U c already then say...,49
4,0,"Nah I don't think he goes to usf, he lives aro...",61


In [320]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Class    5572 non-null   object
 1   Message  5572 non-null   object
 2   Length   5572 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 130.7+ KB


In [321]:
df.groupby('Class').count()

Unnamed: 0_level_0,Message,Length
Class,Unnamed: 1_level_1,Unnamed: 2_level_1
ham,4825,4825
spam,747,747


In [322]:
df.describe()

Unnamed: 0,Length
count,5572.0
mean,80.118808
std,59.690841
min,2.0
25%,36.0
50%,61.0
75%,121.0
max,910.0


In [323]:
# Bag of Words 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

count = CountVectorizer()
text = count.fit_transform(df['Message'])
#Train & test split
x_train, x_test, y_train, y_test = train_test_split(text, df['Class'], test_size=0.30, random_state=100)
text

<5572x8672 sparse matrix of type '<class 'numpy.int64'>'
	with 73916 stored elements in Compressed Sparse Row format>

In [311]:
#Let's print the dimentions of the train & test dataset
display('X-Train :', x_train.shape)
display('X-Test :',x_test.shape)
display('Y-Train :',y_train.shape)
display('X-Test :',y_test.shape)

'X-Train :'

(3900, 8672)

'X-Test :'

(1672, 8672)

'Y-Train :'

(3900,)

'X-Test :'

(1672,)

In [328]:
prediction = multinomial_nb_model.predict(x_test)

print("Multinomial NB")
print("Accuracy score: {}". format(accuracy_score(y_test, prediction)) )
#print("Precision score: {}". format(precision_score(y_test, prediction)) )
#print("Recall score: {}". format(recall_score(y_test, prediction)))
#print("F1 score: {}". format(f1_score(y_test, prediction)))

Multinomial NB
Accuracy score: 0.9814593301435407


### Task 3: Probability Calculation
Calculate the probability of spam and ham emails in the dataset.

In [53]:
# Your code for calculating the probability of spam and ham emails in the dataset goes here
# Count the labels which has the column containing spam/ham labels
label_counts = df_mail['label'].value_counts()
label_counts

# Calculate the probabilities
total_emails = len(df_mail)
prob_spam = label_counts.get('spam', 0)/total_emails
prob_ham = label_counts.get('ham', 1)/total_emails


# Print probabilities
print("Probability of spam emails:", prob_spam)
print("Probability of ham emails:", prob_ham)


Probability of spam emails: 0.0
Probability of ham emails: 0.001


### Task 4: Implementing Bayes' Theorem
Implement Bayes' Theorem to classify emails as spam or ham.

In [331]:
# Write a function using Bayes' Theorem for classification
# Train Naives Bayes Classifier

# Import Libraries
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.preprocessing import LabelEncoder



 #Training the ML models 
# Multinomial Naive Bayes model 

from sklearn.naive_bayes import MultinomialNB

multinomial_nb_model = MultinomialNB()
multinomial_nb_model.fit(x_train, y_train)  # Train the model


### Task 5: Model Testing
Test the model on a new dataset and evaluate its performance. You can use a subset of the dataset that you created or create a new one.

In [332]:
# Your code goes here
prediction = multinomial_nb_model.predict(x_test)

print("Multinomial NB")
print("Accuracy score: {}". format(accuracy_score(y_test, prediction)) )


Multinomial NB
Accuracy score: 0.9814593301435407


### Task 6: Discussion
1. Which probability distribution would you choose for an email classifier? Explain your answer.
2. Discuss how Bayesian updating improves the accuracy of the classifier.
3. What are the limitations of the model built in this lab?

# Solution:
1. The rationale for this is that probability distribution is appropriate for discrete outcomes with more than two categories, making it useful when dealing with many classifications (for example, ham, spam). It assumes that features are counts or frequencies (for example, word counts in email), allowing it to simulate the distribution of several discrete categories. In this lab activity, because the features are word counts, the multinomial distribution is the ideal choice for modelling the distribution of words in spam and ham emails.

2. Bayesian updating is a key notion in Bayesian statistics that allows us to modify our probabilities in response to new data. In the scenario of an email classifier, Bayesian updating increases model accuracy by iteratively adding new data and updating predictions. It incorporates a prior probability distribution that represents initial probabilities about the model's parameters.

3. The key limitations of Multinomial Naive Bayes include:
(i) Lack of context understanding: Word context and semantic meaning are not captured by Multinomial Naive Bayes (MNB). For example, it has difficulty differentiating between multiple meanings of a word based only on frequency of recurrence.

(ii) Loss of word order information: MNB sees documents as unordered bags of words, regardless of the sequence in which they occur. This constraint can be relevant in the context of emails, when word order and context are critical to comprehension and meaning.

(iii) Difficulty with rare words: Words that are uncommon or unknown in the training data provide issues for MNB. Because it calculates probabilities using observed frequencies, it may give zero probability to unknown words, resulting in poor generalization.

(iv) Impact of class imbalance: A class imbalance, in which one type (for example, spam) is considerably more abundant than ham, might have an impact on Multinomial Naive Bayes performance. It may skew the classifier toward the majority class, particularly when calculating class priors. 


## Submission
Submit a link to your completed Jupyter Notebook file hosted on your private GitHub repository through the submission link in Blackboard.