# Lab 2 - Probability in Machine Learning

Welcome to the Probability in Machine Learning Lab! In this lab, we will explore how probability theory plays a crucial role in machine learning. We will start with a simple coin flip example to grasp the basics and then move on to build a Bayesian email classifier. Let's dive in!

## Setting Up the Environment

First, let's import the necessary libraries.


In [1]:
import pandas as pd
import numpy as np

## Part 1: Coin Flip Probability Example

### Objective:
To understand basic probability and Python coding through a coin flip example.

### Simulating Coin Flips
We will simulate flipping a coin 1000 times.


In [2]:
# Simulating 1000 coin flips, 0 for 'tails' and 1 for 'heads'
coin_flips = np.random.choice(['heads', 'tails'], size=1000)
df_coin = pd.DataFrame({'flip_result': coin_flips})

### Analyzing Flip Results
Now, let's count how many heads and tails we got.

In [3]:
flip_counts = df_coin['flip_result'].value_counts()
print(flip_counts)

flip_result
tails    532
heads    468
Name: count, dtype: int64


### Calculating Probabilities
Next, we will calculate the probability of getting heads or tails.

In [4]:
p_heads = flip_counts['heads'] / len(df_coin)
p_tails = flip_counts['tails'] / len(df_coin)
print(f"Probability of Heads: {p_heads}")
print(f"Probability of Tails: {p_tails}")

Probability of Heads: 0.468
Probability of Tails: 0.532


## Part 2: Bayesian Email Classifier

### Objective:
Now, you will build a Bayesian email classifier to differentiate between 'spam' and 'ham' (not spam) emails.

### Task 1: Exploring the Dataset
First, load and explore the dataset. You can either find and use a dataset or use the following code to simulate a sample dataset.

In [1]:
# The following code snippet creates a simulated email classification (spam and not spam) dataset with 1000 data points.

import pandas as pd
import numpy as np

# Sample size
n_samples = 1000

# Simulating data
np.random.seed(42)

data = {
    'email_length': np.random.normal(100, 20, n_samples).astype(int),
    'contains_free': np.random.choice([0, 1], size=n_samples, p=[0.7, 0.3]),
    'contains_winner': np.random.choice([0, 1], size=n_samples, p=[0.5, 0.5]),
    'time_of_day': np.random.choice(['morning', 'afternoon', 'evening', 'night'], n_samples),
    'label': np.random.choice(['spam', 'ham'], n_samples, p=[0.4, 0.6])
}

df = pd.DataFrame(data)

#Replace labels with ones with some relationship
for index, row in df.iterrows():
    prob = min(1, .7 *row["contains_free"] + .7*row["contains_winner"]+.1)
    df.at[index, 'label'] = np.random.choice(['spam', 'ham'], p=[prob, 1-prob])

# Saving the dataset
df.to_csv('simulated_email_dataset.csv', index=False)

In [6]:
# Load the dataset (Replace 'path_to_dataset' with the actual file path). You can uncomment the codes below. Notice what `df_emails.head()` is representing.
df = pd.read_csv('simulated_email_dataset.csv')
df.head()

Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label
0,109,0,0,morning,ham
1,97,0,0,morning,ham
2,112,0,0,morning,ham
3,130,1,0,afternoon,spam
4,95,0,1,afternoon,spam


### Task 2: Data Preprocessing
You need to preprocess the data for analysis. This involves normalizing and encoding the features.

In [7]:
categories = ['short', 'medium', 'long']

df['Len'] = pd.qcut(df['email_length'], len(categories), labels = categories)
df.head()

Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label,Len
0,109,0,0,morning,ham,long
1,97,0,0,morning,ham,medium
2,112,0,0,morning,ham,long
3,130,1,0,afternoon,spam,long
4,95,0,1,afternoon,spam,medium


### Task 3: Probability Calculation
Calculate the probability of spam and ham emails in the dataset.

In [9]:
df.label.value_counts()

label
spam    590
ham     410
Name: count, dtype: int64

In [16]:
# Your code for calculating the probability of spam and ham emails in the dataset goes here as prior probabilities

counts = df['label'].value_counts()

# prior probabilities
p_spam = counts['spam'] / len(df)
p_ham = counts['ham'] / len(df)

print(f"Probability of Spam: {p_spam}, Probability of Ham: {p_ham}")

Probability of Spam: 0.59, Probability of Ham: 0.41


### Task 4: Implementing Bayes' Theorem
Implement Bayes' Theorem to classify emails as spam or ham.

In [19]:
df.columns

Index(['email_length', 'contains_free', 'contains_winner', 'time_of_day',
       'label', 'Len'],
      dtype='object')

In [20]:
df.time_of_day.unique()

array(['morning', 'afternoon', 'evening', 'night'], dtype=object)

In [21]:
df.Len.unique()

['long', 'medium', 'short']
Categories (3, object): ['short' < 'medium' < 'long']

In [50]:
# P(spam | L, F, W, T.D) = P(L|spam) * P(F|spam) * P(W|spam) * P(T.D|spam) * P(spam)
# P(ham | L, F, W, T.D) = P(L|ham) * P(F|ham) * P(W|ham) * P(T.D|ham) * P(ham)


# P for a specific feature value and one target value, like P(L|spam)
def calc_one_feature_val_one_target_val_prob(df, feature, feature_value, target, target_value):
    top = len(df[(df[feature] == feature_value) & (df[target] == target_value)])
    bottom = len(df[df[target] == target_value])
    if bottom != 0:
        return top / bottom
    else: 
        return 0

print(calc_one_feature_val_one_target_val_prob(df,'Len', 'medium', 'label', 'spam'))

0.3440677966101695


In [49]:
df['label'].unique()

array(['ham', 'spam'], dtype=object)

In [51]:
# P for all target values for a specific feature value
def calc_one_feature_val_all_target_val_prob(df, feature, feature_value, target):
    probs = {}
    for target_value in df[target].unique():
        probs[target_value] = calc_one_feature_val_one_target_val_prob(df, feature, feature_value, target, target_value)
    return probs

print(calc_one_feature_val_all_target_val_prob(df,'Len', 'medium', 'label'))

{'ham': 0.33902439024390246, 'spam': 0.3440677966101695}


In [None]:
#### conditional probabilities for all feature_values of a specific feature

In [30]:
# P for all target values for all feature values of a specific feature
def calc__all_feature_val_all_target_val_prob(df, feature, target):
    probabilities = {}
    for feature_value in df[feature].unique():
        probabilities[feature_value] = calc_one_feature_val_all_target_val_prob(df, feature, feature_value, target)
    return probabilities

print(calc__all_feature_val_all_target_val_prob(df, 'Len', 'label'))

{'long': {'ham': 0.29024390243902437, 'spam': 0.3474576271186441}, 'medium': {'ham': 0.33902439024390246, 'spam': 0.3440677966101695}, 'short': {'ham': 0.37073170731707317, 'spam': 0.30847457627118646}}


In [32]:
# Calculate conditional probabilities for each feature
features = ['Len', 'contains_free', 'contains_winner', 'time_of_day']
conditional_probabilities = {}
for feature in features:
    conditional_probabilities[feature] = calc__all_feature_val_all_target_val_prob(df, feature, 'label')
conditional_probabilities

{'Len': {'long': {'ham': 0.29024390243902437, 'spam': 0.3474576271186441},
  'medium': {'ham': 0.33902439024390246, 'spam': 0.3440677966101695},
  'short': {'ham': 0.37073170731707317, 'spam': 0.30847457627118646}},
 'contains_free': {0: {'ham': 0.9243902439024391, 'spam': 0.5423728813559322},
  1: {'ham': 0.07560975609756097, 'spam': 0.4576271186440678}},
 'contains_winner': {0: {'ham': 0.848780487804878, 'spam': 0.2711864406779661},
  1: {'ham': 0.15121951219512195, 'spam': 0.7288135593220338}},
 'time_of_day': {'morning': {'ham': 0.21951219512195122,
   'spam': 0.23728813559322035},
  'afternoon': {'ham': 0.23902439024390243, 'spam': 0.2423728813559322},
  'evening': {'ham': 0.24878048780487805, 'spam': 0.26101694915254237},
  'night': {'ham': 0.2926829268292683, 'spam': 0.2593220338983051}}}

In [64]:
conditional_probabilities['Len']

{'long': {'ham': 0.29024390243902437, 'spam': 0.3474576271186441},
 'medium': {'ham': 0.33902439024390246, 'spam': 0.3440677966101695},
 'short': {'ham': 0.37073170731707317, 'spam': 0.30847457627118646}}

In [65]:
conditional_probabilities['Len']['medium']

{'ham': 0.33902439024390246, 'spam': 0.3440677966101695}

In [75]:
def Bayes_email_classifier(email, probabilities, prior_spam_prob, prior_ham_prob):
    spam_probability = prior_spam_prob
    ham_probability = prior_ham_prob

    for feature, feature_value in email.items():
        if feature_value in probabilities[feature]:
                    
                spam_probability *= probabilities[feature][feature_value]['spam']
 
                ham_probability *= probabilities[feature][feature_value]['ham']

    # Normalization step
    total = spam_probability + ham_probability 
    spam_probability = spam_probability / total
    ham_probability = ham_probability / total

    if spam_probability > ham_probability:
        return 'spam', spam_probability, ham_probability            # added final proba for better undrestanding
    else: 
        return'ham', spam_probability, ham_probability




email_1 = {
    'Len': 'short',
    'contains_free': 1,
    'contains_winner': 0,
    'time_of_day': 'morning'
}

Result = Bayes_email_classifier(email_1, conditional_probabilities, p_spam, p_ham)
print(f"This email is probably: {Result}")

This email is probably: ('spam', 0.7145260913282748, 0.2854739086717252)


In [78]:
def Bayes_email_classifier(email, probabilities, prior_spam_prob, prior_ham_prob):
    spam_probability = prior_spam_prob
    ham_probability = prior_ham_prob

    for feature, feature_value in email.items():
        if feature_value in probabilities[feature]:
                    
                spam_probability *= probabilities[feature][feature_value]['spam']
 
                ham_probability *= probabilities[feature][feature_value]['ham']

    # Normalization step
    total = spam_probability + ham_probability 
    spam_probability = spam_probability / total
    ham_probability = ham_probability / total

    if spam_probability > ham_probability:
        return 'spam'
    else: 
        return'ham'




email_1 = {
    'Len': 'short',
    'contains_free': 1,
    'contains_winner': 0,
    'time_of_day': 'morning'
}

Result = Bayes_email_classifier(email_1, conditional_probabilities, p_spam, p_ham)
print(f"This email is probably: {Result}")

This email is probably: spam


### Task 5: Model Testing
Test the model on a new dataset and evaluate its performance. You can use a subset of the dataset that you created or create a new one.

In [100]:
# create the test set

import pandas as pd
import numpy as np

# Sample size
n_samples = 1000

# Simulating data
# np.random.seed(42)

data = {
    'email_length': np.random.normal(100, 20, n_samples).astype(int),
    'contains_free': np.random.choice([0, 1], size=n_samples, p=[0.7, 0.3]),
    'contains_winner': np.random.choice([0, 1], size=n_samples, p=[0.5, 0.5]),
    'time_of_day': np.random.choice(['morning', 'afternoon', 'evening', 'night'], n_samples),
    'label': np.random.choice(['spam', 'ham'], n_samples, p=[0.4, 0.6])
}

t_df = pd.DataFrame(data)

#Replace labels with ones with some relationship
for index, row in t_df.iterrows():
    prob = min(1, .7 *row["contains_free"] + .7*row["contains_winner"]+.1)
    t_df.at[index, 'label'] = np.random.choice(['spam', 'ham'], p=[prob, 1-prob])


In [101]:
# preprocessing the test set
categories = ['short', 'medium', 'long']

t_df['Len'] = pd.qcut(t_df['email_length'], len(categories), labels = categories)
t_df.head()

Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label,Len
0,88,0,0,afternoon,ham,short
1,94,0,0,afternoon,ham,medium
2,116,0,0,morning,ham,long
3,88,0,1,night,spam,short
4,85,0,0,evening,ham,short


In [102]:
predictions = []

for _, record in t_df.iterrows():
    prediction = Bayes_email_classifier(
        {
            'Len': record['Len'],
            'contains_free': record['contains_free'],
            'contains_winner': record['contains_winner'],
            'time_of_day': record['time_of_day']
        },
        conditional_probabilities,
        p_spam,
        p_ham
    )
    predictions.append(prediction)

t_df['prediction'] = predictions
t_df.head()

Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label,Len,prediction
0,88,0,0,afternoon,ham,short,ham
1,94,0,0,afternoon,ham,medium,ham
2,116,0,0,morning,ham,long,ham
3,88,0,1,night,spam,short,spam
4,85,0,0,evening,ham,short,ham


In [103]:
# testing the performance

((t_df['label'] == t_df['prediction']).sum()/len(t_df)) * 100

85.5

### Task 6: Discussion
1. Discuss how Bayesian updating improves the accuracy of the classifier.
2. What are the limitations of the model built in this lab?


In [None]:
# 1) Bayesian updating is signifant because it combines the prior probabilities with the new event probabilities. 
# with no need to rebuild the model. It balances the prior knowledge and new evidence. This balance helps prevent the model 
# from overfitting to noise in the new data or underfitting by ignoring prior information, leading to more accurate predictions.
# It also helps recognize and adjust for dependencies among features and supports continuous learning as well.

# 2) If your training_set does not contain specific feature-value combination, classifier assigns zero probability to those combinations.
# The accuracy of the classifier depends on the quality of the training data
# It requires the conversion of continuous data into categories (like binning), which might not always capture the underlying patterns accurately.

## Submission
Submit a link to your completed Jupyter Notebook file hosted on your private GitHub repository through the submission link in Blackboard.