# Lab 2 - Probability in Machine Learning

Welcome to the Probability in Machine Learning Lab! In this lab, we will explore how probability theory plays a crucial role in machine learning. We will start with a simple coin flip example to grasp the basics and then move on to build a Bayesian email classifier. Let's dive in!

## Setting Up the Environment

First, let's import the necessary libraries.


In [1]:
import pandas as pd
import numpy as np

## Part 1: Coin Flip Probability Example

### Objective:
To understand basic probability and Python coding through a coin flip example.

### Simulating Coin Flips
We will simulate flipping a coin 1000 times.


In [2]:
# Simulating 1000 coin flips, 0 for 'tails' and 1 for 'heads'
coin_flips = np.random.choice(['heads', 'tails'], size=1000)
df_coin = pd.DataFrame({'flip_result': coin_flips})

### Analyzing Flip Results
Now, let's count how many heads and tails we got.

In [3]:
flip_counts = df_coin['flip_result'].value_counts()
print(flip_counts)

flip_result
tails    511
heads    489
Name: count, dtype: int64


### Calculating Probabilities
Next, we will calculate the probability of getting heads or tails.

In [4]:
p_heads = flip_counts['heads'] / len(df_coin)
p_tails = flip_counts['tails'] / len(df_coin)
print(f"Probability of Heads: {p_heads}")
print(f"Probability of Tails: {p_tails}")

Probability of Heads: 0.489
Probability of Tails: 0.511


## Part 2: Bayesian Email Classifier

### Objective:
Now, you will build a Bayesian email classifier to differentiate between 'spam' and 'ham' (not spam) emails.

### Task 1: Exploring the Dataset
First, load and explore the dataset. You can either find and use a dataset or use the following code to simulate a sample dataset.

In [5]:
# The following code snippet creates a simulated email classification (spam and not spam) dataset with 1000 data points.

import pandas as pd
import numpy as np

# Sample size
n_samples = 1000

# Simulating data
np.random.seed(42)
data = {
    'email_length': np.random.normal(100, 20, n_samples).astype(int),
    'contains_free': np.random.choice([0, 1], size=n_samples, p=[0.7, 0.3]),
    'contains_winner': np.random.choice([0, 1], size=n_samples, p=[0.8, 0.2]),
    'time_of_day': np.random.choice(['morning', 'afternoon', 'evening', 'night'], n_samples),
    'label': np.random.choice(['spam', 'ham'], n_samples, p=[0.4, 0.6])
}

df = pd.DataFrame(data)

# Saving the dataset
df.to_csv('simulated_email_dataset.csv', index=False)

In [8]:
# Load the dataset (Replace 'path_to_dataset' with the actual file path). You can uncomment the codes below. Notice what `df_emails.head()` is representing.
# df_emails = pd.read_csv('path_to_dataset.csv')
# df_emails.head()

### Task 2: Data Preprocessing
You need to preprocess the data for analysis. This involves normalizing and encoding the features.

In [14]:
# Check for missing values
print(df.isnull().sum())

# Fill missing values or drop rows/columns with missing values
df.fillna(method='ffill', inplace=True)  # Forward fill

email_length       0
contains_free      0
contains_winner    0
time_of_day        0
label              0
dtype: int64


  df.fillna(method='ffill', inplace=True)  # Forward fill


In [13]:
# Your code for Data Preprocessing goes here
print(df)
from sklearn.preprocessing import StandardScaler

# Normalize 'email_length'
scaler = StandardScaler()
df['email_length'] = scaler.fit_transform(df[['email_length']])

     email_length  contains_free  contains_winner time_of_day label
0        0.465685              0                0     morning   ham
1       -0.146723              0                0     morning  spam
2        0.618787              0                0     morning  spam
3        1.537399              1                0   afternoon   ham
4       -0.248791              0                1   afternoon  spam
..            ...            ...              ...         ...   ...
995     -0.299825              0                1       night   ham
996      1.792569              0                0       night  spam
997      0.618787              0                0     evening  spam
998     -0.606029              0                1   afternoon  spam
999      0.567753              0                0     evening  spam

[1000 rows x 5 columns]


In [16]:
from sklearn.preprocessing import StandardScaler

# Normalize 'email_length'
scaler = StandardScaler()
df['email_length'] = scaler.fit_transform(df[['email_length']])

In [17]:
from sklearn.preprocessing import OneHotEncoder

# One-hot encode categorical features
encoder = OneHotEncoder(drop='first', sparse=False)
encoded_features = encoder.fit_transform(df[['contains_free', 'contains_winner', 'time_of_day']])

# Create a DataFrame with the encoded features
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(['contains_free', 'contains_winner', 'time_of_day']))

# Concatenate the encoded features with the original DataFrame
df = pd.concat([df, encoded_df], axis=1).drop(['contains_free', 'contains_winner', 'time_of_day'], axis=1)

TypeError: OneHotEncoder.__init__() got an unexpected keyword argument 'sparse'

In [20]:
from sklearn.model_selection import train_test_split

# Define features and target
X = df.drop('label', axis=1)
y = df['label']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [21]:
from sklearn.linear_model import LogisticRegression

# Initialize and train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate the model
accuracy = model.score(X_test, y_test)
print(f'Model accuracy: {accuracy:.2f}')

ValueError: could not convert string to float: 'evening'

### Task 3: Probability Calculation
Calculate the probability of spam and ham emails in the dataset.

In [22]:
# Your code for calculating the probability of spam and ham emails in the dataset goes here
import pandas as pd

# Load the dataset
df = pd.read_csv('simulated_email_dataset.csv')

# Total number of emails
total_emails = len(df)

# Number of spam emails
spam_emails = len(df[df['label'] == 'spam'])

# Number of ham emails
ham_emails = len(df[df['label'] == 'ham'])

# Probability of spam and ham emails
prob_spam = spam_emails / total_emails
prob_ham = ham_emails / total_emails

print(f"Probability of spam emails: {prob_spam:.2f}")
print(f"Probability of ham emails: {prob_ham:.2f}")

Probability of spam emails: 0.41
Probability of ham emails: 0.59


### Task 4: Implementing Bayes' Theorem
Implement Bayes' Theorem to classify emails as spam or ham.

In [23]:
# Write a function using Bayes' Theorem for classification
# Calculate prior probabilities
prob_spam = len(df[df['label'] == 'spam']) / len(df)
prob_ham = len(df[df['label'] == 'ham']) / len(df)

# Calculate likelihoods
def calculate_likelihood(df, feature, value, label):
    return len(df[(df[feature] == value) & (df['label'] == label)]) / len(df[df['label'] == label])

# Example email to classify
email = {
    'email_length': 120,
    'contains_free': 1,
    'contains_winner': 0,
    'time_of_day': 'morning'
}

# Calculate likelihoods for the example email
likelihood_spam = (
    calculate_likelihood(df, 'contains_free', email['contains_free'], 'spam') *
    calculate_likelihood(df, 'contains_winner', email['contains_winner'], 'spam') *
    calculate_likelihood(df, 'time_of_day', email['time_of_day'], 'spam')
)

likelihood_ham = (
    calculate_likelihood(df, 'contains_free', email['contains_free'], 'ham') *
    calculate_likelihood(df, 'contains_winner', email['contains_winner'], 'ham') *
    calculate_likelihood(df, 'time_of_day', email['time_of_day'], 'ham')
)

# Calculate posterior probabilities using Bayes' Theorem
posterior_spam = prob_spam * likelihood_spam
posterior_ham = prob_ham * likelihood_ham

# Normalize to get probabilities
total_posterior = posterior_spam + posterior_ham
prob_spam_given_email = posterior_spam / total_posterior
prob_ham_given_email = posterior_ham / total_posterior

print(f"Probability of email being spam: {prob_spam_given_email:.2f}")
print(f"Probability of email being ham: {prob_ham_given_email:.2f}")

Probability of email being spam: 0.38
Probability of email being ham: 0.62


### Task 5: Model Testing
Test the model on a new dataset and evaluate its performance. You can use a subset of the dataset that you created or create a new one.

In [25]:
# Your code goes here
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

# Load the dataset
df = pd.read_csv('simulated_email_dataset.csv')

# Convert categorical features to numerical
df['time_of_day'] = df['time_of_day'].astype('category').cat.codes

# Split the dataset into features and labels
X = df.drop('label', axis=1)
y = df['label'].map({'spam': 1, 'ham': 0})

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the Naive Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Evaluate the model's performance with zero_division parameter
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, zero_division=1)
recall = recall_score(y_test, y_pred, zero_division=1)
f1 = f1_score(y_test, y_pred, zero_division=1)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

Accuracy: 0.62
Precision: 1.00
Recall: 0.00
F1 Score: 0.00


### Task 6: Discussion
1. Which probability distribution would you choose for an email classifier? Explain your answer.
2. Discuss how Bayesian updating improves the accuracy of the classifier.
3. What are the limitations of the model built in this lab?


1. The Multinomial Naive Bayes classifier is used for classifying tasks such as spam detection because it models the distribution of word counts or term frequencies in emails.

Text Representation: Emails are typically represented as a bag of words, where the frequency of each word is counted. The Multinomial distribution imainlyal for handling these word counts.
Feature Independence: Naive Bayhashat the features (words) are conditionally independent given the class (spam or ham). This assumption simplifies the computation and works reasonably well in practice for text classification.
Handling of Sparse Data: Text data is often sparse, meaning that most words do not appear in a given email. The Multinomial Naive Bayes classifier can handle this sparsity effectiv

2. Bayesian updating upddates it classifier everytime a new data is ocuring. Here’s a detailed explanation of how this process works and why it enhances accuracy:

Bayesian Updating Process
Initial Prior: Start with an initial prior probability, which represents your initial belief about the likelihood of an event (e.g., an email being spam or ham) before observing any data.
Likelihood: As new data (features of an email) is observed, calculate the likelihood of this data given each possible class (spam or ham).
Posterior Probability: Update the prior probability using Bayes’ Theorem to obtain the posterior probability. This posterior becomes the new prior for the next round of updating as more data is obse

3. The limitation of the model built in this lab are:
Simplistic Assumptions: Assumes feature independence, which isn’t always true.
Feature Representation: Uses binary features and simple categorical encoding, missing out on more nuanced information.
Data Quality: Based on simulated data, which may not reflect real-world complexities.
Model Performance: Issues like undefined precision can arise, affecting reliability.
Scalability: May not handle very large datasets efficiently.
Adaptability: Static model that doesn’t update in real-time.
Contextual Understanding: Lacks understanding of word contexmeaningantics.rved.ely.

## Submission
Submit a link to your completed Jupyter Notebook file hosted on your private GitHub repository through the submission link in Blackboard.