# 🧑‍🏫 Task 1 Part 1: Building a Spam Classifier with Naive Bayes
In this exercise, you'll implement a spam classifier using the **Naive Bayes algorithm** . You'll work with email data to classify messages as spam or non-spam (ham). Follow the steps below and fill in the code where indicated.

**Objective:** Implement all key components of an ML pipeline (except for data handling).

**Allowed Libraries:** `pandas`, `numpy`

**Not Allowed:** Any pre-built ML algorithms or functions like those from `sklearn`.

Follow the instructions step-by-step and answer the questions!

## Step 1: Data Loading and Preprocessing
First, let's load and examine our data.

In [11]:
# Load the data
# TODO: Load the 'emails.csv' file into a DataFrame called 'emails'
# Your code here
import pandas as pd
import numpy as np
emails =  pd.read_csv("/content/emails.csv")

In [5]:
# Display the first few rows
print(emails.head())

# HINT: Use pd.read_csv() to load the data
# HINT: The DataFrame should have 'text' and 'spam' columns

                                                text  spam
0  Subject: naturally irresistible your corporate...     1
1  Subject: the stock trading gunslinger  fanny i...     1
2  Subject: unbelievable new homes made easy  im ...     1
3  Subject: 4 color printing special  request add...     1
4  Subject: do not have money , get software cds ...     1


In [10]:
#Analyse the data and remove or modify rows with missing or invalid values
# Check for missing values
print(emails.isnull().sum())

# Check for invalid values in 'spam' column(by seeing the unique values)
print(emails['spam'].unique())

# Example: Remove rows with missing values in 'text' column
emails = emails.dropna(subset=['text'])

# Example: Remove rows where 'spam' is not 0 or 1
emails = emails[emails['spam'].isin([0, 1])]

text    0
spam    0
dtype: int64
[1 0]


## Step 2: Text Preprocessing
We need to process each email to extract unique words.

In [12]:
def process_email(text):
    """
    Convert email text to a list of unique, lowercase words

    Parameters:
        text (str): The email text

    Returns:
        list: List of unique words in the email
    """
    # TODO: Implement the preprocessing function
    # 1. Convert text to lowercase
    # 2. Split into words
    # 3. Remove duplicates

    # Your code here

    l_text= text.lower()
    words = l_text.split()
    unique_words = list(set(words))
    return unique_words

    # HINT: Use text.lower() for lowercase conversion
    # HINT: Use split() to convert text to words
    # HINT: Use set() to remove duplicates
    pass

In [13]:
# Apply preprocessing to all emails

emails['words'] = emails['text'].apply(process_email)


In [14]:
# Test your preprocessing by testing on the first email
emails['words'][0]

['result',
 'ciear',
 'all',
 'at',
 'guaranteed',
 'business',
 'gaps',
 'promptness',
 'amount',
 'benefits',
 'here',
 'organization',
 'of',
 '%',
 'recollect',
 'company',
 'website',
 'ordered',
 "'",
 'irresistible',
 'much',
 'it',
 'isoverwhelminq',
 'naturally',
 'days',
 'list',
 'promise',
 'content',
 'identity',
 '100',
 'practicable',
 'your',
 'marketing',
 'within',
 'stationery',
 'provide',
 'budget',
 'are',
 'full',
 'collaboration',
 'aim',
 'three',
 'statlonery',
 'efforts',
 'look',
 'products',
 'to',
 'good',
 'iogo',
 'logos',
 'corporate',
 'outstanding',
 ';',
 'make',
 'use',
 ':',
 'this',
 'system',
 'automaticaily',
 'really',
 'even',
 'not',
 'with',
 'havinq',
 'in',
 'fees',
 'extra',
 ',',
 'changes',
 'ieader',
 'more',
 'suqgestions',
 'we',
 'convenience',
 'lt',
 'easier',
 'break',
 'clear',
 'be',
 'subject:',
 'reflect',
 'affordability',
 'effective',
 'but',
 'you',
 'made',
 'information',
 'image',
 'love',
 'without',
 'done',
 'manage

## Step 3: Calculate Prior Probabilities
Let's calculate the basic probability of an email being spam.

In [15]:
# TODO: Calculate the following:
# 1. Total number of emails
# 2. Number of spam emails
# 3. Probability of spam

num_emails = len(emails) # Your code here
num_spam = sum(emails['spam'])# Your code here
spam_probability = num_spam/num_emails # Your code here

print(f"Number of emails: {num_emails}")
print(f"Number of spam emails: {num_spam}")
print(f"Probability of spam: {spam_probability:.4f}")

# HINT: Use len(emails) for total count
# HINT: Use sum(emails['spam']) for spam count

Number of emails: 5728
Number of spam emails: 1368
Probability of spam: 0.2388


## Step 4: Training the Model
Now we'll build our Naive Bayes model by counting word occurrences in spam and ham emails.

In [None]:
def train_naive_bayes(emails_data):
    """
    Train a Naive Bayes model on email data

    Parameters:
        emails_data (DataFrame): DataFrame with 'words' and 'spam' columns

    Returns:
        dict: Dictionary with word frequencies in spam and ham emails
    """
    # TODO: Create a dictionary to store word frequencies
    # For each word, store counts of its occurrence in spam and ham emails
    model = {}

    # Your code here
    # HINT: Initialize counts with 1 (Laplace smoothing)
    # HINT: Structure: model[word] = {'spam': count, 'ham': count}

    return model

In [None]:
# Test your model with some words
# Examples: 'lottery', 'sale', 'meeting'



## Step 5: Implementing the Prediction Function
Finally, let's implement the function to predict whether an email is spam.

In [None]:
def predict_naive_bayes(email_text, model, num_spam, num_ham):
    """
    Predict whether an email is spam using Naive Bayes

    Parameters:
        email_text (str): The text of the email to classify
        model (dict): Trained Naive Bayes model
        num_spam (int): Number of spam emails in training data
        num_ham (int): Number of ham emails in training data

    Returns:
        float: Probability that the email is spam
    """
    # TODO: Implement the Naive Bayes prediction
    # 1. Preprocess the email text
    # 2. Calculate probability using the Naive Bayes formula

    # Your code here

    # HINT: Use the log of probabilities to avoid numerical underflow
    # HINT: Remember to handle words not in the training data
    pass

In [None]:
# Test your prediction function
test_emails = [
    "lottery winner claim prize money",
    "meeting tomorrow at 3pm",
    "buy cheap watches online"
]

## Step 6: Wrap-up
1. How well did your model perform?
2. What challenges did you face while implementing it from scratch?
3. What improvements would you suggest for the future?

### Notes (if any):