# 🧑‍🏫 Task 1 Part 1: Building a Spam Classifier with Naive Bayes
In this exercise, you'll implement a spam classifier using the **Naive Bayes algorithm** . You'll work with email data to classify messages as spam or non-spam (ham). Follow the steps below and fill in the code where indicated.

**Objective:** Implement all key components of an ML pipeline (except for data handling).

**Allowed Libraries:** `pandas`, `numpy`

**Not Allowed:** Any pre-built ML algorithms or functions like those from `sklearn`.

Follow the instructions step-by-step and answer the questions!

## Step 1: Data Loading and Preprocessing
First, let's load and examine our data.

In [None]:
# Load the data
# TODO: Load the 'emails.csv' file into a DataFrame called 'emails'
import pandas as pd
emails = pd.read_csv("emails.csv")

In [None]:
# Display the first few rows
print(emails.head())

# HINT: Use pd.read_csv() to load the data
# HINT: The DataFrame should have 'text' and 'spam' columns

In [None]:
#Analyse the data and remove or modify rows with missing or invalid values

## Step 2: Text Preprocessing
We need to process each email to extract unique words.

In [None]:
def process_email(text):
    text=text.lower()
    words=list(map(str,text.split(" ")))
    words=list(set(words))
    # TODO: Implement the preprocessing function
    # 1. Convert text to lowercase
    # 2. Split into words
    # 3. Remove duplicates

    # Your code here

    # HINT: Use text.lower() for lowercase conversion
    # HINT: Use split() to convert text to words
    # HINT: Use set() to remove duplicates
    return words

In [None]:
# Apply preprocessing to all emails
emails["processed_text"=email["text"].apply(process_email)

In [None]:
# Test your preprocessing by testing on the first email
first=emails["text"].iloc[0]
processed=process_email(first)
print(first,processed,sep="\n")

## Step 3: Calculate Prior Probabilities
Let's calculate the basic probability of an email being spam.

In [None]:
# TODO: Calculate the following:
# 1. Total number of emails
# 2. Number of spam emails
# 3. Probability of spam

num_emails = len(emails)
num_spam = sum(emails["spam"])
spam_probability = num_spam/num_emails

print(f"Number of emails: {num_emails}")
print(f"Number of spam emails: {num_spam}")
print(f"Probability of spam: {spam_probability:.4f}")

# HINT: Use len(emails) for total count
# HINT: Use sum(emails['spam']) for spam count

## Step 4: Training the Model
Now we'll build our Naive Bayes model by counting word occurrences in spam and ham emails.

In [None]:
def train_naive_bayes(emails_data):
    model = {}

    for index, row in emails_data.iterrows():
        words=process_email(row["text]")
        for word in words:
            if word not in model:
                model[word]={"spam"=0, "ham"=0}
            if row["spam"]==1:
                model[word]["spam"]+=1
            else:
                model[word]["ham"]+=1
            model[word]["spam"]+=1
            model[word]["ham"]+=1
    # HINT: Initialize counts with 1 (Laplace smoothing)
    # HINT: Structure: model[word] = {'spam': count, 'ham': count}

    return model

In [None]:
model = train_naive_bayes(emails)

In [None]:
# Test your model with some words
# Examples: 'lottery', 'sale', 'meeting'

print(model["lottery"],model["sale"],model["meeting"])

## Step 5: Implementing the Prediction Function
Finally, let's implement the function to predict whether an email is spam.

In [None]:

import math as m
def predict_naive_bayes(email_text, model, num_spam, num_ham):
    words=process_email(email_text)
    log_spam=m.log(num_spam/(num_spam+num_ham))
    log_ham=m.log(num_ham/(num_spam+num_ham))
    for word in words:
        spam_words=model[word][spam]
        spam_words=model[word][ham]
    log_spam+=m.log((spam_count+1)/(num_spam+len(model))
    log_ham+=m.log((ham_count+1)/(num_ham+len(model))
    probability=1/(1+m.exp(log_ham-log_spam))
    return probability

    pass

In [None]:
# Test your prediction function
test_emails = [
    "lottery winner claim prize money",
    "meeting tomorrow at 3pm",
    "buy cheap watches online"
]

## Step 6: Wrap-up
1. How well did your model perform?
2. What challenges did you face while implementing it from scratch?
3. What improvements would you suggest for the future?

### Notes (if any):