# 📧 Spam Email Detection using Naive Bayes

This notebook demonstrates how to build a **Spam Email Classifier** using **Naive Bayes** — one of the most popular algorithms for text classification.  
We’ll use **Scikit-learn** to preprocess the data, train the model, and evaluate its performance.

---

## 🎯 Objectives
- Understand the concept of Naive Bayes for text classification.  
- Learn how to preprocess text using TF-IDF vectorization.  
- Build and evaluate a spam detection model.  

---


In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

# Ignore warnings for clean output
import warnings
warnings.filterwarnings('ignore')

## 📚 Step 1: Load Dataset
We'll use the **SMS Spam Collection Dataset** (can be downloaded from UCI Machine Learning Repository).  
For demonstration, we’ll create a small sample dataset directly here.


In [None]:
# Sample dataset
data = {
    'label': ['ham', 'spam', 'ham', 'ham', 'spam', 'spam', 'ham', 'ham', 'spam', 'ham'],
    'message': [
        'Hey, how are you doing today?',
        'You won $1000! Claim your prize now.',
        'Are we still meeting for lunch?',
        'Don’t forget to bring the documents.',
        'Congratulations! You have been selected for a free trip!',
        'Get cheap loans now, limited offer!',
        'Can you call me later?',
        'Let’s catch up this weekend.',
        'Exclusive deal just for you, click here!',
        'Happy Birthday! Have a great day!'
    ]
}

df = pd.DataFrame(data)
df.head()

## 🧹 Step 2: Preprocess Data
We'll split the dataset into **train** and **test** sets, and convert the text into numerical features using **TF-IDF Vectorizer**.


In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.3, random_state=42)

# TF-IDF Vectorization
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print("Training data shape:", X_train_tfidf.shape)
print("Testing data shape:", X_test_tfidf.shape)

## 🤖 Step 3: Train Naive Bayes Model
We'll use the **Multinomial Naive Bayes** classifier, commonly used for text classification problems.


In [None]:
# Initialize and train model
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

# Predict on test data
y_pred = model.predict(X_test_tfidf)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"✅ Model Accuracy: {accuracy * 100:.2f}%")

## 📊 Step 4: Evaluate Model Performance

In [None]:
# Confusion Matrix and Report
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, cmap='Blues', fmt='d', xticklabels=['ham', 'spam'], yticklabels=['ham', 'spam'])
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

## 📨 Step 5: Test the Model with New Messages
Let's check predictions for new, unseen messages.


In [None]:
# Custom test examples
new_messages = [
    "You have won a free gift card worth $500! Click to claim.",
    "Are you coming to the meeting tomorrow?",
    "Congratulations, you've been selected for an exclusive offer!",
    "Let's go for dinner tonight."
]

# Transform and predict
new_tfidf = vectorizer.transform(new_messages)
predictions = model.predict(new_tfidf)

for msg, pred in zip(new_messages, predictions):
    print(f"Message: {msg}\nPrediction: {pred}\n")

---
### ✅ Summary
In this notebook, we:
- Learned how to preprocess text data using TF-IDF.  
- Built and trained a **Naive Bayes** model for spam detection.  
- Evaluated performance using accuracy and confusion matrix.  
- Tested the model with new unseen messages.

---

**Next Step:** Try using a larger dataset (like the UCI SMS Spam Collection) and experiment with other models such as **Logistic Regression** or **Support Vector Machines**.
