# 📊 SMS Spam Detection Project
This notebook demonstrates a complete pipeline to classify SMS messages as spam or ham using both Machine Learning and Deep Learning models.

## 1️⃣ Dataset Overview & Preprocessing
We begin by loading the labeled SMS Spam Collection dataset and converting it into a format suitable for model training.

# SMS Spam Detection
This notebook includes complete code to classify SMS messages using Machine Learning and Deep Learning models.

## 2️⃣ Machine Learning Models
We apply classic machine learning models (Random Forest, XGBoost, LightGBM) using TF-IDF vectorized text data.

In [None]:
# Data Loading and Preprocessing
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv('../data/SMSSpamCollection.csv', encoding='latin-1')[['label', 'text']]
df['label_num'] = df.label.map({'ham':0, 'spam':1})

X = df['text']
y = df['label_num']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

vectorizer = TfidfVectorizer(stop_words='english', max_df=0.9)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

KeyError: "['label'] not in index"

## 3️⃣ Deep Learning Models
We implement modern deep learning models (ANN, LSTM, CNN) to capture sequential features of text data.

In [None]:
# Machine Learning Models
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import classification_report

models_ml = {
    "Random Forest": RandomForestClassifier(random_state=42),
    "XGBoost": XGBClassifier(eval_metric='logloss', random_state=42),
    "LightGBM": LGBMClassifier(random_state=42)
}

for name, model in models_ml.items():
    model.fit(X_train_vec, y_train)
    preds = model.predict(X_test_vec)
    print(f"\nResults for {name}:")
    print(classification_report(y_test, preds))

In [None]:
# Deep Learning Models
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Conv1D, MaxPooling1D, Flatten, Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

max_words = 1000
max_len = 150
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)

X_train_seq = pad_sequences(tokenizer.texts_to_sequences(X_train), maxlen=max_len)
X_test_seq = pad_sequences(tokenizer.texts_to_sequences(X_test), maxlen=max_len)

# ANN Model
model_ann = Sequential([
    Embedding(max_words, 32, input_length=max_len),
    Flatten(),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

model_ann.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_ann.fit(X_train_seq, y_train, epochs=3, validation_split=0.1)
print("\nANN Evaluation:")
model_ann.evaluate(X_test_seq, y_test)

# LSTM Model
model_lstm = Sequential([
    Embedding(max_words, 32, input_length=max_len),
    LSTM(32),
    Dense(1, activation='sigmoid')
])

model_lstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_lstm.fit(X_train_seq, y_train, epochs=3, validation_split=0.1)
print("\nLSTM Evaluation:")
model_lstm.evaluate(X_test_seq, y_test)

# CNN Model
model_cnn = Sequential([
    Embedding(max_words, 32, input_length=max_len),
    Conv1D(32, 3, activation='relu'),
    MaxPooling1D(pool_size=2),
    Flatten(),
    Dense(1, activation='sigmoid')
])

model_cnn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_cnn.fit(X_train_seq, y_train, epochs=3, validation_split=0.1)
print("\nCNN Evaluation:")
model_cnn.evaluate(X_test_seq, y_test)

In [None]:
# 📘 Final Testing on Professor's Dataset
import pandas as pd

# Load professor's dataset
test_df = pd.read_csv('../data/spam_texts.csv', encoding='latin-1')

# Vectorize using the same trained TF-IDF vectorizer
test_texts = test_df['text']
test_vec = vectorizer.transform(test_texts)

# Predict using trained Random Forest model
test_preds = models_ml['Random Forest'].predict(test_vec)

# Add predictions to dataframe
test_df['predicted_label'] = test_preds
test_df['predicted_label'] = test_df['predicted_label'].map({0: 'ham', 1: 'spam'})

# Show results
print(test_df[['text', 'predicted_label']])

# Optional: export results for use in your report
test_df[['text', 'predicted_label']].to_csv('../report/test_predictions.csv', index=False)


## 4️⃣ Final Testing on Unlabeled Dataset
Here, we apply our best trained model to a separate unlabeled dataset provided for testing and observe the predictions.