### **Spam Detection: using Passive-Aggressive Classifier**


---


1.   Pull sample tweets (or use a dataset)
2.   Vectorize text (TF-IDF)
3.   Train a Passive-Aggressive classifier
4.   Test live predictions

STEPS:
1. Data collection: Start with a dataset
2. Feature Extraction: The system identifies features and this is what the model uses to make predictions
3. Model Training: ML algorithms uses labeled data to learn how to map features to correct class
4. Model Evaluation: Once the model is trained, it is tested on new, unseen data to test how accurately it can classify items
5. Prediction: The model can now predict the class of new data based on the features it has learned
6. Model Evaluation: Evaluating the performance of model to check how well it performs and how good it is at handling new, unseen data. We can use different metrics to measure its performance, i.e., accuracy, precision, recall, F1 score, log loss, Area Under Curve(AUC) and ROC Curve, Confusion Matrix

In [21]:
import pandas as pd

from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [23]:
# Load dataset

url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
df = pd.read_csv(url, sep="\t", header=None, names=["label", "message"])

# Convert labels to binary
df['label'] = df['label'].map({'ham':0, 'spam':1})

df.head()

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


### Train-Test Split

We split data to test the model on new data it hasnt seen before. The train_test_split(...) is a function that divides your dataset into two parts:
- Training set â†’ data the model learns from
- Testing set â†’ data used later to check how well the model performs on unseen data

X_train: Features for training (text/inputs used to train model)
X_test:	Features for testing (unseen inputs for evaluation)
y_train:	Labels for training (correct outputs for training samples)
y_test:	Labels for testing (correct outputs for test samples)



In [24]:
# train/test split

X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.2, random_state=42)

### Vectorization

Vectorization is needed as ML models can only work with numbers and not raw words. It turns sentences into vectors (arrays of numbers) that represent the meaning or presence of words.
Common methods to vectorize text:



1.   Bag-of-Words: Counts how many times each word appears
2.   TF-IDF: Word importance (frequency + uniqueness)
3.   Word2Vec: Words â†’ dense meaning vectors (Captures semantics)
4.   BERT embeddings: Deep contextual meaning



In [26]:
# vectorize text

vectorizer = TfidfVectorizer(stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

### Creating a Machine Learning Model

- PassiveAggressiveClassifier is the algorithm
- max_iter=1000: the model can go through the training data up to 1000 iterations (passes) to learn patterns

- X_train_vec â†’ training features (inputs)
- y_train_vec â†’ training labels (correct outputs)

In [29]:
# model = ... : Choosing a learning method (eg. using flashcards to study)
# model.fit(...) : Actually using the flashcards (learning from data)

model = PassiveAggressiveClassifier(max_iter=1000)
model.fit(X_train_vec, y_train)

# After training, the model is ready to go and will make predictions on new data

y_pred = model.predict(X_test_vec)

### Accuracy
accuracy_score(y_test, pred) compares the true labels (y_test) with the model's predicted labels (pred)

It calculates how many predictions were correct out of the total

- Accuracy = (Correct Predictions / Total Predictions)

classification_report(...) gives a detailed performance summary for each class

- Precision:	Out of predicted positives, how many were correct
- Recall:	Out of actual positives, how many the model found
- F1-score:	Balance of precision & recall
- Support:	Number of samples per class

Accuracy_score gives overall correctness and classification_report gives detailed metrics per class

In [32]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nReport:\n", classification_report(y_test, y_pred))
print("\nðŸ”Ž Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.9901345291479821

Report:
               precision    recall  f1-score   support

           0       0.99      1.00      0.99       966
           1       0.99      0.93      0.96       149

    accuracy                           0.99      1115
   macro avg       0.99      0.97      0.98      1115
weighted avg       0.99      0.99      0.99      1115


ðŸ”Ž Confusion Matrix:
 [[965   1]
 [ 10 139]]


In [37]:
# try custom input

test_messages = [
    "Congratulations! You've won a free iPhone! Click here to claim.",
    "Hey, are we still meeting for lunch?",
]

print("\nCustom Predictions:")
for msg, pred in zip(test_messages, model.predict(vectorizer.transform(test_messages))):
    print(f"{msg} --> {'SPAM' if pred == 1 else 'HAM'}")


Custom Predictions:
Congratulations! You've won a free iPhone! Click here to claim. --> SPAM
Hey, are we still meeting for lunch? --> HAM
