**Programmer:** python_scripts (Abhijith Warrier)

**PYTHON SCRIPT TO **_BUILD A SIMPLE SPAM DETECTOR USING CountVectorizer + MultinomialNB_**. üêç‚úâÔ∏èüîç**

This script demonstrates a classic NLP baseline: convert text messages to bag-of-words features with CountVectorizer, train a Multinomial Naive Bayes classifier, evaluate it, and test on a few new messages.

### üì¶ Import Required Libraries

We‚Äôll use scikit-learn for vectorization, model, and evaluation.

In [1]:
# Text features, model, split, and metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report

### üß∞ Create a Tiny Labeled Dataset (Demo)

A small, illustrative SMS-like dataset (in real use, swap with a larger dataset).

In [2]:
# Sample messages (X) and labels (y): 'spam' or 'ham'
X = [
    "Win a FREE iPhone now!!! Click here",
    "Your OTP is 483920. Do not share it.",
    "Limited time offer! Claim your reward today",
    "Are we still on for lunch tomorrow?",
    "Meeting rescheduled to 3pm. See you!",
    "You won a lottery. Send your bank details",
    "Project update: pushed commits to main",
    "Get cheap meds without prescription. Order now",
]
y = ["spam", "ham", "spam", "ham", "ham", "spam", "ham", "spam"]

### ‚úÇÔ∏è Train/Test Split

Hold out a test set to measure generalization.

In [3]:
# 25% test split for quick evaluation
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

### üß± Build a Pipeline: CountVectorizer ‚Üí MultinomialNB

Vectorize text to token counts, then train a Naive Bayes classifier.

In [4]:
# Combine vectorizer and classifier for a clean workflow
model = Pipeline([
    ("vec", CountVectorizer(lowercase=True, stop_words="english")),
    ("nb", MultinomialNB())
])

### üöÄ Train the Model

Fit the pipeline on training data.

In [5]:
# Learn vocabulary + train NB on word counts
model.fit(X_train, y_train)

0,1,2
,steps,"[('vec', ...), ('nb', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,'english'
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


### üìä Evaluate the Model

Check overall accuracy and per-class report.

In [None]:
# Predictions and metrics
y_pred = model.predict(X_test)
print("Accuracy:", round(accuracy_score(y_test, y_pred), 3))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

### üîÆ Try on New Messages

Quick sanity checks on unseen texts.

In [7]:
# A few example messages to classify
new_msgs = [
    "Exclusive deal just for you! Click to claim your prize",
    "Can we move our call to 5 pm?",
    "Your package is out for delivery"
]
print("\nPredictions on new messages:")
for m, p in zip(new_msgs, model.predict(new_msgs)):
    print(f"- {m}  ->  {p}")


Predictions on new messages:
- Exclusive deal just for you! Click to claim your prize  ->  spam
- Can we move our call to 5 pm?  ->  ham
- Your package is out for delivery  ->  ham
