# Lesson 6: Naive Bayes for Text Classification

In this notebook, we'll build a Naive Bayes classifier to determine if a text message is "spam" or "ham" (not spam).

## 1. Import Libraries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

## 2. Load and Prepare the Data

We'll create a small dataset of text messages. In a real-world scenario, you would load this from a file (e.g., a CSV).

In [None]:
data = {'message': ['call now for free prize', 'buy one get one free', 'hey how are you', 'meet me at the park', 'urgent call for money', 'how was your day'],
        'label': ['spam', 'spam', 'ham', 'ham', 'spam', 'ham']}
df = pd.DataFrame(data)

print(df)

## 3. Split Data and Vectorize Text

Machine learning models need numbers, not text. We'll use `CountVectorizer` to turn our text messages into a matrix of token counts.

In [None]:
# Map labels to numbers
df['label_num'] = df['label'].map({'ham':0, 'spam':1})

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label_num'], test_size=0.33, random_state=42)

# Initialize the vectorizer and transform the data
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)

## 4. Create and Train the Naive Bayes Model

In [None]:
model = MultinomialNB()
model.fit(X_train_counts, y_train)

## 5. Evaluate the Model

Let's see how well our model did on the test data.

In [None]:
predictions = model.predict(X_test_counts)
accuracy = accuracy_score(y_test, predictions)
print(f"Model Accuracy: {accuracy:.2f}")

## 6. Make Predictions on New Messages

Now let's try our model on some new messages it has never seen before.

In [None]:
new_messages = [
    'free entry to the contest',
    'Can we meet tomorrow?'
]
new_messages_counts = vectorizer.transform(new_messages)
new_predictions = model.predict(new_messages_counts)

for msg, pred in zip(new_messages, new_predictions):
    label = 'spam' if pred == 1 else 'ham'
    print(f'"{msg}" -> {label}')