# Logistic Regression for Email Spam Classification
In this notebook, we will use a Logistic Regression model to classify emails as spam or not spam. We will work with a small sample dataset and go through each step of the machine learning process.

## Step 1: Import Required Libraries
We need the following libraries:
- `Pandas` for data handling
- `Numpy` for numerical operations
- `Scikit-learn` to build and evaluate the logistic regression model

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

print("Libraries imported successfully!")

## Step 2: Load and Explore the Dataset
We load a CSV file that contains example emails and labels: `1` for spam, `0` for not spam.

In [None]:
# Load the dataset
df = pd.read_csv('data/spam_email_dataset.csv')
df.head()

## Step 3: Convert Text to Numeric Features
Machine learning models can't work directly with text, so we use `CountVectorizer` to convert each email into a vector of word counts.
We also split the data into a training set and a test set.

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['Email'])
y = df['Spam']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Step 4: Train the Logistic Regression Model
We create a Logistic Regression model and train it using the `.fit()` method, which learns from the training data.

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)

# Feedback after evaluation
print("✅ Model trained successfully!")

## Step 5: Understand What the Model Learned

Now that the model is trained, we can inspect it to understand how it makes decisions. We will extract two key pieces of information:

- **Vocabulary**: The list of all unique words the model knows from the training data.
- **Coefficients (Weights)**: The importance value the model has assigned to each word. A high positive weight means a word is a strong indicator of "Spam," while a large negative weight suggests the word is a strong indicator of "Not Spam".

By looking at the words with the highest and lowest weights, we can see which terms the model relies on most for its classification.

In [None]:
# Get the vocabulary and the learned weights from the model
vocabulary = vectorizer.get_feature_names_out()
weights = model.coef_[0]

# Create a DataFrame to view the words and their weights
weights_df = pd.DataFrame({'Word': vocabulary, 'Weight': weights})

# Display the words that most strongly predict SPAM (highest weights)
print("--- Top Words Predicting SPAM ---")
print(weights_df.sort_values(by='Weight', ascending=False).head())

print("\n--- Top Words Predicting NOT SPAM ---")
print(weights_df.sort_values(by='Weight', ascending=True).head())

## Step 6: Evaluate the Model
We use the model to predict labels for the test set. Then, we evaluate the results using accuracy and a classification report.

In [None]:
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred))

## Step 7: Try It Yourself – Classify a New Email
Now that the model is trained, you can enter a new email subject below and let the model predict if it is spam or not.

This is a simple way to interact with the model and understand how it makes decisions based on the words in the message.

In [None]:
# Enter a new email subject
user_input = input("Enter the subject of an email: ")

# Convert the input to the same vector format used during training
user_vector = vectorizer.transform([user_input])

# Predict and show result
prediction = model.predict(user_vector)[0]
prob = model.predict_proba(user_vector)[0][prediction]
label = 'Spam' if prediction == 1 else 'Not Spam'
print(f"Prediction: {label} (Probability: {prob:.2f})")

## Conclusion
- We loaded and explored a dataset of labeled emails.
- We converted text into numeric format using `CountVectorizer`.
- We trained a Logistic Regression model to detect spam.
- We evaluated its performance with test data.

This basic example demonstrates how to apply machine learning to a real-world text classification task.