# Spam Detection with Logistic Regression

In this notebook, we'll train a Logistic Regression model to classify emails as "spam" or "ham" based on their content.

We will:
1. Load and preprocess the dataset.
2. Convert the text into numerical features using `TfidfVectorizer`.
3. Split the data into training and test sets.
4. Train a Logistic Regression model.
5. Evaluate the model's performance.

## Step 1: Load and explore the dataset

We first load the CSV file containing 5000 emails labeled as either "spam" or "ham". Let's inspect the first few rows to understand the structure.

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv("data/spam_ham_dataset.csv")  # Cambiar o nome ao CSV final
df.head(10)

## Step 2: Preprocess labels

We convert the textual labels into numerical values: 0 for ham, 1 for spam. We also check for missing values that could affect training.

In [None]:
# Map 'ham' to 0 and 'spam' to 1
df["label_num"] = df["label"].map({"ham": 0, "spam": 1})

# Check for missing values
print(df.isnull().sum())

## Step 3: Split the dataset

We separate the features (email text) and the labels, and split the data into a training set (80%) and a test set (20%).

In [None]:
from sklearn.model_selection import train_test_split

X = df["text"]
y = df["label_num"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Step 4: Convert text to numeric features with TF-IDF

Text data needs to be transformed into numerical vectors. We use TF-IDF to give weight to relevant words while ignoring common ones.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', max_df=0.9)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

## Step 5: Train the Logistic Regression model

We now train the Logistic Regression model using the TF-IDF transformed training data.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
print("Trainning model ...")
model.fit(X_train_tfidf, y_train)
print("Model trained successfully")

## Step 6: Evaluate the model

Finally, we evaluate how well our model performs. We check accuracy, confusion matrix (to see false positives/negatives), and precision/recall.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

y_pred = model.predict(X_test_tfidf)

print("Accuracy:", accuracy_score(y_test, y_pred))

### Interactive Test

Try your own email sample and see the prediction.

In [None]:
subject = input("Subject: ")
body = input("Body: ")

# Concatenate subject and body with newline"
email_text = f"Subject: {subject}\\r\\n{body}"

# Transform and predict\n"
email_tfidf = vectorizer.transform([email_text])
prob = model.predict_proba(email_tfidf)[0][1]
print()
print(f"Spam probability: {prob:.2%}")