 **Business Problem**
 "How can a company automatically classify incoming emails as spam or legitimate (ham) to ensure a cleaner, more productive inbox for its employees and reduce security risks?"

here are the required libraies

**pandas** for reading dataset to dataframs

**CountVectorizer** tokenizes the input text. Tokenization is the process of breaking down the text into individual words (or tokens).

**Logistic Regression** can be used to classify emails as either "spam" or "not spam" based on features extracted from the email content.

**train_test_split**  is a utility function in scikit-learn that splits your dataset into two subsets: a training set and a testing set.

**confusion_matrix** It helps to visualize the performance of a classification model by comparing the predicted classifications to the actual true values.

**matplotlib** used for interactive visualizations.


In [None]:
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay,classification_report, accuracy_score
import matplotlib.pyplot as plt
import kagglehub
import os

# Download latest version
path = kagglehub.dataset_download("ahmedhassansaqr/email-spam-detection-v2")

print("Path to dataset files:", path)
data_file = None
for file in os.listdir(path):
    # Check for common data file extensions
    if file.endswith((".csv", ".tsv", ".txt", ".data")):
        data_file = os.path.join(path, file)
        break

if data_file is None:
    print("No supported data file found in the dataset folder.")
else:
    print(f"Found data file: {data_file}")
    # Try different delimiters based on common data file formats
    try:
        df = pd.read_csv(data_file, sep=',')  # Try comma delimiter first
    except pd.errors.ParserError:
        try:
            df = pd.read_csv(data_file, sep='\t')  # Try tab delimiter
        except pd.errors.ParserError:
            try:
                df = pd.read_csv(data_file, delim_whitespace=True)  # Try space delimiter
            except pd.errors.ParserError:
                print("Unable to determine the correct delimiter for the data file.")
                df = None

    if df is not None:
        print(df.head())

Path to dataset files: /root/.cache/kagglehub/datasets/ahmedhassansaqr/email-spam-detection-v2/versions/1
Found data file: /root/.cache/kagglehub/datasets/ahmedhassansaqr/email-spam-detection-v2/versions/1/smsspamcollection.tsv
  label                                            message  length  punct
0   ham  Go until jurong point, crazy.. Available only ...     111      9
1   ham                      Ok lar... Joking wif u oni...      29      6
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...     155      6
3   ham  U dun say so early hor... U c already then say...      49      6
4   ham  Nah I don't think he goes to usf, he lives aro...      61      2


In [None]:
from sklearn.preprocessing import LabelEncoder
# Preprocessing
X = df['label']  # Feature: email content
y = df['message']  # Target: spam or not (0 or 1)

# Encode labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# Train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Text vectorization (Bag-of-Words)
vectorizer = CountVectorizer(stop_words='english', max_features=5000)
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)
print(df.head())

  label                                            message  length  punct
0   ham  Go until jurong point, crazy.. Available only ...     111      9
1   ham                      Ok lar... Joking wif u oni...      29      6
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...     155      6
3   ham  U dun say so early hor... U c already then say...      49      6
4   ham  Nah I don't think he goes to usf, he lives aro...      61      2


In [None]:
from sklearn.model_selection import cross_val_score

# Perform cross-validation with the logistic regression model
cv_scores = cross_val_score(model, X_train_vect, y_train, cv=5, scoring='accuracy')
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean accuracy: {cv_scores.mean()}")




Cross-validation scores: [0.0044843  0.00560538 0.00448934 0.00448934 0.00448934]
Mean accuracy: 0.004711539913333636


In [None]:
from sklearn.model_selection import GridSearchCV

# Set up the hyperparameter grid
param_grid = {'C': [0.01, 0.1, 1, 10, 100]}

# Set up GridSearchCV to find the best hyperparameter
grid_search = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_vect, y_train)

# Print best hyperparameter and score
print(f"Best hyperparameters: {grid_search.best_params_}")
print(f"Best cross-validation accuracy: {grid_search.best_score_}")




Best hyperparameters: {'C': 100}
Best cross-validation accuracy: 0.004936006804467193


In [None]:
from sklearn.naive_bayes import MultinomialNB

# Initialize and train the Multinomial Naive Bayes model
model_nb = MultinomialNB()
model_nb.fit(X_train_vect, y_train)

# Predict on the test set
y_pred_nb = model_nb.predict(X_test_vect)

In [None]:
from sklearn.naive_bayes import MultinomialNB
# Evaluate performance
cm_nb = confusion_matrix(y_test, y_pred_nb)
print("Confusion Matrix (Naive Bayes):")
print(cm_nb)

#print("\nClassification Report (Naive Bayes):")
#print(classification_report(y_test, y_pred_nb))


Confusion Matrix (Naive Bayes):
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
