# Spam Email Classifier

Problem Statement: Build a machine learning model that can classify emails as either spam
or not spam (ham). Use natural language processing (NLP) techniques
to process and analyze email text.

# Data Exploration:

Examine the dataset to understand its structure, such as columns and their meanings.
Explore the distribution of spam and ham emails.


# Data Preprocessing:

Clean and preprocess the text data. This includes:
Removing special characters and punctuation.
Tokenization: Splitting the text into words or tokens.
Removing stopwords.
Lemmatization or stemming to reduce words to their base form.

# Feature Extraction:

Convert the text data into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency) vectorization. You can use the TfidfVectorizer from scikit-learn.

# Split the Data:

Split the dataset into a training set and a testing set. Common ratios are 80% for training and 20% for testing.

# Model Selection:

Choose a machine learning algorithm for classification. You can start with simple models like Multinomial Naive Bayes or try more advanced algorithms like Random Forest or Support Vector Machines (SVM).

# Model Training:

Train the selected model using the training data and the TF-IDF features.

# Model Evaluation:

Evaluate the model's performance using various metrics such as accuracy, precision, recall, F1 score, and ROC AUC. You can use scikit-learn's classification_report and confusion_matrix for this purpose.

# Hyperparameter Tuning:

Fine-tune the model by adjusting hyperparameters for better performance. You can use techniques like grid search or random search.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the dataset
data = pd.read_csv('spam.csv', encoding='latin-1')
# You may need to adjust the column names based on your dataset's structure
X = data['v2']  # Email text
y = data['v1']  # Spam or ham labels

# Data preprocessing and TF-IDF vectorization
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(X)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Make predictions
y_pred = classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)

print("Accuracy:", accuracy)
print(report)
print("Confusion Matrix:")
print(confusion)


Accuracy: 0.9623318385650225
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       965
        spam       1.00      0.72      0.84       150

    accuracy                           0.96      1115
   macro avg       0.98      0.86      0.91      1115
weighted avg       0.96      0.96      0.96      1115

Confusion Matrix:
[[965   0]
 [ 42 108]]


THANKS YOU