# Spam Email Detection using Machine Learning

This project focuses on building a machine learning model to classify emails as spam or not spam using text data and natural language processing techniques.


In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report


In [2]:
data = pd.read_csv(
    'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv',
    sep='\t',
    names=['label', 'message']
)

data.head()


Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Dataset Description
The dataset contains labeled email and SMS messages classified as spam or ham. This dataset is used for supervised machine learning classification.


In [3]:
data['label'] = data['label'].map({'ham': 0, 'spam': 1})

X = data['message']
y = data['label']


In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [5]:
vectorizer = TfidfVectorizer(stop_words='english')

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


In [6]:
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)


In [7]:
y_pred = model.predict(X_test_tfidf)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 0.9695067264573991
              precision    recall  f1-score   support

           0       0.97      1.00      0.98       966
           1       1.00      0.77      0.87       149

    accuracy                           0.97      1115
   macro avg       0.98      0.89      0.93      1115
weighted avg       0.97      0.97      0.97      1115



In [8]:
sample_email = ["You have won a free lottery. Click now to claim prize"]
sample_vector = vectorizer.transform(sample_email)

prediction = model.predict(sample_vector)
print("Spam" if prediction[0] == 1 else "Not Spam")


Spam


## Conclusion
The Spam Email Detection system successfully classifies emails using machine learning techniques. This project demonstrates the application of natural language processing and supervised learning in real-world problems.


## Challenges and Learnings

One of the main challenges in this project was converting raw text data into a numerical format that a machine learning model could understand. This was handled using TF-IDF vectorization.

Another challenge was selecting an appropriate model for text classification. Logistic Regression was chosen due to its simplicity and effectiveness.

Through this project, I learned how to preprocess text data, extract meaningful features, train a machine learning model, and evaluate its performance on unseen data.
