# EMAIL SPAM DETECTION

##  The objective of this project is to design and implement an effective email spam detection system using machine learning techniques to automatically Classify incoming emails as either spam or legitimate (ham) based on their content and characteristics.

* CONTRIBUTION - INDIVIDUAL
* AUTHOR - ADITYA DEY
* PROJECT TYPE - ML

###The goal of this project is to develop a robust email spam detection system using machine learning techniques. By analyzing the content and characteristics of emails, the system should be able to accurately classify incoming emails as either spam or legitimate (ham).

## FOR EXAMPLE
### SPAM MAIL -
FREE ENTRY IN 2 A WKLY COMP TO WIN FA CUP FINAL TKTS 21ST MAY 2005. TEXT FA TO 87121 TO RECEIVE ENTRY QUESTION
### HAM MAIL -
PLS GO AHEAD WITH WATTS. I JUST WANTED TO BE SURE. DO HAVE A GREAT WEEKEND, ABIOLA.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer         # Convert the text to numeric
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [3]:
raw_mail_data = pd.read_csv('/content/mail_data.csv')

In [4]:
print(raw_mail_data)

     Category                                            Message
0         ham  Go until jurong point, crazy.. Available only ...
1         ham                      Ok lar... Joking wif u oni...
2        spam  Free entry in 2 a wkly comp to win FA Cup fina...
3         ham  U dun say so early hor... U c already then say...
4         ham  Nah I don't think he goes to usf, he lives aro...
...       ...                                                ...
5567     spam  This is the 2nd time we have tried 2 contact u...
5568      ham               Will ü b going to esplanade fr home?
5569      ham  Pity, * was in mood for that. So...any other s...
5570      ham  The guy did some bitching but I acted like i'd...
5571      ham                         Rofl. Its true to its name

[5572 rows x 2 columns]


In [5]:
# Replace all null values with a Null String
mail_data = raw_mail_data.where((pd.notnull(raw_mail_data)), '')

In [7]:
# Printing the first 5 rows
mail_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [8]:
mail_data.tail()

Unnamed: 0,Category,Message
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


In [9]:
# Checking the number of rows and columns in the dataframe

mail_data.shape

(5572, 2)

In [10]:
# Labeling the spam mails as 0 and the ham mails as 1
mail_data.loc[mail_data['Category'] == 'spam', 'Category',] = 0
mail_data.loc[mail_data['Category'] == 'ham', 'Category',] = 1

In [14]:
# Seperating the dat as Labels and Data
X = mail_data['Message']
Y = mail_data['Category']

In [15]:
print(X)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                 Will ü b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object


In [16]:
print(Y)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: object


In [17]:
# Splitting the Data into Training Data and Test Data

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=3)


In [18]:
print(X.shape)

(5572,)


In [19]:
print(X_train.shape)
print(X_test.shape)

(4457,)
(1115,)


In [20]:
print(Y.shape)

(5572,)


In [21]:
print(Y_train.shape)
print(Y_test.shape)

(4457,)
(1115,)


In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TfidfVectorizer with specified parameters
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)

# Fit and transform the training data
X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

# Converting Y_train and Y_test values to integers
Y_train = Y_train.astype(int)
Y_test = Y_test.astype(int)


In [24]:
print(X_train)

3075                  Don know. I did't msg him recently.
1787    Do you know why god created gap between your f...
1614                         Thnx dude. u guys out 2nite?
4304                                      Yup i'm free...
3266    44 7732584351, Do you want a New Nokia 3510i c...
                              ...                        
789     5 Free Top Polyphonic Tones call 087018728737,...
968     What do u want when i come back?.a beautiful n...
1667    Guess who spent all last night phasing in and ...
3321    Eh sorry leh... I din c ur msg. Not sad alread...
1688    Free Top ringtone -sub to weekly ringtone-get ...
Name: Message, Length: 4457, dtype: object


In [25]:
print(X_train_features)

  (0, 5413)	0.6198254967574347
  (0, 4456)	0.4168658090846482
  (0, 2224)	0.413103377943378
  (0, 3811)	0.34780165336891333
  (0, 2329)	0.38783870336935383
  (1, 4080)	0.18880584110891163
  (1, 3185)	0.29694482957694585
  (1, 3325)	0.31610586766078863
  (1, 2957)	0.3398297002864083
  (1, 2746)	0.3398297002864083
  (1, 918)	0.22871581159877646
  (1, 1839)	0.2784903590561455
  (1, 2758)	0.3226407885943799
  (1, 2956)	0.33036995955537024
  (1, 1991)	0.33036995955537024
  (1, 3046)	0.2503712792613518
  (1, 3811)	0.17419952275504033
  (2, 407)	0.509272536051008
  (2, 3156)	0.4107239318312698
  (2, 2404)	0.45287711070606745
  (2, 6601)	0.6056811524587518
  (3, 2870)	0.5864269879324768
  (3, 7414)	0.8100020912469564
  (4, 50)	0.23633754072626942
  (4, 5497)	0.15743785051118356
  :	:
  (4454, 4602)	0.2669765732445391
  (4454, 3142)	0.32014451677763156
  (4455, 2247)	0.37052851863170466
  (4455, 2469)	0.35441545511837946
  (4455, 5646)	0.33545678464631296
  (4455, 6810)	0.29731757715898277
  (4

In [26]:
# Training the Machine Learning Model

models = LogisticRegression()

In [28]:
# Training the Logistic Regression model with the Training Data

models.fit(X_train_features, Y_train)

In [29]:
# Evaluating the Trained Model
prediction_on_training_data = models.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

In [30]:
print("Accuracy on Training Data - ", accuracy_on_training_data)

Accuracy on Training Data -  0.9670181736594121


In [32]:
# Evaluating the Testing Model
prediction_on_testing_data = models.predict(X_test_features)
accuracy_on_testing_data = accuracy_score(Y_test, prediction_on_testing_data)

In [33]:
print("Accuracy on Testing Data - ", accuracy_on_testing_data)

Accuracy on Testing Data -  0.9659192825112107


In [36]:
# BUILD A PREDICTIVE CLASSIFIER SYSTEM

input_mail = ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"]

# Convert Text to Vectors

input_mail_features = feature_extraction.transform(input_mail)

# Making Predictions now
prediction = models.predict(input_mail_features)
print(prediction)

if prediction[0] == "1":
    print("HAM MAIL")
else:
    print("SPAM MAIL")

[0]
SPAM MAIL


##In this Mail Spam Prediction Classifier project, we set out to develop an effective machine learning model to identify and classify email messages as either spam or not spam. By leveraging Python and various machine learning techniques, we achieved the following results and conclusions:

###Data Preparation and Feature Engineering:

1. We preprocessed and cleaned the email dataset, including text normalization, removing stop words, and transforming the text data into numerical feature vectors using TF-IDF.

###Model Selection and Training:
2. We experimented with several machine learning algorithms, including logistic regression, Naive Bayes, and support vector machines, to find the most suitable classifier for the task.

3. Model selection was based on performance metrics such as accuracy, precision, recall, and F1-score, as well as the trade-off between false positives and false negatives.


###Model Evaluation:
4. We partitioned the dataset into training and testing sets and used cross-validation techniques to ensure robust model evaluation.

5. The selected model demonstrated strong predictive performance in terms of accuracy, recall, and precision, effectively distinguishing between spam and legitimate emails.


###Hyperparameter Tuning:
6. We fine-tuned the hyperparameters of the chosen model to optimize its performance, achieving a balance between bias and variance.

###Feature Importance:
7. Feature importance analysis provided insights into the most influential terms and features contributing to the spam classification.

###Challenges and Limitations:
8. We encountered challenges in dealing with imbalanced data, and we had to employ techniques like oversampling or undersampling to mitigate the issue.

9. The model's performance may vary depending on the quality and representativeness of the training data.


###Future Directions:
10. To further enhance the model's performance, future work could involve exploring more advanced natural language processing techniques, deep learning approaches, and ensemble methods.

Real-time deployment and integration into email filtering systems could be considered for practical application.

Overall, this project demonstrates the effectiveness of machine learning in automating the detection of email spam. By selecting the right model and optimizing its parameters, we can build a robust classifier to enhance email security and streamline communication processes.