### <center> Spam Mail Prediction </center>

In this project we are going to build a system which can detect whether an email is spam or not.


#### Import the required packages

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

#### Data Collection & Pre-Processing

In [4]:
# load the data from csv file to a pandas Dataframe
raw_mail_data = pd.read_csv('mail_data.csv')

In [5]:
print(raw_mail_data)

     Category                                            Message
0         ham  Go until jurong point, crazy.. Available only ...
1         ham                      Ok lar... Joking wif u oni...
2        spam  Free entry in 2 a wkly comp to win FA Cup fina...
3         ham  U dun say so early hor... U c already then say...
4         ham  Nah I don't think he goes to usf, he lives aro...
...       ...                                                ...
5567     spam  This is the 2nd time we have tried 2 contact u...
5568      ham               Will ü b going to esplanade fr home?
5569      ham  Pity, * was in mood for that. So...any other s...
5570      ham  The guy did some bitching but I acted like i'd...
5571      ham                         Rofl. Its true to its name

[5572 rows x 2 columns]


In the category column, ham means an authentic or genuine email and spam means a fake or unauthentic email 🙂

In [6]:
# dataset information
raw_mail_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [7]:
# check the number of missing values in each column
raw_mail_data.isnull().sum()

Category    0
Message     0
dtype: int64

Great we do not have missing values in the data set

In [8]:
# distribution of category class
raw_mail_data['Category'].value_counts()

ham     4825
spam     747
Name: Category, dtype: int64

We have a huge imbalance in the dataset. Let us under sample the majority class

#### Handle dataset imbalance

In [9]:
# separate the data for analysis
ham = raw_mail_data[raw_mail_data.Category == 'ham']
spam = raw_mail_data[raw_mail_data.Category == 'spam']

In [10]:
ham_sample = ham.sample(n=747)

In [11]:
#Concatenating two DataFrames
new_dataset = pd.concat([ham_sample, spam], axis=0)

In [12]:
new_dataset

Unnamed: 0,Category,Message
2799,ham,Purity of friendship between two is not about ...
691,ham,Was the farm open?
4302,ham,Are there TA jobs available? Let me know pleas...
4425,ham,Update your face book status frequently :)
16,ham,Oh k...i'm watching here:)
...,...,...
5537,spam,Want explicit SEX in 30 secs? Ring 02073162414...
5540,spam,ASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE ...
5547,spam,Had your contract mobile 11 Mnths? Latest Moto...
5566,spam,REMINDER FROM O2: To get 2.50 pounds free call...


In [13]:
# replace the null values with a null string
mail_data = new_dataset.where((pd.notnull(raw_mail_data)),'')

In [14]:
# printing the first 5 rows of the dataframe
mail_data.head()

Unnamed: 0,Category,Message
2799,ham,Purity of friendship between two is not about ...
691,ham,Was the farm open?
4302,ham,Are there TA jobs available? Let me know pleas...
4425,ham,Update your face book status frequently :)
16,ham,Oh k...i'm watching here:)


In [15]:
# checking the number of rows and columns in the dataframe
mail_data.shape

(1494, 2)

#### Label Encoding

In [16]:
# label spam mail as 0;  ham mail as 1;
mail_data.loc[mail_data['Category'] == 'spam', 'Category',] = 0
mail_data.loc[mail_data['Category'] == 'ham', 'Category',] = 1

In [17]:
# separating the data as texts and label

X = mail_data['Message']
Y = mail_data['Category']

In [18]:
print(X)

2799    Purity of friendship between two is not about ...
691                                    Was the farm open?
4302    Are there TA jobs available? Let me know pleas...
4425           Update your face book status frequently :)
16                             Oh k...i'm watching here:)
                              ...                        
5537    Want explicit SEX in 30 secs? Ring 02073162414...
5540    ASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE ...
5547    Had your contract mobile 11 Mnths? Latest Moto...
5566    REMINDER FROM O2: To get 2.50 pounds free call...
5567    This is the 2nd time we have tried 2 contact u...
Name: Message, Length: 1494, dtype: object


In [19]:
print(Y)

2799    1
691     1
4302    1
4425    1
16      1
       ..
5537    0
5540    0
5547    0
5566    0
5567    0
Name: Category, Length: 1494, dtype: object


#### Split the data into training data & test data

In [20]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=3)

In [21]:
print(X.shape)
print(X_train.shape)
print(X_test.shape)

(1494,)
(1195,)
(299,)


#### Feature Extraction

In [22]:
# transform the text data to feature vectors that can be used as input to the Logistic regression

feature_extraction = TfidfVectorizer(min_df = 1, stop_words='english', lowercase='True')

X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

In [23]:
# convert Y_train and Y_test values as integers
Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

In [24]:
print(X_train)

1174                            Ü dun need to pick ur gf?
4649               We are okay. Going to sleep now. Later
529     You will recieve your tone within the next 24h...
857                         Going to take your babe out ?
1542    Do u konw waht is rael FRIENDSHIP Im gving yuo...
                              ...                        
273     HMV BONUS SPECIAL 500 pounds of genuine HMV vo...
5187                                   WHAT TIME U WRKIN?
1544    Hello from Orange. For 1 month's free access t...
1430    For sale - arsenal dartboard. Good condition b...
3968    YOU HAVE WON! As a valued Vodafone customer ou...
Name: Message, Length: 1195, dtype: object


In [25]:
print(X_train_features)

  (0, 1676)	0.564731278850303
  (0, 3513)	0.2700612957206539
  (0, 2606)	0.4674346451109705
  (0, 2413)	0.3786525089298166
  (0, 1390)	0.49625861128339
  (1, 2044)	0.4723877202552172
  (1, 3066)	0.5458732004274188
  (1, 1698)	0.44012816336581617
  (1, 2493)	0.5340032683602595
  (2, 533)	0.27027551186456716
  (2, 2592)	0.371523060007585
  (2, 3296)	0.371523060007585
  (2, 1058)	0.371523060007585
  (2, 1169)	0.3366928712498884
  (2, 3307)	0.29594416261730416
  (2, 346)	0.3511487056912929
  (2, 3393)	0.2249070047160074
  (2, 2806)	0.371523060007585
  (3, 824)	0.7316475900133818
  (3, 1698)	0.681683067141623
  (4, 1378)	0.18050688008620577
  (4, 3259)	0.1536853960128704
  (4, 2431)	0.13131510439588673
  (4, 739)	0.1437863842763453
  (4, 1708)	0.18050688008620577
  :	:
  (1192, 252)	0.18596398672520945
  (1192, 3145)	0.23432959358108624
  (1192, 1649)	0.20554772033561108
  (1192, 3737)	0.13565055940130433
  (1192, 3307)	0.21923621804083598
  (1193, 3430)	0.4073229596644884
  (1193, 1367)	0.

#### Model Training: Logistic Regression 

In [26]:
model = LogisticRegression()

In [27]:
# training the Logistic Regression model with the training data
model.fit(X_train_features, Y_train)

LogisticRegression()

#### Model Evaluation

In [28]:
# prediction on training data

prediction_on_training_data = model.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)
print('Accuracy on training data : ', accuracy_on_training_data)

Accuracy on training data :  0.9832635983263598


In [29]:
# prediction on test data

prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)
print('Accuracy on test data : ', accuracy_on_test_data)

Accuracy on test data :  0.9531772575250836


#### Build a Predictive System

In [32]:
input_mail = ["I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times"]

# convert text to feature vectors
input_data_features = feature_extraction.transform(input_mail)

# make a prediction

prediction = model.predict(input_data_features)
print(prediction)


if (prediction[0]==1):
  print('Ham mail')

else:
  print('Spam mail')

[1]
Ham mail
