## SMS spam detection

## Step 1:
#### Importing libraries 

In [1]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

## Step 2 :
#### Importing the dataset

In [5]:
#import dataset
spam_data = pd.read_csv('spam.csv', encoding='latin1')

The latin1 encoding is often used when a file contains non-ASCII characters, which might cause errors if read with the default UTF-8 encoding.

## Step 3:
#### Preprocessing data

In [7]:
#preprocess data
print(spam_data.head())


     v1                                                 v2 Unnamed: 2  \
0   ham  Go until jurong point, crazy.. Available only ...        NaN   
1   ham                      Ok lar... Joking wif u oni...        NaN   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3   ham  U dun say so early hor... U c already then say...        NaN   
4   ham  Nah I don't think he goes to usf, he lives aro...        NaN   

  Unnamed: 3 Unnamed: 4  
0        NaN        NaN  
1        NaN        NaN  
2        NaN        NaN  
3        NaN        NaN  
4        NaN        NaN  


In [9]:
print(spam_data.describe())


          v1                      v2  \
count   5572                    5572   
unique     2                    5169   
top      ham  Sorry, I'll call later   
freq    4825                      30   

                                               Unnamed: 2  \
count                                                  50   
unique                                                 43   
top      bt not his girlfrnd... G o o d n i g h t . . .@"   
freq                                                    3   

                   Unnamed: 3 Unnamed: 4  
count                      12          6  
unique                     10          5  
top      MK17 92H. 450Ppw 16"    GNT:-)"  
freq                        2          2  


In [10]:
print(spam_data.isna().sum())

v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64


The unnamed columns are almost empty columns , so our predictor column would be v2 and target column v1

## Step 4:
#### Selecting the predictor variable and target variable

In [11]:
#feature selection
X = spam_data['v2']   #predictor
Y = spam_data['v1']   #target

## Step 5:
#### Converting the text to vectors(numerical data) using TfidfVectorizer

In [13]:
#text vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

vector = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = vector.fit_transform(X)

## Step 6:
#### Splitting the data into training and testing sets

In [15]:
#splitting data
X_train, X_test, Y_train, Y_test = train_test_split(X_tfidf,Y, test_size=0.2, random_state=42)

## Step 7:
#### Implementing Naive Bayes model

In [19]:
#model selection (naive bayes)
from sklearn.naive_bayes import MultinomialNB

model_nb = MultinomialNB()
model_nb.fit(X_train, Y_train)

y_pred1 = model_nb.predict(X_test)
print(y_pred1)

print('Accuracy Score: ', accuracy_score(Y_test, y_pred1))
print('Classification report: ', classification_report(Y_test, y_pred1))

['ham' 'ham' 'spam' ... 'ham' 'ham' 'spam']
Accuracy Score:  0.9757847533632287
Classification report:                precision    recall  f1-score   support

         ham       0.97      1.00      0.99       965
        spam       1.00      0.82      0.90       150

    accuracy                           0.98      1115
   macro avg       0.99      0.91      0.94      1115
weighted avg       0.98      0.98      0.97      1115



1. Accuracy is 97.57% 
2. Precision 1.00 (100%) for spam indicates that all messages predicted as spam were actually spam 
3. Recall 0.82 (82%) for spam indicates only 82% of actual spam messages were correctly identified.

### Implementing logistic regression model

In [21]:
#logistic model
from sklearn.linear_model import LogisticRegression

model_lr = LogisticRegression()
model_lr.fit(X_train, Y_train)

y_pred2 = model_lr.predict(X_test)
print(y_pred2)

print('Accuracy Score: ', accuracy_score(Y_test, y_pred2))
print('Classification report: ', classification_report(Y_test, y_pred2))

['ham' 'ham' 'ham' ... 'ham' 'ham' 'ham']
Accuracy Score:  0.9479820627802691
Classification report:                precision    recall  f1-score   support

         ham       0.95      1.00      0.97       965
        spam       0.97      0.63      0.77       150

    accuracy                           0.95      1115
   macro avg       0.96      0.82      0.87      1115
weighted avg       0.95      0.95      0.94      1115



1. Accuracy is 94.80%
2. Precision 0.97 (97%) indicates , the messages predicted as spam were actually spam for 97% 
3. Recall 0.63 (63%) indicates that only 63% actual spam messages were correctly identified

### Implementing SVM model 

In [20]:
#svm model 
from sklearn.svm import SVC

SVM_model = SVC(kernel='linear')
SVM_model.fit(X_train, Y_train)

y_pred3 = SVM_model.predict(X_test)
print(y_pred3)

print('Accuracy Score: ', accuracy_score(Y_test, y_pred3))
print('Classification report: ', classification_report(Y_test, y_pred3))

['ham' 'ham' 'ham' ... 'ham' 'ham' 'spam']
Accuracy Score:  0.9757847533632287
Classification report:                precision    recall  f1-score   support

         ham       0.98      1.00      0.99       965
        spam       0.97      0.85      0.90       150

    accuracy                           0.98      1115
   macro avg       0.97      0.92      0.95      1115
weighted avg       0.98      0.98      0.98      1115



1. Accuracy is 97.57%
2. Precision 0.97 (97%)for spam indicates , the messages predicted as spam were actually spam for 97% 
3. Recall 0.85 (85%) for spam indicates that only 63% actual spam messages were correctly identified

## Accuracy Scores:

1. Naive Bayes: 97.58%
2. Logistic Regression: 94.80%
3. SVM: 97.58%

## Overall Summary of the three models
1. Best Model for Spam Recall: SVM and Naive Bayes outperform Logistic Regression in detecting spam.
2. Overall Balanced Performance: SVM achieves the best balance between precision and recall across both ham and spam.