<a href="https://colab.research.google.com/github/Vishal-495/spam-classification/blob/main/Spam_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing the libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Load the dataset

In [None]:
data = pd.read_excel("spam.xlsx")
print(data)

     result                                               Text
0       ham  Go until jurong point, crazy.. Available only ...
1       ham                      Ok lar... Joking wif u oni...
2      spam  Free entry in 2 a wkly comp to win FA Cup fina...
3       ham  U dun say so early hor... U c already then say...
4       ham  Nah I don't think he goes to usf, he lives aro...
...     ...                                                ...
5567   spam  This is the 2nd time we have tried 2 contact u...
5568    ham               Will ü b going to esplanade fr home?
5569    ham  Pity, * was in mood for that. So...any other s...
5570    ham  The guy did some bitching but I acted like i'd...
5571    ham                         Rofl. Its true to its name

[5572 rows x 2 columns]


Checking foe the null values

In [None]:
data.isnull().sum()

Unnamed: 0,0
result,0
Text,0


Import the libraries required for the text pre processing

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

removing 'not' and 's' from the stop words and adding 'may'

In [None]:
stop = stopwords.words('english')
stop.append('may')
stop.remove('not')
stop.remove('s')
print(stop)

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 'same', 'shan', "shan't", 'she', "she'd", 

defining the stemmer

In [None]:
lam = PorterStemmer()

converting the text into string

In [None]:
data['Text'] = data['Text'].astype(str)

defining the preprocess steps for the text i.e., removing all the punctuations, split and removing the stop words

In [None]:
def process(txt):
    txt = txt.lower()
    txt = re.sub(r'[^a-zA-Z]', ' ', txt)
    words = word_tokenize(txt)
    words = [w for w in words if w not in stop]
    lemma = []
    for word in words:
        lemma.append(lam.stem(word))
    return " ".join(lemma)

applying the function on the text

In [None]:
cdata = data['Text'].apply(process)
print(cdata)

0       go jurong point crazi avail bugi n great world...
1                                   ok lar joke wif u oni
2       free entri wkli comp win fa cup final tkt st t...
3                     u dun say earli hor u c alreadi say
4                    nah think goe usf live around though
                              ...                        
5567    nd time tri contact u u pound prize claim easi...
5568                                b go esplanad fr home
5569                                    piti mood suggest
5570    guy bitch act like interest buy someth els nex...
5571                                       rofl true name
Name: Text, Length: 5572, dtype: object


Vectoring the text with the size of the vector as 200

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
res = CountVectorizer(max_features = 200)
a = res.fit_transform(cdata)
a = a.toarray()
print(a)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


Encoding the target variable and saving it in a new column

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['label'] = le.fit_transform(data['result'])

In [None]:
print(data)

     result                                               Text  label
0       ham  Go until jurong point, crazy.. Available only ...      0
1       ham                      Ok lar... Joking wif u oni...      0
2      spam  Free entry in 2 a wkly comp to win FA Cup fina...      1
3       ham  U dun say so early hor... U c already then say...      0
4       ham  Nah I don't think he goes to usf, he lives aro...      0
...     ...                                                ...    ...
5567   spam  This is the 2nd time we have tried 2 contact u...      1
5568    ham               Will ü b going to esplanade fr home?      0
5569    ham  Pity, * was in mood for that. So...any other s...      0
5570    ham  The guy did some bitching but I acted like i'd...      0
5571    ham                         Rofl. Its true to its name      0

[5572 rows x 3 columns]


In [None]:
y = data['label']

Splitting the data onto train and test sets

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(a, y, test_size=0.2, random_state = 0)

In [None]:
print(x_train)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
model_lin = LogisticRegression()
model_lin.fit(x_train, y_train)

In [None]:
y_pred_lin = model_lin.predict(x_test)

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
print(confusion_matrix(y_test, y_pred_lin))
print(classification_report(y_test, y_pred_lin))
accuracy_score(y_test, y_pred_lin)

[[953   2]
 [ 24 136]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       955
           1       0.99      0.85      0.91       160

    accuracy                           0.98      1115
   macro avg       0.98      0.92      0.95      1115
weighted avg       0.98      0.98      0.98      1115



0.9766816143497757

# Naive-Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
model_nb = GaussianNB()
model_nb.fit(x_train, y_train)

In [None]:
y_pred_nb = model_nb.predict(x_test)

In [None]:
print(confusion_matrix(y_test, y_pred_nb))
print(classification_report(y_test, y_pred_nb))

[[457 498]
 [  4 156]]
              precision    recall  f1-score   support

           0       0.99      0.48      0.65       955
           1       0.24      0.97      0.38       160

    accuracy                           0.55      1115
   macro avg       0.61      0.73      0.51      1115
weighted avg       0.88      0.55      0.61      1115



# KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model_knn = KNeighborsClassifier(n_neighbors = 5)
model_knn.fit(x_train, y_train)

In [None]:
y_pred_knn = model_knn.predict(x_test)

In [None]:
print(confusion_matrix(y_test, y_pred_knn))
print(classification_report(y_test, y_pred_knn))

[[955   0]
 [ 61  99]]
              precision    recall  f1-score   support

           0       0.94      1.00      0.97       955
           1       1.00      0.62      0.76       160

    accuracy                           0.95      1115
   macro avg       0.97      0.81      0.87      1115
weighted avg       0.95      0.95      0.94      1115



# SVM linear

In [None]:
from sklearn.svm import SVC
model_svm = SVC(kernel = 'linear')
model_svm.fit(x_train, y_train)

In [None]:
y_pred_svm = model_svm.predict(x_test)

In [None]:
print(confusion_matrix(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))

[[951   4]
 [ 16 144]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       955
           1       0.97      0.90      0.94       160

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115



# Gaussian SVM

In [None]:
from sklearn.svm import SVC
model_svmk = SVC(kernel = 'rbf')
model_svmk.fit(x_train, y_train)

In [None]:
y_pred_svmk = model_svmk.predict(x_test)

In [None]:
print(confusion_matrix(y_test, y_pred_svmk))
print(classification_report(y_test, y_pred_svmk))

[[955   0]
 [ 19 141]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       955
           1       1.00      0.88      0.94       160

    accuracy                           0.98      1115
   macro avg       0.99      0.94      0.96      1115
weighted avg       0.98      0.98      0.98      1115



# Decision tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
model_dt = DecisionTreeClassifier(random_state = 0)
model_dt.fit(x_train, y_train)

In [None]:
y_pred_dt = model_dt.predict(x_test)

In [None]:
print(confusion_matrix(y_test, y_pred_dt))
print(classification_report(y_test, y_pred_dt))

[[935  20]
 [ 15 145]]
              precision    recall  f1-score   support

           0       0.98      0.98      0.98       955
           1       0.88      0.91      0.89       160

    accuracy                           0.97      1115
   macro avg       0.93      0.94      0.94      1115
weighted avg       0.97      0.97      0.97      1115



# Random forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
model_rf = RandomForestClassifier(n_estimators = 20, random_state = 0)
model_rf.fit(x_train, y_train)

In [None]:
y_pred_rf = model_rf.predict(x_test)

In [None]:
print(confusion_matrix(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

[[946   9]
 [ 14 146]]
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       955
           1       0.94      0.91      0.93       160

    accuracy                           0.98      1115
   macro avg       0.96      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115



Putting the input SMS

In [None]:
s = input('enter the sms:')

pred = model_dt.predict(res.transform([process(s)]))

if pred == 1:
  print('Spam')
else:
  print('not Spam')

enter the sms:Congratulations! You won a free ticket.
not Spam


In [None]:
#s = 'Congratulations! You won a free ticket.'

