# SMS Spam Ham Classification Project using NLP Techniques and Machine Learning Algorithms

## Loading the Data

In [1]:
import pandas as pd 
import re

In [2]:
messages = pd.read_csv(r"smsspamcollection\SMSSpamCollection",sep= "\t",names=["label","message"])

In [3]:
messages.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
messages.shape

(5572, 2)

In [70]:
messages.isnull().sum()

label      0
message    0
dtype: int64

In [5]:
import nltk
# nltk.download("stopwords") only needed if you dont downloaded it already

In [6]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

In [7]:
len(messages)

5572

In [8]:
messages["message"]

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                 Will ü b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: message, Length: 5572, dtype: object

In [9]:
corpus = []
stopwords = set(stopwords.words("english"))

for i in range(len(messages)):
    retext = re.sub("[^a-zA-Z]"," ",messages["message"][i])
    retext = retext.lower()
    words = retext.split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in stopwords]
    cleaned_text = " ".join(words)
    corpus.append(cleaned_text)


In [None]:
corpus # this is nothing but our independent variable X

['go jurong point crazy available bugis n great world la e buffet cine got amore wat',
 'ok lar joking wif u oni',
 'free entry wkly comp win fa cup final tkts st may text fa receive entry question std txt rate c apply',
 'u dun say early hor u c already say',
 'nah think go usf life around though',
 'freemsg hey darling week word back like fun still tb ok xxx std chgs send rcv',
 'even brother like speak treat like aid patent',
 'per request melle melle oru minnaminunginte nurungu vettam set callertune caller press copy friend callertune',
 'winner valued network customer selected receivea prize reward claim call claim code kl valid hour',
 'mobile month u r entitled update latest colour mobile camera free call mobile update co free',
 'gonna home soon want talk stuff anymore tonight k cried enough today',
 'six chance win cash pound txt csh send cost p day day tsandcs apply reply hl info',
 'urgent week free membership prize jackpot txt word claim c www dbuk net lccltd pobox ldnw rw'

Segregating independent and dependent variables

In [None]:
# when we convert out label into one hot encoded columns we can retain any one of the column so her i am retaining
# spam column if there is 0 then it means the sms is spam and if it is 0 then is ham or not a spam
y = pd.get_dummies(messages["label"])["spam"]

0       0
1       0
2       1
3       0
4       0
       ..
5567    1
5568    0
5569    0
5570    0
5571    0
Name: spam, Length: 5572, dtype: uint8

spliting the dataset into train and test data:

In [19]:
import sklearn
from sklearn.model_selection import train_test_split

X_train ,X_test,y_train,y_test = train_test_split(corpus, y, test_size= 0.3, random_state= 3)

Now we will use vectorization methods.

1) Bag of words:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features= 2000,ngram_range=(1,3),) # here i am saying need top 2000 features and i should consider
                                                           # unigram, bigram and trigram

when we are performing the text processing techniques while creating the corpus we may lose some row because of empty spaces
so it is always good to know wheather the train and test data on X and y are in same length

In [25]:
print(len(X_train),len(y_train),len(X_test),len(y_test))

3900 3900 1672 1672


In [36]:
X_train_cv = cv.fit_transform(X_train).toarray()
X_test_cv = cv.transform(X_test).toarray()

In [None]:
X_train_cv # we have uses normal bow so we will get frequency of occuring in the array 

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [49]:
print(X_train_cv.shape)
print(X_test_cv.shape)


(3900, 2000)
(1672, 2000)


In [None]:
cv.vocabulary_ # now you can see we have taken words with 1 word to 3 word combination for max_features

{'decide': 401,
 'co': 291,
 'si': 1542,
 'going': 658,
 'home': 767,
 'liao': 918,
 'going home': 659,
 'heard': 736,
 'call': 184,
 'night': 1159,
 'make': 999,
 'like': 923,
 'last': 887,
 'time': 1720,
 'xx': 1979,
 'luv': 994,
 'net': 1138,
 'pls': 1283,
 'enough': 491,
 'family': 524,
 'hot': 778,
 'sun': 1645,
 'place': 1263,
 'reason': 1391,
 'invited': 830,
 'actually': 12,
 'go': 641,
 'wait': 1860,
 'serious': 1509,
 'yup': 1999,
 'wun': 1971,
 'believe': 118,
 'wat': 1882,
 'really': 1390,
 'neva': 1142,
 'msg': 1107,
 'sent': 1506,
 'shuhui': 1541,
 'hi': 747,
 'always': 44,
 'online': 1204,
 'yahoo': 1983,
 'would': 1964,
 'chat': 257,
 'yes': 1989,
 'great': 686,
 'told': 1733,
 'kallis': 856,
 'best': 119,
 'world': 1959,
 'tough': 1748,
 'get': 617,
 'die': 418,
 'want': 1872,
 'stuff': 1637,
 'hey': 743,
 'horny': 775,
 'see': 1481,
 'naked': 1123,
 'text': 1682,
 'charged': 255,
 'pm': 1293,
 'unsubscribe': 1806,
 'stop': 1623,
 'text stop': 1687,
 'still': 1622,
 'd

2) TF- IDF Vectorization

In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=2000,ngram_range=(1,3))

In [45]:
X_train_tfidf = tfidf.fit_transform(X_train).toarray()
X_test_tfidf = tfidf.transform(X_test).toarray()

In [46]:
X_train_tfidf

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [48]:
print(X_train_tfidf.shape)
print(X_test_tfidf.shape)

(3900, 2000)
(1672, 2000)


In [50]:
tfidf.vocabulary_

{'decide': 401,
 'co': 291,
 'si': 1542,
 'going': 658,
 'home': 767,
 'liao': 918,
 'going home': 659,
 'heard': 736,
 'call': 184,
 'night': 1159,
 'make': 999,
 'like': 923,
 'last': 887,
 'time': 1720,
 'xx': 1979,
 'luv': 994,
 'net': 1138,
 'pls': 1283,
 'enough': 491,
 'family': 524,
 'hot': 778,
 'sun': 1645,
 'place': 1263,
 'reason': 1391,
 'invited': 830,
 'actually': 12,
 'go': 641,
 'wait': 1860,
 'serious': 1509,
 'yup': 1999,
 'wun': 1971,
 'believe': 118,
 'wat': 1882,
 'really': 1390,
 'neva': 1142,
 'msg': 1107,
 'sent': 1506,
 'shuhui': 1541,
 'hi': 747,
 'always': 44,
 'online': 1204,
 'yahoo': 1983,
 'would': 1964,
 'chat': 257,
 'yes': 1989,
 'great': 686,
 'told': 1733,
 'kallis': 856,
 'best': 119,
 'world': 1959,
 'tough': 1748,
 'get': 617,
 'die': 418,
 'want': 1872,
 'stuff': 1637,
 'hey': 743,
 'horny': 775,
 'see': 1481,
 'naked': 1123,
 'text': 1682,
 'charged': 255,
 'pm': 1293,
 'unsubscribe': 1806,
 'stop': 1623,
 'text stop': 1687,
 'still': 1622,
 'd

Model Building:

* we all start with general classification model logistic regression which defines the decision boundary between classes so we are also going to start with the same.

* we know naive bayes works well with text data where we use vectorization methods like TFIDF and BOW which depends on the frequency of text occurring so we use naive bayes.

Logistic Regression

In [51]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()

Logistic regression on Bag of words(cv):


In [52]:
log_reg.fit(X_train_cv,y_train)
y_pred = log_reg.predict(X_test_cv)

In [53]:
from sklearn.metrics import classification_report,accuracy_score
print(accuracy_score(y_test,y_pred))

0.9814593301435407


In [54]:
print(classification_report(y_pred,y_test))

              precision    recall  f1-score   support

           0       1.00      0.98      0.99      1465
           1       0.88      0.99      0.93       207

    accuracy                           0.98      1672
   macro avg       0.94      0.98      0.96      1672
weighted avg       0.98      0.98      0.98      1672



Logistic regression on TF-IDF:


In [56]:
log_reg.fit(X_train_tfidf,y_train)
y_pred = log_reg.predict(X_test_tfidf)

In [None]:
print(accuracy_score(y_test,y_pred))

0.9748803827751196


In [58]:
print(classification_report(y_pred,y_test))

              precision    recall  f1-score   support

           0       1.00      0.97      0.99      1480
           1       0.82      0.99      0.90       192

    accuracy                           0.97      1672
   macro avg       0.91      0.98      0.94      1672
weighted avg       0.98      0.97      0.98      1672



Naive Bayes 

Naive Bayes on Bag of words(cv)

In [59]:
from sklearn.naive_bayes import MultinomialNB
nb= MultinomialNB()

In [66]:
nb.fit(X_train_cv,y_train)
y_pred = nb.predict(X_test_cv)

In [67]:
print(accuracy_score(y_test,y_pred))

0.9838516746411483


In [68]:
print(classification_report(y_pred,y_test))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1447
           1       0.93      0.96      0.94       225

    accuracy                           0.98      1672
   macro avg       0.96      0.97      0.97      1672
weighted avg       0.98      0.98      0.98      1672



Naive Bayes on TFIDF:

In [63]:
nb.fit(X_train_tfidf,y_train)
y_pred = nb.predict(X_test_tfidf)

In [64]:
print(accuracy_score(y_test,y_pred))

0.9772727272727273


In [65]:
print(classification_report(y_pred,y_test))

              precision    recall  f1-score   support

           0       1.00      0.97      0.99      1476
           1       0.84      0.99      0.91       196

    accuracy                           0.98      1672
   macro avg       0.92      0.98      0.95      1672
weighted avg       0.98      0.98      0.98      1672



* on the above results, we can see that Multinomial Naive Bayes using Bag of Words (CountVectorizer) vectorization performs better on the SMS Spam-Ham classification task, achieving an accuracy of 98.38%, along with strong precision and recall metrics. 
* specifically, a precision and recall of 0.99 for the ham class and 0.93 precision and 0.96 recall for the spam class.

I have skipped the regular EDA and feature engineering, but it is highly recommended. I built this project to practice the NLP concepts I have learned.