<a href="https://colab.research.google.com/github/datascientist-hist/Spam_Messages_Classification/blob/main/Spam_Messages_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Prepare environment and dataset

In [4]:
! pip install plotly_express

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import plotly_express as px
import plotly.figure_factory as ff
import wordcloud
import nltk
import warnings
from sklearn.model_selection import train_test_split
warnings.filterwarnings('ignore')

In [6]:
data=pd.read_csv('Spam_Classification.csv')


In [7]:
data.shape

(5572, 2)

In [8]:
data.columns

Index(['Category', 'Message'], dtype='object')

In [9]:
data.head(20)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [10]:
data['Category'].value_counts().to_dict()

{'ham': 4825, 'spam': 747}

In [11]:
fig = px.histogram(data, x="Category", color="Category",
                   color_discrete_sequence=["#871fff","#ffa78c"])
fig.show()

In [12]:
data['Category'].value_counts().to_dict()

{'ham': 4825, 'spam': 747}

The dataset is unbalanced, we can observe that there are:
- 560 observations as Spam
- 3619 observations as ham

#Feature Engineering

I am going to perform a cleaning of the dataset and then adding some feature like:
- length text


In [13]:
#apply len columns to entire dataset and for train and test dataset
data['length'] = data['Message'].apply(len)



In [14]:
data.head()

Unnamed: 0,Category,Message,length
0,ham,"Go until jurong point, crazy.. Available only ...",111
1,ham,Ok lar... Joking wif u oni...,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,ham,U dun say so early hor... U c already then say...,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",61


Here we will observe if there are difference in length between email spam and not

In [15]:

lis=[data.length[data['Category']=='ham'], data.length[data['Category']=='spam']]
group_labels=['ham','spam']
colors = ['#003f5c', '#ffa600']
# Create distplot 
fig = ff.create_distplot(lis, group_labels, bin_size=20,show_rug=False,
                         curve_type='kde', # override default 'kde'
                         colors=colors)

# Add title
fig.update_layout(xaxis_range=[0,300])
fig.update_layout(title_text='Distplot with Normal Distribution')
fig.show()

In [16]:

mean_spam=data.length[data['Category']=='spam'].mean()
mean_ham=data.length[data['Category']=='ham'].mean()
sd_spam=data.length[data['Category']=='spam'].std()
sd_ham=data.length[data['Category']=='ham'].std()
print('the average length for spam is :',round(mean_spam,2),'with standard deviation :',sd_spam)
print('the average length for ham is :',round(mean_ham,2),'with standard deviation :',sd_ham)

the average length for spam is : 137.99 with standard deviation : 29.9802865150208
the average length for ham is : 71.45 with standard deviation : 58.4348642857575


Now to better understand the dataset i will compute the most frequency words for the categories to do that i am going to use Wìword cloud library that provides an image,let's see

In [17]:
#dividing the dataset
data_ham  = data[data['Category']=='ham'].copy()
data_spam = data[data['Category']=='spam'].copy()

def show_wordcloud(df, title):
    text = ' '.join(df['Message'].astype(str).tolist())
    stopwords = set(wordcloud.STOPWORDS)
    fig_wordcloud = wordcloud.WordCloud(stopwords=stopwords, background_color="#ffa78c",
                                        width = 3000, height = 2000).generate(text)
    plt.figure(figsize=(15,15), frameon=True)
    plt.imshow(fig_wordcloud)  
    plt.axis('off')
    plt.title(title, fontsize=20)
    plt.show()

In [None]:
#create the image for Spam messages
show_wordcloud(data_spam, "Spam messages\n")

In [None]:
#create the image for Ham messages
show_wordcloud(data_ham, "Ham messages\n")

Now i have the first idea for the most word most used  among the categories, so we can continue preprocessing data. 

# Preprocess the data

In this step first i have to perform the following steps:
- convert tha label feature in numerical feature 
- convert web addresses
- convert  phone numbers
- convert  numbers 
-  encode symbols, 
- remove punctuation and white spaces
- convert all text to lowercase

In [None]:
data['class_label'] = data['Category'].map( {'spam': 1, 'ham': 0})

In [None]:
# Replace email address with 'emailaddress'
data['Message'] = data['Message'].str.replace(r'^.+@[^\.].*\.[a-z]{2,}$', 'emailaddress')

# Replace urls with 'webaddress'
data['Message'] = data['Message'].str.replace(r'^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$', 'webaddress')

# Replace money symbol with 'money-symbol'
data['Message'] = data['Message'].str.replace(r'£|\$', 'money-symbol')

# Replace 10 digit phone number with 'phone-number'
data['Message'] = data['Message'].str.replace(r'^\(?[\d]{3}\)?[\s-]?[\d]{3?[\d]{4}$', 'phone-number')

# Replace normal number with 'number'
data['Message'] = data['Message'].str.replace(r'\d+(\.\d+)?', 'number')

# remove punctuation
data['Message'] = data['Message'].str.replace(r'[^\w\d\s]', ' ')

# remove whitespace between terms with single space
data['Message'] = data['Message'].str.replace(r'\s+', ' ')

# remove leading and trailing whitespace
data['Message'] = data['Message'].str.replace(r'^\s+|\s*?$', ' ')

# change words to lower case
data['Message'] = data['Message'].str.lower()

Going forward, we'll remove stopwords from the message content. Stop words are words that search engines have been programmed to ignore, such as “the”, “a”, “an”, “in”, "but", "because" etc.

In [None]:
!pip install nltk

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
data['Message'] = data['Message'].apply(lambda x: ' '.join(term for term in x.split() if term not in stop_words))

Next, we will extract the base form of words by removing affixes from them. This called stemming,there are numerous stemming algorithms,i'll use Snowball Stemmer


<a href="https://ibb.co/RpqbT1p"><img src="https://i.ibb.co/nsFmMTs/stopword.png" alt="stopword" border="0"></a>

In [None]:
ss = nltk.SnowballStemmer("english")
data['Message'] = data['Message'].apply(lambda x: ' '.join(ss.stem(term) for term in x.split()))

Machine learning algorithms cannot work with raw text directly. The text must be converted into numbers.
First, we create a Bag of Words (BOW) model to extract features from text:

In [None]:
import nltk
nltk.download('punkt')

In [None]:
sms_df = data['Message']
from nltk.tokenize import word_tokenize

# creating a bag-of-words model
all_words = []
for sms in sms_df:
  #Tokenization is essentially splitting a phrase in words
    words = word_tokenize(sms)
    for w in words:
        all_words.append(w)

#counting the number of occurence for each word
all_words = nltk.FreqDist(all_words)     

In [None]:
print('Number of words: {}'.format(len(all_words)))

Now i'll plot the top 10 common words in the text data:

In [None]:
all_words.plot(20, title='Top 10 Most Common Words in Corpus');

Next, we will implement an NLP technique—term frequency-inverse document frequency—to evaluate how important words are in the text data. In short, this technique simply defines what a “relevant word” is.

This technique is an improvement of the count vectors, and is widely used in the search technologies. Tf-Idf stands for Term frequency-Inverse document frequency. It tends to capture :

-How frequently a word/term Wi appears in a document dj . This expression can be mathematically represented by Tf(Wi, dj)

-How frequently the same word/term appears across the entire corpus D. This expression can be mathematically represented by df(Wi, D).

-Idf measures how infrequently the word Wi occurs in the corpus D.

With that additional information, we can compute the Tf-Idf using the product of the tf and idf values

This technique doesn't consider the context of the word that can be a bad drawback

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_model = TfidfVectorizer()
tfidf_vec=tfidf_model.fit_transform(sms_df)

tfidf_data=pd.DataFrame(tfidf_vec.toarray(),columns=tfidf_model.get_feature_names_out())
tfidf_data.head()



In [None]:
tfidf_data.shape

Since i will use K-fold :cross validation i don't need to split dataset ,i need only:

- X dataset
- Y dataset

In [None]:
data.columns

In [None]:
X= data['length']
Y=data['class_label']

In [None]:
X=pd.concat([X,tfidf_data],axis=1)

In [None]:
X.head()

# Model Building
i will use a series of model:
- Random Forest
- Logistic Regression
- XGBoost

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [None]:
#I usually use Naive Bayes as a baseline for my classification tasks 
gnb = GaussianNB()
cv = cross_val_score(gnb,X,Y,cv=5)
print(cv)
print(cv.mean())

In [None]:
lr = LogisticRegression(max_iter = 2000)
cv = cross_val_score(lr,X,Y,cv=5)
print(cv)
print(cv.mean())

In [None]:
rf = RandomForestClassifier(random_state = 1)
cv = cross_val_score(rf,X,Y,cv=5)
print(cv)
print(cv.mean())

In [None]:
#svc = SVC(probability = True)
#cv = cross_val_score(svc,X,Y,cv=5)
#print(cv)
#print(cv.mean())

# Tuning the model