# Spam Classification

No emails are free from the plague of spams. Owning an email address means you are bound to get spam mails. When you make a purchase or sign up to a website online, you are required to give your email address, which could possibly be sold to marketers. Though spams serve no serious harm if left ignored, they are still an underlying threat to unsuspecting people. It is adviceable to exercise caution when interacting with mail that could potentially be spam. 

In [71]:
# Import dependencies ready for reading, visualization, and classification tasks.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 

import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [5]:
# Import data
# Dataset link : https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset/data?select=spam.csv

sms = pd.read_csv('../data/messages.csv', encoding='latin-1')
sms.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [6]:
# Explore data

sms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [15]:
# Check rows where all columns are non-NaN.

sms[sms.notna().all(axis=1)].head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
281,ham,\Wen u miss someone,the person is definitely special for u..... B...,why to miss them,"just Keep-in-touch\"" gdeve.."""
1038,ham,"Edison has rightly said, \A fool can ask more ...",GN,GE,"GNT:-)"""
2255,ham,I just lov this line: \Hurt me with the truth,I don't mind,i wil tolerat.bcs ur my someone..... But,"Never comfort me with a lie\"" gud ni8 and swe..."
3525,ham,\HEY BABE! FAR 2 SPUN-OUT 2 SPK AT DA MO... DE...,HAD A COOL NYTHO,TX 4 FONIN HON,"CALL 2MWEN IM BK FRMCLOUD 9! J X\"""""
4668,ham,"When I was born, GOD said, \Oh No! Another IDI...",GOD said,"\""OH No! COMPETITION\"". Who knew","one day these two will become FREINDS FOREVER!"""


Upon first glance, the 'Unnamed' Columns are of no worth, and are expendable. Each of the columns are no more than 1% non-null. But with a closer look, it can be deduced that they are in fact continuations of the messages from 'v2'. It appears useful to append the strings together to get the full message. Though that might not necessarily be the case all the time, as some messages could be classified merely by the first part, for the most reliability and accuracy we should aim to always use the entire data. 

In [35]:
#Create a column 'combined' that concatenates non-nan strings from other columns to 'v2'.

sms['combined'] = sms.apply(
    lambda row: f"{row['v2']} {row['Unnamed: 2']} {row['Unnamed: 3']} {row['Unnamed: 4']}".strip()
    if pd.notna(row['Unnamed: 4'])
    else (f"{row['v2']} {row['Unnamed: 2']} {row['Unnamed: 3']}".strip()
          if pd.notna(row['Unnamed: 3'])
          else (f"{row['v2']} {row['Unnamed: 2']}".strip()
                if pd.notna(row['Unnamed: 2'])
                else row['v2'])),
    axis=1
)

In [28]:
sms_com = sms[['v1', 'combined']].rename(columns={'v1': 'class', 'combined': 'message'})
sms_com.head()

Unnamed: 0,class,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Next step is to preprocess the data, preparing it for modelling. Having already dealt with null values, next steps to take are:
1. **Cleaning** (removing punctuations, numbers, lowercasing, etc.)
2. **Tokenization** (breaking apart the messages into individual words, a.k.a tokens.)
3. Removing **Stopwords** (removing commonly appearing and insignificant words such as "the", "is", "and", and others in the nltk corpus.)
4. **Stemming** (stripping words to their very basic forms, e.g. "eating" --> "eat".)
4. **Lemmatization** (converts words to their dictionary forms, e.g. )

In [48]:
# Function to clean the messages.

import re

def clean(msg):

    msg = msg.lower()  # Convert to lowercase
    msg = re.sub(r'\d+', '', msg)  # Remove numbers
    msg = re.sub(r'[^\w\s]', '', msg)  # Remove punctuations
    msg = re.sub(r'http\S+|www\S+', '', msg)  # Remove URLs
    return msg

In [51]:
sms_com['message'] = sms_com['message'].apply(clean)
sms_com.head()

Unnamed: 0,class,message
0,ham,go until jurong point crazy available only in ...
1,ham,ok lar joking wif u oni
2,spam,free entry in a wkly comp to win fa cup final...
3,ham,u dun say so early hor u c already then say
4,ham,nah i dont think he goes to usf he lives aroun...


In [66]:
# Tokenize

nltk.download('punkt_tab')

sms_com['tokens'] = sms_com['message'].apply(word_tokenize)
sms_com['tokens'].head()

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Yasa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


0    [go, until, jurong, point, crazy, available, o...
1                       [ok, lar, joking, wif, u, oni]
2    [free, entry, in, a, wkly, comp, to, win, fa, ...
3    [u, dun, say, so, early, hor, u, c, already, t...
4    [nah, i, dont, think, he, goes, to, usf, he, l...
Name: tokens, dtype: object

In [67]:
# Remove stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

sms_com['tokens'] = sms_com['tokens'].apply(lambda x: [word for word in x if word not in stop_words])
sms_com['tokens'].head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Yasa\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


0    [go, jurong, point, crazy, available, bugis, n...
1                       [ok, lar, joking, wif, u, oni]
2    [free, entry, wkly, comp, win, fa, cup, final,...
3        [u, dun, say, early, hor, u, c, already, say]
4    [nah, dont, think, goes, usf, lives, around, t...
Name: tokens, dtype: object

In [73]:
# Stem and Lemmatize

nltk.download('wordnet')

stemmer = PorterStemmer()
sms_com['tokens'] = sms_com['tokens'].apply(lambda x: [stemmer.stem(word) for word in x])

lemmatizer = WordNetLemmatizer()
sms_com['tokens'] = sms_com['tokens'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

sms_com['tokens'].head()


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Yasa\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


0    [go, jurong, point, crazi, avail, bugi, n, gre...
1                         [ok, lar, joke, wif, u, oni]
2    [free, entri, wkli, comp, win, fa, cup, final,...
3        [u, dun, say, earli, hor, u, c, alreadi, say]
4    [nah, dont, think, goe, usf, live, around, tho...
Name: tokens, dtype: object

In [None]:
# Vectorize

In [56]:
sms_com.to_csv('../data/messages_curated.csv')