# SMS Spam Detection Using Machine Learning


---


## Practice Module: Pattern Recognition Systems (PRS)

## Group: 18

## Members:

Lim Jun Ming, A0231523U

Mediana, A0231458E

Yeong Wee Ping, A0231533R

# Text Preprocessing

## 0. File Path & Library Setup

In [None]:
# Load All Necessary Packages

import os
from google.colab import drive

import pandas as pd
import numpy as np
import re

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

seed = 18

print('Versions of key libraries')
print('-------------------------')
print('pandas:  ', pd.__version__)
print('numpy:   ', np.__version__)
print('nltk:    ', nltk.__version__)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Versions of key libraries
-------------------------
pandas:   1.1.5
numpy:    1.19.5
nltk:     3.2.5


In [None]:
# Mounting to Google Drive
drive.mount('/content/gdrive')

# Change Working Directory
os.chdir('/content/gdrive/My Drive/iss/prs_pm/training')

print('Working Directory: ')
!pwd

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
Working Directory: 
/content/gdrive/My Drive/iss/prs_pm/training


## 1. Text Preprocessing Functions

1. Lower Casing - Using simple string lower case for each tokens in each sms.
2. Remove numbers and punctuations
3. Tokenization - Using simple tokenisation method by splitting each sms into a list of words without punctuations.
4. Stopwords Removal - Using nltk english stopword corpus and custom stopword dictionary to remove all stopwords from sms.
5. Lemmatization

Simple tokenisation and stopword removal are used as often in sms context, words used are not in their proper form. Hence, the original words from the sms is remained in the processed text with punctuation removed.

In [None]:
# Text processing function
stopwordsdic = stopwords.words('english') # stopwords dictionary
customstopwords = ['u', 'ur']
lemmatizer = WordNetLemmatizer()


def text_process(sms):
    text = sms.lower() # Lower casing 
    text = text.strip() # Strip off excess spacing in the end of message
    text = re.sub(r'\d+', '', text) # Remove all numbers
    text = re.sub('[^A-Za-z0-9]+', ' ', text) # Remove punctuations
    text = word_tokenize(text) # Tokenization
    text = [word for word in text if word not in stopwordsdic] # Remove stopwords
    text = [word for word in text if word not in customstopwords] # Remove custom stopwords
    text = [lemmatizer.lemmatize(word) for word in text] # Lemmatization
    return text # List of tokens

def text_maker(tokens_list): # To join the processed words(tokens) back into a text
    sent = ' '.join(tokens_list)
    return sent


## 2. Load Structure Data and Apply Text Preprocessing

In [None]:
# Load Data
header = ['Label', 'Text']
data = pd.read_csv('structured_data/smsdata.csv', encoding='UTF-8', names=header)

# Apply Text Preprocessing
data['Processed Token'] = data["Text"].apply(lambda x: text_process(x))

# Apply Text Make (To combined tokens back into a string for easy storage)
data['Processed Text'] = data['Processed Token'].apply(lambda x: text_maker(x))

# Preprocessed Dataset
data_text = data[['Label', 'Processed Text']]


In [None]:
# Show data heads

row, clm = data_text.shape

print('Preprocessed dataset has {:2d} rows of labelled data points'.format(row))
print('-----------------------------------------------------------')
print(data_text.head())
print('-----------------------------------------------------------')


Preprocessed dataset has 5837 rows of labelled data points
-----------------------------------------------------------
  Label                                     Processed Text
0   ham                          dunno let go learn pilate
1   ham                                    wat time finish
2   ham     would really appreciate call need someone talk
3   ham                                   still grand prix
4   ham  mmm thats better got roast b better drink good...
-----------------------------------------------------------


## 3. Save Preprocessed Text Data

In [None]:
# Save Preprocessed Dataset
data_text.to_csv('structured_data/procdata.csv', index=False, header=False)