In [1]:
import nltk
import numpy as np
import pandas as pd

**Data Import & Snapshot of the data**

In [2]:
# Load data in a dataframe
dt = pd.read_csv('SPAM-210331-134237.csv')

# Snapshot of the data - 10 items
dt.head(10)

Unnamed: 0,type,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


**Mapping**

Map the class 'spam' to 1 (int) and 'ham' to 0 (int).

In [3]:
# Normalization: Mapping 'spam' to 1 (int) and 'ham' to 0 (int)
dt['spam'] = dt['type'].map({'spam':1, 'ham':0}).astype(int)

dt.head() # Snapshot of updated dataframe

Unnamed: 0,type,text,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


List the names of columns in the updated datafram:

In [5]:
print('Columns in the given data:')
for col in dt.columns:
    print(col)

Columns in the given data:
type
text
spam


In [6]:
type_len = len(dt['type'])
print('Number of rows in the review column:', type_len)

text_len = len(dt['text'])
print('Number of rows in the liked column:', text_len)

Number of rows in the review column: 116
Number of rows in the liked column: 116


## 2. Tokenization

In [7]:
dt['text'][1] # before

'Ok lar... Joking wif u oni...'

**Tokenization**

Tokenization is changing a sentence or multiple sentences into a bag of words.

In [8]:
def tokenizer(text):
    return text.split()

In [9]:
dt['text'] = dt['text'].apply(tokenizer)

In [10]:
dt['text'][1] # after

['Ok', 'lar...', 'Joking', 'wif', 'u', 'oni...']

**Stemming**

Stemming is reducing all the derived words to their base words. For e.g., changing growing, grown, etc. to 'grow'.

We can anyone of the three different tools:
1. Snowball
1. Porter
1. Lancaster

In [13]:
dt['text'][48] # before

['Yeah',
 'hopefully,',
 'if',
 'tyler',
 "can't",
 'do',
 'it',
 'I',
 'could',
 'maybe',
 'ask',
 'around',
 'a',
 'bit']

In [14]:
# Stemming

from nltk.stem.snowball import SnowballStemmer
porter = SnowballStemmer('english', ignore_stopwords = False)

In [15]:
def stem_it(text):
    """Assumes text to be a list of strings.
       Returns --> a list of strings of the same size
       but having the words replaced by their respective
       root forms."""
    return [porter.stem(word) for word in text]

In [16]:
dt['text'] = dt['text'].apply(stem_it)

In [17]:
dt['text'][48] # after stemming

['yeah',
 'hopefully,',
 'if',
 'tyler',
 "can't",
 'do',
 'it',
 'i',
 'could',
 'mayb',
 'ask',
 'around',
 'a',
 'bit']

**Lemmitization**

In [18]:
dt['text'][92] # before

['smile',
 'in',
 'pleasur',
 'smile',
 'in',
 'pain',
 'smile',
 'when',
 'troubl',
 'pour',
 'like',
 'rain',
 'smile',
 'when',
 'sum1',
 'hurt',
 'u',
 'smile',
 'becoz',
 'someon',
 'still',
 'love',
 'to',
 'see',
 'u',
 'smiling!!']

In [19]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [20]:
def lemmatize_it(text):
    return [lemmatizer.lemmatize(word, pos = 'a') for word in text]

In [21]:
dt['text'] = dt['text'].apply(lemmatize_it)

In [22]:
dt['text'][92] # after

['smile',
 'in',
 'pleasur',
 'smile',
 'in',
 'pain',
 'smile',
 'when',
 'troubl',
 'pour',
 'like',
 'rain',
 'smile',
 'when',
 'sum1',
 'hurt',
 'u',
 'smile',
 'becoz',
 'someon',
 'still',
 'love',
 'to',
 'see',
 'u',
 'smiling!!']

**Stopword Removal**

In [23]:
dt['text'][34] # before

['thank',
 'for',
 'your',
 'subscript',
 'to',
 'rington',
 'uk',
 'your',
 'mobil',
 'will',
 'be',
 'charg',
 '£5/month',
 'pleas',
 'confirm',
 'by',
 'repli',
 'yes',
 'or',
 'no.',
 'if',
 'you',
 'repli',
 'no',
 'you',
 'will',
 'not',
 'be',
 'charg']

In [24]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [25]:
def stop_it(text):
    review = [word for word in text if not word in stop_words]
    return review

In [26]:
dt['text'] = dt['text'].apply(stop_it)

In [27]:
dt['text'][34] # after

['thank',
 'subscript',
 'rington',
 'uk',
 'mobil',
 'charg',
 '£5/month',
 'pleas',
 'confirm',
 'repli',
 'yes',
 'no.',
 'repli',
 'charg']

In [28]:
dt['text'] = dt['text'].apply(' '.join)
dt.head(10)

Unnamed: 0,type,text,spam
0,ham,"go jurong point, crazy.. avail onli bugi n gre...",0
1,ham,ok lar... joke wif u oni...,0
2,spam,free entri 2 wkli comp win fa cup final tkts 2...,1
3,ham,u dun say earli hor... u c alreadi say...,0
4,ham,"nah think goe usf, live around though",0
5,spam,freemsg hey darl 3 week word back! i'd like fu...,1
6,ham,even brother like speak me. treat like aid pat...,0
7,ham,per request mell mell (oru minnaminungint nuru...,0
8,spam,winner!! valu network custom select receivea £...,1
9,spam,mobil 11 month more? u r entitl updat late col...,1


**Vectorization**

It is the process of changing the sentence (a string) to an array of numbers. The resulting vector should incorporate the frequency and significance of a word in a document.
To vectorize a document, we use here, _**Term Frequency - Inverse Document Frequency (TF-IDF)**_

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
y = dt.spam.values
x = tfidf.fit_transform(dt['text'])

In [30]:
x

<116x709 sparse matrix of type '<class 'numpy.float64'>'
	with 1076 stored elements in Compressed Sparse Row format>

In [31]:
y

array([0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0])

In [32]:
print(len(y))

116


**Split the data into two categories: train-test spit**

In [33]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 1, test_size = 0.2, shuffle = False)

In [35]:
print("Items in train split:", len(y_train), "\nItems in test split:", len(y_test))

Items in train split: 92 
Items in test split: 24


**Classification using Logistic Regression**

In [36]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

In [37]:
from sklearn.metrics import accuracy_score
acc_log = accuracy_score(y_pred, y_test)*100
print("Accuracy:", acc_log)

Accuracy: 87.5


**Classification using LinearSVC Accuracy**

In [38]:
from sklearn.svm import LinearSVC

linear_svc = LinearSVC(random_state = 0)
linear_svc.fit(x_train, y_train)
y_pred = linear_svc.predict(x_test)

In [39]:
acc_linear_svc = accuracy_score(y_pred, y_test)*100
print("Accuracy:", acc_linear_svc)

Accuracy: 87.5


Looks like both **Logistic Regression** and the **Linear Regression** both models have the same accuracy %. And the accuracy of 87.5% is not really that great, but considering the small dataset of just 116 observation, it is good enough!