# Classification of spam

## Introduction

In this notebook a classification of spam and ham sms is performed with LDA, LSA and LDiA.

The dataset is obtained from here: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

The guidance how to do this kind of text classification is taken from book: Natural language processing in Action: Understanding, Analyzing and Generating Text with Python by Hobson Lane, Cole Howard, Hannes Max Hapke.

## Dataset

In this notebook text classification is performed for 'SMS Spam Collection Data Set'-dataset . The dataset can be found here: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

Source of the data (2012):

Tiago A. Almeida (talmeida ufscar.br) 
Department of Computer Science 
Federal University of Sao Carlos (UFSCar) 
Sorocaba, Sao Paulo - Brazil 

Attribute Information: The collection is composed by one text file, where each line has the correct class followed by the raw message. 

Relevant papers: Almeida, T.A., GÃ³mez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.

## Loading the data

In [1]:
# Load libraries
import numpy as np
import pandas as pd
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize.casual import casual_tokenize
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDiA
from sklearn.model_selection import train_test_split

In [3]:
with open('SMSSpamCollection') as f:
    content =f.read() 

In [24]:
spam=[]; text=[]; index=[]
for i, sent in enumerate(content.splitlines()):
    line=sent.split('\t')
    spam.append(int(line[0]!='ham'))
    text.append(line[1])
    index.append('sms{}{}'.format(i,'!'*spam[i]))

## Exploring the data

In [25]:
df=pd.DataFrame({'spam':spam,'text':text},index=index)
df.head()

Unnamed: 0,spam,text
sms0,0,"Go until jurong point, crazy.. Available only ..."
sms1,0,Ok lar... Joking wif u oni...
sms2!,1,Free entry in 2 a wkly comp to win FA Cup fina...
sms3,0,U dun say so early hor... U c already then say...
sms4,0,"Nah I don't think he goes to usf, he lives aro..."


In [26]:
# Number of document rows
len(df)

5574

In [29]:
# Number of document rows that are spam, and the share of spam
df.spam.sum(), round(df.spam.sum()/len(df),2)

(747, 0.13)

Thus 13 % of the sms documents are spam.

## Creating Tf-Idf vectors

In [31]:
# Let's do tokenization and TF-IDF vector transformation on all sms messages
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize.casual import casual_tokenize
tfidf_model=TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs=tfidf_model.fit_transform(raw_documents=df.text).toarray()
tfidf_docs=tfidf_docs-tfidf_docs.mean(axis=0)
# rows: number of documents,columns:number of terms
tfidf_docs.shape

(5574, 9232)

## Classification of spam messages

The idea is to try simple LDA classification just with Tf-idf vectors, and compare that with cases where LDA is performed with topic vectors. In the latter case the topic vectors are created either with Latent Semantic Analysis, LSA (PCA), or Latent Dirichlet Allocation, LDiA.

In [51]:
# Import the libraries
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.decomposition import PCA
from sklearn.decomposition import LatentDirichletAllocation as LDiA
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import casual_tokenize

### a) LDA classification with Tf-idf vectors (no PCA)

In [50]:
X_train,X_test,y_train,y_test=train_test_split(tfidf_docs,df.spam.values,test_size=0.33,
                                               random_state=256242)
lda=LDA(n_components=1)
lda=lda.fit(X_train,y_train)
# accuracy for train and test sets
round(float(lda.score(X_train,y_train)),3),   round(float(lda.score(X_test,y_test)),3)



(1.0, 0.748)

Training set accuracy is perfect, but test set accuracy is quite bad. It is thus better to use some method that reduces the number of dimensions. Let's try PCA.

### b) LDA classification with 16 PCA topic vectors

In [33]:
# Let's try PCA from scikit-learn, transforming 9232 dimension TF-IDF vectors into 16-D topic vectors
pca=PCA(n_components=16)
pca=pca.fit(tfidf_docs)
pca_topic_vectors=pca.transform(tfidf_docs)
columns=['topic{}'.format(i) for i in range(pca.n_components)]
pca_topic_vectors=pd.DataFrame(pca_topic_vectors,columns=columns,index=index)
pca_topic_vectors.round(3).head(6)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,topic11,topic12,topic13,topic14,topic15
sms0,0.201,0.032,0.006,-0.004,0.019,-0.052,0.041,-0.046,0.012,-0.085,-0.0,-0.003,-0.001,0.021,-0.016,-0.036
sms1,0.399,-0.037,-0.078,0.085,-0.11,0.055,0.031,0.073,-0.017,-0.02,0.001,0.037,-0.03,-0.036,0.061,0.023
sms2!,-0.029,0.055,-0.05,-0.102,-0.087,-0.04,0.004,-0.032,-0.029,0.07,0.116,0.033,-0.025,0.035,-0.035,-0.044
sms3,0.326,-0.034,-0.028,0.012,-0.055,0.055,-0.166,-0.024,0.05,-0.122,0.024,0.044,-0.079,0.003,0.042,0.025
sms4,0.003,0.035,0.03,0.015,0.07,-0.102,-0.036,0.034,-0.051,0.05,0.025,0.003,-0.005,0.086,-0.033,0.044
sms5!,-0.021,-0.002,0.055,-0.028,-0.117,-0.042,0.019,0.138,-0.06,0.105,0.036,0.044,0.066,0.029,-0.001,-0.005


Let's use PCA topic vectors to train LDA model (simple binary classifier)

In [49]:
X_train,X_test,y_train,y_test=train_test_split(pca_topic_vectors,df.spam,test_size=0.33,
                                               random_state=256242)
lda=LDA(n_components=1)
lda=lda.fit(X_train,y_train)
df['pca_spam']=lda.predict(pca_topic_vectors)
# accuracy for train and test sets
round(float(lda.score(X_train,y_train)),3),   round(float(lda.score(X_test,y_test)),3)

(0.958, 0.959)

### c) LDA classification with 16 LDiA topic vectors

In [34]:
# LDiA works with raw BOW count vectors rather than normalized TF-IDF vectors
np.random.seed(42)
counter=CountVectorizer(tokenizer=casual_tokenize)
bow_docs=pd.DataFrame(counter.fit_transform(raw_documents=df.text).toarray(),index=index)
column_nums,terms=zip(*sorted(zip(counter.vocabulary_.values(),counter.vocabulary_.keys())))
bow_docs.columns=terms
bow_docs.head()

Unnamed: 0,!,"""",#,#150,#5000,$,%,&,',(,...,ü'll,–,—,‘,’,“,…,┾,〨ud,鈥
sms0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sms1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sms2!,0,0,0,0,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
sms3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sms4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [35]:
# Create the ldia model
ldia=LDiA(n_components=16,learning_method='batch')
ldia=ldia.fit(bow_docs)
# rows: 16 topics, columns: 9232 terms
ldia.components_.shape

(16, 9232)

In [41]:
# Then create topic vectors with LDiA for this sms corpus
pd.set_option('display.width',60)
ldia16_topic_vectors=ldia.transform(bow_docs)
ldia16_topic_vectors=pd.DataFrame(ldia16_topic_vectors,index=index,columns=columns)
ldia16_topic_vectors.round(2).head()

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,topic11,topic12,topic13,topic14,topic15
sms0,0.0,0.44,0.0,0.0,0.0,0.0,0.0,0.0,0.53,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sms1,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.67,0.01,0.01,0.24,0.01,0.01,0.01,0.01
sms2!,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.39,0.27,0.0,0.0,0.31,0.0
sms3,0.0,0.0,0.0,0.11,0.0,0.0,0.0,0.0,0.83,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sms4,0.54,0.0,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.08,0.0,0.0,0.0,0.0


Let's use LDiA topic vectors (created from BOW vectors) to train LDA model (simple binary classifier), in the similar way it was done with PCA topic vectors.

In [52]:
X_train,X_test,y_train,y_test=train_test_split(ldia16_topic_vectors,df.spam,test_size=0.33,
                                               random_state=256242)
lda=LDA(n_components=1)
lda=lda.fit(X_train,y_train)
df['ldia16_spam']=lda.predict(ldia16_topic_vectors)

# accuracy for train and test sets
round(float(lda.score(X_train,y_train)),3),   round(float(lda.score(X_test,y_test)),3)



(0.918, 0.92)

The result was worse than with PCA topic vectors. Let's try LDiA with a bit larger number of topic vectors, 32.

### d) LDA classification with 32 LDiA topic vectors

LDiA works in a bit different way as LSA (PCA) so it usually needs more topics to allocate words to. Let's try 32 topics (components) instead of 16.

In [53]:
ldia32=LDiA(n_components=32,learning_method='batch')
ldia32=ldia32.fit(bow_docs)
# Rows: 32 topics, columns: 9232 terms
ldia32.components_.shape

(32, 9232)

In [54]:
# Let's compute 32-D topic vectors for all the sms messages
ldia32_topic_vectors=ldia32.transform(bow_docs)
columns32=['topic{}'.format(i) for i in range(ldia32.n_components)]
ldia32_topic_vectors=pd.DataFrame(ldia32_topic_vectors,index=index,columns=columns32)
ldia32_topic_vectors.round(2).head()

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,...,topic22,topic23,topic24,topic25,topic26,topic27,topic28,topic29,topic30,topic31
sms0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sms1,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.51,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sms2!,0.0,0.0,0.0,0.98,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sms3,0.0,0.0,0.0,0.0,0.16,0.0,0.0,0.0,0.62,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sms4,0.0,0.0,0.0,0.0,0.0,0.4,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [57]:
# LDA model (classifier training)
X_train,X_test,y_train,y_test=train_test_split(ldia32_topic_vectors,df.spam,test_size=0.33,
                                               random_state=256242)
lda=LDA(n_components=1)
lda=lda.fit(X_train,y_train)
df['ldia32_spam']=lda.predict(ldia32_topic_vectors)
X_train.shape



(3734, 32)

In [58]:
# Let's look at accuracy for train and test set
round(float(lda.score(X_train,y_train)),3),  round(float(lda.score(X_test,y_test)),3)

(0.924, 0.922)

The accuracy score didn't really improve very much, only slightly.

## Conclusions

In this notebook the spam classification was performed with four methods:
- a) LDA classification done with Tf-idf vectors
- b) LDA classification with 16 topic vectors created by LSA (PCA)
- c) LDA classification with 16 topic vectors created by LDiA
- d) LDA classification with 32 topic vectors created by LDiA

The results show that some kind of dimension reduction method is needed, since a) gave poor results with test set. The best accuracy score was obtained with b) , when LSA, Latent semantic analysis with PCA was used for creating 16 topic vectors. LDA classification was then performed with these 16 topic vectors.