# SMS Spam Dataset Exploration

## Introduction
This Jupyter Notebook explores the [SMS Spam Collection](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) dataset from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) and compares performance of various machine learning algorithms in text processing.

## Data Wrangling
To begin with, lets load the dataset into a Pandas Dataframe.

In [1]:
import csv
import pandas as pd

sms_spam_df = pd.read_csv('sms-spam.tsv', quoting=csv.QUOTE_NONE, sep='\t', names=['label', 'message'])
sms_spam_df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Missing values skew the dataset, and should be avoided. Lets see if the dataset has any missing values.

In [2]:
sms_spam_df.isnull().values.any()

False

Now that we are sure there are no missing values, lets have some fun by checking stats about spam and ham(non spam) messages in the dataset.

In [3]:
sms_spam = sms_spam_df.groupby('label')['message']
sms_spam.describe()

Unnamed: 0_level_0,count,unique,top,freq
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ham,4827,4518,"Sorry, I'll call later",30
spam,747,653,Please call our customer service representativ...,4


## Data Preprocessing

For messages to be understood by machine learning algorithms, they have to be converted into vectors. To do that, we have to first split our messages into tokens (list of words). This technique is called Bag of Words model as in the end we are left with a collection (bag) of word vectors. The following methods can be used to vectorize messages:
  1. Tokenization: splitting messages into individual words.
  2. Lemmatization: splitting messages into individual words and converting them into their base form (lemma).

### Tokenization

Tokenization simply splits the message into individual tokens.

In [4]:
from textblob import TextBlob

def tokenize(message):
    message = unicode(message, 'utf8')
    return TextBlob(message).words

Lets try applying this on some of our messages. Here are the original messages we are going to tokenize.

In [5]:
sms_spam_df['message'].head()

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: message, dtype: object

Now, here are those messages tokenized.

In [6]:
sms_spam_df['message'].head().apply(tokenize)

0    [Go, until, jurong, point, crazy, Available, o...
1                       [Ok, lar, Joking, wif, u, oni]
2    [Free, entry, in, 2, a, wkly, comp, to, win, F...
3    [U, dun, say, so, early, hor, U, c, already, t...
4    [Nah, I, do, n't, think, he, goes, to, usf, he...
Name: message, dtype: object

As you can see, tokenization simply splits message into tokens. 

### Lemmatization

The `textblob` library provides tools that can convert each word in a message to its base form (lemma).

In [7]:
from textblob import TextBlob

def lemmatize(message):
    message = unicode(message, 'utf8').lower()
    return [word.lemma for word in TextBlob(message).words]

Alright, here are first few of our original messages.

In [8]:
sms_spam_df['message'].head()

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: message, dtype: object

And, here are our messages lemmatized.

In [9]:
sms_spam_df['message'].head().apply(lemmatize)

0    [go, until, jurong, point, crazy, available, o...
1                       [ok, lar, joking, wif, u, oni]
2    [free, entry, in, 2, a, wkly, comp, to, win, f...
3    [u, dun, say, so, early, hor, u, c, already, t...
4    [nah, i, do, n't, think, he, go, to, usf, he, ...
Name: message, dtype: object

As you can see, lemmatization converts messages into their base form; for example, goes becomes go as you may notice from the last message.

## Vectorization

As already mentioned, machine learning algorithms can only understand vectors and not text. Converting list of words (obtained after tokenization or lemmatization) into vectors involves the following steps:

  1. Term Frequency (TF): Determine frequency of each word in the message.
  2. Inverse Document Frequency (IDF): Weigh frequency of each word in the message such that more frequent words get lower weights.
  3. Normalization: Normalize message vectors to unit length.

### Count Vectorization

Count Vectorization obtains frequency of unique words in each tokenized message.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

"""Bag of Words Transformer using lemmatization"""

bow_transformer = CountVectorizer(analyzer=lemmatize)
bow_transformer.fit(sms_spam_df['message'])

CountVectorizer(analyzer=<function lemmatize at 0x7feadba3ab90>, binary=False,
        decode_error=u'strict', dtype=<type 'numpy.int64'>,
        encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None,
        stop_words=None, strip_accents=None,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None,
        vocabulary=None)

Now, lets try out the Bag of Words transformer on some dummy message.

In [11]:
dummy_vectorized = bow_transformer.transform(['Hey you... you of the you... This message is to you.'])
print dummy_vectorized

  (0, 3925)	1
  (0, 4297)	1
  (0, 5083)	1
  (0, 5589)	1
  (0, 7673)	1
  (0, 7717)	1
  (0, 7801)	1
  (0, 8737)	4


So, the message _Hey you... you of the you... This message is to you._ contains 8 unique words, of which _you_ is repeated 4 times. Hope you can guess what vector representation of _you_ is. Hint: _you_ is repeated 4 times.

In [12]:
bow_transformer.get_feature_names()[8737]

u'you'

Now, lets transform entire set of messages in our dataset.

In [13]:
msgs_vectorized = bow_transformer.transform(sms_spam_df['message'])
msgs_vectorized.shape

(5574, 8859)

### TF-IDF Transformation

Now that we have obtained a vectorized representation of messages in our dataset, we can use it to weigh words in our dataset such that words with high frequency have a lower weight (Inverse Document Frequency). Also, this process also performs normalization of messages.

In [14]:
from sklearn.feature_extraction.text import TfidfTransformer

"""TFIDF Transformer using vectorized messages"""

tfidf_transformer = TfidfTransformer().fit(msgs_vectorized)

Lets use this transformer to weigh the previous message; _Hey you... you of the you... This message is to you._

In [15]:
dummy_transformed = tfidf_transformer.transform(dummy_vectorized)
print dummy_transformed

  (0, 8737)	0.676815614927
  (0, 7801)	0.164667697974
  (0, 7717)	0.290066163457
  (0, 7673)	0.201312794894
  (0, 5589)	0.248872120698
  (0, 5083)	0.377358904206
  (0, 4297)	0.224280949576
  (0, 3925)	0.368104513252


Now, lets check IDF for _you_, the most frequently repeated word in the message against _hey_, a least repeated word.

In [16]:
print '{}: {}'.format('you', tfidf_transformer.idf_[bow_transformer.vocabulary_['you']])
print '{}: {}'.format('hey', tfidf_transformer.idf_[bow_transformer.vocabulary_['hey']])

you: 2.25581695452
hey: 4.90754872503


As you can see, words with lower frequency are weighed higher than words with higher frequency in the dataset.

Now, to weigh and normalize all messages in our dataset.

In [17]:
msgs_tfidf = tfidf_transformer.transform(msgs_vectorized)
msgs_tfidf.shape

(5574, 8859)

## Naive Bayes Classifier

Having converted text messages into vectors, it can be parsed by machine learning algorithms. Naive Bayes is a classification algorithm commonly used in text processing. 

In [18]:
from sklearn.naive_bayes import MultinomialNB

"""Naive Bayes classifier trained with vectorized messages and its corresponding labels"""

nb_clf = MultinomialNB().fit(msgs_tfidf, sms_spam_df['label'])

### Predictions

Now that we have a trained classifier, it can be used for prediction. 

In [19]:
msgs_pred = nb_clf.predict(msgs_tfidf)

### Report

The sklearn library provides various tools for reporting the classifier accuracy. One we are using is called `classification_report`.

In [20]:
from sklearn.metrics import classification_report

print classification_report(sms_spam_df['label'], msgs_pred)

             precision    recall  f1-score   support

        ham       0.97      1.00      0.98      4827
       spam       1.00      0.77      0.87       747

avg / total       0.97      0.97      0.97      5574



## Conclusion?

Is that it? Well, in case of a demonstration, yes. That is it. But in reality no! For the ease of demonstration, we have skipped few important machine learning operations.

## Take Two

Now, lets improve our procedure. This time, doing machine learning the way its meant to be done.

### Splitting Dataset

For our demonstration, we trained a Naive Bayes classifier on the entire dataset. Then we tested our classifier on the same complete dataset. On doing so, we are actually overfitting our classifier.

A better approach would be to split our dataset into two partitions; one for training the classifier and another for testing the classifer. The sklearn library provides just what we need.

In [24]:
from sklearn.model_selection import train_test_split

msgs_train, msgs_test, lbls_train, lbls_test = \
    train_test_split(sms_spam_df['message'], sms_spam_df['label'], test_size=0.2)

### Pipeline

As mentioned in demonstration, we cannot directly feed our text messages to the machine learning algorithm. It has to be vectorized. If you remember, vectorization involved two processes:
  1. Counting words in each message and converting dataset into one large matrix (Count Vectorization).
  2. Weighing words based on their frequency (TF-IDF Transformation) and normalization.
  
Once the preprocessing is complete, we can construct the classifier.

These operations can be pipelined using the `Pipeline` class from sklearn library.

In [25]:
from sklearn.pipeline import Pipeline

"""Pipeline CountVectorizer, TfidfTransformer and Naive Bayes Classifier"""

pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=lemmatize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB())
])

### Cross Validation

Cross validation (K-Folds cross validation) involves splitting the training set again into k partitions such that 1 partition is used for testing and remaining k-1 partitions are used for training. The process is repeated k times, and the average score obtained is considered the score of the machine learning model.

The `cross_val_score` function of sklearn library can be used to determine the cross validation score of a model.

In [27]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    pipeline, 
    msgs_train, 
    lbls_train,
    cv=10,
    scoring='accuracy',
    n_jobs=-1
)

print scores

[ 0.96420582  0.94407159  0.95302013  0.94843049  0.93497758  0.93721973
  0.93932584  0.96404494  0.95955056  0.95505618]
