# SMS Spam Dataset Exploration

## Introduction
This Jupyter Notebook explores the [SMS Spam Collection](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) dataset from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) and compares performance of various machine learning algorithms in text processing.

## Data Wrangling
To begin with, lets load the dataset into a Pandas Dataframe.

In [2]:
import csv
import pandas as pd

sms_spam_df = pd.read_csv('sms-spam.tsv', quoting=csv.QUOTE_NONE, sep='\t', names=['label', 'message'])
sms_spam_df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Missing values skew the dataset, and should be avoided. Lets see if the dataset has any missing values.

In [3]:
sms_spam_df.isnull().values.any()

False

Now that we are sure there are no missing values, lets have some fun by checking stats about spam and ham(non spam) messages in the dataset.

In [4]:
sms_spam = sms_spam_df.groupby('label')['message']
sms_spam.describe()

Unnamed: 0_level_0,count,unique,top,freq
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ham,4827,4518,"Sorry, I'll call later",30
spam,747,653,Please call our customer service representativ...,4


## Data Preprocessing

Alright then flocks, lets get down to buisness. To be frank, computers are bad at understanding text; it prefers binary. So we'll have to convert our texts to binary. But, before we can do that, there are some unfinished buisness. Introducing tokenization and lemmatization.

### Tokenization

Tokenization simply splits the message into individual tokens.

In [6]:
from textblob import TextBlob

def tokenize(message):
    message = unicode(message, 'utf8')
    return TextBlob(message).words

Lets try applying this on some of our messages. Here are the original messages we are going to tokenize.

In [11]:
sms_spam_df['message'].head()

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: message, dtype: object

Now, here are those messages tokenized.

In [13]:
sms_spam_df['message'].head().apply(tokenize)

0    [Go, until, jurong, point, crazy, Available, o...
1                       [Ok, lar, Joking, wif, u, oni]
2    [Free, entry, in, 2, a, wkly, comp, to, win, F...
3    [U, dun, say, so, early, hor, U, c, already, t...
4    [Nah, I, do, n't, think, he, goes, to, usf, he...
Name: message, dtype: object

As you can see, tokenization simply splits message into tokens. 

### Lemmatization

The `textblob` library provides tools that can convert each word in a message to its base form (lemma).

In [14]:
from textblob import TextBlob

def lemmatize(message):
    message = unicode(message, 'utf8').lower()
    return [word.lemma for word in TextBlob(message).words]

Alright, here are first few of our original messages.

In [17]:
sms_spam_df['message'].head()

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: message, dtype: object

And, here are our messages lemmatized.

In [18]:
sms_spam_df['message'].head().apply(lemmatize)

0    [go, until, jurong, point, crazy, available, o...
1                       [ok, lar, joking, wif, u, oni]
2    [free, entry, in, 2, a, wkly, comp, to, win, f...
3    [u, dun, say, so, early, hor, u, c, already, t...
4    [nah, i, do, n't, think, he, go, to, usf, he, ...
Name: message, dtype: object

As you can see, lemmatization converts messages into their base form; for example, goes becomes go as you may notice from the last message.