# Authors: Aishwarya Mathew, Vikram Yabannavar

# Final Project - Spamifier

## Introduction

Data science is all around us. It is an interdisciplinary field of study that strives to discover meaning behind data. Finding a good first data science project to work on is a hard task. You may know a lot of data science concepts but you get stuck when you want to apply your skills to real world tasks. However, you've come to the right place! Have you ever wanted to make an application that could distinguish between some bad thing and a good thing? This tutorial will do just that. It's actually a basic introduction to the world of data science. This project will teach you how to create a spam classifier/filter that will distinguish between spam vs. not spam (we call this ham) SMS messages. We will take you through the entire data science lifecycle which includes data collection, data processing, exploratory data analysis and visualization, analysis, hypothesis testing, machine learning and insight/policy decision.

### Tutorial Content

--Installing Libraries
--Downloading and Preparing the Data

In [91]:
import pandas as pd
import csv
import re
import string

## Downloading The Data (Data Collection)

The first step of any data science project is Data Collection and for that, you need data. Kaggle is a dataset website that has a lot of real world data. To get started, you need to click on this link, https://www.kaggle.com/uciml/sms-spam-collection-dataset , to download the SMS spam vs. ham dataset from Kaggle to your local disk. You will have to create a user account on Kaggle to download any of their datasets.  

## Data Processing

Once you have the dataset (a csv file) on your local server, we can start the next step, in which, we prepare our data. The code below is going to export the spam vs. ham dataset from the local server to a pandas dataframe. 

In [92]:
#reading the data into a pandas dataframe
spamham_data = pd.read_csv("spam.csv", encoding='latin-1')

#removing unnecessary columns
del spamham_data['Unnamed: 2']
del spamham_data['Unnamed: 3']
del spamham_data['Unnamed: 4']
#renaming the remaining two columns
spamham_data.columns = ['Spam or Ham','SMS Message']

spamham_data

Unnamed: 0,Spam or Ham,SMS Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


## Exploratory Analysis and Data Visualization

### Getting The Top 100 Words in Spam Messages

In [93]:
word_dict = {} #dictionary of word -> num occurrences

translator= str.maketrans('','',string.punctuation) #for stripping punctuation

# Loop through the above data frame and check if message is spam. 
# If it is, we strip punctuation, convert each message to lower case, 
# then split each message by spaces. 
# If a word isn't in dictionary, we add it with default value 1,
# else, increment the corresponding count.

for item,row in spamham_data.iterrows():
    if row['Spam or Ham'] == 'spam': 
        line = row['SMS Message'].translate(translator) #remove punctuation
        line = line.lower().split() #convert lowercase and split by space
        for word in line:
            if word in word_dict:
                word_dict[word] = word_dict[word] + 1
            else:
                word_dict[word] = 1
 


spam_word_df = pd.DataFrame.from_dict(word_dict,orient='index')
spam_word_df.columns = ['count']
spam_word_df

Unnamed: 0,count
hlp,2
89938,1
zebra,1
ac,4
3,22
potter,2
0721072,1
music,20
87575,4
07742676969,2


In [94]:
spam_word_df['count'].nlargest(100)

to          686
a           376
call        347
you         287
your        263
free        216
the         204
for         203
now         189
or          188
2           173
is          158
txt         150
u           147
on          144
ur          144
have        135
from        128
mobile      123
and         122
text        120
4           119
claim       113
stop        113
with        109
reply       101
of           95
prize        92
this         87
our          85
           ... 
1            33
every        33
as           33
receive      33
camera       33
holiday      32
if           32
message      32
landline     32
shows        31
å£2000       31
go           31
number       30
me           30
has          30
box          30
more         30
want         29
video        29
code         29
tcs          29
apply        29
live         29
po           29
can          29
all          28
award        28
it           28
å£150        27
msg          27
Name: count, dtype: int6

### Getting the Top 100 words in Non-Spam Messages

In [95]:
word_dict = {} #dictionary of word -> num occurrences

translator= str.maketrans('','',string.punctuation) #for stripping punctuation

# Loop through the above data frame and check if message is spam. 
# If it is, we strip punctuation, convert each message to lower case, 
# then split each message by spaces. 
# If a word isn't in dictionary, we add it with default value 1,
# else, increment the corresponding count.

for item,row in spamham_data.iterrows():
    if row['Spam or Ham'] == 'ham': 
        line = row['SMS Message'].translate(translator) #remove punctuation
        line = line.lower().split() #convert lowercase and split by space
        for word in line:
            if word in word_dict:
                word_dict[word] = word_dict[word] + 1
            else:
                word_dict[word] = 1
 


ham_word_df = pd.DataFrame.from_dict(word_dict,orient='index')
ham_word_df.columns = ['count']
ham_word_df

Unnamed: 0,count
lengths,1
eggs,2
arestaurant,1
sportsx,1
3,44
iz,4
spoil,1
ofice,1
luton,1
everybody,3


In [96]:
ham_word_df['count'].nlargest(100)

i        2185
you      1837
to       1554
the      1118
a        1052
u         972
and       848
in        811
me        756
my        743
is        728
it        590
of        524
for       501
that      486
im        449
have      438
but       418
your      414
so        412
are       409
not       406
on        391
do        377
at        377
can       376
if        347
will      334
be        332
2         305
         ... 
home      160
about     159
need      156
sorry     153
from      150
as        146
still     146
see       137
by        135
n         134
later     134
da        131
only      131
r         131
she       130
back      129
think     128
well      126
today     125
send      123
tell      121
cant      118
hi        117
ì         117
did       116
her       113
take      112
much      112
some      112
oh        111
Name: count, dtype: int64

### Finding Words Only In Spam

#### Getting sets of all words in each Dataframe

In [97]:
ham_set = set([ line for line in ham_word_df.index])
spam_set = set([ line for line in spam_word_df.index])

#### Doing set difference to get the words found in spam and not in ham

In [98]:
only_spam_set = spam_set.difference(ham_set)

#### Filtering the DataFrame

In [105]:
#This only contains the words used by spammers
only_spam_df = spam_word_df.ix[only_spam_set]
only_spam_df

Unnamed: 0,count
hlp,2
89938,1
zebra,1
ac,4
0721072,1
07742676969,2
9ae,4
08714712412,1
4403ldnw1a7rw18,2
07808,1


In [106]:
only_spam_df['count'].nlargest(100)

claim               113
prize                92
won                  73
guaranteed           50
tone                 48
18                   43
awarded              38
å£1000               35
150ppm               34
å£2000               31
tcs                  29
å£150                27
collection           26
ringtone             26
entry                26
tones                25
500                  25
weekly               24
mob                  24
valid                23
150p                 23
å£100                22
bonus                21
8007                 21
sae                  21
vouchers             20
å£5000               20
86688                19
å£500                19
unsubscribe          18
                   ... 
dogging              11
representative       10
uks                  10
admirer              10
wap                  10
10pmin               10
hmv                  10
ntt                  10
ldn                  10
reward               10
11mths          

Through these transformations, we can conclude that text messages contain words relating to winning prizes or monetary values/symbols have a high likelihood of being a spam text message. The word 'prize' is the highest occurring word in the spam text messages, with 113 occurrences while the next three words are also related to winning or prizes. There also seem to be a trend of poor spacing or non-phonetic combinations of letters and numbers, which are common signs of spam messaging.  Next we will do a tf-idf analysis to determine how important each word is. 