# Authors: Aishwarya Mathew, Vikram Yabannavar

# SPAMIFIER

## References

1. Kaggle: [https://www.kaggle.com/](https://www.kaggle.com/)

## Introduction

Data science is all around us. It is an interdisciplinary field of study that strives to discover meaning behind data. Finding a good first data science project to work on is a hard task. You may know a lot of data science concepts but you get stuck when you want to apply your skills to real world tasks. However, you've come to the right place! Have you ever wanted to make an application that could distinguish between some bad thing and a good thing? This tutorial will do just that. It's actually a basic introduction to the world of data science. This project will teach you how to perform text classification by creating a spam classifier/filter that will distinguish between spam vs. not spam (we call this ham) SMS messages and emails. We will take you through the entire data science lifecycle which includes data collection, data processing, exploratory data analysis and visualization, analysis, hypothesis testing, machine learning and insight/policy decision.

### Tutorial Content

This tutorial will go over the following topics:

1. [Installing Libraries](#libraries)
<br>
2. [Loading Data](#load data)
<br>
3. [Processing Data](#process data)
<br>
4. [Exploratory Analysis and Data Visualization](#eda)
<br>
   4.1. [Getting The Top 100 Words in Spam Messages](#spam1)
<br>
   4.2. [Getting The Top 100 Words in Ham Messages](#spam1)

<a id='libraries'></a>
## Installing Libraries

In [20]:
import pandas as pd
import csv
import re
import string
import pytagcloud

<a id='load data'></a>
## Loading Data (Data Collection)

The first step of any data science project is data collection. Kaggle is a dataset website that has a lot of real world data. To get started, you need to click on this link, https://www.kaggle.com/uciml/sms-spam-collection-dataset , to download the SMS spam vs. ham dataset from Kaggle to your local disk. You will have to create a user account on Kaggle to download any of their datasets. For email data, we will use the dataset provided from Andrew Ng's Machine Learning course at Stanford. That data can be found [here](http://openclassroom.stanford.edu/MainFolder/courses/MachineLearning/exercises/ex6materials/ex6DataEmails.zip).

<a id='process data'></a>
## Processing Data

Once you have the dataset (a csv file) on your local server, we can start the next step, in which, we prepare our data. The code below is going to export the spam vs. ham dataset from the local server to a pandas dataframe. 

In [21]:
#reading the data into a pandas dataframe
spamham_data = pd.read_csv("sms.csv", encoding='latin-1')

#removing unnecessary columns
del spamham_data['Unnamed: 2']
del spamham_data['Unnamed: 3']
del spamham_data['Unnamed: 4']
#renaming the remaining two columns
spamham_data.columns = ['Spam or Ham','SMS Message']

#the resulting dataframe with our data
spamham_data.head()

### Check for missing values

Unnamed: 0,Spam or Ham,SMS Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


<a id='eda'></a>
## Exploratory Analysis and Data Visualization

### Getting The Top 100 Words in Spam Messages

In [22]:
word_dict = {} #dictionary of word -> num occurrences

translator= str.maketrans('','',string.punctuation) #for stripping punctuation

# We tokenize the words. We strip punctuation but do not convert to lowercase,
# so Free, free, and FREE are all going to be different for our purposes.
# If a word isn't in the dictionary, add it with default value 1,
# else, increment the corresponding count.

for item,row in spamham_data.iterrows():
    if row['Spam or Ham'] == 'spam': 
        line = row['SMS Message'].translate(translator) #remove punctuation
        line = line.split() #convert lowercase and split by space
        for word in line:
            if word in word_dict:
                word_dict[word] = word_dict[word] + 1
            else:
                word_dict[word] = 1
                
spam_word_df = pd.DataFrame.from_dict(word_dict,orient='index')
spam_word_df.columns = ['count']
spam_word_df.head()

Unnamed: 0,count
refundedThis,1
REAL1,1
save,1
REWARD,2
professional,1


In [23]:
spam_word_df['count'].nlargest(100)

to            608
a             358
call          189
your          187
you           185
or            185
the           178
2             173
for           170
is            149
on            138
Call          137
now           131
have          128
and           119
4             119
from          116
FREE          112
ur            107
with          102
mobile         95
of             93
U              85
claim          78
You            77
are            77
our            76
prize          73
To             73
text           72
             ... 
å£1000         35
Please         34
by             34
Get            33
1              33
phone          33
line           33
draw           33
Claim          32
150ppm         32
every          32
å£2000         31
shows          31
Just           30
receive        30
has            30
TCs            29
me             29
number         28
win            28
PO             28
Mobile         28
å£150          27
apply          27
award     

### Getting The Top 100 Words in Ham Messages

In [24]:
word_dict = {} #dictionary of word -> num occurrences

translator= str.maketrans('','',string.punctuation) #for stripping punctuation

# Loop through the above data frame and check if message is spam. 
# If it is, we strip punctuation, convert each message to lower case, 
# then split each message by spaces. 
# If a word isn't in dictionary, we add it with default value 1,
# else, increment the corresponding count.

for item,row in spamham_data.iterrows():
    if row['Spam or Ham'] == 'ham': 
        line = row['SMS Message'].translate(translator) #remove punctuation
        line = line.lower().split() #convert lowercase and split by space
        for word in line:
            if word in word_dict:
                word_dict[word] = word_dict[word] + 1
            else:
                word_dict[word] = 1

ham_word_df = pd.DataFrame.from_dict(word_dict,orient='index')
ham_word_df.columns = ['count']
ham_word_df.head()

Unnamed: 0,count
reflection,1
soonc,1
doctors,1
returned,3
tc,2


#### 100 most common words in non-spam

In [25]:
ham_word_df['count'].nlargest(100)

i        2185
you      1837
to       1554
the      1118
a        1052
u         972
and       848
in        811
me        756
my        743
is        728
it        590
of        524
for       501
that      486
im        449
have      438
but       418
your      414
so        412
are       409
not       406
on        391
do        377
at        377
can       376
if        347
will      334
be        332
2         305
         ... 
home      160
about     159
need      156
sorry     153
from      150
as        146
still     146
see       137
by        135
later     134
n         134
da        131
r         131
only      131
she       130
back      129
think     128
well      126
today     125
send      123
tell      121
cant      118
ì         117
hi        117
did       116
her       113
take      112
much      112
some      112
here      111
Name: count, dtype: int64

### Finding Words Only In Spam

#### Getting sets of all words in each Dataframe

In [26]:
ham_set = set([ line for line in ham_word_df.index])
spam_set = set([ line for line in spam_word_df.index])

#### Doing set difference to get the words found in spam and not in ham

In [27]:
only_spam_set = spam_set.difference(ham_set)

#### Filtering the DataFrame

In [28]:
#This only contains the words used by spammers
only_spam_df = spam_word_df.ix[only_spam_set]
only_spam_df

Unnamed: 0,count
REAL1,1
INCLU,1
REWARD,2
SPAM,1
professional,1
FREEMSG,1
MOB,1
Cup,3
Alfie,1
specially,9


#### 100 most common words in Spam

In [29]:
only_spam_df['count'].nlargest(100)

Call           137
FREE           112
U               85
claim           78
You             77
prize           73
To              73
Your            71
Txt             70
STOP            62
won             49
Nokia           46
NOW             44
18              43
Reply           42
Free            42
URGENT          41
This            40
Text            40
I               39
No              38
awarded         37
We              36
å£1000          35
Please          34
Get             33
Claim           32
150ppm          32
å£2000          31
Just            30
              ... 
Expires         17
vouchers        17
tones           17
Urgent          17
Code            17
12hrs           17
Identifier      16
Dear            16
Bonus           16
Statement       16
Account         16
unredeemed      16
08000930705     16
å£250           16
C               16
PRIVATE         16
NOKIA           15
Ltd             15
Had             15
08000839402     15
unsubscribe     15
operator    

Through these transformations, we can conclude that text messages contain words relating to winning prizes or monetary values/symbols have a high likelihood of being a spam text message. The word 'prize' is the highest occurring word in the spam text messages, with 113 occurrences while the next three words are also related to winning or prizes. There also seem to be a trend of poor spacing or non-phonetic combinations of letters and numbers, which are common signs of spam messaging.  Next we will do a tf-idf analysis to determine how important each word is. 

## Looking at Emails 

We're using data from Andrew Ng's Machine Learning course, which splits the data into training sets and test sets. Since we aren't training on the data, we've created a script (included in the GitHub repo) that combines all of these into a convenient CSV for reading into a pandas dataframe. 

In [30]:
#reading the data into a pandas dataframe
email_spamham_data = pd.read_csv("email.csv", encoding='latin-1')

#renaming the two columns
email_spamham_data.columns = ['Spam or Ham','Email']

email_spamham_data.head()

Unnamed: 0,Spam or Ham,Email
0,spam,great parttime summer job display box credit a...
1,spam,auto insurance rate too high dear nlpeople m s...
2,spam,want best economical hunt vacation life want b...
3,spam,email million million email addresses want mon...
4,spam,amaze world record sex attention warn adult wa...


### Getting The Top 100 Words in Ham and Spam Messages

In [31]:
spam_words = {} #dictionaries of word -> num occurrences
ham_words = {}

translator= str.maketrans('','',string.punctuation) #for stripping punctuation

# We tokenize the words. We strip punctuation but do not convert to lowercase,
# so Free, free, and FREE are all going to be different for our purposes.
# If a word isn't in dictionary, we add it with default value 1,
# else, increment the corresponding count.

for item,row in email_spamham_data.iterrows():
    line = row['Email'].translate(translator) #remove punctuation
    line = line.split() #convert lowercase and split by space
    for word in line:
        if row['Spam or Ham'] == 'spam':
            if word in spam_words:
                spam_words[word] = spam_words[word] + 1
            else:
                spam_words[word] = 1
        else: #if ham
            if word in ham_words:
                ham_words[word] = ham_words[word] + 1
            else:
                ham_words[word] = 1


email_spam_df = pd.DataFrame.from_dict(spam_words,orient='index')
email_spam_df.columns = ['count']
email_spam_df

Unnamed: 0,count
rome,1
external,4
professional,73
photomask,7
tc,1
valref,1
tarrant,9
personalize,7
assault,1
nwlink,1


### 100 Most Common Words in Spam Emails

In [32]:
email_spam_df['count'].nlargest(100)

email          1754
s              1572
order          1502
report         1315
address        1300
our            1183
mail           1173
program        1046
send           1032
free            953
list            942
receive         887
money           873
name            871
d               841
business        753
one             732
work            675
com             670
nt              662
internet        643
http            610
please          603
day             593
information     589
over            577
check           531
us              502
web             476
each            476
               ... 
company         279
pay             276
hour            270
below           270
computer        267
click           267
advertise       265
place           261
opportunity     260
today           259
message         258
own             257
income          256
cd              253
t               250
easy            249
within          247
world           246
file            246


### Getting The Top 100 Words in Ham Messages

In [33]:
email_ham_df = pd.DataFrame.from_dict(ham_words,orient='index')
email_ham_df.columns = ['count']
email_ham_df

Unnamed: 0,count
milena,1
prefere,1
external,3
dworkin,4
morri,3
hana,2
radon,1
liberium,1
tsimplus,1
choueka,3


### 100 Most Common Words in Spam Emails

In [34]:
email_ham_df['count'].nlargest(100)

language         1525
university       1268
s                 878
linguistic        660
de                569
information       540
conference        495
workshop          479
english           477
e                 420
email             418
one               398
paper             395
please            371
include           368
edu               364
research          351
address           350
abstract          340
http              335
fax               328
word              317
h                 315
papers            315
d                 302
speech            301
submission        283
theory            281
www               277
m                 276
                 ... 
th                183
between           181
science           178
ac                177
issue             177
example           175
two               175
present           175
available         172
list              172
grammar           172
both              171
write             170
area              169
discussion

Through these transformations, we can conclude that text messages contain words relating to winning prizes or monetary values/symbols have a high likelihood of being a spam text message. The word 'prize' is the highest occurring word in the spam text messages, with 113 occurrences while the next three words are also related to winning or prizes. There also seem to be a trend of poor spacing or non-phonetic combinations of letters and numbers, which are common signs of spam messaging.  Next we will do a tf-idf analysis to determine how important each word is. 

### Finding Words Only In Spam Emails

#### Getting sets of all words in each Dataframe

In [35]:
ham_set = set([ line for line in email_ham_df.index])
spam_set = set([ line for line in email_spam_df.index])

#### Doing set difference to get the words found in spam and not in ham

In [36]:
only_spam_set = spam_set.difference(ham_set)

#### Filtering the DataFrame

In [39]:
#This only contains the words used by spammers
only_spam_df = email_spam_df.ix[only_spam_set]
only_spam_df

Unnamed: 0,count
returned,2
tc,1
personalize,7
assault,1
nwlink,1
limitedtime,1
corte,1
exciting,2
selfliquidating,1
truster,1


In [41]:
only_spam_df['count'].nlargest(100)

capitalfm      196
nbsp           196
ffa            183
floodgate      150
aol            133
bonus          119
mailing        118
investment     118
profit         111
hundred         95
reports         93
stealth         82
links           82
always          75
millions        75
offshore        73
sales           67
invest          62
tm              60
mlm             60
toll            56
amaze           55
recruit         55
album           54
mailer          52
xxx             51
spam            49
isp             48
goldrush        48
cent            48
              ... 
infoseek        30
millionaire     29
expiration      28
amazing         28
alba            28
unsubscribe     28
staggering      27
spider          27
ram             27
href            27
moneymake       27
advertiser      26
vanish          26
largest         26
instant         26
descrambler     26
wrap            25
plans           25
webmaster       25
estate          25
comply          25
teen        

In [38]:
#from pytagcloud import create_tag_image, make_tags
#from pytagcloud.lang.counter import get_tag_counts

#YOUR_TEXT = "A tag cloud is a visual representation for text data, typically\
#used to depict keyword metadata on websites, or to visualize free form text."


#creating a dictionary of the text
#tags = {}

#test = spamham_data.loc[spamham_data['Spam or Ham'] == 'spam']['SMS Message']
    


#tags = make_tags(get_tag_counts(YOUR_TEXT), maxsize=120)

#create_tag_image(tags, 'cloud_large.png', size=(900, 600), fontname='Lobster')