# Authors: Aishwarya Mathew, Vikram Yabannavar

# SPAMIFIER

## References

1. Kaggle: [https://www.kaggle.com/](https://www.kaggle.com/)

## Introduction

Data science is all around us. It is an interdisciplinary field of study that strives to discover meaning behind data. Finding a good first data science project to work on is a hard task. You may know a lot of data science concepts but you get stuck when you want to apply your skills to real world tasks. However, you've come to the right place! Have you ever wanted to make an application that could distinguish between some bad thing and a good thing? This tutorial will do just that. It's actually a basic introduction to the world of data science. This project will teach you how to perform text classification by creating a spam classifier/filter that will distinguish between spam vs. not spam (we call this ham) SMS messages and emails. We will take you through the entire data science lifecycle which includes data collection, data processing, exploratory data analysis and visualization, analysis, hypothesis testing, machine learning and insight/policy decision.

### Tutorial Content

This tutorial will go over the following topics:

1. [Installing Libraries](#libraries)
<br>
2. [Loading Data](#load data)
<br>
3. [Processing Data](#process data)
<br>
4. [Exploratory Analysis and Data Visualization](#eda)
<br>
   4.1. [Getting The Top 100 Words in Spam Messages](#spam1)
<br>
   4.2. [Getting The Top 100 Words in Ham Messages](#spam1)

<a id='libraries'></a>
## Installing Libraries

In [30]:
import pandas as pd
import csv
import re
import string
from wordcloud import WordCloud

ImportError: No module named 'wordcloud'

<a id='load data'></a>
## Loading Data (Data Collection)

The first step of any data science project is data collection. Kaggle is a dataset website that has a lot of real world data. To get started, you need to click on this link, https://www.kaggle.com/uciml/sms-spam-collection-dataset , to download the SMS spam vs. ham dataset from Kaggle to your local disk. You will have to create a user account on Kaggle to download any of their datasets. For email data, we will use the dataset provided from Andrew Ng's Machine Learning course at Stanford. That data can be found [here](http://openclassroom.stanford.edu/MainFolder/courses/MachineLearning/exercises/ex6materials/ex6DataEmails.zip).

<a id='process data'></a>
## Processing Data

Once you have the dataset (a csv file) on your local server, we can start the next step, in which, we prepare our data. The code below is going to export the spam vs. ham dataset from the local server to a pandas dataframe. 

In [28]:
#reading the data into a pandas dataframe
spamham_data = pd.read_csv("sms.csv", encoding='latin-1')

#removing unnecessary columns
del spamham_data['Unnamed: 2']
del spamham_data['Unnamed: 3']
del spamham_data['Unnamed: 4']
#renaming the remaining two columns
spamham_data.columns = ['Spam or Ham','SMS Message']

#the resulting dataframe with our data
spamham_data.head()

### Check for missing values

Unnamed: 0,Spam or Ham,SMS Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


<a id='eda'></a>
## Exploratory Analysis and Data Visualization

### Getting The Top 100 Words in Spam Messages

In [25]:
word_dict = {} #dictionary of word -> num occurrences

translator= str.maketrans('','',string.punctuation) #for stripping punctuation

# We tokenize the words. We strip punctuation but do not convert to lowercase,
# so Free, free, and FREE are all going to be different for our purposes.
# If a word isn't in the dictionary, add it with default value 1,
# else, increment the corresponding count.

for item,row in spamham_data.iterrows():
    if row['Spam or Ham'] == 'spam': 
        line = row['SMS Message'].translate(translator) #remove punctuation
        line = line.split() #convert lowercase and split by space
        for word in line:
            if word in word_dict:
                word_dict[word] = word_dict[word] + 1
            else:
                word_dict[word] = 1
                
spam_word_df = pd.DataFrame.from_dict(word_dict,orient='index')
spam_word_df.columns = ['count']
spam_word_df.head()

Unnamed: 0,count
center,2
vary,6
S89,1
tonight,2
CM,1


In [7]:
spam_word_df['count'].nlargest(100)

to            608
a             358
call          189
your          187
or            185
you           185
the           178
2             173
for           170
is            149
on            138
Call          137
now           131
have          128
and           119
4             119
from          116
FREE          112
ur            107
with          102
mobile         95
of             93
U              85
claim          78
are            77
You            77
our            76
prize          73
To             73
text           72
             ... 
å£1000         35
Please         34
by             34
1              33
phone          33
Get            33
draw           33
line           33
every          32
Claim          32
150ppm         32
å£2000         31
shows          31
has            30
receive        30
Just           30
me             29
TCs            29
Mobile         28
PO             28
win            28
number         28
å£150          27
apply          27
collection

### Getting The Top 100 Words in Ham Messages

In [26]:
word_dict = {} #dictionary of word -> num occurrences

translator= str.maketrans('','',string.punctuation) #for stripping punctuation

# Loop through the above data frame and check if message is spam. 
# If it is, we strip punctuation, convert each message to lower case, 
# then split each message by spaces. 
# If a word isn't in dictionary, we add it with default value 1,
# else, increment the corresponding count.

for item,row in spamham_data.iterrows():
    if row['Spam or Ham'] == 'ham': 
        line = row['SMS Message'].translate(translator) #remove punctuation
        line = line.lower().split() #convert lowercase and split by space
        for word in line:
            if word in word_dict:
                word_dict[word] = word_dict[word] + 1
            else:
                word_dict[word] = 1

ham_word_df = pd.DataFrame.from_dict(word_dict,orient='index')
ham_word_df.columns = ['count']
ham_word_df.head()

Unnamed: 0,count
center,2
tonight,57
market,3
lifeis,1
hidid,1


#### 100 most common words in non-spam

In [9]:
ham_word_df['count'].nlargest(100)

i        2185
you      1837
to       1554
the      1118
a        1052
u         972
and       848
in        811
me        756
my        743
is        728
it        590
of        524
for       501
that      486
im        449
have      438
but       418
your      414
so        412
are       409
not       406
on        391
at        377
do        377
can       376
if        347
will      334
be        332
2         305
         ... 
home      160
about     159
need      156
sorry     153
from      150
as        146
still     146
see       137
by        135
n         134
later     134
r         131
only      131
da        131
she       130
back      129
think     128
well      126
today     125
send      123
tell      121
cant      118
ì         117
hi        117
did       116
her       113
some      112
much      112
take      112
oh        111
Name: count, dtype: int64

### Finding Words Only In Spam

#### Getting sets of all words in each Dataframe

In [10]:
ham_set = set([ line for line in ham_word_df.index])
spam_set = set([ line for line in spam_word_df.index])

#### Doing set difference to get the words found in spam and not in ham

In [11]:
only_spam_set = spam_set.difference(ham_set)

#### Filtering the DataFrame

In [12]:
#This only contains the words used by spammers
only_spam_df = spam_word_df.ix[only_spam_set]
only_spam_df

Unnamed: 0,count
vary,6
Polyphonic,1
S89,1
CM,1
6031,2
Bloomberg,2
mre,1
449071512431,1
AREA,1
62220Cncl,1


#### 100 most common words in Spam

In [13]:
only_spam_df['count'].nlargest(100)

Call           137
FREE           112
U               85
claim           78
You             77
prize           73
To              73
Your            71
Txt             70
STOP            62
won             49
Nokia           46
NOW             44
18              43
Free            42
Reply           42
URGENT          41
This            40
Text            40
I               39
No              38
awarded         37
We              36
å£1000          35
Please          34
Get             33
Claim           32
150ppm          32
å£2000          31
Just            30
              ... 
TsCs            17
vouchers        17
Expires         17
tones           17
12hrs           17
Urgent          17
Account         16
C               16
å£250           16
Identifier      16
Statement       16
unredeemed      16
08000930705     16
Dear            16
PRIVATE         16
Bonus           16
unsubscribe     15
Cost            15
MobileUpd8      15
08000839402     15
Had             15
YES         

Through these transformations, we can conclude that text messages contain words relating to winning prizes or monetary values/symbols have a high likelihood of being a spam text message. The word 'prize' is the highest occurring word in the spam text messages, with 113 occurrences while the next three words are also related to winning or prizes. There also seem to be a trend of poor spacing or non-phonetic combinations of letters and numbers, which are common signs of spam messaging.  Next we will do a tf-idf analysis to determine how important each word is. 

## Looking at Emails 

We're using data from Andrew Ng's Machine Learning course, which splits the data into training sets and test sets. Since we aren't training on the data, we've created a script (included in the GitHub repo) that combines all of these into a convenient CSV for reading into a pandas dataframe. 

In [29]:
#reading the data into a pandas dataframe
email_spamham_data = pd.read_csv("email.csv", encoding='latin-1')

#renaming the two columns
email_spamham_data.columns = ['Spam or Ham','Email']

email_spamham_data.head()

Unnamed: 0,Spam or Ham,Email
0,spam,great parttime summer job display box credit a...
1,spam,auto insurance rate too high dear nlpeople m s...
2,spam,want best economical hunt vacation life want b...
3,spam,email million million email addresses want mon...
4,spam,amaze world record sex attention warn adult wa...
5,spam,help loan subject re are debt help qualify fin...
6,spam,beat irs payno please read found father unite ...
7,spam,email million million email addresses want mon...
8,spam,per week home computer put free software comp...
9,spam,best better newest hottest interactive adult w...


### Getting The Top 100 Words in Spam Messages

In [15]:
word_dict = {} #dictionary of word -> num occurrences

translator= str.maketrans('','',string.punctuation) #for stripping punctuation

# We tokenize the words. We strip punctuation but do not convert to lowercase,
# so Free, free, and FREE are all going to be different for our purposes.
# If a word isn't in dictionary, we add it with default value 1,
# else, increment the corresponding count.

for item,row in email_spamham_data.iterrows():
    if row['Spam or Ham'] == 'spam': 
        line = row['Email Body'].translate(translator) #remove punctuation
        line = line.split() #convert lowercase and split by space
        for word in line:
            if word in word_dict:
                word_dict[word] = word_dict[word] + 1
            else:
                word_dict[word] = 1
 


email_spam_df = pd.DataFrame.from_dict(word_dict,orient='index')
email_spam_df.columns = ['count']
email_spam_df

Unnamed: 0,count
center,27
vary,9
ambra,16
freeware,1
tonight,5
karicohen,2
market,476
seymour,1
outcome,1
ntr,2


### 100 Most Common Words in Spam Emails

In [16]:
email_spam_df['count'].nlargest(100)

email          1754
s              1572
order          1502
report         1315
address        1300
our            1183
mail           1173
program        1046
send           1032
free            953
list            942
receive         887
money           873
name            871
d               841
business        753
one             732
work            675
com             670
nt              662
internet        643
http            610
please          603
day             593
information     589
over            577
check           531
us              502
market          476
each            476
               ... 
company         279
pay             276
below           270
hour            270
click           267
computer        267
advertise       265
place           261
opportunity     260
today           259
message         258
own             257
income          256
cd              253
t               250
easy            249
within          247
file            246
world           246


Through these transformations, we can conclude that text messages contain words relating to winning prizes or monetary values/symbols have a high likelihood of being a spam text message. The word 'prize' is the highest occurring word in the spam text messages, with 113 occurrences while the next three words are also related to winning or prizes. There also seem to be a trend of poor spacing or non-phonetic combinations of letters and numbers, which are common signs of spam messaging.  Next we will do a tf-idf analysis to determine how important each word is. 

In [31]:
from pytagcloud import create_tag_image, make_tags
from pytagcloud.lang.counter import get_tag_counts

YOUR_TEXT = "A tag cloud is a visual representation for text data, typically\
used to depict keyword metadata on websites, or to visualize free form text."

tags = make_tags(get_tag_counts(YOUR_TEXT), maxsize=120)

create_tag_image(tags, 'cloud_large.png', size=(900, 600), fontname='Lobster')

ImportError: No module named 'pygame'