# SMS Spam Classification

This project I will be looking into SMS text data from multiple sources all collected by the team [Tiago A. Almeida](http://dcomp.sor.ufscar.br/talmeida/) and [José María Gómez Hidalgo](http://www.esp.uem.es/jmgomez). For more information on how they collected this data check it out [here](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/).

Some notable sources used while performing this analysis and classification: 
- [Ultimate guide to deal with Text Data (using Python)](https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/)

The data here is a collection of 747 Spam texts along with 4,827 non-spam (HAM) texts. The file is formatted as a plain text file.



In [75]:
import pandas as pd
import matplotlib.pyplot as plt
import string
import seaborn as sns
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
from sklearn.model_selection import train_test_split
from sklearn.manifold import TSNE
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
from textblob import TextBlob
import re
np.random.seed(0)
%matplotlib inline

### Read in the file. 

from exploring the data we know that we need to strip the new line characters (__\n__) and that the message and label are separated by a tab (__\t__).

In [2]:
with open('Data/SMSSpamCollection.txt') as f:
    lines = [line.rstrip('\n').split('\t') for line in f]

In [3]:
sms_df = pd.DataFrame(lines)

In [4]:
sms_df.head()
sms_df.shape

(5574, 2)

In [5]:
#rename the columns
sms_df.rename(columns={0:'label', 1:'text'},inplace=True)

### Basic Feature Engineering

Before any of this I did a little bit of text exploring to see if I could see anything that may or may not help me--this was crucial for choosing __'flag words'__. Items like word count, char counts, number of numerics, number of upper case, etc. are pretty common practice, so they are great features to add to your data set before cleaning. 

1. word count
2. character count
3. Number of numerics
4. Number of upper case
5. Number of Exclamation Points (!)
6. Number of Flag Words
7. Links in message
8. Count of stop words


__ 1. word count__

In [6]:
sms_df['word_count'] = sms_df.text.apply(lambda x: len(str(x).split(' ')))

__2. character count__

In [7]:
sms_df['char_count'] = sms_df.text.str.len() #this includes the spaces

__3. Number of numerics__

In [8]:
sms_df['numerics'] = sms_df.text.apply(lambda x: len([x for x in x.split() if x.isdigit()]))

__4. Number of upper case characters__

this returns how many words in the message are all-caps

In [9]:
sms_df['upper'] = sms_df.text.apply(lambda x: len([x for x in x.split() if x.isupper()]))

__5. Number of Excalmation Points (!)__

This splits the message a returns how many times the message has been split minus 1. This will return the total number of '!' in the message. e.g. if we have a message: 'Hey!' it will return ['Hey',''], so we subtract one to get # of excalamtion points.

In [10]:
sms_df['bangs'] = sms_df.text.apply(lambda x: len([x for x in x.split('!')]) - 1 )

__6. Flag Words__

Shout-out to [Grace](https://github.com/graceh3) for this idea! 

Possible __"flag"__ words from looking at the first few rows of data:

In [11]:
sms_df['flag_words'] = sms_df.text.apply(lambda x: len([x for x in x.split(' ') 
                                                        if x.translate(str.maketrans('', '', string.punctuation)).lower().strip() 
                                                        in ['winner','urgent','win','won','free','cash','freemsg',
                                                            'stopsms','ppm']]))


__7. Links in message__

We will use regex to be able to see if there are any links in the message

In [12]:
p = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

sms_df['links'] = sms_df.text.apply(lambda x: len(re.findall(p, x))) + sms_df.text.apply(lambda x: x.count('.com')+x.count('.co')+x.count('.uk'))



In [13]:
sms_df.links.sum()

276

In [14]:
sms_df[sms_df['links'] >= 1].head()

Unnamed: 0,label,text,word_count,char_count,numerics,upper,bangs,flag_words,links
15,spam,"XXXMobileMovieClub: To use your credit, click ...",19,149,0,1,0,0,3
136,ham,I only haf msn. It's yijue@hotmail.com,6,38,0,1,0,0,2
191,spam,Are you unique enough? Find out from 30th Augu...,10,72,0,0,0,0,2
250,spam,Congratulations ur awarded 500 of CD vouchers ...,23,150,4,2,0,1,2
268,spam,Ur ringtone service has changed! 25 Free credi...,27,159,1,5,3,1,2


__8. Count of Stop Words__

these are words that don't add any real value to our messages, which include: of, the, on, etc. Without them our sentences would not be great, but they don't impact meaning in the long run.

In [15]:
stop = stopwords.words('english')

sms_df['stp_wrd_cnt'] = sms_df.text.apply(lambda x: len([x for x in x.split() if x in stop]))

__Lets look at the first few columns to see how all these new columns look__

So far, these engineered columns are looking _great!_ They should have a big impact on our spam predicting.

In [16]:
sms_df.head()

Unnamed: 0,label,text,word_count,char_count,numerics,upper,bangs,flag_words,links,stp_wrd_cnt
0,ham,"Go until jurong point, crazy.. Available only ...",20,111,0,0,0,0,0,4
1,ham,Ok lar... Joking wif u oni...,6,29,0,0,0,0,0,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,28,155,2,2,0,2,0,5
3,ham,U dun say so early hor... U c already then say...,11,49,0,2,0,0,0,2
4,ham,"Nah I don't think he goes to usf, he lives aro...",13,61,0,1,0,0,0,5


### Data Preprocessing

Next, we need to move into data cleaning. This section will be very important for the remaineder of this project and the models we run. In the next few cells we will:
1. create a function to remove all punction
2. lower case all of the words in our messages
3. remove stop words
4. check for spelling and correct where needed
5. remove frequent
6. remove rare/uncommon words


#### 1) and 2) get rid of special charaters and lower case:

In [17]:
def clean_text_column(row):
    import string
    '''
    takes in a cell from the dataframe and removes all of the symbols from 
    string.punctuation ('!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'), and then lower
    cases each line.
    '''
    return row.translate(str.maketrans('', '', string.punctuation)).lower()

In [18]:
sms_df.text = sms_df.text.apply(lambda row: clean_text_column(row))

Check what our function did:

In [19]:
print(sms_df.text.iloc[2],sms_df.iloc[2],sep='\n\n')

free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s

label                                                       spam
text           free entry in 2 a wkly comp to win fa cup fina...
word_count                                                    28
char_count                                                   155
numerics                                                       2
upper                                                          2
bangs                                                          0
flag_words                                                     2
links                                                          0
stp_wrd_cnt                                                    5
Name: 2, dtype: object


#### 3. Remove all stop words:

Here we will remove all of the words that do not add value to the meaning of our messages.

In [20]:
stop = stopwords.words('english') #loads the stop words for the english language
sms_df.text = sms_df.text.apply(lambda x: " ".join(x for x in x.split() if x not in stop)) 
#returns only words that are not in the list of stop words

#### 4. Correct Obvious Spelling issues:

First, lets fix the most commonly 'missspelled' words. After, we will use textBlob to correct our spelling for all the other ones.

In [21]:
#most common word list:
common_w = pd.Series(' '.join(sms_df.text).split()).value_counts()[:50]
common_w 

u         1132
call       578
2          482
im         464
ur         390
get        386
4          293
dont       287
go         281
ok         278
ltgt       276
free       275
know       257
like       244
got        239
ill        239
good       236
come       229
time       208
day        203
love       200
want       193
send       191
text       188
one        171
going      171
ü          169
need       167
txt        163
home       162
lor        160
see        157
sorry      156
still      154
r          153
stop       152
back       152
n          146
reply      144
today      141
mobile     138
tell       137
new        136
later      134
well       134
hi         133
think      132
da         131
please     129
take       126
dtype: int64

A good amount of these are abbreviations, so lets change these in our dataset before we remove frequent and rare words.

In [22]:
rep = {' wif u ':' with you ', ' u r ':' you are ', 'ic': ' i see ', ' i c ': ' i see ',
       ' u c ': ' you see ',' u ':' you ',' n ':' and ',' r ':' are ',' txt ':' text ',
       ' ü ':' you ',' 4 ':' for ',' 2 ':' to ',' ur ':' your ',' da ':' the ',' wif ':' with ',
       ' urself ':' yourself ', ' thats ': ' thats ', ' i‘m ': ' im ', '£':''}

pattern = re.compile("|".join(rep.keys()))
sms_df.text = sms_df.text.apply(lambda x: pattern.sub(lambda m: rep[m.group(0)], x.center(len(x)+2)).strip())
# text = pattern.sub(lambda m: rep[m.group(0)], x.center(len(x)+2)).strip()

__Now if we look at the most common words, we can see the changes that we have made__

In [26]:
common_w = pd.Series(' '.join(sms_df.text).split()).value_counts()[:50]
common_w 

you       1203
see        961
i          794
call       578
im         469
to         464
get        386
your       349
text       346
e          320
for        287
dont       287
go         281
ok         278
ltgt       276
free       275
know       257
like       244
ill        239
got        239
good       236
come       229
time       208
k          207
day        203
love       200
want       193
send       191
one        171
going      171
need       167
home       162
lor        160
sorry      157
still      154
back       152
stop       152
are        149
p          146
reply      144
and        141
today      141
mobile     138
tell       137
new        136
later      134
well       134
hi         133
think      132
please     129
dtype: int64

#### 5. Remove Very Frequent Words:
Super common words do not add value to their connection with other words, which is why we will remove the top ten most common words in the cell below. Becuase this is text data, I decided it would be fine in this case to remove these words. [Here](https://www.quora.com/Why-do-we-remove-frequent-and-infrequent-words-when-in-NLP) is a great explaination for this.

In [28]:
#most common word list:
common_w = pd.Series(' '.join(sms_df.text).split()).value_counts()[:15]
#Pretty good sign that most of these seem like filler words

In [29]:
#now that we've checked on them, lets remove them from our texts
sms_df.text = sms_df.text.apply(lambda x: ' '.join(x for x in x.split() if x not in list(common_w.index)))

#### 6. Use TextBlob to correct spelling: 

Now that we've removed a good portion of words that add noise to our data, we should clean up the words in one final step to correct spelling. 

There are many libraries to do this, but for convenience sake I have stuck with this one.

In [None]:
sms_df.text = sms_df.text.apply(lambda x: str(TextBlob(x).correct()))

In [None]:
sms_df.to_csv('Data/spelling_and_features_sms.csv')

### Lemmatizing:

Great article [here](https://www.datacamp.com/community/tutorials/stemming-lemmatization-python) about lemmatizing and stemming with NLTK.

If you get an error about POS (part of speech) tagging, then you may need to download treebank and punkt from NLTK. [Here](https://stackoverflow.com/questions/14089887/nltk-pos-tag-usage) is the resource I used. However, mine did not work until I downloaded 'averaged_perceptron_tagger'.

Below I defined a function that takes the returned pos values from nltk.pos_tag, and returns a compatible version of them. This was a great solution on [stackoverflow](https://stackoverflow.com/questions/5364493/lemmatizing-pos-tagged-words-with-nltk).

In [82]:
tag_map = {
        'CC':'', # coordin. conjunction (and, but, or)  
        'CD':wn.NOUN, # cardinal number (one, two)             
        'DT':'', # determiner (a, the)                    
        'EX':wn.ADV, # existential ‘there’ (there)           
        'FW':'', # foreign word (mea culpa)             
        'IN':wn.ADV, # preposition/sub-conj (of, in, by)   
        'JJ': wn.ADJ, # adjective (yellow)                  
        'JJR': wn.ADJ_SAT, # adj., comparative (bigger)          
        'JJS': wn.ADJ_SAT, # adj., superlative (wildest)           
        'LS':'', # list item marker (1, 2, One)          
        'MD':'', # modal (can, should)                    
        'NN':wn.NOUN, # noun, sing. or mass (llama)          
        'NNS':wn.NOUN, # noun, plural (llamas)                  
        'NNP':wn.NOUN, # proper noun, sing. (IBM)              
        'NNPS':wn.NOUN, # proper noun, plural (Carolinas)
        'PDT':wn.ADJ, # predeterminer (all, both)            
        'POS':'', # possessive ending (’s )               
        'PRP':'', # personal pronoun (I, you, he)     
        'PRP$':'', # possessive pronoun (your, one’s)    
        'RB':wn.ADV, # adverb (quickly, never)            
        'RBR':wn.ADV, # adverb, comparative (faster)        
        'RBS':wn.ADV, # adverb, superlative (fastest)     
        'RP':wn.ADJ, # particle (up, off)
        'TO':'', # “to” (to)
        'UH':'', # interjection (ah, oops)
        'VB':wn.VERB, # verb base form (eat)
        'VBD':wn.VERB, # verb past tense (ate)
        'VBG':wn.VERB, # verb gerund (eating)
        'VBN':wn.VERB, # verb past participle (eaten)
        'VBP':wn.VERB, # verb non-3sg pres (eat)
        'VBZ':wn.VERB, # verb 3sg pres (eats)
        'WDT':'', # wh-determiner (which, that)
        'WP':'', # wh-pronoun (what, who)
        'WP$':'', # possessive (wh- whose)
        'WRB':'' # wh-adverb (how, where)
    }

def get_wordnet_pos(treebank_tag,tag_map):
    for word, initial in tag_map.items():
        treebank_tag = treebank_tag.replace(word.lower(), initial)
    if treebank_tag != '':
        return treebank_tag
    else:
        return None

In [83]:
def lemm_it(msg):
    txt = []
    lemmatizer=WordNetLemmatizer()
    tokens = word_tokenize(msg)
    pos_tokens = nltk.pos_tag(tokens)
    for word in pos_tokens:
        txt.append(lemmatizer.lemmatize(word[0], pos=get_wordnet_pos(word[1])))
    return txt
    

In [85]:
msg = 'hey how are you doing today you look amazing'
lemmatizer=WordNetLemmatizer()
tokens = word_tokenize(msg)
poses = nltk.pos_tag(tokens)
for i in poses: 
    print(lemmatizer.lemmatize(i[0], pos=get_wordnet_pos(i[1],tag_map)))

KeyError: 'VB'

In [68]:
sms_df['lemmed']=sms_df.text.apply(lambda txt: lemm_it(txt))


KeyError: ''

In [47]:
sms_df.lemmed

0       jurong point crazy available bugis and great w...
1                                     lar joking with oni
2       free entry wkly comp win fa cup final tkts 21s...
3                           dun say early hor already say
4                     nah think go usf life around though
5       freemsg hey darling 3 week word back id like f...
6           even brother like speak treat like aid patent
7       per request melle melle oru minnaminunginte nu...
8       winner valued network customer selected receiv...
9       mobile 11 month are entitled update latest col...
10      gonna home soon want talk stuff anymore tonigh...
11      six chance win cash 100 20000 pound csh11 send...
12      urgent 1 week free membership 100000 prize jac...
13      ive searching right word thank breather promis...
14                                            date sunday
15      xxxmobilemovieclub use credit cl k wap link ne...
16                                        oh kim watching
17        eh r

### Tokenize the words in each message

below we will use the nltk word tokenizer to accomplish this.

In [None]:
def tokenize_message(row):
    return word_tokenize(row)

In [None]:
sms_df.text = sms_df.text.apply(lambda row: tokenize_message(row))

In [None]:
sms_df.head()

In [None]:
def get_all_words(row,li):
    li.append(' '.join(set(row)))
    return li

In [None]:
list_of_words = []
sms_df.text.apply(lambda row: get_all_words(row,list_of_words));

In [None]:
new_list_of_words = ' '.join(list_of_words)

In [None]:
new_list_of_words = new_list_of_words.split(' ')

In [None]:
set(new_list_of_words)