# SMS Spam Classification

This project I will be looking into SMS text data from multiple sources all collected by the team [Tiago A. Almeida](http://dcomp.sor.ufscar.br/talmeida/) and [José María Gómez Hidalgo](http://www.esp.uem.es/jmgomez). For more information on how they collected this data check it out [here](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/).

Some notable sources used while performing this analysis and classification: 
- [Ultimate guide to deal with Text Data (using Python)](https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/)

The data here is a collection of 747 Spam texts along with 4,827 non-spam (HAM) texts. The file is formatted as a plain text file.



In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import string
import seaborn as sns
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
from sklearn.model_selection import train_test_split
from sklearn.manifold import TSNE
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
from textblob import TextBlob
import spacy
import re
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import LabelEncoder
np.random.seed(0)
%matplotlib inline

### Read in the file. 

from exploring the data we know that we need to strip the new line characters (__\n__) and that the message and label are separated by a tab (__\t__).

In [2]:
with open('Data/SMSSpamCollection.txt') as f:
    lines = [line.rstrip('\n').split('\t') for line in f]

In [3]:
sms_df = pd.DataFrame(lines)

In [4]:
sms_df.head()
sms_df.shape

(5574, 2)

In [5]:
#rename the columns
sms_df.rename(columns={0:'label', 1:'text'},inplace=True)

le = LabelEncoder()
sms_df['target'] = le.fit_transform(sms_df['label'])

In [6]:
sms_df.head(3)

Unnamed: 0,label,text,target
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1


### Basic Feature Engineering

Before any of this I did a little bit of text exploring to see if I could see anything that may or may not help me--this was crucial for choosing __'flag words'__. Items like word count, char counts, number of numerics, number of upper case, etc. are pretty common practice, so they are great features to add to your data set before cleaning. 

1. word count
2. character count
3. Number of numerics
4. Number of upper case
5. Number of Exclamation Points (!)
6. Number of Flag Words
7. Links in message
8. Count of stop words


__ 1. word count__

In [7]:
sms_df['word_count'] = sms_df.text.apply(lambda x: len(str(x).split(' ')))

__2. character count__

In [8]:
sms_df['char_count'] = sms_df.text.str.len() #this includes the spaces

__3. Number of numerics__

In [9]:
sms_df['numerics'] = sms_df.text.apply(lambda x: len([x for x in x.split() if x.isdigit()]))

__4. Number of upper case characters__

this returns how many words in the message are all-caps

In [10]:
sms_df['upper'] = sms_df.text.apply(lambda x: len([x for x in x.split() if x.isupper()]))

__5. Number of Excalmation Points (!)__

This splits the message a returns how many times the message has been split minus 1. This will return the total number of '!' in the message. e.g. if we have a message: 'Hey!' it will return ['Hey',''], so we subtract one to get # of excalamtion points.

In [11]:
sms_df['bangs'] = sms_df.text.apply(lambda x: len([x for x in x.split('!')]) - 1 )

__6. Flag Words__

Shout-out to [Grace](https://github.com/graceh3) for this idea! 

Possible __"flag"__ words from looking at the first few rows of data:

In [12]:
sms_df['flag_words'] = sms_df.text.apply(lambda x: len([x for x in x.split(' ') 
                                                        if x.translate(str.maketrans('', '', string.punctuation)).lower().strip() 
                                                        in ['winner','urgent','win','won','free','cash','freemsg',
                                                            'stopsms','ppm']]))


__7. Links in message__

We will use regex to be able to see if there are any links in the message

In [13]:
p = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

sms_df['links'] = sms_df.text.apply(lambda x: len(re.findall(p, x))) + sms_df.text.apply(lambda x: x.count('.com')+x.count('.co')+x.count('.uk'))



In [14]:
sms_df.links.sum()

276

In [15]:
sms_df[sms_df['links'] >= 1].head()

Unnamed: 0,label,text,target,word_count,char_count,numerics,upper,bangs,flag_words,links
15,spam,"XXXMobileMovieClub: To use your credit, click ...",1,19,149,0,1,0,0,3
136,ham,I only haf msn. It's yijue@hotmail.com,0,6,38,0,1,0,0,2
191,spam,Are you unique enough? Find out from 30th Augu...,1,10,72,0,0,0,0,2
250,spam,Congratulations ur awarded 500 of CD vouchers ...,1,23,150,4,2,0,1,2
268,spam,Ur ringtone service has changed! 25 Free credi...,1,27,159,1,5,3,1,2


__8. Count of Stop Words__

these are words that don't add any real value to our messages, which include: of, the, on, etc. Without them our sentences would not be great, but they don't impact meaning in the long run.

In [16]:
stop = stopwords.words('english')

sms_df['stp_wrd_cnt'] = sms_df.text.apply(lambda x: len([x for x in x.split() if x in stop]))

__Lets look at the first few columns to see how all these new columns look__

So far, these engineered columns are looking _great!_ They should have a big impact on our spam predicting.

In [17]:
sms_df.head()

Unnamed: 0,label,text,target,word_count,char_count,numerics,upper,bangs,flag_words,links,stp_wrd_cnt
0,ham,"Go until jurong point, crazy.. Available only ...",0,20,111,0,0,0,0,0,4
1,ham,Ok lar... Joking wif u oni...,0,6,29,0,0,0,0,0,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1,28,155,2,2,0,2,0,5
3,ham,U dun say so early hor... U c already then say...,0,11,49,0,2,0,0,0,2
4,ham,"Nah I don't think he goes to usf, he lives aro...",0,13,61,0,1,0,0,0,5


### Data Preprocessing

Next, we need to move into data cleaning. This section will be very important for the remaineder of this project and the models we run. In the next few cells we will:
1. create a function to remove all punction
2. lower case all of the words in our messages
3. remove stop words
4. check for spelling and correct where needed
5. remove frequent
6. remove rare/uncommon words


#### 1) and 2) get rid of special charaters and lower case:

In [18]:
def clean_text_column(row):
    import string
    '''
    takes in a cell from the dataframe and removes all of the symbols from 
    string.punctuation ('!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'), and then lower
    cases each line.
    '''
    return row.translate(str.maketrans('', '', string.punctuation)).lower()

In [19]:
sms_df.text = sms_df.text.apply(lambda row: clean_text_column(row))

Check what our function did:

In [20]:
print(sms_df.text.iloc[2],sms_df.iloc[2],sep='\n\n')

free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s

label                                                       spam
text           free entry in 2 a wkly comp to win fa cup fina...
target                                                         1
word_count                                                    28
char_count                                                   155
numerics                                                       2
upper                                                          2
bangs                                                          0
flag_words                                                     2
links                                                          0
stp_wrd_cnt                                                    5
Name: 2, dtype: object


#### 3. Remove all stop words:

Here we will remove all of the words that do not add value to the meaning of our messages.

In [21]:
stop = stopwords.words('english') #loads the stop words for the english language
sms_df.text = sms_df.text.apply(lambda x: " ".join(x for x in x.split() if x not in stop)) 
#returns only words that are not in the list of stop words

#### 4. Correct Obvious Spelling issues:

First, lets fix the most commonly 'missspelled' words. After, we will use textBlob to correct our spelling for all the other ones.

In [22]:
#most common word list:
common_w = pd.Series(' '.join(sms_df.text).split()).value_counts()[:50]
common_w 

u         1132
call       578
2          482
im         464
ur         390
get        386
4          293
dont       287
go         281
ok         278
ltgt       276
free       275
know       257
like       244
ill        239
got        239
good       236
come       229
time       208
day        203
love       200
want       193
send       191
text       188
going      171
one        171
ü          169
need       167
txt        163
home       162
lor        160
see        157
sorry      156
still      154
r          153
stop       152
back       152
n          146
reply      144
today      141
mobile     138
tell       137
new        136
well       134
later      134
hi         133
think      132
da         131
please     129
take       126
dtype: int64

A good amount of these are abbreviations, so lets change these in our dataset before we remove frequent and rare words.

In [23]:
rep = {' wif u ':' with you ', ' u r ':' you are ', 'ic': ' i see ', ' i c ': ' i see ',
       ' u c ': ' you see ',' u ':' you ',' n ':' and ',' r ':' are ',' txt ':' text ',
       ' ü ':' you ',' 4 ':' for ',' 2 ':' to ',' ur ':' your ',' da ':' the ',' wif ':' with ',
       ' urself ':' yourself ', ' thats ': ' thats ', ' i‘m ': ' im ', '£':''}

pattern = re.compile("|".join(rep.keys()))
sms_df.text = sms_df.text.apply(lambda x: pattern.sub(lambda m: rep[m.group(0)], x.center(len(x)+2)).strip())
# text = pattern.sub(lambda m: rep[m.group(0)], x.center(len(x)+2)).strip()

__Now if we look at the most common words, we can see the changes that we have made__

In [24]:
common_w = pd.Series(' '.join(sms_df.text).split()).value_counts()[:50]
common_w 

you       1203
see        961
i          794
call       578
im         469
to         464
get        386
your       349
text       346
e          320
dont       287
for        287
go         281
ok         278
ltgt       276
free       275
know       257
like       244
got        239
ill        239
good       236
come       229
time       208
k          207
day        203
love       200
want       193
send       191
one        171
going      171
need       167
home       162
lor        160
sorry      157
still      154
back       152
stop       152
are        149
p          146
reply      144
and        141
today      141
mobile     138
tell       137
new        136
well       134
later      134
hi         133
think      132
please     129
dtype: int64

#### 5. Remove Very Frequent Words:
Super common words do not add value to their connection with other words, which is why we will remove the top ten most common words in the cell below. Becuase this is text data, I decided it would be fine in this case to remove these words. [Here](https://www.quora.com/Why-do-we-remove-frequent-and-infrequent-words-when-in-NLP) is a great explaination for this.

In [25]:
#most common word list:
common_w = pd.Series(' '.join(sms_df.text).split()).value_counts()[:15]
#Pretty good sign that most of these seem like filler words

In [26]:
#now that we've checked on them, lets remove them from our texts
sms_df.text = sms_df.text.apply(lambda x: ' '.join(x for x in x.split() if x not in list(common_w.index)))

#### 6. Use SymSpell to correct spelling: 

Now that we've removed a good portion of words that add noise to our data, we should clean up the words in one final step to correct spelling. 

There are many libraries to do this, but this one is very quick and compared to others I've used appears to be very accurate. We will correct the spelling of words that occur once.

In [27]:
from symspellpy.symspellpy import SymSpell
sym_spell = SymSpell(2, 7)
sym_spell.load_dictionary('frequency_dictionary_en_82_765.txt', 0, 1)

True

In [28]:
#list of words that only occur once:
word_lists = list(sms_df.text.apply(lambda x: x.split(' ')))
all_words = [word for rev in word_lists for word in rev]
corpus_word_counts_df = pd.DataFrame(pd.Series(all_words).value_counts()).reset_index()\
.rename(columns={'index':'words',0:'counts'})

In [29]:
corpus_word_counts_df_1 = corpus_word_counts_df[corpus_word_counts_df['counts'] == 1]

In [30]:
from tqdm._tqdm_notebook import tqdm_notebook

tqdm_notebook.pandas(desc="Progress: ")

corpus_word_counts_df_1['corrected'] = corpus_word_counts_df_1.words.progress_apply(lambda w: 
                                                                                     sym_spell.word_segmentation(w)[0])




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Replace words that were corrected in this last step by applying the function we're defining below:

In [31]:
def replace_fixed_words(message,df):
    words = message.split(' ')
    cor_rev = []
    for word in words: 
        if word in list(df.words):
            cor_rev.append(df[df['words'] == word]['corrected'].item())
        else:
            cor_rev.append(word)
    return ' '.join(cor_rev)

In [32]:
sms_df['text_corr'] = sms_df\
.text.progress_apply(lambda x: replace_fixed_words(x,corpus_word_counts_df_1))




In [33]:
sms_df.to_csv('Data/spelling_and_features_sms.csv')
# sms_df = pd.read_csv('Data/spelling_and_features_sms.csv')

### Lemmatizing:

I originally planned on using NLTK amd the part of speech to lemmatize. However, I later discovered Spacy, which is very efficient and accurate. They have amazing [documentation](https://spacy.io/api/annotation#lemmatization) for lemmatizing and everything else the librazry offers, which is _a lot_. 

__Here is a little information on what I previously planned on doing, along with some great articles to help:__

Great article [here](https://www.datacamp.com/community/tutorials/stemming-lemmatization-python) about lemmatizing and stemming with NLTK.

If you get an error about POS (part of speech) tagging, then you may need to download treebank and punkt from NLTK. [Here](https://stackoverflow.com/questions/14089887/nltk-pos-tag-usage) is the resource I used. However, mine did not work until I downloaded 'averaged_perceptron_tagger'.

Below I defined a function that takes the returned pos values from nltk.pos_tag, and returns a compatible version of them. This was a great solution on [stackoverflow](https://stackoverflow.com/questions/5364493/lemmatizing-pos-tagged-words-with-nltk).


In [34]:
def return_lemma(review,nlp):
    doc = nlp(review)
    return ' '.join([word.lemma_ for word in doc])
    

In [35]:
nlp = spacy.load("en_core_web_sm")
sms_df['lemmed']=sms_df.text.apply(lambda txt: return_lemma(txt,nlp))


In [37]:
sms_df.lemmed[:10]

0    jurong point crazy available bugi and great wo...
1                                    lar joke with oni
2    free entry wkly comp win fa cup final tkts 21s...
3                        dun say early hor already say
4                  nah think go usf life around though
5    freemsg hey darle 3 week word back -PRON- woul...
6        even brother like speak treat like aid patent
7    per request melle melle oru minnaminunginte nu...
8    winner value network customer select receivea ...
9    mobile 11 month be entitle update late colour ...
Name: lemmed, dtype: object

### Tokenize the words in each message

below we will use the nltk word tokenizer to accomplish this. This function will train-test split our data. We are setting a random state of 40. I have also chosen to do a 80/20 split based on the pareto principle.

First, let's split up our columns into predictors (X) and Targets. For this we will have two separate sets of predictors. X1 will be just the corrected text and X2 will be everything else except for texts.

In [39]:
sms_df.columns

Index(['label', 'text', 'target', 'word_count', 'char_count', 'numerics',
       'upper', 'bangs', 'flag_words', 'links', 'stp_wrd_cnt', 'text_corr',
       'lemmed'],
      dtype='object')

In [43]:
X1 = sms_df.lemmed
X2 = sms_df.iloc[:,3:11]
y = sms_df.target

__Let us train-test split the non-text predictors using the same parameters (test_size and random_state)__

In [53]:
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X2, y, test_size=0.20, random_state=42)

In [54]:
def tfidf(X, y): 
    '''
    Generate train and test TF-IDF vectorization for our data set
    
    Parameters
    ----------
    X: pandas.Series object
        Pandas series of text documents to classify 
    y : pandas.Series object
        Pandas series containing label for each document
    Returns
    --------
    tf_idf_train :  sparse matrix, [n_train_samples, n_features]
        Vector representation of train data
    tf_idf_test :  sparse matrix, [n_test_samples, n_features]
        Vector representation of test data
    y_train : array-like object
        labels for training data
    y_test : array-like object
        labels for testing data
    vectorizer : vectorizer object
        fit TF-IDF vecotrizer object

    '''
    
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import TfidfVectorizer
    vectorizer = TfidfVectorizer()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
    tf_idf_train = vectorizer.fit_transform(X_train)
    tf_idf_test = vectorizer.transform(X_test)
    return tf_idf_train,tf_idf_test, y_train, y_test, vectorizer

In [55]:
tf_idf_train,tf_idf_test, y_train, y_test, vectorizer = tfidf(X1, y)

### Predictions: 

Next, I will use muliple models to predict if the message is spam or not. After I will take the average of the results and make the final prediction.

__First lets do it just for the tf-idf matrix:__

I have defined a function called "classify_text", you pass it the classifier, tf_idf_train, tf_idf_test, and y_train and it will return the predictions for train and test.

In [56]:
def classify_text(classifier, tf_idf_train, tf_idf_test, y_train):
    '''
    Train a classifier to identify whether a message is spam or ham
    
    Parameters
    ----------
    classifier: sklearn classifier
       initialized sklearn classifier (MultinomialNB, RandomForestClassifier, etc.)
    tf_idf_train : sparse matrix, [n_train_samples, n_features]
        TF-IDF vectorization of train data
    tf_idf_test : sparse matrix, [n_test_samples, n_features]
        TF-IDF vectorization of test data
    y_train : pandas.Series object
        Pandas series containing label for each document in the train set
    Returns
    --------
    train_preds :  list object
        Predictions for train data
    test_preds :  list object
        Predictions for test data
    '''
    clf = classifier
    clf.fit(tf_idf_train, y_train)
    return clf.predict(tf_idf_train), clf.predict(tf_idf_test)

Initalize a random forest classifier and then pass the sbove function the correct objects:

In [57]:
rf_classifier = RandomForestClassifier(n_estimators=100)
rf_train_preds, rf_test_preds = classify_text(rf_classifier,tf_idf_train, tf_idf_test, y_train)

In [58]:
print(confusion_matrix(y_test, rf_test_preds))
print(accuracy_score(y_test, rf_test_preds))

[[953   1]
 [ 30 131]]
0.9721973094170404


#### Our random forest classifier is very accurate. We could definitly just stop here, however lets try to make it even better!

Now, I will do a model for our other predictors: