# SMS Spam Classification

This project I will be looking into SMS text data from multiple sources all collected by the team [Tiago A. Almeida](http://dcomp.sor.ufscar.br/talmeida/) and [José María Gómez Hidalgo](http://www.esp.uem.es/jmgomez). For more information on how they collected this data check it out [here](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/).

Some notable sources used while performing this analysis and classification: 
- [Ultimate guide to deal with Text Data (using Python)](https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/)

The data here is a collection of 747 Spam texts along with 4,827 non-spam (HAM) texts. The file is formatted as a plain text file.



In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import string
import seaborn as sns
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
from sklearn.manifold import TSNE
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from textblob import TextBlob
import re
np.random.seed(0)
%matplotlib inline

### Read in the file. 

from exploring the data we know that we need to strip the new line characters (__\n__) and that the message and label are separated by a tab (__\t__).

In [2]:
with open('Data/SMSSpamCollection.txt') as f:
    lines = [line.rstrip('\n').split('\t') for line in f]

In [3]:
sms_df = pd.DataFrame(lines)

In [4]:
sms_df.head()
sms_df.shape

(5574, 2)

In [6]:
#rename the columns
sms_df.rename(columns={0:'label', 1:'text'},inplace=True)

### Basic Feature Engineering

1. word count
2. character count
3. Number of numerics
4. Number of upper case
5. Number of Exclamation Points (!)
6. Number of Flag Words
7. Links in message


__ 1. word count__

In [7]:
sms_df['word_count'] = sms_df.text.apply(lambda x: len(str(x).split(' ')))

__2. character count__

In [8]:
sms_df['char_count'] = sms_df.text.str.len() #this includes the spaces

__3. Number of numerics__

In [9]:
sms_df['numerics'] = sms_df.text.apply(lambda x: len([x for x in x.split() if x.isdigit()]))

__4. Number of upper case characters__

this returns how many words in the message are all-caps

In [10]:
sms_df['upper'] = sms_df.text.apply(lambda x: len([x for x in x.split() if x.isupper()]))

__5. Number of Excalmation Points (!)__

This splits the message a returns how many times the message has been split minus 1. This will return the total number of '!' in the message. e.g. if we have a message: 'Hey!' it will return ['Hey',''], so we subtract one to get # of excalamtion points.

In [11]:
sms_df['bangs'] = sms_df.text.apply(lambda x: len([x for x in x.split('!')]) - 1 )

__6. Flag Words__

Shout-out to [Grace](https://github.com/graceh3) for this idea! 

Possible __"flag"__ words from looking at the first few rows of data:

In [12]:
sms_df['flag_words'] = sms_df.text.apply(lambda x: len([x for x in x.split(' ') 
                                                        if x.translate(str.maketrans('', '', string.punctuation)).lower().strip() 
                                                        in ['winner','urgent','win','won','free','cash','freemsg']]))


In [18]:
sms_df.loc[15,'text']

'XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL'

__7. Links in message__

We will use regex to be able to see if there are any links in the message

In [None]:
p = re.compile(r"([.]{1}com|co|uk)")


sms_df['links'] = sms_df.text.apply(lambda x: 0 if len(p.split(x)) == 1 else 1)



In [None]:
sms_df.loc[2,'text']

In [None]:
sms_df.links

__Lets look at the first few columns to see how all these new columns look__

In [None]:
sms_df

### Data Cleaning

Next, we need to move into data cleaning. This section will be very important for the remaineder of this project and the models we run. In the next few cells we will:
1. create a function to remove all punction
2. lower case all of the words in our messages
3. remove stop words
4. check for spelling and correct where needed


__1) and 2) get rid of special charaters and lower case:__

In [None]:
def clean_text_column(row):
    import string
    '''
    takes in a cell from the dataframe and removes all of the symbols from 
    string.punctuation ('!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'), and then lower
    cases each line.
    '''
    return row.translate(str.maketrans('', '', string.punctuation)).lower()

In [None]:
sms_df.text = sms_df.text.apply(lambda row: clean_text_column(row))

Check what our function did:

In [20]:
sms_df

Unnamed: 0,label,text,word_count,char_count,numerics,upper,bangs,flag_words
0,ham,"Go until jurong point, crazy.. Available only ...",20,111,0,0,0,0
1,ham,Ok lar... Joking wif u oni...,6,29,0,0,0,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,28,155,2,2,0,2
3,ham,U dun say so early hor... U c already then say...,11,49,0,2,0,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",13,61,0,1,0,0
5,spam,FreeMsg Hey there darling it's been 3 week's n...,32,147,1,0,2,1
6,ham,Even my brother is not like to speak with me. ...,16,77,0,0,0,0
7,ham,As per your request 'Melle Melle (Oru Minnamin...,26,160,0,0,0,0
8,spam,WINNER!! As a valued network customer you have...,26,157,1,2,3,1
9,spam,Had your mobile 11 months or more? U R entitle...,29,154,2,3,1,2


__3. Correct Spelling with replacement and TextBlob:__


In [None]:
sms_df['text'] = sms_df['text'].apply(lambda x: x.replace({ ' c ' : ' see ', ' u ' : ' you ', 
                                                           ' r ' : ' are ',' n ': ' and ',' wif ':' with ',
                                                          'urself':'yourself',' v ':' very '})

In [None]:
sms_df['text'] = sms_df['text'].apply(lambda x: str(TextBlob(x).correct()))

In [None]:
stop = stopwords.words('english') #loads the stop words for the english language
sms_df['text'] = sms_df['text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop)) 
#returns only words that are not in the list of stop words

In [None]:
def get_all_uni_words(df):
    word_li = []
    for i in df.text:
        word_li.append(i.split(' '))
    return  [word for sublist in word_li for word in sublist]

In [None]:
unique_words = set(get_all_uni_words(sms_df))

In [None]:
unique_words

__4. Correct Spelling using TextBlob__

In [None]:
sms_df['text'] = sms_df['text'].apply(lambda x: x.replace(' c ',' see ').replace(' u ',' you ')
                     .replace(' r ',' are ').replace(' n ',' and ').replace(' wif ',' with '))

### Tokenize the words in each message

below we will use the nltk word tokenizer to accomplish this.

In [None]:
def tokenize_message(row):
    return word_tokenize(row)

In [None]:
sms_df.text = sms_df.text.apply(lambda row: tokenize_message(row))

In [None]:
sms_df.head()

In [None]:
def get_all_words(row,li):
    li.append(' '.join(set(row)))
    return li

In [None]:
list_of_words = []
sms_df.text.apply(lambda row: get_all_words(row,list_of_words));

In [None]:
new_list_of_words = ' '.join(list_of_words)

In [None]:
new_list_of_words = new_list_of_words.split(' ')

In [None]:
set(new_list_of_words)

### Vectorize

