# Week 10/11: Spam Text Messgae Filter 

For this week's assignment I will be looking at a text messgaes that have been labeled as either spam or ham (not spam). The idea is to split the total group into training and test sets and then run Naive Bayes Classifier on the test set to see if it can guess which messages are spam or ham.

In [1]:
import pandas as pd
import string
from nltk.corpus import stopwords
from nltk import PorterStemmer as Stemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
%matplotlib inline

# Get the data
I've uploaded the text messgaes file to my githup here: https://github.com/dquarshie89/Data-620/blob/master/spam.csv

In [3]:
spam = pd.read_csv('spam.csv', header = 0, encoding='latin-1')

#Remove unwanted columns
spam=spam.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'])

#Preview data
print(spam.head(5))

#See how many spam messages there are
print(len(spam[spam.v1=='spam']))

#See how many ham messages there are
print(len(spam[spam.v1=='ham']))

     v1                                                 v2
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
747
4825


Now that we have the csv read we will clean up the messages so that our predictor can work with them. We'll make all the text lower case, get rid of any punctuations, remove stopwords, and allow stem sentences. Stemming is the process of shorting long unneeded words or sentences. For example fishy or fish-like will simply become fish

In [4]:
def cleaning(text):
    text = text.lower() #Make all the text lowercase
    text = ''.join([t for t in text if t not in string.punctuation]) #Get rid of puntucations
    text = [t for t in text.split() if t not in stopwords.words('english')] #Remove stop words
    st = Stemmer() #Stem sentences to reduce inflections(https://en.wikipedia.org/wiki/Stemming)
    text = [st.stem(t) for t in text]
    return text

# TfidfVectorizer
TfidfVectorizer (term frequency–inverse document frequency) is a method that will convert our text message file into a 2D matrix and see which words are most important. It will seee how often a word shows up and then proceed to give the word a weighted score. We'll use these scores in our prediction to see which text messages are spam or ham. 

In [5]:
tfidfv = TfidfVectorizer(cleaning)
data = tfidfv.fit_transform(spam['v2'])

# Naive Bayes
Withv our vectors set up and scores in place we're ready to make our classifier. We can use MultinomialNB to train our set using the TFIDF scores.

In [6]:
text_filter = Pipeline([
    ('vectorizer', TfidfVectorizer(cleaning)), #weighted TFIDF score
    ('classifier', MultinomialNB()) #train on TFIDF vectors with Naive Bayes
])

In [7]:
#Split into training and test sets
x_train, x_test, y_train, y_test = train_test_split(spam['v2'], spam['v1'], test_size=0.2)

text_filter.fit(x_train, y_train)

Pipeline(memory=None,
     steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8',
        input=<function cleaning at 0x1a0d94bd90>, lowercase=True,
        max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1),
        norm='l2'...      vocabulary=None)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [8]:
#Predict on test set
predictions = text_filter.predict(x_test)

Now that NB has ran predictions to see which messages were spam or ham let's see how many in actually got correct.

In [10]:
count = 0
for i in range(len(y_test)):
    if y_test.iloc[i] == predictions[i]:
        count += 1
        
print('Number of test cases:', len(y_test))
print('Number of correct of predictions:', count)

Number of test cases: 1115
Number of correct of predictions: 1071


We can see that our NB predictor was able to get 1071 messages labeled correctly, 96%. 

# Conclusion
Our method was successful in getting 96% of the predictions correct. Having a small dataset set us a back a bit and using Naive Bayes which assumes features are indepent of each other may have been a drawback. However getting 96% correct is great and shows that NB can be used to help us label what we'd want to see.