# IST 371 Machine Learning

## Chapter 14 PA

### Spring 2019

### Alex Lange, Aidan Polivka

This assignment uses an RNN to classify SMS messages (texts) as spam or "ham," i.e., non-spam. Use the provided `sms-spam.csv` file as your dataset, and fill in the following code cells, using the sarcasm detection example as a basis.

In [1]:
# read in data and display the first few lines
import pandas as pd

spamhamdata = pd.read_csv('sms-spam.csv')

spamhamdata.head()

Unnamed: 0,is-spam,sms
0,0,"Go until jurong point\r\n0,Ok lar... Joking wi..."
1,1,FreeMsg Hey there darling it's been 3 week's n...
2,1,"SIX chances to win CASH! From 100 to 20\r\n1,U..."
3,0,I've been searching for the right words to tha...
4,0,I HAVE A DATE ON SUNDAY WITH WILL!!


In [3]:
# lowercase the sms column, and display the first few
# rows of the dataframe
spamhamdata['sms'] = spamhamdata['sms'].apply(lambda x: x.lower())
spamhamdata.head()

Unnamed: 0,is-spam,sms
0,0,"go until jurong point\r\n0,ok lar... joking wi..."
1,1,freemsg hey there darling it's been 3 week's n...
2,1,"six chances to win cash! from 100 to 20\r\n1,u..."
3,0,i've been searching for the right words to tha...
4,0,i have a date on sunday with will!!


In [6]:
# perform other data clean up (removing non-alpha characters, 
# stopwords, and lemmatizing), then print out the first few
# lines of the resulting array

import nltk
# preparation for step 3 -- download and load stopwords
nltk.download('stopwords') # only need to run once
# get a list of english stopwords
stopwords = nltk.corpus.stopwords.words('english')
# display the stopwords 
print(sorted(stopwords))
import re
from nltk.stem import WordNetLemmatizer

lemm = WordNetLemmatizer()

def headlineToList(headline):
    # split into individual words
    headline = headline.split(' ')
    # remove non-alpha characters using a regular expression.
    # ^a-zA-Z\s matches any character that's not alphabetic or
    # a space; matching characters are removed
    headline = [re.sub('[^a-zA-Z\s]', '', word) for word in headline]
    # only keep words that are not stopwords
    headline = [word for word in headline if word not in stopwords]
    # lemmatize words
    headline = [lemm.lemmatize(word) for word in headline]
    return headline

# 3. remove stopwords
# 4. remove non-alphabetic characters
# 5. Lemmatize
nltk.download('wordnet') #This is also a one of download.
# headlines will be a list of lists.
# Each list in headlines contains the words
# of each sms, minus stopwords and non-alpha
# characters
headlines = []

# loop thru the sms column
for row in range(0, spamhamdata.shape[0]):
    # grab the current sms
    headline = spamhamdata.sms[row]
    
    # append to the list
    headlines.append(headlineToList(headline))

    
# headlines is a list, not a Pandas dataframe, so we don't have
# a head() function. So, let's use a loop to display the first
# 20 processed headlines
for i in range(20):
    print(i, headlines[i])

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Alex\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Alex\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some',

In [7]:
# use a set to determine the number of distinct words in 
# the sms data
wordSet = set()

# populate the set
for line in headlines:
    for word in line:
        wordSet.add(word)
        
# display number of distince words in the headlines data
print(len(wordSet), 'distinct words in the dataset')

8859 distinct words in the dataset


In [8]:
# tokenize the data and pad it so each row is of
# same length 
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# now we will convert the words in the headlines into 
# numeric values

# maximum number of words to tokenize. We could make this
# smaller (say, 1000 or 2000). Training would be faster, but
# we might lose some important information (the rarer words 
# would be ignored). 
max_words = 8900   

# create a tokenizer
tokenizer = Tokenizer(num_words = max_words)

# fit the tokenizer to our data
tokenizer.fit_on_texts(headlines)

# convert each headline to a sequence of words
X = tokenizer.texts_to_sequences(headlines)

# make each tokenized headline the same length
# (shorter ones are left-filled with the value 0)
X = pad_sequences(X)

Using TensorFlow backend.


In [11]:
# convert the 'is-spam' labels to one-hot format using
# the pd.get_dummies() function
y = pd.get_dummies(spamhamdata['is-spam']).values

# look at the new labels
y

array([[1, 0],
       [0, 1],
       [0, 1],
       ...,
       [0, 1],
       [1, 0],
       [1, 0]], dtype=uint8)

In [12]:
# divide into testing and training data
from sklearn.model_selection import StratifiedShuffleSplit

# divide into training and non-training data
splitter = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 68333)
for train_index, test_index in splitter.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    

In [13]:
# define your RNN model, and compile it
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, GlobalMaxPool1D, GRU, LSTM
from keras.backend import clear_session

# using Keras vs. the TensorFlow Keras; same effect
clear_session()

# set up the model
model = Sequential([
    Embedding(input_dim = max_words, output_dim = 150, input_length = X.shape[1]),
    LSTM(units = 256, dropout = 0.2, recurrent_dropout = 0.2),
    Dense(256, activation = 'elu'),
    Dropout(0.5),
    Dense(128, activation = 'elu'),
    Dropout(0.5), 
    Dense(56, activation = 'elu'),
    Dropout(0.5), 
    Dense(2, activation = 'softmax')
])

model.compile(
    loss = 'categorical_crossentropy', 
    optimizer = 'adam', 
    metrics = ['accuracy']
)


Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [14]:
import tensorflow as tf

# create a checkpointer that will save the best ever 
# epoch's weights; so, if the accuracy goes down 
# slightly as the training progresses, we will have the
# best-ever model saved
checkpointer = tf.keras.callbacks.ModelCheckpoint(
    filepath = 'best_weights.hdf5',
    monitor = 'val_acc',
    verbose = 1,
    save_best_only = True
)

In [15]:
# train the model 
history = model.fit(
    X_train, y_train,
    epochs = 10,
    batch_size = 32,
    validation_split = 0.2, # use 20% of training data for validation
    callbacks = [checkpointer],
    verbose = 1
)


Instructions for updating:
Use tf.cast instead.
Train on 1852 samples, validate on 464 samples
Epoch 1/10

Epoch 00001: val_acc improved from -inf to 0.91379, saving model to best_weights.hdf5
Epoch 2/10

Epoch 00002: val_acc did not improve from 0.91379
Epoch 3/10

Epoch 00003: val_acc did not improve from 0.91379
Epoch 4/10

Epoch 00004: val_acc did not improve from 0.91379
Epoch 5/10

Epoch 00005: val_acc did not improve from 0.91379
Epoch 6/10

Epoch 00006: val_acc did not improve from 0.91379
Epoch 7/10

Epoch 00007: val_acc did not improve from 0.91379
Epoch 8/10

Epoch 00008: val_acc did not improve from 0.91379
Epoch 9/10

Epoch 00009: val_acc did not improve from 0.91379
Epoch 10/10

Epoch 00010: val_acc did not improve from 0.91379


In [16]:
# reinitialize the model with the best weights,
# and save the model so we could use it later
model.load_weights('best_weights.hdf5')
model.save('shapes_cnn.h5')


In [17]:
# evaluate the model's performance on the test data
score = model.evaluate(X_test, y_test, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])


Test loss: 0.2223778432813184
Test accuracy: 0.9206896551724137
