# Spam Detection with RNN
Spam - whether in the form of emails, messages, etc. - is a nuisance. Thanks to machine learning algorithms, the problem is now well under control. Here I show with a Recurrent Neural Network (RNN) model how fast and uncomplicated a model can be calculated that can distinguish spam from non-spam. The used dataset is the "SMS Spam Collection", which can be found at Kaggle under "https://www.kaggle.com/ishansoni/sms-spam-collection-dataset".

## Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


## Read data

In [2]:
df = pd.read_csv('spam.csv')
df.head()

Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Create target-column
Map text from label-column to integer (0/1) in new column

In [3]:
df['target'] = df['label'].map( {'spam':1, 'ham':0 })

## Separate train- and test-data
This time manually with df.sample

In [4]:
## b) Trainings- und Testdaten separieren
df_train = df.sample(frac=.8, random_state=11)
df_test = df.drop(df_train.index)
print(df_train.shape, df_test.shape)

(4458, 3) (1114, 3)


## Create y-data for analysis

In [5]:
y_train = df_train['target'].values
y_test = df_test['target'].values
y_test.shape

(1114,)

## Create x-data for analysis

In [6]:
X_train = df_train['sms'].values
X_test = df_test['sms'].values

## Tokenize
word_dict is a list, ordered by the most frequent words (they come first in the list)

In [7]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
word_dict = tokenizer.index_word
print(len(word_dict))
print(word_dict)

#for key in word_dict.keys():
#    print(key, word_dict[key])

7982


## Create sequences from sentences
text_to_sequences puts the index-numbers from word_dict in the place of the words in the training- and test-data 

In [8]:
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)
print(X_train_seq[:5])
print(df_train.iloc[0,:])
for el in X_train_seq[0]:
    print(word_dict[el], end=' ')


[[172, 211, 12, 13, 87, 92, 45, 8, 32, 3799, 231, 9, 7, 86, 6, 81, 1020, 5, 3800, 7, 1999, 11, 635, 241, 21, 25, 436, 928, 1110, 178, 131, 206, 929, 2564, 23, 1, 154, 80, 2, 110, 82, 48, 2, 135, 11, 929, 227, 98, 1639], [257, 307, 2, 1426, 2565, 6, 33, 30, 1245, 1246, 15, 49, 5, 337, 709, 7, 1427, 1428, 581, 68, 34, 2000, 88, 2, 2001], [22, 636, 13, 283, 211, 7, 26, 3, 17, 94, 1429, 67], [13, 296, 2, 30, 18, 4, 2002, 1640, 491, 16, 22, 1247, 37, 930, 258, 183, 931, 671, 401, 349, 1111, 1112, 1113, 1114, 1021, 8, 4, 553, 360, 16], [99, 203, 166, 1, 184, 3, 117, 3801, 148, 2, 52, 48, 3802, 22]]
label                                                   ham
sms       Thanks again for your reply today. When is ur ...
target                                                    0
Name: 4460, dtype: object
thanks again for your reply today when is ur visa coming in and r u still buying the gucci and bags my sister things are not easy uncle john also has his own bills so i really need to think abou

## Create pads with fix length
Maximum length is 20

In [9]:
X_train_pad = pad_sequences(X_train_seq, maxlen=20, padding='post')
X_test_pad = pad_sequences(X_test_seq, maxlen=20, padding='post')
X_train_pad[:5]
X_train_pad.shape

(4458, 20)

## Create Keras-model
Of course a "Long Short Term Memory" (LSTM) is used

In [10]:
laenge_pads = 20
anz_woerter = 7982

lstm_model = Sequential()
lstm_model.add(Embedding(input_dim=anz_woerter+1, output_dim=20, input_length=laenge_pads))
lstm_model.add(LSTM(400))
lstm_model.add(Dense(1, activation='sigmoid'))

lstm_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
lstm_model.summary()

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 20, 20)            159660    
_________________________________________________________________
lstm_1 (LSTM)                (None, 400)               673600    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 401       
Total params: 833,661
Trainable params: 833,661
Non-trainable params: 0
_________________________________________________________________


## Train model

In [11]:
history = lstm_model.fit(X_train_pad, y_train, epochs=10, batch_size=64, 
                        validation_data=(X_test_pad, y_test))

Instructions for updating:
Use tf.cast instead.
Train on 4458 samples, validate on 1114 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Test-Estimation

In [12]:
sms_test = ['Hi Paul, would you come around tonight']
sms_seq = tokenizer.texts_to_sequences(sms_test)

sms_pad = pad_sequences(sms_seq, maxlen=20, padding='post')
tokenizer.index_word
sms_pad
lstm_model.predict_classes(sms_pad)

array([[0]], dtype=int32)

... classified the text as no spam. Correct!

In [13]:
sms_test = ['Free SMS service for anyone']
sms_seq = tokenizer.texts_to_sequences(sms_test)

sms_pad = pad_sequences(sms_seq, maxlen=20, padding='post')
tokenizer.index_word
sms_pad
lstm_model.predict_classes(sms_pad)

array([[1]], dtype=int32)

... classified the tet as spam. Correct again!