#TFLite SMS Spam Classifier

A spam classification model optimized with TensorFlow Lite to run on mobile devices.

### Dataset

**SMS Spam Collection Dataset**

* Tagged SMS messages in English
* Encoding is "latin-1"

https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

### Model Architecture
We use a deep learning TensorFlow model with a Bidirectional LSTM Layer.

* https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM
* https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional

LSTM layers are good at keeping track of input (word) order which will matter
for whether or not a message is SPAM.



## Install Requirements

In [1]:
!pip install -q kaggle tensorflow

## Imports

In [17]:
import os
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
import json

In [3]:
import numpy as np
os.environ['PYTHONHASHSEED'] = '42'
np.random.seed(42)
tf.random.set_seed(42)

## Setup/Upload Dataset

### Make the ./data folder

In [4]:
!mkdir ./data

### Upload the spam.csv file
You can download the dataset linked above as a zip file, extract it, and then upload the spam.csv file here.

In [None]:
from google.colab import files
files.upload("./data")

### Prep the Dataset
load the CSV file, fix column names, and convert the text label to 1 for 'spam' and 0 for 'ham'

In [6]:
df = pd.read_csv("./data/spam.csv", sep=",", encoding="latin-1")
df.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1, inplace=True)
df.rename(columns={"v1": "label_txt", "v2": "message"}, inplace=True)
df['label'] = df['label_txt'].map({'ham': 0, 'spam': 1})
df.drop(["label_txt"], axis=1, inplace=True)
df.head(20)

Unnamed: 0,message,label
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0
5,FreeMsg Hey there darling it's been 3 week's n...,1
6,Even my brother is not like to speak with me. ...,0
7,As per your request 'Melle Melle (Oru Minnamin...,0
8,WINNER!! As a valued network customer you have...,1
9,Had your mobile 11 months or more? U R entitle...,1


In [7]:
texts = df['message'].values
labels = df['label'].values

In [8]:
df.groupby('label').describe()

Unnamed: 0_level_0,message,message,message,message
Unnamed: 0_level_1,count,unique,top,freq
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,4825,4516,"Sorry, I'll call later",30
1,747,653,Please call our customer service representativ...,4


We have a class imbalance of about 6:1 of 'ham' to 'spam' messages. We'll note this and if our results don't look good we can come back and address this.

### Tokenize and Padding
We have to turn the text words into numeric values so that the alogrithms can work on them.

We'll use a Tokenizer to convert the words into unique numbers and then create a vectors of a fixed size for the algorithm to learn from.

Since the vector has to be a fixed input size we will have to pad and/or split SMS messages.

In [9]:
tokenizer = Tokenizer(num_words=10000, oov_token="<OOV>")
tokenizer.fit_on_texts(texts)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(texts)
padded = pad_sequences(sequences, maxlen=100, padding='post', truncating='post')

### Create train/val/test splits

In [10]:
X_train, X_val, y_train, y_val = train_test_split(padded, labels, test_size=0.2, random_state=42)


## Build Model

In [22]:
input_layer = tf.keras.Input(shape=(100,), dtype='int32')  # explicitly set dtype

x = tf.keras.layers.Embedding(10000, 32)(input_layer)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32))(x)
x = tf.keras.layers.Dense(32, activation='relu')(x)
output = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.Model(inputs=input_layer, outputs=output)


In [23]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.build(input_shape=(None, 100))
model.summary()

## Train the Model

In [24]:
model.fit(X_train, y_train, epochs=15, validation_data=(X_val, y_val), callbacks=[tf.keras.callbacks.EarlyStopping(patience=5)])

Epoch 1/15
[1m140/140[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 85ms/step - accuracy: 0.8867 - loss: 0.3498 - val_accuracy: 0.9767 - val_loss: 0.0824
Epoch 2/15
[1m140/140[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 81ms/step - accuracy: 0.9900 - loss: 0.0383 - val_accuracy: 0.9821 - val_loss: 0.0725
Epoch 3/15
[1m140/140[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 79ms/step - accuracy: 0.9967 - loss: 0.0126 - val_accuracy: 0.9794 - val_loss: 0.0796
Epoch 4/15
[1m140/140[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 86ms/step - accuracy: 0.9994 - loss: 0.0033 - val_accuracy: 0.9812 - val_loss: 0.0974
Epoch 5/15
[1m140/140[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 85ms/step - accuracy: 0.9996 - loss: 8.8736e-04 - val_accuracy: 0.9803 - val_loss: 0.1053
Epoch 6/15
[1m140/140[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 81ms/step - accuracy: 1.0000 - loss: 2.0029e-04 - val_accuracy: 0.9812 - val_loss: 0.1106
Epoch 7/

<keras.src.callbacks.history.History at 0x7b81530d3d90>

## Test the Model

In [25]:
test_text = [
    "Winner! Free money! click here: asfdlk2j3.adkalfkj2a.cm.com/winner/activation",
    "Make $100/hr no ID verification reply send 'Cash' to 54135",
    "Hey man, don't forget to bring cash to the show!",
    "Where's the lottery ticket? I want to see if we are winners!",
    "Congratulations you won a free giftcard redeem here bit.ly/2389012a",
    "Last warning! Your Toll Bill is due! Pay here https://sunpass.com-8lf5.sbs/us"
    ]
test_sequences = tokenizer.texts_to_sequences(test_text)
test_padded = pad_sequences(test_sequences, maxlen=100, padding='post', truncating='post')

prediction = model.predict(test_padded)

for idx, p in enumerate(prediction):
  if p[0] > 0.5:
    print(f"SPAM {p[0]}: {test_text[idx]}")
  else:
    print(f"HAM {p[0]}: {test_text[idx]}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 380ms/step
SPAM 0.9963923096656799: Winner! Free money! click here: asfdlk2j3.adkalfkj2a.cm.com/winner/activation
SPAM 0.9014689922332764: Make $100/hr no ID verification reply send 'Cash' to 54135
HAM 0.00010331479279557243: Hey man, don't forget to bring cash to the show!
HAM 3.89026063203346e-05: Where's the lottery ticket? I want to see if we are winners!
SPAM 0.9817833304405212: Congratulations you won a free giftcard redeem here bit.ly/2389012a


There are 4 SPAM messages in this test and our classifier found all of them. We're happy those results so we can save the model and convert it for use with TFLite.

## Save Model

In [26]:
if not os.path.exists('./saved_model'):
  os.makedirs('./saved_model')

model.save('./saved_model/spam_classifier.keras')

### Save the TFLite converted model

In [29]:
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = []  # Disable optimizations
converter.experimental_enable_resource_variables = True
converter.target_spec.supported_ops = [
    tf.lite.OpsSet.TFLITE_BUILTINS,
    tf.lite.OpsSet.SELECT_TF_OPS
]
converter.target_spec.supported_types = [tf.int32]
tflite_model = converter.convert()

converter._experimental_lower_tensor_list_ops = False
with open("./saved_model/spam_classifier.tflite", "wb") as f:
    f.write(tflite_model)

Saved artifact at '/tmp/tmpfv6athx6'. The following endpoints are available:

* Endpoint 'serve'
  args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None, 100), dtype=tf.int32, name='keras_tensor_5')
Output Type:
  TensorSpec(shape=(None, 1), dtype=tf.float32, name=None)
Captures:
  135795387282832: TensorSpec(shape=(), dtype=tf.resource, name=None)
  135795387283984: TensorSpec(shape=(), dtype=tf.resource, name=None)
  135795374000080: TensorSpec(shape=(), dtype=tf.resource, name=None)
  135795374000656: TensorSpec(shape=(), dtype=tf.resource, name=None)
  135795373999696: TensorSpec(shape=(), dtype=tf.resource, name=None)
  135795374001232: TensorSpec(shape=(), dtype=tf.resource, name=None)
  135795374001808: TensorSpec(shape=(), dtype=tf.resource, name=None)
  135795373998928: TensorSpec(shape=(), dtype=tf.resource, name=None)
  135795373999504: TensorSpec(shape=(), dtype=tf.resource, name=None)
  135795373998160: TensorSpec(shape=(), dtype=tf.resource, name=None)
  135795374003536: Tenso

### Let's export the Tokenizer word index file
This will allow us to use the same word indicies when we use the model on a mobile device.

In [19]:
with open("./saved_model/word_index.json", "w") as f:
    json.dump(tokenizer.word_index, f)

### Let's see how big these models are on disk

In [32]:
!ls saved_model/ -lh

total 5.4M
-rw-r--r-- 1 root root 4.0M Jun 21 21:14 spam_classifier.keras
-rw-r--r-- 1 root root 1.4M Jun 21 21:20 spam_classifier.tflite
-rw-r--r-- 1 root root 141K Jun 21 20:36 word_index.json


The TFLite model and word_index is about **1.5MB** in size and ready to ship in a mobile device!