# Training a Comment-spam Detection model with TensorFlow Lite Model Maker

## Learning objectives

1. Install TensorFlow Lite Model Maker.
2. Download the data from the Colab server to your device.
3. Use a data loader to make the training data.
4. Build the model.

## Overview

In this lab, you review code created with TensorFlow and TensorFlow Lite Model Maker to create a model with a dataset based on comment spam. The original data is available on Kaggle. It's been gathered into a single CSV file, and cleaned up by removing broken text, markup, repeated words and more. This will make it easier to focus on the model instead of the text.

Each learning objective will correspond to a __#TODO__ in this student lab notebook -- try to complete this notebook first and then review the [solution notebook](../solutions/spam_comments_model_maker.ipynb).


### Install TensorFlow Lite Model Maker

In [1]:
# Install Model maker
!pip install -q tflite-model-maker &> /dev/null

**Note:** After the installation, restart the kernel by clicking **Kernel > Restart kernel > Restart**.

### Import the code

Import the necessary dependencies and check the Tensorflow version:

In [1]:
# Imports and check that we are using TF2.x
import numpy as np
import os

from tflite_model_maker import configs
from tflite_model_maker import ExportFormat
from tflite_model_maker import model_spec
from tflite_model_maker import text_classifier
from tflite_model_maker.text_classifier import DataLoader

import tensorflow as tf
assert tf.__version__.startswith('2')
tf.get_logger().setLevel('ERROR')

### Download the dataset

Next you'll download the data from the lab server to your device, and set the `data_file` variable to point at the local file:

In [2]:
# Download the dataset as a CSV and store as data_file
data_file = tf.keras.utils.get_file(fname='comment-spam.csv', origin='https://storage.googleapis.com/laurencemoroney-blog.appspot.com/lmblog_comments.csv', extract=False)

Downloading data from https://storage.googleapis.com/laurencemoroney-blog.appspot.com/lmblog_comments.csv


### Use a model spec from model maker

When you use pre-learned embeddings, you get to start with a corpus, or collection, of words that have already had sentiment learned from a large body of text, so you get to a solution much faster than when you start from zero.

Model Maker provides several pre-learned embeddings that you can use, but the simplest and quickest one to begin with is average_word_vec option.

In [3]:
# TODO 1
# Use a model spec from model maker. Options are 'mobilebert_classifier', 'bert_classifier' and 'average_word_vec'
# The first 2 use the BERT model, which is accurate, but larger and slower to train
# Average Word Vec is kinda like transfer learning where there are pre-trained word weights
# and dictionaries
spec = # TODO 1: Your code goes here
spec.num_words = 2000
spec.seq_len = 20
spec.wordvec_dim = 7

#### The num_words parameter

You also specify the number of words that you want your model to use.
You might think "the more the better", but there's generally a right number based on the frequency that each word is used. If you use every word in the entire corpus, the model could try to learn and establish the direction of words that are only used once. In any text corpus, many words are only used once or twice, so their inclusion in your model isn't worthwhile because they have a negligible impact on the overall sentiment.
You can use the `num_words` parameter to tune your model based on the number of words that you want. A smaller number might provide a smaller and quicker model, but it could be less accurate because it recognizes fewer words. On the other hand, a larger number might provide a larger and slower model. It's important to find the sweet spot!

#### The wordvec_dim parameter

The `wordved_dim` parameter is the number of dimensions that you want to use for the vector for each word. The rule of thumb determined from research is that it's the fourth root of the number of words. For example, if you use 2,000 words, 7 is a good starting point. If you change the number of words that you use, you can also change this.

#### The seq_len parameter

Models are generally very rigid when it comes to input values. For a language model, this means that the language model can classify sentences of a particular static length. That's determined by the `seq_len` parameter or sequence length.
When you convert words into numbers or tokens, a sentence then becomes a sequence of these tokens. In this case, your model is trained to classify and recognize sentences with 20 tokens. If the sentence is longer than this, it's truncated. If it's shorter, it's padded. You can see a dedicated `<PAD>` token in the corpus that's used for this.

Earlier you downloaded the CSV file. Now it's time to use a data loader to turn this into training data that the model can recognize:

In [4]:
# TODO 2
# Load the CSV using DataLoader.from_csv to make the training_data
data = # TODO 2: Your code goes here(
      filename=data_file,
      text_column='commenttext',
      label_column='spam',
      model_spec=spec,
      delimiter=',',
      shuffle=True,
      is_training=True)

train_data, test_data = data.split(0.9)

If you open the CSV file in an editor, you'll see that each line just has two values, and these are described with text in the first line of the file. Typically, each entry is then deemed to be a column.

You'll see that the descriptor for the first column is `commenttext`, and that the first entry on each line is the text of the comment. Similarly, the descriptor for the second column is `spam`, and you'll see that the second entry on each line is `True` or `False`, to denote if that text is considered comment spam or not. The other properties set the `model_spec` variable that you created earlier, along with a delimiter character, which in this case is a comma as the file is comma separated. You will use this data for training the model, so `is_Training` is set to `True`.

You will want to hold back a portion of the data for testing the model. Split the data, with 90% of it for training, and the other 10% for testing/evaluation. Because we're doing this we want to make sure that the testing data is chosen at random, and isn't the ‘bottom' 10% of the dataset, so you use `shuffle=True` when loading the data to randomize it.

### Build the model

In [5]:
# TODO 3
# Build the model
model = # TODO 3: Your code goes here

Epoch 2/2
Epoch 3/3
Epoch 4/4
Epoch 5/5
Epoch 6/6
Epoch 7/7
Epoch 8/8
Epoch 9/9
Epoch 10/10
Epoch 11/11
Epoch 12/12
Epoch 13/13
Epoch 14/14
Epoch 15/15
Epoch 16/16
Epoch 17/17
Epoch 18/18
Epoch 19/19
Epoch 20/20
Epoch 21/21
Epoch 22/22
Epoch 23/23
Epoch 24/24
Epoch 25/25
Epoch 26/26
Epoch 27/27
Epoch 28/28
Epoch 29/29
Epoch 30/30
Epoch 31/31
Epoch 32/32
Epoch 33/33
Epoch 34/34
Epoch 35/35
Epoch 36/36
Epoch 37/37
Epoch 38/38
Epoch 39/39
Epoch 40/40
Epoch 41/41
Epoch 42/42
Epoch 43/43
Epoch 44/44
Epoch 45/45
Epoch 46/46
Epoch 47/47
Epoch 48/48
Epoch 49/49
Epoch 50/50


This code creates a text-classifier model with Model Maker and you specify the training data that you want to use as set up in fourth step), the model specification as set up in the fourth step, and a number of epochs, which is 50 in this case.

In [6]:
loss, accuracy = model.evaluate(train_data)



### Export a model

Export a model to SavedModel format with the model, vocabulary and labels. Run this cell to specify a directory and export the model:

In [7]:
# This will export to SavedModel format with the model, vocabulary and labels.
model.export(export_dir='/mm_spam_savedmodel/', export_format=[ExportFormat.LABEL, ExportFormat.VOCAB, ExportFormat.SAVED_MODEL])

Compress the entire folder of `/mm_spam_savedmodel` and down the generated `mm_spam_savedmodel.zip` file:

In [8]:
# TODO 4
# Rename the SavedModel subfolder to a version number
!mv /mm_spam_savedmodel/saved_model /mm_spam_savedmodel/123
# TODO 4: Your code goes here

  adding: mm_spam_savedmodel/ (stored 0%)
  adding: mm_spam_savedmodel/vocab.txt (deflated 47%)
  adding: mm_spam_savedmodel/labels.txt (stored 0%)
  adding: mm_spam_savedmodel/123/ (stored 0%)
  adding: mm_spam_savedmodel/123/saved_model.pb (deflated 87%)
  adding: mm_spam_savedmodel/123/variables/ (stored 0%)
  adding: mm_spam_savedmodel/123/variables/variables.data-00000-of-00001 (deflated 35%)
  adding: mm_spam_savedmodel/123/variables/variables.index (deflated 59%)
  adding: mm_spam_savedmodel/123/assets/ (stored 0%)
  adding: mm_spam_savedmodel/123/keras_metadata.pb (deflated 86%)


In [9]:
# Optional extra
# You can use this cell to export details for projector.tensorflow.org
# Where you can explore the embeddings that were learned for this dataset
embeddings = model.model.layers[0]
weights = embeddings.get_weights()[0]
tokenizer = model.model_spec.vocab

import io

out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word in tokenizer:
  #word = tokenizer.decode([word_num])
  value = tokenizer[word]
  embeddings = weights[value]
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()


try:
  from google.colab import files
except ImportError:
  pass
else:
  files.download('vecs.tsv')
  files.download('meta.tsv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>