# Text classification with an RNN for Tensor Flow Lite

My initial work looked at the [IMDB large movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/) for sentiment analysis. This was using in the tensor flow examples so looked to be a good place to start. However, I discovered the [Toxic Comment challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview) and that provided a much more relavent dataset. 

The code is based on a number of examples from TensorFlow combined with my own code and some ideas from [Pukar Acharya](https://github.com/iampukar/toxic-comments-classification/blob/master/toxic_comment_analysis.ipynb)

See https://github.com/Workshopshed/TinyMLTextClassification

This notebook trains a [recurrent neural network](https://developers.google.com/machine-learning/glossary/#recurrent_neural_network) on the dataset provided for the competition.

**Caution:** their data set intentionally contains comments that are both toxic and obsecene.

## Setup

In [None]:
!pip install tensorflow
!pip install tensorflow-text
import tensorflow_datasets as tfds
import tensorflow as tf
import tensorflow_text as text
import pandas as pd
import numpy as numpy

## Load the data

Note, you'll need to upload the data sets from the competition to colab or wherever you are running this notebook from. Note that pandas (pd) is expecting the files to be in UTF-8 form.

In [None]:
train_data = pd.read_csv('train_small.csv')
test_data = pd.read_csv('test_small.csv')

In [None]:
train_data.describe()

In [None]:
train_data.head(4)

Validate that the data does not have any blanks

In [None]:
train_data.isnull().any()

In [None]:
classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
result_set = train_data[classes].values

And the input data from the comment_text column of the CSV, make it all lowercase to simplify the encoding.

In [None]:
train_sentences = train_data["comment_text"].fillna("fillna").str.lower()
test_sentences = test_data["comment_text"].fillna("fillna").str.lower()

Create our custom encoder based on `tfds.features.text.TextEncoder`. Note this is more complex than it needs to be because of the orgional processing of data via tfds

In [None]:
import binascii
import sys
import tensorflow_text as text
from math import floor

class HashedTextEncoder(tfds.features.text.TextEncoder):
  """Encodes text using PySuperFastHash"""

  def __init__(self):
    """Constructs HashedTextEncoder.
    Args:
      None
    """
  def encode(self, s):
    # Handle additional tokens
    s = tf.compat.as_text(s)
    s = s.lower()
    ids = []
    words = s.split(" ") 
    for substr in words[0:8]:
      if not substr:
        continue
      newid = self.superFastHash(substr)
      ids.append(newid)
    #If length is too long then select the middle words
    #if len(ids) > 12:
    #  ids = ids[floor((len(ids)-12) / 2):floor((len(ids)-12) / 2) + 12]
    return self.pad_incr(ids)

  def pad_incr(self,ids):
    """Add 1 to ids to account for pad."""
    return [i + 1 for i in ids]

  def decode(self, ids):
    raise NotImplementedError

  def load_from_file():
    raise NotImplementedError
    
  def save_to_file():
    raise NotImplementedError  
  
  def vocab_size():
    raise NotImplementedError  

  def get16bits(self, data):
    """Returns the first 16bits of a string"""
    return int(binascii.hexlify(data[1::-1]), 16)

  def superFastHash(self, data):
    # Start by stripping out UTF data
    data=data.encode("ascii","ignore")

    hash = length = len(data)
    if length == 0:
        return 0

    rem = length & 3
    length >>= 2

    while length > 0:
        hash += self.get16bits(data) & 0xFFFFFFFF
        tmp = (self.get16bits(data[2:])<< 11) ^ hash
        hash = ((hash << 16) & 0xFFFFFFFF) ^ tmp
        data = data[4:]
        hash += hash >> 11
        hash = hash & 0xFFFFFFFF
        length -= 1

    if rem == 3:
        hash += self.get16bits (data)
        hash ^= (hash << 16) & 0xFFFFFFFF
        hash ^= (data[2] << 18) & 0xFFFFFFFF
        hash += hash >> 11
    elif rem == 2:
        hash += self.get16bits (data)
        hash ^= (hash << 11) & 0xFFFFFFFF
        hash += hash >> 17
    elif rem == 1:
        hash += data[0]
        hash ^= (hash << 10) & 0xFFFFFFFF
        hash += hash >> 1

    hash = hash & 0xFFFFFFFF
    hash ^= (hash << 3) & 0xFFFFFFFF
    hash += hash >> 5
    hash = hash & 0xFFFFFFFF
    hash ^= (hash << 4) & 0xFFFFFFFF
    hash += hash >> 17
    hash = hash & 0xFFFFFFFF
    hash ^= (hash << 25) & 0xFFFFFFFF
    hash += hash >> 6

    return hash

This text encoder converts words to hashes.

In [None]:
#Needed to test filtering out unicode strings
sample_string = 'Hello TensorFlow, this is a @fun test. Fichier non trouvé, now check that the too long didnt read function is also working'

encoder = HashedTextEncoder()

encoded_string = encoder.encode(sample_string)
print('Encoded string is {}'.format(encoded_string))

#Note this is a hash function so we can't reverse the operation

## Prepare the data for training

Now run the encoder on the dataset, pad to 8 numbers and format for processing by the model.

In [None]:
def encode_sentences(sentences):

  outputarray = []

  for words in sentences:
    row = encoder.encode(words)
    # Padd the rows to 8 so we can stack them for Numpy / Keras
    x = len(row)
    while x < 8:
      row.append(0)
      x = x + 1
    outputarray.append(row)
  return outputarray

encoded_sentences = encode_sentences(train_sentences)

for i in encoded_sentences[:5]:
  print('Encoded string is {}'.format(i))

train_batches = numpy.stack( encoded_sentences,axis=0 )

encoded_test = encode_sentences(test_sentences)
test_batches = numpy.stack(encoded_test,axis=0)


Check the shape of the data, it should match the shape of the model inputs

In [None]:
print(train_batches.shape)
print(test_batches.shape)
print(result_set.shape)

## Create the model

Build a `tf.keras.Sequential` model, the first embedding layer needs to be as big as our biggest hash + 1 as 0 is used for padding.

When using the Embedding layer, the input_length parameter is needed so that we don't get the following error when converting the model to lite.

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

`None is only supported in the 1st dimension. Tensor 'embedding_input' has invalid shape '[None, None]'.`

Otherwise use the input_shape parameter

A recurrent neural network (RNN) processes sequence input by iterating through the elements. RNNs pass the outputs from one timestep to their input—and then to the next.

The `tf.keras.layers.Bidirectional` wrapper can also be used with an RNN layer. This propagates the input forward and backwards through the RNN layer and then concatenates the output. This helps the RNN to learn long range dependencies.

In [None]:
# Orgional Model.
# model = tf.keras.Sequential([
#    tf.keras.layers.Embedding(32767,64, input_length=32),
#    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
#    tf.keras.layers.Dense(64, activation='relu'),
#    tf.keras.layers.Dense(1)
#])

# This one seems to cause problems with optimisation steps.
# model = tf.keras.Sequential([
#    tf.keras.layers.Embedding(1025,16, input_length=16),
#    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(16)),
#    tf.keras.layers.Dense(16, activation='relu'),
#    tf.keras.layers.Dense(1)
#])

# Experiments
model = tf.keras.Sequential([
  tf.keras.layers.Dense(8, activation='relu',input_shape=(8,)),                       
  tf.keras.layers.Dense(8, activation='relu'),
  tf.keras.layers.Dense(6)
  ])

model.summary()

Please note that we choose to Keras sequential model here since all the layers in the model only have single input and produce single output. In case you want to use stateful RNN layer, you might want to build your model with Keras functional API or model subclassing so that you can retrieve and reuse the RNN layer states. Please check [Keras RNN guide](https://www.tensorflow.org/guide/keras/rnn#rnn_state_reuse) for more details.

Compile the Keras model to configure the training process:

Optimisers - https://www.tensorflow.org/api_docs/python/tf/keras/optimizers

In [None]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-3),
              metrics=['accuracy'])

## Train the model

Note that the validation split allows some of the training data to be kept bac for validation.

In [None]:
history = model.fit(train_batches,result_set, epochs=30, batch_size=10, validation_split=0.2)



In [None]:
#TODO: Need to fix this
test_loss, test_acc = model.evaluate(test_batches)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

In [None]:
# Save the model
model.save("textclassification_model") 

The above model does not mask the padding applied to the sequences. This can lead to skew if trained on padded sequences and test on un-padded sequences. Ideally you would [use masking](../../guide/keras/masking_and_padding) to avoid this, but as you can see below it only have a small effect on the output.

If the prediction is >= 0.5, it is positive else it is negative.

## Test the model

In [None]:
def pad_to_size(vec, size):
  zeros = [0] * (size - len(vec))
  vec.extend(zeros)
  return vec

In [None]:
def sample_predict(sample_pred_text, pad):
  encoded_sample_pred_text = encoder.encode(sample_pred_text)

  if pad:
    encoded_sample_pred_text = pad_to_size(encoded_sample_pred_text, 8)

  predictions = model.predict(tf.expand_dims(encoded_sample_pred_text, 0))

  return (predictions)

In [None]:
# predict on a sample text without padding.

sample_pred_text = ('The movie was cool. The animation and the graphics '
                    'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=False)
print(predictions)

In [None]:
# predict on a sample text with padding

sample_pred_text = ('The movie was fantastic. The animation and the graphics '
                    'were out of this world. I would recommend this movie. Loved every minute of it. A cast of famous people')
predictions = sample_predict(sample_pred_text, pad=True)
print(predictions)

print(sample_predict('This was rubbish, wont be going again. Hated it. Totally pants', pad=True))

print(sample_predict('Amazing film, loved seeing this', pad=True))

print(sample_predict('Only a total noob would say something like that', pad=True))

print(sample_predict('Have you considered an alternative', pad=True))

print(sample_predict('Total crap, wont be going again', pad=True))

# Export the model

Convert the model to TFLite then format as a big C array.

Based on https://github.com/eloquentarduino/tinymlgen/blob/master/tinymlgen/tinymlgen.py

Ref https://blog.tensorflow.org/2019/06/tensorflow-integer-quantization.html

In [None]:
!pip install hexdump

In [None]:
#Experimenting with optimisations

import re
import hexdump
import tensorflow as tf

def port(model,optimize=True, variable_name='model_data',pretty_print=False):
    converter = tf.lite.TFLiteConverter.from_keras_model(model)
    if optimize:
        if isinstance(optimize, bool):
            optimizers = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
        else:
            optimizers = optimize

        converter.optimizations = optimizers
    tflite_model = converter.convert()
    bytes = hexdump.dump(tflite_model).split(' ')
    c_array = ', '.join(['0x%02x' % int(byte, 16) for byte in bytes])
    c = 'const unsigned char %s[] DATA_ALIGN_ATTRIBUTE = {%s};' % (variable_name, c_array)
    if pretty_print:
        c = c.replace('{', '{\n\t').replace('}', '\n}')
        c = re.sub(r'(0x..?, ){12}', lambda x: '%s\n\t' % x.group(0), c)
    c += '\nconst int %s_len = %d;' % (variable_name, len(bytes))
    preamble = '''
#ifdef __has_attribute
#define HAVE_ATTRIBUTE(x) __has_attribute(x)
#else
#define HAVE_ATTRIBUTE(x) 0
#endif
#if HAVE_ATTRIBUTE(aligned) || (defined(__GNUC__) && !defined(__clang__))
#define DATA_ALIGN_ATTRIBUTE __attribute__((aligned(4)))
#else
#define DATA_ALIGN_ATTRIBUTE
#endif
'''
    return preamble + c

In [None]:
c_code = port(model,optimize=True,pretty_print=True)

print(len(c_code))

File size needs to be < 400K to fit onto the device. Check the model_data_len value at the bottom of the file.
const int model_data_len = 109840 and a tiny bit of code comes to 90% of the availabe space.
But perhaps it also needs to be smaller than the available ram to be able to run? For example the sine model is just 2640 bytes;

In [None]:
c_file = open(r"text_model.h","w+")

n = c_file.write(c_code)
c_file.close()

# Testing the TFLite model

It is possible to reload the model back into the notebook and test it here.

Arena Size?

https://github.com/edgeimpulse/tflite-find-arena-size


Debugging TFLite
