# **Check Your Tone: A Speech-To-Text Sentiment Analyzer for Raspberry Pi**
This Google Colab notebook is used to train neural network models for text sentiment classification, as part of the Check Your Tone speech-to-text sentiment classifier project for Raspberry Pi. [See my GitHub repo](https://github.com/ericvc/Check-Your-Tone) for more information about the project and additional code necessary to get it up and running.


---



## **0) Prepare Workspace for Model Fitting**

### **Hardware Acceleration**

Check Hardware Acceleration settings and verify GPU type. If hardware acceleration is not enabled, go to **Runtime > Change Runtime Type** and select GPU from the Hardware Accelerator dropdown menu.

In [None]:
!nvidia-smi

### **Connect Colaboratory Runtime to Google Drive to Save Files**

Connecting to Google Drive will allow you to save workspace files for use in later runs and to save fitted models for download to your local 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

if not os.path.exists("/content/drive/CYT"):
  !mkdir "/content/drive/My Drive/CYT/"



---


### **Load Python modules, define helper functions, and prepare data**


---


Load Python modules

In [None]:
!pip install tf-nightly

import pandas as pd
import pickle
import numpy as np
import os
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras import layers
from keras.optimizers import Adam
from keras.regularizers import L1L2
from keras.utils import plot_model
from datetime import datetime

Download the IMDB movie reviews dataset from [ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz](https://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz)

In [None]:
%%capture
#Download file with wget
!wget ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz
#Extract from compressed file
!tar xvzf aclImdb_v1.tar.gz



---


Define functions to process raw text for tokenization.

In [None]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

"""
Functions for processing raw text. Removes or replaces special characters, empty spaces, etc.
"""

stop_words = stopwords.words("english")
def REPLACE_STOP_WORDS_NO_SPACE(x):
    # list comprehension to split input string into list of words, then evaluate each word
    words = [word for word in x.split() if word not in stop_words]
    # recombine the list of remaining words into a string
    words_no_stop = " ".join(words)
    return words_no_stop


def REPLACE_ELLIPSES_WITH_SPACE(x):
    return re.compile("\\.{2,}").sub(" ", x)


def REPLACE_CHARACTER_NO_SPACE(x):
    return re.compile("[\\.\\-;:!\'?,\"()\[\]\/]").sub("", x)


def REPLACE_BLANK_START_NO_SPACE(x):
    return re.compile("^\\s+").sub("", x)


def REPLACE_BLANK_END_NO_SPACE(x):
    return re.compile("\\s+$").sub("", x)


def REPLACE_BLANK_WITH_SPACE(x):
    return re.compile("\\s{2,}").sub(" ", x)


def REPLACE_FORMAT_NO_SPACE(x):
    return re.compile("&\\w").sub(" ", x)


def pre_process_sentence(sentences):
    sentences = [REPLACE_ELLIPSES_WITH_SPACE(line) for line in sentences]
    sentences = [REPLACE_CHARACTER_NO_SPACE(line) for line in sentences]
    sentences = [REPLACE_FORMAT_NO_SPACE(line) for line in sentences]
    sentences = [REPLACE_BLANK_START_NO_SPACE(line) for line in sentences]
    sentences = [REPLACE_BLANK_END_NO_SPACE(line) for line in sentences]
    sentences = [REPLACE_BLANK_WITH_SPACE(line) for line in sentences]
    sentences = [REPLACE_STOP_WORDS_NO_SPACE(line) for line in sentences]
    return sentences



---


With the raw data extracted, we can get to work preparing the training and validation datasets.

In [None]:
# Function for extracting text from each review file
def get_review_text(dir):
    files = os.listdir(dir)
    reviews = []
    for file in files:
        file_path = dir + file
        with open(file_path, "rb") as f:
            review = f.read().decode("utf-8")
        reviews.append(review)
    return reviews

## Get raw text of positive reviews
pos_dir1 = "aclImdb/train/pos/"
pos_dir2 = "aclImdb/test/pos/"
positive_reviews = [*get_review_text(pos_dir1), *get_review_text(pos_dir2)]

## Get raw text of negative reviews
neg_dir1 = "aclImdb/train/neg/"
neg_dir2 = "aclImdb/test/neg/"
negative_reviews = [*get_review_text(neg_dir1), *get_review_text(neg_dir2)]

## Combine all reviews into a pandas DataFrame
all_reviews = [*positive_reviews, *negative_reviews]
df = pd.DataFrame()
df["sentence"] = all_reviews  # review text as column 'sentence'
df['sentence.lower'] = [text.lower() for text in df['sentence']]  # convert to lowercase letters
labels = np.zeros(len(all_reviews))
labels[0:int(len(all_reviews) / 2)] = 1
df["label"] = labels  # label value as column 'label'

## Write processed data to CSV file
df.to_csv("imdb_reviews_labeled.csv")

## Shuffle rows to mix positive and negative reviews
df = pd.read_csv("imdb_reviews_labeled.csv").sample(frac=1, random_state=222)
sentences = df['sentence.lower'].values
y = df['label'].values

print("Data saved to CSV file.")



---


Download pre-trained word embeddings from the GloVe project: https://nlp.stanford.edu/projects/glove/

I will be using the embedding trained on the Common Crawl with 42B tokens and 300 dimension vectors (**~1.75 GB download**).

On the first run of this notebook, the embeddings will be downloaded and saved to my Google Drive project folder. On subsequent runs, the saved version will be copied from Drive project folder to the Colab workspace, rather than downloaded again from the source. *This saves around 10-15 minutes per run*.

In [None]:
if not os.path.exists("/content/drive/My Drive/CYT/glove.42B.300d.zip"):
  !wget http://nlp.stanford.edu/data/wordvecs/glove.42B.300d.zip
  !gsutil cp "glove.42B.300d.zip" "/content/drive/My Drive/CYT/"
  !unzip "glove.42B.300d.zip"

else:
  !gsutil cp "/content/drive/My Drive/CYT/glove.42B.300d.zip" "/content/"
  print("File copied to local workspace. Unzipping archive.")
  !unzip "glove.42B.300d.zip"

print("Archive downloaded and inflated.")
  



---
### **Tokenize IMDB Review Text Using Pre-Trained Embedding**
Tokenize and prepare the embedding matrix used to train the models.

In [None]:
## Clean review text
X_processed = pre_process_sentence(sentences)

## Tokenize words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_processed)
X_tokenized = tokenizer.texts_to_sequences(X_processed)

## Save fitted tokenizer to storage for use later on
if not os.path.exists("tokenizer"):
  os.mkdir("tokenizer")
filename = 'tokenizer/sentence_tokenizer_fitted.sav'
pickle.dump(tokenizer, open(filename, 'wb'))

## Dimensions for word embeddings
maxlen = 250
embedding_dim = 300

## Pad sequences (right side only) with 0s
X = pad_sequences(X_tokenized, padding='post', maxlen=maxlen)

## Create embedding matrix
# define function for extracting word embeddings (line-by-line) from file
def create_embedding_matrix(filepath, word_index, embedding_dim):
    vocab_size = len(word_index) + 1  # Adding again 1 because of reserved 0 index
    embedding_matrix = np.zeros((vocab_size, embedding_dim))
    with open(filepath, encoding="utf8") as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index[word]
                embedding_matrix[idx] = np.array(
                    vector, dtype=np.float32)[:embedding_dim]

    return embedding_matrix

## Create embedding matrix
embedding_matrix = create_embedding_matrix(
    'glove.42B.300d.txt',
    tokenizer.word_index, embedding_dim)

## Check proportion of words included in pre-trained word embeddings.
nonzero_elements = np.count_nonzero(np.count_nonzero(embedding_matrix, axis=1))
vocab_size = len(tokenizer.word_index) + 1
print(f"{int(np.round(nonzero_elements / vocab_size, 2) * 100)}% of words are included in the pre-trained embedding.")


---
### **Define functions for specifying neural network models**

In [None]:
## Model template functions
# RNN
def create_lstm_model(learn_rate: float = 0.001, units: int = 32, n_blocks: int=0):
    # Define optimization settings
    optimizer = Adam(lr=learn_rate)

    # Initialize model
    model = Sequential()

    # Add embedding layer
    model.add(layers.Embedding(vocab_size, embedding_dim,
                               weights=[embedding_matrix],
                               input_length=maxlen,
                               trainable=False, 
                               name="embedding"))

    # Add LSTM blocks
    for blocks in range(n_blocks):
        label = str(blocks)
        model.add(layers.LSTM(units=units, name="lstm_"+label, 
                              return_sequences=True, 
                              activity_regularizer=L1L2(0.0, 1e-2)))
        model.add(layers.BatchNormalization(name="batch_norm_"+label))
        model.add(layers.Activation("elu", name="activation_"+label))

    model.add(layers.LSTM(units=units, name="lstm_final", 
                          activity_regularizer=L1L2(0.0, 1e-2)))
    model.add(layers.BatchNormalization(name="batch_norm_final"))
    model.add(layers.Activation("elu", name="activation_final"))

    # Output layer
    model.add(layers.Dense(1, activation='sigmoid', name="output"))

    # Compile model
    model.compile(optimizer=optimizer,
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

    # Print summary
    #model.summary()

    return model


# CNN
def create_conv_model(learn_rate: float = 0.001, filters: int = 32, n_blocks: int = 0, kernel_size: int = 3):

    # Define optimization settings
    optimizer = Adam(lr=learn_rate)

    # Initialize model
    model = Sequential()

    # Add embedding layer
    model.add(layers.Embedding(vocab_size, embedding_dim,
                               weights=[embedding_matrix],
                               input_length=maxlen,
                               trainable=False, 
                               name="embedding"))

    # Add convolution blocks
    for blocks in range(n_blocks):
        label = str(blocks)
        model.add(layers.Conv1D(filters=filters,
                                kernel_size=kernel_size,
                                padding="same",
                                name="conv1D_"+label))
        model.add(layers.BatchNormalization(name="batch_norm_"+label))
        model.add(layers.Activation("relu", name="activation_"+label))
        model.add(layers.MaxPool1D(pool_size=2, name
="max_pool_"+label))

    # Final convolution block (w/o MaxPooling)
    model.add(layers.Conv1D(filters=filters,
                            kernel_size=kernel_size,
                            padding="same",
                            name="conv1D_final_block"))
    model.add(layers.BatchNormalization(name="batch_norm_final_block"))
    model.add(layers.Activation("relu", name="activation_final_block"))
    model.add(layers.Flatten(name="flatten"))

    # Dropout layer
    model.add(layers.Dropout(0.5, name="dropout_flat_to_dense"))
    model.add(layers.Dense(32, name="dense", 
                           activity_regularizer=L1L2(0.0, 1e-4)))
    model.add(layers.BatchNormalization(name="batch_norm_dense"))
    model.add(layers.Activation("relu", name="activation_dense"))

    # Output layer
    model.add(layers.Dense(1, activation='sigmoid', name="output"))

    # Compile model
    model.compile(optimizer=optimizer,
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

    # Print summary
    #model.summary()

    return model



---


## **1) Model Fitting with Keras/TensorFlow**

Models (SavedModel and TF-lite) will be saved to the folder 'TensorFlow Models'

In [None]:
if not os.path.exists("TensorFlow Models"):
  os.mkdir("TensorFlow Models")



---
### **1-D Convolutional Neural Network**

In [None]:
# CNN model fitting
cnn_model = create_conv_model(filters=8, learn_rate=1e-5, n_blocks=0, kernel_size=3)
plot_model(cnn_model, to_file="TensorFlow Models/cnn_model.png", show_shapes=True, show_layer_names=True)
cnn_model.fit(X, y, batch_size=16, epochs=65, validation_split=0.5, shuffle=True)
# Save as SavedModel
cnn_save_model_file = "TensorFlow Models/model_fit_conv_{:%Y%m%d_%H%M%S}".format(datetime.now())
cnn_model.save(filepath=cnn_save_model_file, overwrite=True, include_optimizer=True, save_format=None)
# Convert to TF-Lite model and save
cnn_converter = tf.lite.TFLiteConverter.from_keras_model(cnn_model)
cnn_model_lite = cnn_converter.convert()
open("TensorFlow Models/conv_sentiment_classifier.tflite", "wb").write(cnn_model_conv_lite)

---
### **Recurrent Neural Network**


In [None]:
# RNN
rnn_model = create_lstm_model(units=4, learn_rate=1e-4, n_blocks=0)  # 25,505 trainable parameters
plot_model(rnn_model, to_file="TensorFlow Models/rnn_model.png", show_shapes=True, show_layer_names=True)
rnn_model.fit(X, y, batch_size=16, epochs=5, validation_split=0.5, shuffle=True)
# Save as SavedModel
rnn_save_model_file = "TensorFlow Models/model_fit_rnn_{:%Y%m%d_%H%M%S}".format(datetime.now())
rnn_model.save(filepath=rnn_save_model_file, overwrite=True, include_optimizer=True, save_format=None)
# Convert to TF-Lite model and save
rnn_converter = tf.lite.TFLiteConverter.from_keras_model(rnn_save_model_file)
rnn_model_lite = rnn_converter.convert()
open("TensorFlow Models/lstm_sentiment_classifier.tflite", "wb").write(rnn_model_lite)



---
### **Some quick, un-scientific tests of the models**

Let's try the fitted models out on some example text. These sentences are meant to include a mix of clearly negative, neutral, and positive statements.

In [None]:
new_sentence = [
    "this place sucks so much. i hate it. i never want to go here ever again. please, listen to me when i tell you to avoid it like the plague.",
    "this is the best movie i've ever seen so full of excitement and beautiful moments to cherish",
    "it was ok, good, but not great. they should add more dinosaurs to make it better.",
    "the movie was pretty good and i liked most of it, but the acting was could use some work",
    "this is the worst product i've ever purchased. it broke within hours of use.",
    "new research reveals the secret to being the cutest marine animal that ever existed.",
    "global carbon emissions are down over 80 percent as climate improves for millions",
    "congress passes legislation to protect endangered sea turtles.",
    "reading this book was life affirming and now i have the confidence to express my best work. great job. this is the most awesome thing ever.",
    "the lemon potatoes were disgusting and i had a bad time. overall, this place is gross. don't ever go here if you can help it."]
#new_sentence = ["This text sentiment analyzer could be used when practicing for a presentation or drafting a writing project. It predicts the sentinment of example text using deep learning models that were trained on the IMDB movie reviews data set. See below for more information about how the models were created."]
new_sentence_processed = pre_process_sentence(new_sentence)
new_sentence_tokenized = tokenizer.texts_to_sequences(new_sentence_processed)
X_new = pad_sequences(new_sentence_tokenized, padding='post', maxlen=250)

# CNN predictions for new data
#y_new = cnn_model.predict(X_new)
#for text, sentiment in zip(new_sentence, y_new):
#    print(f"{text}: {np.round(sentiment,3)}")

# RNN predictions for new data
y_new = rnn_model.predict(X_new)
for text, sentiment in zip(new_sentence, y_new):
    print(f"{text}: {np.round(sentiment,3)}")




---


## **2) Save Fitted Models to Google Drive**

In [None]:
import shutil
shutil.make_archive('TensorFlow Models', 'zip', 'TensorFlow Models')
!gsutil cp -r "TensorFlow Models.zip" "/content/drive/My Drive/CYT"
shutil.make_archive('tokenizer', 'zip', 'tokenizer')
!gsutil cp -r "tokenizer.zip" "/content/drive/My Drive/CYT"



---


# **LICENSE**

*MIT License*

*Copyright (c) 2020 Eric Van Cleave*

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.