# A Complete Step by Step Tutorial on Sentiment Analysis in Keras and Tensorflow

This is a Jupyter notebook that implements the tutorial described by [an article by Rashida Nasrin Sucky](https://towardsdatascience.com/a-complete-step-by-step-tutorial-on-sentiment-analysis-in-keras-and-tensorflow-ea420cc8913f).

## Data Preparation

Download the the dataset ["Reviews of Amazon Baby Products"](https://www.kaggle.com/datasets/sameersmahajan/reviews-of-amazon-baby-products?resource=download) before executing.

The first step in building sentiment analysis model, like all other ML models, is data preprocessing.

In [2]:
import pandas as pd
import numpy as np

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

df = pd.read_csv('./amazon_baby.csv')
df['sentiments'] = df.rating.apply(lambda x: 0 if x in [1, 2] else 1)

## Splitting the Dataset
# 80% for training and 20% for testing.
split = round(len(df)*0.8)
train_reviews = df['review'][:split]
train_label = df['sentiments'][:split]
test_reviews = df['review'][split:]
test_label = df['sentiments'][split:]

## Guarantee Reviews are Strings
# This is to ensure that, should the data not be in string format,
# they will be proactively converted into a string.
training_sentences = []
training_labels = []
testing_sentences = []
testing_labels = []

for row in train_reviews:
    training_sentences.append(str(row))

for row in train_label:
    training_labels.append(row)

for row in test_reviews:
    testing_sentences.append(str(row))

for row in test_label:
    testing_labels.append(row)

## Additional constants
vocab_size = 40000
embedding_dim = 16
max_length = 120
trunc_type = 'post'
oov_tok = '<OOV>'
padding_type = 'post'

## Initialize the Tokenizer
# A tokenizer identifies tokenizable elements in the data.
# Here, we tokenize the words and assign values to them.
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index

# This code block converts sentences in into sequences of words
# and then pads if necessary.
sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences, maxlen=max_length,
                       truncating=trunc_type)
testing_sentences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sentences, maxlen=max_length)

2024-02-25 21:48:22.602749: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-25 21:48:22.602788: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-25 21:48:22.623271: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-25 21:48:22.677957: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
