<a href="https://colab.research.google.com/github/cs145442/nlp-projects-with-tf2/blob/master/sentiment_classification_with_keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Gathering the data

In [1]:
# add and unzip the dataset here
! ls
! wget http://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip
! unzip stanfordSentimentTreebank.zip

sample_data
--2020-11-21 11:42:09--  http://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip [following]
--2020-11-21 11:42:09--  https://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6372817 (6.1M) [application/zip]
Saving to: ‘stanfordSentimentTreebank.zip’


2020-11-21 11:42:10 (11.9 MB/s) - ‘stanfordSentimentTreebank.zip’ saved [6372817/6372817]

Archive:  stanfordSentimentTreebank.zip
   creating: stanfordSentimentTreebank/
  inflating: stanfordSentimentTreebank/datasetSentences.txt  
   creating: __MACOSX/
   creating: __MACOSX/stanfordSentimentTreebank/
  inflati

In [2]:
! cat stanfordSentimentTreebank/README.txt

Stanford Sentiment Treebank V1.0

This is the dataset of the paper:

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng and Christopher Potts
Conference on Empirical Methods in Natural Language Processing (EMNLP 2013)

If you use this dataset in your research, please cite the above paper.

@incollection{SocherEtAl2013:RNTN,
title = {{Parsing With Compositional Vector Grammars}},
author = {Richard Socher and Alex Perelygin and Jean Wu and Jason Chuang and Christopher Manning and Andrew Ng and Christopher Potts},
booktitle = {{EMNLP}},
year = {2013}
}

This file includes:
1. original_rt_snippets.txt contains 10,605 processed snippets from the original pool of Rotten Tomatoes HTML files. Please note that some snippet may contain multiple sentences.

2. dictionary.txt contains all phrases and their IDs, separated by a vertical line |

3. sentiment_labels.txt contains all phrase 

In [3]:
# take a peek at the dataset format
! echo "----- contents of the treebank -------------------"
! ls stanfordSentimentTreebank
! echo "----- first 5 lines of dictionary.txt ------------"
! tail -n 5 stanfordSentimentTreebank/dictionary.txt
! echo "----- first 5 lines of sentiment_labels.txt ------"
! tail -n 5 stanfordSentimentTreebank/sentiment_labels.txt

----- contents of the treebank -------------------
datasetSentences.txt  dictionary.txt		README.txt	      SOStr.txt
datasetSplit.txt      original_rt_snippets.txt	sentiment_labels.txt  STree.txt
----- first 5 lines of dictionary.txt ------------
zoning ordinances to protect your community from the dullest science fiction|220441
zzzzzzzzz|179256
élan|220442
É|220443
É um passatempo descompromissado|220444
----- first 5 lines of sentiment_labels.txt ------
239227|0.36111
239228|0.38889
239229|0.33333
239230|0.88889
239231|0.5


In [4]:
# install all the dependencies here
! pip install bert-for-tf2

Collecting bert-for-tf2
[?25l  Downloading https://files.pythonhosted.org/packages/18/d3/820ccaf55f1e24b5dd43583ac0da6d86c2d27bbdfffadbba69bafe73ca93/bert-for-tf2-0.14.7.tar.gz (41kB)
[K     |████████                        | 10kB 21.5MB/s eta 0:00:01[K     |████████████████                | 20kB 14.3MB/s eta 0:00:01[K     |███████████████████████▉        | 30kB 12.8MB/s eta 0:00:01[K     |███████████████████████████████▉| 40kB 11.9MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 6.0MB/s 
[?25hCollecting py-params>=0.9.6
  Downloading https://files.pythonhosted.org/packages/a4/bf/c1c70d5315a8677310ea10a41cfc41c5970d9b37c31f9c90d4ab98021fd1/py-params-0.9.7.tar.gz
Collecting params-flow>=0.8.0
  Downloading https://files.pythonhosted.org/packages/a9/95/ff49f5ebd501f142a6f0aaf42bcfd1c192dc54909d1d9eb84ab031d46056/params-flow-0.8.2.tar.gz
Building wheels for collected packages: bert-for-tf2, py-params, params-flow
  Building wheel for bert-for-tf2 (setup.py) ... 

In [30]:
# import all the dependencies here
try:
    %tensorflow_version 2.x
except Exception:
    pass
import tensorflow as tf

from tensorflow.keras import layers

import tensorflow_hub as hub
import pandas as pd
import bert

import math
import random

In [6]:
# reading the dataset
dataset_df = pd.read_csv('stanfordSentimentTreebank/dictionary.txt', sep='\n')
dataset_df.head()

Unnamed: 0,!|0
0,! '|22935
1,! ''|18235
2,! Alas|179257
3,! Brilliant|22936
4,! Brilliant !|40532


In [7]:
# formatting the dataframe for processing
dataset_df['phrase_text'] = dataset_df['!|0'].apply(lambda x: x.split('|')[0])
dataset_df['phrase_ids'] = dataset_df['!|0'].apply(lambda x: x.split('|')[1])
dataset_df = dataset_df.drop('!|0', axis=1)

In [8]:
# take a peek at the dataframe
dataset_df.tail()

Unnamed: 0,phrase_text,phrase_ids
239226,zoning ordinances to protect your community fr...,220441
239227,zzzzzzzzz,179256
239228,élan,220442
239229,É,220443
239230,É um passatempo descompromissado,220444


In [9]:
# reading the sentiment data
sentiment_df = pd.read_csv('stanfordSentimentTreebank/sentiment_labels.txt', sep='\n')
sentiment_df.head()

Unnamed: 0,phrase ids|sentiment values
0,0|0.5
1,1|0.5
2,2|0.44444
3,3|0.5
4,4|0.42708


In [10]:
# formatting the sentiment dataframe for processing
sentiment_df['phrase_ids'] = sentiment_df['phrase ids|sentiment values'].apply(lambda x: x.split('|')[0])
sentiment_df['sentiment_values'] = sentiment_df['phrase ids|sentiment values'].apply(lambda x: x.split('|')[1])
sentiment_df = sentiment_df.drop('phrase ids|sentiment values', axis=1)

In [11]:
sentiment_df.head()

Unnamed: 0,phrase_ids,sentiment_values
0,0,0.5
1,1,0.5
2,2,0.44444
3,3,0.5
4,4,0.42708


In [12]:
# let's merge the phrases and sentiments
dataset_sentiment_df = pd.merge(left=dataset_df, right=sentiment_df, how='inner', on='phrase_ids')
# let's also validate the number of datapoints
print(f"dataset df shape: {dataset_df.shape}")
print(f"sentiment df shape: {sentiment_df.shape}")
print(f"dataset_sentiment df shape: {dataset_sentiment_df.shape}")

dataset df shape: (239231, 2)
sentiment df shape: (239232, 2)
dataset_sentiment df shape: (239231, 3)


*seems good. we missed one datapoint while merging, that's okay for now.*

In [13]:
def recover_sentiment_class(sentiment_value: float):
  """
  recovering classes from sentiment_values
  [very negative, negative, neutral, positive, very positive]
  [0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0]
  [1, 2, 3, 4, 5]
  respectively
  :params:
    sentiment_value: floating value of sentiment
  """
  if sentiment_value <= 0.2:
    return 1
  elif sentiment_value <= 0.4:
    return 2
  elif sentiment_value <= 0.6:
    return 3
  elif sentiment_value <= 0.8:
    return 4
  else:
    return 5

In [14]:
dataset_sentiment_df['sentiment_class'] = dataset_sentiment_df['sentiment_values'].apply(
    lambda x: recover_sentiment_class(float(x)))

In [15]:
dataset_sentiment_df.tail()

Unnamed: 0,phrase_text,phrase_ids,sentiment_values,sentiment_class
239226,zoning ordinances to protect your community fr...,220441,0.13889,1
239227,zzzzzzzzz,179256,0.19444,1
239228,élan,220442,0.51389,3
239229,É,220443,0.5,3
239230,É um passatempo descompromissado,220444,0.5,3


## 2. Preprocessing the data

In [16]:
# let's setup the tokenizer
BertTokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=False)

In [17]:
# define the vocab and tokenizer from the bert_layer here
vocabulary_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
to_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = BertTokenizer(vocabulary_file, to_lower_case)

In [18]:
# simple function to encode the sentence
def encode_sentence(sent):
    return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sent))

*we're using bert layer for tokenization only!*

In [19]:
tokenizer.tokenize("don't be so judgemental")

['don', "'", 't', 'be', 'so', 'judgement', '##al']

*masked language model tokenizer, email me for any queries*

In [20]:
list_of_pharses = list(dataset_sentiment_df['phrase_text'])
encoded_phrases = [encode_sentence(phrase) for phrase in list_of_pharses]

In [21]:
y = dataset_sentiment_df['sentiment_class']

In [22]:
list_of_pharses_with_length = [[phrase, y[i], len(phrase)]
                 for i, phrase in enumerate(encoded_phrases)]

In [23]:
random.shuffle(list_of_pharses_with_length)

*shuffling, something we should always do for better tangling*

In [24]:
list_of_pharses_with_length.sort(key=lambda x: x[2])

*to handle the dimension for each sequence model, we pad the sequence as per batch size.*

In [25]:
sorted_phrases_sentiments = [(phrase_with_length[0], phrase_with_length[1]) for phrase_with_length in list_of_pharses_with_length]

In [26]:
processed_dataset = tf.data.Dataset.from_generator(lambda: sorted_phrases_sentiments, output_types=(tf.int32, tf.int32))

In [27]:
BATCH_SIZE = 32
batched_dataset = processed_dataset.padded_batch(BATCH_SIZE, padded_shapes=((None, ), ()))

In [28]:
next(iter(batched_dataset))

(<tf.Tensor: shape=(32, 1), dtype=int32, numpy=
 array([[21962],
        [ 3581],
        [10657],
        [ 2447],
        [14474],
        [20437],
        [10349],
        [13999],
        [ 4348],
        [ 9556],
        [22570],
        [ 5623],
        [ 5593],
        [ 4816],
        [18006],
        [ 6703],
        [13827],
        [ 5070],
        [ 9010],
        [12401],
        [ 4975],
        [ 6644],
        [ 8011],
        [16839],
        [ 2440],
        [ 2203],
        [ 3458],
        [29387],
        [ 2789],
        [11255],
        [19910],
        [11245]], dtype=int32)>, <tf.Tensor: shape=(32,), dtype=int32, numpy=
 array([3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 4, 3, 3, 3, 4, 2, 2, 3, 3, 3, 3, 3,
        2, 3, 4, 3, 3, 3, 3, 3, 3, 3], dtype=int32)>)

In [31]:
TOTAL_BATCHES = math.ceil(len(list_of_pharses_with_length) / BATCH_SIZE)
TEST_BATCHES = TOTAL_BATCHES // 10
batched_dataset.shuffle(TOTAL_BATCHES)
test_data = batched_dataset.take(TEST_BATCHES)
train_data = batched_dataset.skip(TEST_BATCHES)

*keeping the 10% of the batched dataset for evaluation*

## 3. Modelling

In [50]:
class FiftyGram_SentimentClassification_VanillaModel(tf.keras.Model):
    
    def __init__(self,
                 vocabulary_size,
                 embedding_dimensions=128,
                 cnn_filters=50,
                 dnn_units=512,
                 model_output_classes=2,
                 dropout_rate=0.1,
                 training=False,
                 name="vanilla_model"):
        super(FiftyGram_SentimentClassification_VanillaModel, self).__init__(name=name)
        
        self.embedding = layers.Embedding(vocabulary_size,
                                          embedding_dimensions)
        self.cnn_layer1 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=2,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer2 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=3,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer3 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=4,
                                        padding="valid",
                                        activation="relu")
        self.pool = layers.GlobalMaxPool1D()
        
        self.dense_1 = layers.Dense(units=dnn_units, activation="relu")
        self.dropout = layers.Dropout(rate=dropout_rate)
        if model_output_classes == 2:
            self.last_dense = layers.Dense(units=1,
                                           activation="sigmoid")
        else:
            self.last_dense = layers.Dense(units=model_output_classes,
                                           activation="softmax")
    
    def call(self, inputs, training):
        l = self.embedding(inputs)
        l_1 = self.cnn_layer1(l) 
        l_1 = self.pool(l_1) 
        l_2 = self.cnn_layer2(l) 
        l_2 = self.pool(l_2)
        l_3 = self.cnn_layer3(l)
        l_3 = self.pool(l_3) 
        
        concatenated = tf.concat([l_1, l_2, l_3], axis=-1) # (batch_size, 3 * cnn_filters)
        concatenated = self.dense_1(concatenated)
        concatenated = self.dropout(concatenated, training)
        model_output = self.last_dense(concatenated)
        
        return model_output

In [49]:
# Defining all the hyperparameters
VOCAB_LENGTH = len(tokenizer.vocab)
EMB_DIM = 200
CNN_FILTERS = 100
DNN_UNITS = 256
OUTPUT_CLASSES = 5

DROPOUT_RATE = 0.2

NB_EPOCHS = 4

In [51]:
vanilla_model = FiftyGram_SentimentClassification_VanillaModel(
    vocabulary_size=VOCAB_LENGTH,
    embedding_dimensions=EMB_DIM,
    cnn_filters=CNN_FILTERS,
    dnn_units=DNN_UNITS,
    model_output_classes=OUTPUT_CLASSES,
    dropout_rate=DROPOUT_RATE
    )

In [47]:
# loss and optimization metrics
if OUTPUT_CLASSES == 2:
    vanilla_model.compile(loss="binary_crossentropy",
                       optimizer="adam",
                       metrics=["accuracy"])
else:
    vanilla_model.compile(loss="sparse_categorical_crossentropy",
                       optimizer="adam",
                       metrics=["sparse_categorical_accuracy"])

*model has not been fit yet due to improper embedding dimensions, implemention of keras tuner for better hyperparameters is to be followed*