The model used in this notebook is inspired from :
Duyu Tang, Bing Qin, Xiaocheng Feng, Ting Liu. 2016. 'Effective LSTMs for Target-Dependent Sentiment Classification'. https://arxiv.org/abs/1512.01100

In the above mentioned paper, they developed two target dependent long short-term memory (LSTM) models shown in Figure 1. Instead of LSTMs, BERT model is used in this notebook.



![alt text](https://drive.google.com/file/d/1HnS2uZhalTo7T7f_K6yqEzSm9XU7lFwg/view?usp=sharing)

In [1]:
!pip install pytorch_pretrained_bert



In [0]:
import numpy as np
import pandas as pd
import random
import re
import string 
import tensorflow as tf
import tensorflow_hub as hub

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.layers import Dense, Dropout, Input
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.metrics import Precision, Recall, FalseNegatives, FalsePositives
from tensorflow.keras.optimizers import Adam, SGD, RMSprop
from tensorflow.keras.utils import to_categorical
from pytorch_pretrained_bert import BertTokenizer
from tensorflow.keras.models import Model 

# Data Preprocessing

In [0]:
# Import data
train = pd.read_excel('Trainset.xlsx')
test = pd.read_excel('Testset.xlsx')

# Eliminate the NAs
train = train.fillna('')
test = test.fillna('')

# Remove the rows without Opinion Category values
train = train[train.OpinionCategory != ''] 
test = test[test.OpinionCategory != ''] 

# Sort the data
train = train.sort_values('ID_and_Review').reset_index(drop=True)
test = test.sort_values('ID_and_Review').reset_index(drop=True)

In [4]:
train.head()

Unnamed: 0,ID_number,Review_ID,ID_and_Review,OutOfScope,Sentence_ID,OpinionCategory,OpinionFrom,Polarity,AspectTerm,OpinionTo,Text
0,1004293,1,1004293:0,,1,RESTAURANT#GENERAL,51,negative,place,56,Judging from previous posts this used to be a ...
1,1004293,1,1004293:1,,2,SERVICE#GENERAL,75,negative,staff,80,"We, there were four of us, arrived at noon - t..."
2,1004293,1,1004293:2,,3,SERVICE#GENERAL,0,negative,,0,"They never brought us complimentary noodles, i..."
3,1004293,1,1004293:3,,4,FOOD#QUALITY,4,negative,food,8,The food was lousy - too sweet or too salty an...
4,1004293,1,1004293:3,,4,FOOD#STYLE_OPTIONS,52,negative,portions,60,The food was lousy - too sweet or too salty an...


In [5]:
train.Polarity.value_counts(), test.Polarity.value_counts()

(positive    1657
 negative     749
 neutral      101
 Name: Polarity, dtype: int64, positive    611
 negative    204
 neutral      44
 Name: Polarity, dtype: int64)

Train data consists of 11 variables. The four of them indicate the ID numbers of the sentences, the reviewer, the review and the combination of them. OutofScope variable loses its function when I eliminated the null OpinionCategory values. The Opinion Category shows the aspect which the review refers to. The Opinion Category consists of 12 classes and each class has an entity and a corresponding attribute, in other words, E#A pairs. 

In this notebook, I will deal only with the Polarity and the corresponding reviews under the Text column.

### y_train / y_test

In [0]:
y_train = to_categorical(train.Polarity.astype('category').cat.codes)
y_test = to_categorical(test.Polarity.astype('category').cat.codes)

In [7]:
y_train, y_train.shape, y_test, y_test.shape

(array([[1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        ...,
        [1., 0., 0.],
        [1., 0., 0.],
        [0., 0., 1.]], dtype=float32), (2507, 3), array([[0., 0., 1.],
        [0., 0., 1.],
        [0., 1., 0.],
        ...,
        [0., 0., 1.],
        [0., 0., 1.],
        [0., 0., 1.]], dtype=float32), (859, 3))

## Data Processing for the BERT Model

### beforeAspect List & afterAspect List

To be able to create two target dependent BERT models, we need to divide the text in two parts. First part captures the beginning of the sentence, including the aspect term as the last word, while the second part captures the ending of the sentence which begins with the aspect term of the sentence.

In [0]:
beforeAspectList = [] 
for i in range(len(train.Text)):
    opinionTo = int(train.OpinionTo[i])
    if (opinionTo == 0):
        beforeAspect = []
        beforeAspectList.append(beforeAspect)
    else:
        beforeAspect = train.Text[i][0:opinionTo]
        beforeAspectList.append(beforeAspect)
        
beforeAspectList_test = [] 
for i in range(len(test.Text)):
    opinionTo = int(test.OpinionTo[i])
    if (opinionTo == 0):
        beforeAspect = []
        beforeAspectList_test.append(beforeAspect)
    else:
        beforeAspect = test.Text[i][0:opinionTo]
        beforeAspectList_test.append(beforeAspect)
        
afterAspectList = [] 

for i in range(len(train.Text)):
    OpinionFrom = int(train.OpinionFrom[i])
    if (OpinionFrom == 0):
        afterAspect = train.Text[i]
        afterAspectList.append(afterAspect)
    else:
        afterAspect = train.Text[i][OpinionFrom:len(train.Text[i])]
        afterAspectList.append(afterAspect)
        
afterAspectList_test = [] 

for i in range(len(test.Text)):
    OpinionFrom = int(test.OpinionFrom[i])
    if (OpinionFrom == 0):
        afterAspect = test.Text[i]
        afterAspectList_test.append(afterAspect)
    else:
        afterAspect = test.Text[i][OpinionFrom:len(test.Text[i])]
        afterAspectList_test.append(afterAspect)

In [9]:
len(beforeAspectList), len(beforeAspectList_test), len(afterAspectList), len(afterAspectList_test)

(2507, 859, 2507, 859)

In [0]:
train.insert(9, "beforeAspect", pd.Series(beforeAspectList).astype(str))
train.insert(11, "afterAspect", pd.Series(afterAspectList).astype(str))

test.insert(9, "beforeAspect", pd.Series(beforeAspectList_test).astype(str))
test.insert(11, "afterAspect", pd.Series(afterAspectList_test).astype(str))

In [11]:
train.head()

Unnamed: 0,ID_number,Review_ID,ID_and_Review,OutOfScope,Sentence_ID,OpinionCategory,OpinionFrom,Polarity,AspectTerm,beforeAspect,OpinionTo,afterAspect,Text
0,1004293,1,1004293:0,,1,RESTAURANT#GENERAL,51,negative,place,Judging from previous posts this used to be a ...,56,"place, but not any longer.",Judging from previous posts this used to be a ...
1,1004293,1,1004293:1,,2,SERVICE#GENERAL,75,negative,staff,"We, there were four of us, arrived at noon - t...",80,staff acted like we were imposing on them and ...,"We, there were four of us, arrived at noon - t..."
2,1004293,1,1004293:2,,3,SERVICE#GENERAL,0,negative,,[],0,"They never brought us complimentary noodles, i...","They never brought us complimentary noodles, i..."
3,1004293,1,1004293:3,,4,FOOD#QUALITY,4,negative,food,The food,8,food was lousy - too sweet or too salty and th...,The food was lousy - too sweet or too salty an...
4,1004293,1,1004293:3,,4,FOOD#STYLE_OPTIONS,52,negative,portions,The food was lousy - too sweet or too salty an...,60,portions tiny.,The food was lousy - too sweet or too salty an...


### BERT Tokenization

In [12]:
# add special tokens for BERT to work properly
sentences_before = ["[CLS] " + sent + " [SEP]" for sent in train.beforeAspect]
sentences_before_test = ["[CLS] " + sent + " [SEP]" for sent in test.beforeAspect]

sentences_after = ["[CLS] " + sent + " [SEP]" for sent in train.afterAspect]
sentences_after_test = ["[CLS] " + sent + " [SEP]" for sent in test.afterAspect]

sentences_before[0], sentences_before_test[0], sentences_after[0], sentences_after_test[0]

('[CLS] Judging from previous posts this used to be a good place [SEP]',
 '[CLS] [] [SEP]',
 '[CLS] place, but not any longer. [SEP]',
 '[CLS] Yum! [SEP]')

For the tokenization, pre-trained Bert-Base-Uncased dictionary is used. They constructed it with WordPiece embeddings with a 30,000 token vocabulary. 

In [0]:
# Tokenize with BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tokenized_before = [tokenizer.tokenize(sent) for sent in sentences_before]
tokenized_before_test = [tokenizer.tokenize(sent) for sent in sentences_before_test]

tokenized_after = [tokenizer.tokenize(sent) for sent in sentences_after]
tokenized_after_test = [tokenizer.tokenize(sent) for sent in sentences_after_test]

#tokenized_texts[0]

For the BERT model to work, we need three inputs. 
- Input IDs: shows the ID number of each token with padding. The ID numbers are restored from the BERT vocabulary dictionary.
- Mask IDs: indicates which elements in the sequence are tokens and which are padding elements.
- Segment IDs: distinguishes different sentences, 0 for one-sentence sequence, 1 if there are two sentences.

The functions below are extracted from: https://towardsdatascience.com/simple-bert-using-tensorflow-2-0-132cb19e9b22

In [0]:
def get_masks(tokens, max_seq_length):
    """Mask for padding"""
    if len(tokens)>max_seq_length:
        raise IndexError("Token length more than max seq length!")
    return [1]*len(tokens) + [0] * (max_seq_length - len(tokens))

def get_ids(tokens, tokenizer, max_seq_length):
    """Token ids from Tokenizer vocab"""
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_ids = token_ids + [0] * (max_seq_length-len(token_ids))
    return input_ids

def get_segments(tokens, max_seq_length):
    """Segments: 0 for the first sequence, 1 for the second"""
    if len(tokens)>max_seq_length:
        raise IndexError("Token length more than max seq length!")
    segments = []
    current_segment_id = 0
    for token in tokens:
        segments.append(current_segment_id)
        if token == "[SEP]":
            current_segment_id = 1
    return segments + [0] * (max_seq_length - len(tokens))

### Inputs for the before aspect BERT model (left BERT model)

In [0]:
# find the longest sequence for the padding
def find_max_list(list):
    list_len = [len(i) for i in list]
    return max(list_len)
    
longestSeq_train = find_max_list(tokenized_before)
longestSeq_test = find_max_list(tokenized_before_test)
max_seq_length_before = max(longestSeq_train, longestSeq_test)

# Find input_ids, mask_ids and segment_ids for the before aspect BERT model
input_ids_before = []
for i in range(len(tokenized_before)):
    input_ids_before.append(get_ids(tokenized_before[i], tokenizer, max_seq_length_before))

input_masks_before = [] 
for i in range(len(tokenized_before)):
    input_masks_before.append(get_masks(tokenized_before[i], max_seq_length_before))
    
input_segments_before = [] 
for i in range(len(tokenized_before)):
    input_segments_before.append(get_segments(tokenized_before[i], max_seq_length_before))
    
input_ids_before_test = []
for i in range(len(tokenized_before_test)):
    input_ids_before_test.append(get_ids(tokenized_before_test[i], tokenizer, max_seq_length_before))

input_masks_before_test = [] 
for i in range(len(tokenized_before_test)):
    input_masks_before_test.append(get_masks(tokenized_before_test[i], max_seq_length_before))
    
input_segments_before_test = [] 
for i in range(len(tokenized_before_test)):
    input_segments_before_test.append(get_segments(tokenized_before_test[i], max_seq_length_before))
    
# For the model, I converted the lists to tensors.
input_ids_before = tf.convert_to_tensor(input_ids_before, dtype = tf.int32)
input_masks_before = tf.convert_to_tensor(input_masks_before, dtype = tf.int32)
input_segments_before = tf.convert_to_tensor(input_segments_before, dtype = tf.int32)

input_ids_before_test = tf.convert_to_tensor(input_ids_before_test, dtype = tf.int32)
input_masks_before_test = tf.convert_to_tensor(input_masks_before_test, dtype = tf.int32)
input_segments_before_test = tf.convert_to_tensor(input_segments_before_test, dtype = tf.int32)

### Inputs for the after aspect BERT model (right BERT model)

In [0]:
# find the longest sequence for the padding
def find_max_list(list):
    list_len = [len(i) for i in list]
    return max(list_len)
    
longestSeq_train = find_max_list(tokenized_after)
longestSeq_test = find_max_list(tokenized_after_test)
max_seq_length_after = max(longestSeq_train, longestSeq_test)

# Find input_ids, mask_ids and segment_ids for the after aspect BERT model
input_ids_after = []
for i in range(len(tokenized_after)):
    input_ids_after.append(get_ids(tokenized_after[i], tokenizer, max_seq_length_after))

input_masks_after = [] 
for i in range(len(tokenized_after)):
    input_masks_after.append(get_masks(tokenized_after[i], max_seq_length_after))
    
input_segments_after = [] 
for i in range(len(tokenized_after)):
    input_segments_after.append(get_segments(tokenized_after[i], max_seq_length_after))
    
input_ids_after_test = []
for i in range(len(tokenized_after_test)):
    input_ids_after_test.append(get_ids(tokenized_after_test[i], tokenizer, max_seq_length_after))

input_masks_after_test = [] 
for i in range(len(tokenized_after_test)):
    input_masks_after_test.append(get_masks(tokenized_after_test[i], max_seq_length_after))
    
input_segments_after_test = [] 
for i in range(len(tokenized_after_test)):
    input_segments_after_test.append(get_segments(tokenized_after_test[i], max_seq_length_after))

# For the model, I converted the lists to tensors.    
input_ids_after = tf.convert_to_tensor(input_ids_after, dtype = tf.int32)
input_masks_after = tf.convert_to_tensor(input_masks_after, dtype = tf.int32)
input_segments_after = tf.convert_to_tensor(input_segments_after, dtype = tf.int32)

input_ids_after_test = tf.convert_to_tensor(input_ids_after_test, dtype = tf.int32)
input_masks_after_test = tf.convert_to_tensor(input_masks_after_test, dtype = tf.int32)
input_segments_after_test = tf.convert_to_tensor(input_segments_after_test, dtype = tf.int32)

## BERT MODEL

In [20]:
random.seed(123)
# Three Inputs of the Left Bert Model
InputIDLayer_left = Input(shape=(max_seq_length_before,), dtype=tf.int32, name="InputIDs_left")
MaskIDLayer_left = Input(shape = (max_seq_length_before,), dtype = tf.int32, name = "MaskIDs_left")
SegmentIDLayer_left = Input(shape = (max_seq_length_before,), dtype = tf.int32, name = "SegmentIDs_left")

# Import the pre-trained uncased Bert model
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=True)

# Since it is a classisfication problem, the pooled output is needed.
PooledOutput_left, SequenceOutput_left = bert_layer([InputIDLayer_left, MaskIDLayer_left, SegmentIDLayer_left])
output_left = Dense(258)(PooledOutput_left)

# Three Inputs of the Right Bert Model
InputIDLayer_right = Input(shape=(max_seq_length_after,), dtype=tf.int32, name="InputIDs_right")
MaskIDLayer_right = Input(shape = (max_seq_length_after,), dtype = tf.int32, name = "MaskIDs_right")
SegmentIDLayer_right = Input(shape = (max_seq_length_after,), dtype = tf.int32, name = "SegmentIDs_right")

# Since it is a classisfication problem, the pooled output is needed.
PooledOutput_right, SequenceOutput_right = bert_layer([InputIDLayer_right, MaskIDLayer_right, SegmentIDLayer_right])
output_right = Dense(258)(PooledOutput_right)

# Concatenate the layers and classify with Dense
allLayers = tf.keras.layers.concatenate([output_left, output_right])
output = Dense(3, activation = 'sigmoid')(allLayers)

model = Model(inputs=[InputIDLayer_left, MaskIDLayer_left, SegmentIDLayer_left,
                     InputIDLayer_right, MaskIDLayer_right, SegmentIDLayer_right], outputs = [output])

# Model Compilation
learning_rate = 2e-5
number_of_epochs = 10
optimizer = Adam(learning_rate = learning_rate, epsilon = 1e-08)
loss = CategoricalCrossentropy(from_logits = False)
metrics = [Precision(), Recall(),
          FalseNegatives(), FalsePositives()]

model.compile(optimizer = optimizer, 
              loss = loss,
              metrics = metrics)

# Model Training & Fine-Tuning on train data
earlyStopping = EarlyStopping(monitor = "val_loss", mode = "min", patience = 1)

bert_history = model.fit([input_ids_before, input_masks_before, input_segments_before,
                          input_ids_after, input_masks_after, input_segments_after], [y_train],
                         epochs = number_of_epochs, 
                         batch_size = 32,
                         validation_split = 0.1,
                         callbacks = [earlyStopping]
                         )

Epoch 1/10
Epoch 2/10
Epoch 3/10


In [0]:
# Predictions
pred = model.predict([input_ids_before_test, input_masks_before_test, input_segments_before_test,
                          input_ids_after_test, input_masks_after_test, input_segments_after_test])

In [22]:
# Model Evaluation - Loss, Precision, Recall, FalseNegatives, FalsePositives
results = model.evaluate([input_ids_before_test, input_masks_before_test, input_segments_before_test,
                          input_ids_after_test, input_masks_after_test, input_segments_after_test], y_test)



In [23]:
results

[0.4858801066875458, 0.9764150977134705, 0.48195576667785645, 445.0, 10.0]

In [25]:
f1_score = 2*((results[1] * results[2])/(results[1] + results[2]))
f1_score

0.6453624362699225