# Sentiment Analysis Model Jupyter Notebook

_Giorgio Bakhiet Derias_
_I3a, Bachelorarbeit_

The aim of this notebook is to show the process of creating a sentiment analysis model which reads text input and is able to attribute an emotion to it.
To train the model, I use several different datasets.

## Setup
These are the required python libraries that are used in sentiment analysis.

### Install from requirements
In order to work I first need to install the libraries from which I will then import what I need.
Since I moved my work from a Colab file to here, I created a text file called *requirementsModel*, in which I saved all the libraries I used.
The usefulness of this file is when I move to a new environment, installing all packages at once by simply typing:

In [None]:
#%conda install --file requirementsModel.txt

Check whether packages need updates:

In [None]:
!python -m pip install --upgrade pip

### Installing other libraries

In [None]:
!pip install tensorflow
!pip install gdown
!pip install -q tf-models-official
!pip install tensorflow-gpu
!pip install transformers
!pip install plotly-express
!pip3 install ktrain
!pip3 install git+https://github.com/amaiya/eli5@tfkeras_0_10_1

In [None]:
#create requirements file
!pip3 freeze > requirementsModel.txt

## Imports
Once installed the packages we need we can move on to importing the various libraries that will be used during the notebook.

In [None]:
# Numpy and Pandas
import numpy as np
import pandas as pd
from pandas.plotting import register_matplotlib_converters
from pylab import rcParams

# Seaborn
import seaborn as sns

# Plotly
import matplotlib.pyplot as plt

# Sklearn
from sklearn.model_selection import train_test_split

# Tensorflow
import tensorflow as tf

# KTrain
import ktrain
from ktrain import text

tf.get_logger().setLevel('ERROR')

%matplotlib inline
%config InlineBackend.figure_format='retina'

register_matplotlib_converters()
sns.set(style='whitegrid', palette='muted', font_scale=1.2)

rcParams['figure.figsize'] = 12, 8

RANDOM_SEED = 42

np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)

tf.__version__

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
print("You are using TensorFlow version", tf.__version__)
if len(tf.config.list_physical_devices('GPU')) > 0:
    print("You have a GPU enabled.")
    print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
else:
    print("Enable a GPU before running this notebook.")

## The Datasets
This section goes more in depth about the data set. Specifically what kind of data it contains and how it is structured.
For my work, I have been working on several datasets, so that I can later make several tests with the model, and see which dataset trains my model best.

At the end I used for German language:
- **"Filmstarts dataset"** available at https://zenodo.org/record/3693810/files/sentiment-data-reviews-and-neutral.zip?download=1

Filmstarts dataset is about movie reviews in German.

## Filmstarts dataset

First I import the dataset I downloaded earlier using the pandas function.

In [None]:
# Load the data using pandas
film_de = pd.read_csv("filmstarts.tsv", sep = '\t',encoding='utf8', error_bad_lines=False, warn_bad_lines=True, header=None)

### Attributes (Columns)

In [None]:
# Check the dataset
film_de

### Clean and resample the dataset:

In [None]:
columns_to_drop = [0]
film_de.drop(columns_to_drop, axis="columns",inplace=True)

In [None]:
film_de = film_de[[2,1]]

In [None]:
film_de = film_de.rename(columns={2: 'Review', 1: 'Score'})

In [None]:
film_de.loc[film_de.Score <= 1]

In [None]:
film_de.at[11,"Review"]

## Dataframe inspection

Now that I've cleaned up the dataframes with what I needed, I see how they are composed:

In [None]:
film_de.Score.value_counts()

In [None]:
sns.countplot(
    x="Score",
    data=film_de,
    order=film_de.Score.value_counts().index
)

plt.xlabel("Review type")
plt.ylabel("Number of review")
plt.title("Review types displayed")

### Create an Input and Response Dataframe

- The Input dataframe contains the features that are the input for the learning and decision making of the machine learning model.
- The Response (a.k.a. Target) dataframe contains the correct expected values (a.k.a answers) that the system is suppposed to learn.

As Input I take `"Review"`

As Target I take `"Score"`

Additionally I create a new column `"Positive"` that contains labels describing how good or bad the review score is. The evaluation is done by the following criteria:
*   **"0"** up to score 1
*   **"1"** up to score 5

To achieve this, I write a function to have only positive and negative polarity using the `"Score"` column.

In [None]:
# Get review type by aforementioned method
def get_review_type(review_score):
    if review_score <= 0:
        return 0
    elif review_score >= 5:
        return 1
    else:
        return None


film_de["Positive"] = film_de["Score"].apply(
  lambda x: get_review_type(x)
)

# Combine only the useful columns
film_df_de = film_de[["Review", "Positive"]]

#### This is the dataframe after the changes:

In [None]:
film_df_de

In [None]:
sns.countplot(
    x="Positive",
    data=film_df_de,
    order=film_df_de.Positive.value_counts().index
)

plt.xlabel("Review type")
plt.ylabel("Number of review")
plt.title("Review types displayed")

Now we have the two categories 1("good") and 0("bad").
As can be seen from the chart the "good" category has many more values than the "bad" category, so we should limit the larger category to the value of the smaller one.
By doing so, all categories will have an equal number of reviews.

## Resample reviews
To prepare the data for sentiment analysis, it needs to be reshaped in the way that each review type has an equal number of reviews.


In [None]:
# Get same number of reviews for each type
bad_reviews = film_df_de[film_df_de.Positive == 0]
good_reviews = film_df_de[film_df_de.Positive == 1]


sample_len = len(bad_reviews)

bad_df = bad_reviews
good_df = good_reviews.sample(n=sample_len, random_state=RANDOM_SEED)


film_review_df = good_df.append(bad_df).reset_index(drop=True)
film_review_df.shape

In [None]:
# Display number of each review type
sns.countplot(
  x='Positive',
  data=film_review_df,
  order=film_review_df.Positive.value_counts().index
)

plt.xlabel("Review type")
plt.title("All review types (resampled)");



------------------------------------------------------------------------------------------


# Preprocessing Filmstarts

Check the dataframe:

In [None]:
film_review_df

Shuffle the dataframe:

In [None]:
film_review_df = film_review_df.sample(frac=1).reset_index(drop=True)

film_review_df

-----------------------------
### Choosing Sequence Length

BERT works with fixed-length sequences. We'll use a simple strategy to choose the max length. Let's store the token length of each review:

In [None]:
# Import ktrain along with a couple things from transformers
from transformers import AutoModel, AutoTokenizer

tokenizer_hugg = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased")
#model_hugg = AutoModel.from_pretrained("dbmdz/bert-base-german-cased")

In [None]:
token_lens = []

In [None]:
textToCheck = film_review_df.Review[1]

In [None]:
textToCheck

In [None]:
for txt in textToCheck:
    tokens = tokenizer_hugg.encode(txt, max_length=512)
    token_lens.append(len(tokens))

In [None]:
sns.distplot(token_lens)
plt.xlim([0, 500]);
plt.ylim([0, 1.05])
plt.xlabel('Token count');

----------------------------------------------------------------------------


## Prepare all what I need for the ktrain functions.
I need to define the 2 categories and train/validation set.

In [None]:
categories = ['Positive', 'Negative']

In [None]:
train, test = train_test_split(film_review_df,test_size=0.2)

In [None]:
(x_train, y_train) = (train.Review, train.Positive)
(x_test, y_test) = (test.Review, test.Positive)
 

In [None]:
x_train

In [None]:
y_train

In [None]:
print('size of training set: %s' % (len(train['Review'])))
print('size of test set: %s' % (len(test['Review'])))

To use my sets in ktrain I have to transform them into lists first.

In [None]:
xtrain_list = x_train.values.tolist()

In [None]:
ytrain_list = y_train.values.tolist()

In [None]:
xtest_list = x_test.values.tolist()

In [None]:
ytest_list = y_test.values.tolist()

------------------------------------------
### List problem 

A couple of reviews were empty, so I had an error and could not process it.

AttributeError: 'float' object has no attribute 'split'

I solved this by removing all nulls.

In [None]:
test = test[test['Review'].notnull()]

In [None]:
test

In [None]:
(x_test, y_test) = (test.Review, test.Positive)

In [None]:
xtest_list = x_test.values.tolist()

In [None]:
ytest_list = y_test.values.tolist()

--------------------------------------------------------------------
# Build a Model and Wrap in Learner
Now I put all the pieces together for ktrain, first I define the pre-trained model and the transformer.
Then I will do the preprocessing of the data.

In [None]:
MODEL_NAME = 'dbmdz/bert-base-german-cased'


In [None]:
t = text.Transformer(MODEL_NAME, maxlen=400, class_names=['0','1'])


In [None]:
trn = t.preprocess_train(xtrain_list,ytrain_list)

In [None]:
tst = t.preprocess_test(xtest_list, ytest_list)

In [None]:
model = t.get_classifier()


Now that I have a model I will wrap it in a ktrain learner, I will use this learner to find the best learning rate.

In [None]:
learner = ktrain.get_learner(model, train_data=trn, val_data=tst, batch_size=12)

In [None]:
learner.lr_find(show_plot=True, max_epochs=2)

I chose a learning rate from -5, now I can train the model.

In [None]:
learner.autofit(3e-5, reduce_on_plateau=3, checkpoint_folder='./checkpointNewModel26.05/')

In [None]:
learner.validate()

The trained model has an accuracy of 93%.

In [None]:
learner_de.model.summary()

In [None]:
tf.keras.utils.plot_model(
    model_de, to_file='model.png', show_shapes=True, show_dtype=False,
    show_layer_names=True, rankdir='TB', expand_nested=False, dpi=96
)

# Save the model

In [None]:
# save model and Preprocessor instance after partially training
#ktrain.get_predictor(model_de, preproc).save('./model_save/predictor_22.04')

In [None]:
# save model using transformers API after partially training
#learner_de.model.save('./model_save/my_model_de_22.04')

In [None]:
# save model using transformers API after partially training
#learner_de.model.save_pretrained('./model_save/my_model_smallde_22.04')

In [None]:
print(predictor.model)
print(predictor.preproc)

Save the model using ktrain predictor, after train:

In [None]:
predictor.save('./modelsave/bertDe_predictor_93')

--------------------------------------------------------------------
# Reload Model  

In [None]:
# reload predictor
predictor = ktrain.load_predictor('./modelsave/bertDe_predictor_93')
predictor.predict('Heute ist ein schöner Tag.')

------------------------------------------------------------------------------------------

# Make prediction

In [None]:
predictor.predict('Heute ist ein schlecte Tag.')

In [None]:
predictor.predict('Heute ist ein schöner Tag.')

In [None]:
learner.view_top_losses(n=1, preproc=t)

In [None]:
print(xtest_list[553])

In [None]:
predictor.predict_proba(xtest_list[553])

In [None]:
predictor.get_classes()

In [None]:
txTest = "Philip liebte den Pferdesport – genau wie seine Enkelin Louise. Die Tochter von Prinz Edward soll nun Philips Kutsche und die zwei Lieblingsponys erben."

In [None]:
txTest2 = "Ein Brand in der südafrikanischen Metropole hat auch Flächen des berühmten Tafelbergs in Mitleidenschaft gezogen."

In [None]:
txTest3 = "Prinz Philip: Enkelin Louise bekommt seine geliebten Ponys - 20 Minuten"

In [None]:
txTest4 = "Prinz Philip: Enkelin Louise bekommt seine geliebten Ponys."

In [None]:
predictor.predict(txTest)

In [None]:
predictor.explain(txTest)

In [None]:
predictor.explain(txTest2)

In [None]:
predictor.explain(txTest3)

In [None]:
predictor.explain(txTest4)

In [None]:
predictor.explain("Eskalationsrisiko: Russland stationiert \"mehr als 150’000 Soldaten\" an der Grenze zur Ukraine. Der EU-Aussenbeauftragte Josep Borrell zeigt sich besorgt über die Lage an der ukrainischen Grenze. Russland sei mit mindestens 150’000 Soldaten aufmarschiert, es sei der grösste Aufmarsch, den es je in Russland gegeben habe.")

In [None]:
predictor.explain("Der EU-Aussenbeauftragte Josep Borrell zeigt sich besorgt über die Lage an der ukrainischen Grenze. Russland sei mit mindestens 150’000 Soldaten aufmarschiert, es sei der grösste Aufmarsch, den es je in Russland gegeben habe.")

# Save and Export Model to tensorflow lite

In [None]:
# export TensorFlow Lite model
tflite_model_path = './tensorFlowLite/model.tflite'
tflite_model_path = predictor.export_model_to_tflite(tflite_model_path)

# load interpreter
interpreter = tf.lite.Interpreter(model_path=tflite_model_path)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# set maxlen, class_names, and tokenizer (use settings employed when training the model - see above)
maxlen = 400                                                                       # from above
class_names = ['0', '1'] # from above
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('dbmdz/bert-base-german-cased')

# preprocess and predict outside of ktrain
doc = 'Heute ist ein schöner Tag.'
inputs = tokenizer(doc, max_length=maxlen, padding='max_length', truncation=True, return_tensors="tf")
interpreter.set_tensor(input_details[0]['index'], inputs['attention_mask'])
interpreter.set_tensor(input_details[1]['index'], inputs['input_ids'])
interpreter.invoke()
output_tflite = interpreter.get_tensor(output_details[0]['index'])
print()
print('text input: %s' % (doc))
print()
print('predicted logits: %s' % (output_tflite))
print()
print("predicted class: %s" % ( class_names[np.argmax(output_tflite[0])]) )