# Colab Notebook for Teesside University Student Future Workshop #

This Notebook is designed for Teesside University Student Future 2020 workshop event.

It covers the possible topics for AI / ML deverlopers in industry. 

The following dataset and model focus the Natural Language Process application.

Authors: Teng Fu, Jing Tang, Annalisa Occhipinti

Emails: teng.fu@teleware.com, J.Tang@tees.ac.uk, A.Occhipinti@tees.ac.uk




## Tasks ##

Basic Tasks includes:

- Prepare development environment.

- Data pre-process.

- Use pre-trained model.

- Train your customised text classification model.

- Deploy the model as a REST API.

## Prepare development environment ##

Environment consists two layers:

- OS level.

- Python level.

For system level tools, use ```apt-get install packageX``` for package installation.

For Python level tools, use ```pip install packageX ``` for package installation.

In [0]:
!pip install pdpipe
!pip install transformers
!pip install hug

In [0]:
# import Python packages
import numpy as np
import pandas as pd
import pdpipe as pdp

# from sklearn.model_selection import train_test_split
# from sklearn import svm
# from sklearn.externals import joblib

import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import Adam
from keras.models import load_model

import matplotlib.pyplot as plt

In [0]:
# load jupyter notebook extension
%load_ext google.colab.data_table
%matplotlib inline

## Data Pre-process ##

Data pre-process is a neccessary step before model training. It basically follows the __ETL (Extract, Transform, Load)__ procedure, aims to support model training with high quality, clean data for better model performance.

In this workshop, [Stanford Sentiment Treebank](https://nlp.stanford.edu/sentiment/index.html) dataset is utilised for model __training__.

We will manually create our __test__ dataset through __ETL__ process.



### Extract ###

Fetch the data from sources Github



#### Fetch training data ####

In [0]:
sst2_training_data_tsv_file_address = 'https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv'

train_column_names = ["sentence", "label"]
df_train = pd.read_csv(sst2_training_data_tsv_file_address, delimiter='\t', names=train_column_names)

#### Fetch raw test data ####

In [0]:
raw_test_csv_file_address = "https://raw.githubusercontent.com/gitathrun/Teesside_University_Student_Future_2020_Workshop/master/raw_test_dataset.csv"

df_test_raw = pd.read_csv(raw_test_csv_file_address)

### Observation ### 

- Know your problem.

- Observe raw data.

- How to improve its quality?

#### Observe Training Data ####

In [0]:
df_train.head(10)

#### Observe Raw Test Data ####

In [0]:
df_test_raw

### Transform ###
Anonymization, Clean, Regularization, Labeling process.

In [0]:
# Anonymization
pipeline_drop_commenter_id = pdp.ColDrop("commenter_id")
anonym_df = pipeline_drop_commenter_id(df_test_raw)
anonym_df

In [0]:
# Clean Regularization
# Use Python Regular Expression
import re
def remove_charactors(text):
    regex = r"(<|>|&|\*)"
    subst = ""
    clean_text = re.sub(regex, subst, text, 0)
    return clean_text

pipeline_clean_text = pdp.ApplyByCols('sentence', remove_charactors, result_columns='sentence')
clean_text_df = pipeline_clean_text(anonym_df)
clean_text_df

### ETL pipeline ###

Pipeline is a list of functions & operations orgernised in order.

In [0]:
# assemble the data preprocess pipeline
pipeline_preprocess = pdp.PdPipeline([
    pipeline_drop_commenter_id,
    pipeline_clean_text
])

pipeline_preprocess

In [0]:
df_pipe_test = pipeline_preprocess.apply(df_test_raw)
df_pipe_test

In [0]:
# labeling
# manual labeling
# or auto labeling again certain rules

In [0]:
#@title Manual label the sentences
#@markdown Select __0 (negative)__ or __1 (positive)__ in the Form as the sentiment label.

#@markdown label_X means label for sentence at index X

#@markdown ---

label_0 = "1"  #@param [0, 1]
label_1 = "1"  #@param [0, 1]
label_2 = "0"  #@param [0, 1]
label_3 = "1"  #@param [0, 1]
label_4 = "0"  #@param [0, 1]
label_5 = "1"  #@param [0, 1]
label_6 = "0"  #@param [0, 1]
label_7 = "0"  #@param [0, 1]
label_8 = "1"  #@param [0, 1]
label_9 = "1"  #@param [0, 1]
label_10 = "1"  #@param [0, 1]
label_11 = "1"  #@param [0, 1]
label_12 = "0"  #@param [0, 1]
label_13 = "0"  #@param [0, 1]
label_14 = "0"  #@param [0, 1]

#@markdown ---


In [0]:
# collect labels and concat with processed dataframe
str_label_list = [  label_0,
                    label_1,
                    label_2,
                    label_3,
                    label_4,
                    label_5,
                    label_6,
                    label_7,
                    label_8,
                    label_9,
                    label_10,
                    label_11,
                    label_12,
                    label_13,
                    label_14]

# need to convert string type label into integer type
int_label_list = [int(label) for label in str_label_list]

label_dict = {"label": int_label_list}

df_test_label = pd.DataFrame(label_dict)
df_test_label

#### Merge labels with cleaned data ####

In [0]:
df_test_processed = pd.concat([df_pipe_test, df_test_label], axis=1)
df_test_processed

### Load ###

Save the transformed data to certain places: cloud storage, database, etc.

In [0]:
df_test_processed.to_csv("processed_test_dataset.csv", index=False)

## Transfer Learning with pre-trained model ##

We re going to use two machine learning frameworks:
- PyTorch: we will use __BERT__ as the text encoder model based on PyTorch.

- Keras: We will build a __Multi-Layer Perceptron__ as the upstream classifier.

### Loading the Pre-trained BERT model ###

__B__idirectional __E__ncoder __R__epresentations from __T__ransformers is a technique for NLP pre-training developed by Google. [__BERT__](https://github.com/google-research/bert) was created and published in 2018 by Jacob Devlin and his colleagues from Google. 

Here we use PyTorch version BERT developed by [huggingface](https://github.com/huggingface/transformers). The parameters are identical with Google's Tensorflow version, but this is is developed with PyTorch framework.

In [0]:
# Use PyTorch Hub

# fetch tokenizer
# Download vocabulary from S3 and cache.
loaded_tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-uncased')   

# fetch model
loaded_model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-uncased') 

In [0]:
# text --> tokens
def text_tokenization(text, tokenizer_model=loaded_tokenizer):
    tokenized_ids = tokenizer_model.encode(text, add_special_tokens=True)
    return tokenized_ids

In [0]:
# tokens --> encode vector
def tokens_encoding(tokenized_ids, bert_model=loaded_model):
    input_ids = torch.tensor([tokenized_ids])
    with torch.no_grad():
        # only the pooled vector at position 0: [CLS]
        encoding = bert_model(input_ids)[0][:,0,:].numpy()
    encoding = encoding.reshape([768])
    return encoding

### BERT Encoding Pipeline ###

In [0]:
pipeline_tokenization = pdp.ApplyByCols('sentence', text_tokenization, result_columns='tokenized_ids', drop=False)
pipeline_encoding = pdp.ApplyByCols('tokenized_ids', tokens_encoding, result_columns='bert_encoding', drop=False)

# assemble the bert encoding pipeline
pipeline_bert_encoding = pdp.PdPipeline([
    pipeline_tokenization,
    pipeline_encoding
])

pipeline_bert_encoding

In [0]:
# BERT encoding training data
df_train_batch = df_train[:1000]
%time df_training_bert_encoding = pipeline_bert_encoding(df_train_batch)
# df_training_bert_encoding.head()

In [0]:
df_train_batch.label.values.shape

In [0]:
# BERT encoding test data
%time df_test_bert_encoding = pipeline_bert_encoding(df_test_processed)

#### Save the vectors in numpy format ####

Do Not Save the numpy array in .csv format, for it will consider the vector numeric as string, and use ... to replace the long string.

Use numpy's __.npy__ format. This is a standard format that is supported by most Deep Learning frameworks.

In [0]:
# Make sure the values in dataframe 
# can be properly converted into correct shape numpy array
def reshape_dataframe_array(values_array):
    values_list = values_array.tolist()
    np_array = np.array(values_list)
    return np_array

In [0]:
# save training dataset batch
np_training_bert_encoding_batch = reshape_dataframe_array(df_training_bert_encoding.bert_encoding.values)
np.save("np_training_bert_encoding_batch.npy", np_training_bert_encoding_batch)

In [0]:
# save test dataset
np_test_bert_encoding = reshape_dataframe_array(df_test_bert_encoding.bert_encoding.values)
np_test_bert_encoding.shape
np.save("np_test_bert_encoding.npy", np_test_bert_encoding)

In [0]:
# function for text encoding
def text_to_vector(text, tokenizer_model=loaded_tokenizer, bert_model=loaded_model):
    input_ids = torch.tensor([tokenizer_model.encode(text, add_special_tokens=True)])
    with torch.no_grad():
        last_hidden_states = bert_model(input_ids)[0]
    text_encoding = last_hidden_states.numpy()[0,0,:]
    return text_encoding

## Train your customised text classification model ##

Train a simple two hidden-layer MLP classifier wit Keras.

#### Construct the MLP Classifier and its Hyper-parameters ####

In [0]:
# construct neural network in Sequential manner
clf = Sequential()
clf.add(Dense(256, input_dim=768))
clf.add(Activation('relu'))
clf.add(Dropout(0.3))
clf.add(Dense(64, activation='relu'))
clf.add(Dropout(0.3))
clf.add(Dense(2, activation='softmax'))

adam = Adam()

clf.compile(loss='binary_crossentropy',
              optimizer=adam,
              metrics=['accuracy'])

In [0]:
# train the classifier
x_train = np_training_bert_encoding_batch
y_train = df_train_batch.label.values.reshape((1000, 1))
one_hot_labels_train = keras.utils.to_categorical(y_train, num_classes=2)

train_history = clf.fit(x_train,
                        one_hot_labels_train,
                        validation_split=0.10,
                        epochs=50,
                        batch_size=32)

In [0]:
keras.utils.plot_model(clf, to_file="clf.png")

In [0]:
# Plot training & validation accuracy values
plt.plot(train_history.history['acc'])
plt.plot(train_history.history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Dev'], loc='upper left')
plt.show()

In [0]:
# Plot training & validation loss values
plt.plot(train_history.history['loss'])
plt.plot(train_history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Dev'], loc='upper left')
plt.show()

Is the trained classifier good?

In [0]:
x_test = np_test_bert_encoding
y_test = df_test_bert_encoding.label.values.reshape((15,1))
one_hot_labels_test = keras.utils.to_categorical(y_test, num_classes=2)

score = clf.evaluate(x_test, one_hot_labels_test)

In [0]:
# check the evaluation result
print("Loss evaluation on Test data: {}".format(score[0]))
print("Accuracy evaluation on Test data: {}".format(score[1]))

### Check Prediction and Evaluate trained model ###

In [0]:
# check the prediction
y_pred = clf.predict(x_test)
print(y_pred)

### Interprete the model with confustion matrix and other metrics

In [0]:
# confusion matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import ConfusionMatrixDisplay

# convert it with argmax
y_pred_1d = np.argmax(y_pred, axis=1)
y_pred_1d

# make sure the array shape is 1D
y_true = y_test.reshape((-1))

# get confustion matrix
cm = confusion_matrix(y_true, y_pred_1d)

In [0]:
# Plot non-normalized confusion matrix
target_names = ['positive', 'negative']
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=target_names)
disp.plot()

In [0]:
# classification detail
print(classification_report(y_true, y_pred_1d, target_names=target_names))

### Save the trained Keras classifier model ###

In [0]:
# save trained model
clf.save("trained_sentiment_classifier.h5")

In [0]:
# load from local storage
# loaded_clf = load_model("/content/trained_sentiment_classifier.h5")

## Deploy the model as a REST API ##

Since the encoder and upstream classifier is trained, developer can deploy them behind a REST API for users to access.

[hug](https://www.hug.rest/) package is used for REST API deployment within Jupyter Notebook environment.



### Setup API Server and run ###

In [0]:
%%writefile sentiment_api.py
import hug
import torch
from keras.models import load_model

# intialisation
# fetch tokenizer
# Download vocabulary from S3 and cache.
loaded_tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-uncased')   

# fetch bert model
loaded_bert_model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-uncased') 

# fetch keras model
loaded_keras_model = load_model("trained_sentiment_classifier.h5")

# text --> tokens
def text_tokenization(text, tokenizer_model=loaded_tokenizer):
    tokenized_ids = tokenizer_model.encode(text, add_special_tokens=True)
    return tokenized_ids

    # tokens --> encode vector
def tokens_encoding(tokenized_ids, bert_model=loaded_bert_model):
    input_ids = torch.tensor([tokenized_ids])
    with torch.no_grad():
        # only the pooled vector at position 0: [CLS]
        encoding = bert_model(input_ids)[0][:,0,:].numpy()
    encoding = encoding.reshape((1, 768))
    return encoding

def predict_sentiment(encoding, keras_model=loaded_keras_model):
    sentiment_prob = keras_model.predict(encoding)
    return sentiment_prob

@hug.post("/sentiment")
def sentiment(text):
    token_ids = text_tokenization(text)
    encoding = tokens_encoding(token_ids)
    sentiment_prob = predict_sentiment(encoding)
    return {'sentiment': sentiment_prob}

In [0]:
# run API server but detach it from terminal
!nohup hug -f sentiment_api.py &

In [0]:
# check nohup log
!cat nohup.out

In [0]:
# check hug is running
!ps

In [0]:
# stop hug server with its PID
!kill 1467 

### Make a client request ###

In [0]:
import requests

# headers = {"content-type": "application/json"}
headers = {"Content-Type": "application/json"}

# ready the body
# change the text in a form
input_text = 'That is a bad movie!'  #@param {type: "string"}
data = {"text": input_text}

# POST request to server
json_response = requests.post('http://localhost:8000/sentiment', json=data, headers=headers)

# parse the recieved json 
print(json_response)
print(json_response.text)
