<a href="https://colab.research.google.com/github/birlikov/ru_sentiment_clf/blob/master/Fine_tuning_Multilingual_Universal_Sentence_Encoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Intro

In this tutorial we will fine-tune [Multilingual Universal Sentence Encoder](https://ai.googleblog.com/2019/07/multilingual-universal-sentence-encoder.html) on Russian twitter dataset which I found [here](http://study.mokoron.com). 

After that we will save the model and later deploy it as a web application using Flask. See the project in [this github repo](https://github.com/birlikov/ru_sentiment_clf).

This tutorial can also be viewed in [this colab notebook](https://colab.research.google.com/drive/1VAxpnNyP32kpwLzGB9MtCci3PX3Uzv3g?usp=sharing).

If doing on colab make sure to mount the drive and turn on the gpu for faster training.

### Data

First, we download two csv files "positive.csv" and "negative.csv" from [this website](http://study.mokoron.com) and put them in folder `data` inside our project folder.

In [None]:
import os
my_project_folder_path = '/content/drive/My Drive/Colab Notebooks/pet_project_russian_sentiment_analysis'
os.chdir(my_project_folder_path)
os.listdir()

['data']

In [None]:
os.listdir('data/')

['positive.csv', 'negative.csv']

In [None]:
import pandas as pd
positive = pd.read_csv('data/positive.csv', sep=';', header=None)
negative = pd.read_csv('data/negative.csv', sep=';', header=None)
positive.shape, negative.shape

((114911, 12), (111923, 12))

There are 114_911 positive and 111_923 negative tweets. By the way, these labeling was done in automatic manner as mentioned in the above website. Basicly they used emotocons for automatic labeling. This will be reflected in our final model as we will see later.

In [None]:
positive.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,408906692374446080,1386325927,pleease_shut_up,"@first_timee хоть я и школота, но поверь, у на...",1,0,0,0,7569,62,61,0
1,408906692693221377,1386325927,alinakirpicheva,"Да, все-таки он немного похож на него. Но мой ...",1,0,0,0,11825,59,31,2


In [None]:
negative.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,408906762813579264,1386325944,dugarchikbellko,на работе был полный пиддес :| и так каждое за...,-1,0,0,0,8064,111,94,2
1,408906818262687744,1386325957,nugemycejela,"Коллеги сидят рубятся в Urban terror, а я из-з...",-1,0,0,0,26,42,39,0


We can see that column 3 contains actual texts, so we will use them only. Although column 4 has sentiment labels (1 for positive and -1 for negetave) we will constract our own target labels as this dataframes are separete. Other columns are publication time, author name and etc. which we do not need.

Also, we could do some cleaning of the text, for example removing the @nicknames, but let's leave it as it is for this tutorial)

In [None]:
import numpy as np
all_sentences = np.concatenate([positive.loc[:,3].values, negative.loc[:,3].values])
all_labels = np.array([1]*len(positive) + [0]*len(negative))

Now let's split them into train and test sets.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(all_sentences, all_labels, test_size=0.1, random_state=42, shuffle = True, stratify = all_labels)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((204150,), (22684,), (204150,), (22684,))

### Model

We will use [Multilingual Universal Sentence Encoder](https://ai.googleblog.com/2019/07/multilingual-universal-sentence-encoder.html) as our pre-trained model and add a linear classification layer with one output scalar ranging from 0 to 1, where 1 is for positive sentiment and 0 is for negative sentiment. 

First, let's install and import necessary libraries.

In [None]:
!pip install tensorflow_text



In [None]:
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer

from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Input, Dense

Now let's build the model.

First, we will define an input layer. Notice how we did not deal with tokenization of the input as it is already internaly implemented in Universal Senrtence Encoder. We will just pass variable size texts as input. Pretty cool, ya?!

In [19]:
input = Input(shape = [], dtype=tf.string, name = 'input_text')

Now we define next layer as a keras layer and load Multilingual Universal Sentence Encoder from [tensorflow hub](https://tfhub.dev). 

We have two options for these models: `basic` model has 68,927,232 params and is a CNN architecture, the `large` one has 85,213,184 params and is on Transformer architecture. In this tutoril we will use the `basic` one.

Notice the flag `trainable` set to `True`, this is to do fine-tuning of all model parameters.

In [20]:
module_url = 'https://tfhub.dev/google/universal-sentence-encoder-multilingual/3' 
##CNN 68,927,232 params

# module_url = 'https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3'  
##Transformer 85,213,184 params

#define keras layer
use_layer = hub.KerasLayer(module_url, input_shape=[], dtype=tf.string, trainable=True, name='use_layer')

#pass input to get features(512-dimensional vector)
use_features = use_layer(input)

Next, output layer for binary classification.

In [21]:
output = Dense(1, name = 'output_layer', activation='sigmoid')(use_features)

Finally, build the model and see the summary.

In [22]:
model = Model([input],output, name="sentiment_clf_model")

model.summary()

Model: "sentiment_clf_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_text (InputLayer)      [(None,)]                 0         
_________________________________________________________________
use_layer (KerasLayer)       (None, 512)               68927232  
_________________________________________________________________
output_layer (Dense)         (None, 1)                 513       
Total params: 68,927,745
Trainable params: 68,927,745
Non-trainable params: 0
_________________________________________________________________


Complie the model: define loss, metrics and optimizer.

In [23]:
model.compile(loss="binary_crossentropy", metrics=['acc'], optimizer=Adam(1e-4))

Now, let's fine-tune the model! 

We will train for only 5 epochs, as we will see, it is pretty enough and achieves good accuracy on validation and test sets. Another reason not to fine-tune a pre-trained model for a lot of epochs, is to prevent [catastrofic forgetting](https://en.wikipedia.org/wiki/Catastrophic_interference). Also, one can use EarlyStopping callback to prevent from this and overfitting.

Anyway, let's get to the job.

In [24]:
model.fit(x = X_train, 
          y = y_train,
          epochs = 5,
          verbose = 1,
          batch_size = 1024,
          validation_split = 0.1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f213e499f60>

As we can see, we already achieved 99% accuracy after the first epoch! That's the power of transfer learning, baby!

Let's evaluate on test set.

In [25]:
test_scores = model.evaluate(X_test, y_test, verbose=0)
print('Test accuracy:', test_scores[1])

Test accuracy: 0.9991623759269714


Cool. Now, let's save the model. 

In [26]:
folder = 'ru_sentiment_clf/'
model.save(folder)

Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.


Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.


Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.


Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.


INFO:tensorflow:Assets written to: ru_sentiment_clf/assets


INFO:tensorflow:Assets written to: ru_sentiment_clf/assets


Ignore the warnings)

Let's see what was saved.

In [27]:
os.listdir()

['data', 'ru_sentiment_clf']

In [28]:
os.listdir('ru_sentiment_clf')

['variables', 'assets', 'saved_model.pb']

So we have a folder with all variables, weights, model archietecture and etc.

We can completely restore the model from this folder.

In [29]:
from tensorflow import keras
from tensorflow_text import SentencepieceTokenizer

folder = 'ru_sentiment_clf/'
reconstrucred_model = keras.models.load_model(folder)

test_scores = reconstrucred_model.evaluate(X_test, y_test, verbose=0)
print('Test accuracy again:', test_scores[1])

Test accuracy again: 0.9991623759269714


Cool. Reconstructed model achieved the same accuracy.

Now, let's examaine couple of examples.

In [32]:
i = 0
one_sentence = X_test[i]
true_label = y_test[i]
pred = reconstrucred_model.predict([one_sentence])
pred_label = pred[0][0]
print(f'text: {one_sentence} \ntrue_label:{true_label} \npred_label:{pred_label}')

text: То ли вывихнула палец, то ли что( больно(( 
true_label:0 
pred_label:0.055022940039634705


In [33]:
i = 2
one_sentence = X_test[i]
true_label = y_test[i]
pred = reconstrucred_model.predict([one_sentence])
pred_label = pred[0][0]
print(f'text: {one_sentence} \ntrue_label:{true_label} \npred_label:{pred_label}')

text: @Leon4Ik_ как меня в прошлой поездке к тебе выгнали вспомнила )) 
true_label:1 
pred_label:0.9461257457733154


Do you wanna see the effect of emoticons)

In [34]:
i = 2
one_sentence = X_test[i]

#remove smiles
one_sentence = one_sentence.replace(')','')

true_label = y_test[i]
pred = reconstrucred_model.predict([one_sentence])
pred_label = pred[0][0]
print(f'text: {one_sentence} \ntrue_label:{true_label} \npred_label:{pred_label}')

text: @Leon4Ik_ как меня в прошлой поездке к тебе выгнали вспомнила  
true_label:1 
pred_label:0.05593390390276909


### Conclusion

In this tutorial we learned how we can load acutally any pre-trained model from [tensorflow hub](https://tfhub.dev) as a keras layer and fine-tune it on our own dataset and any downstream task. 

In this tutorial we used Russian laguage, but it could be any of those 16 languages which is supported by [Multilingual Universal Sentence Encoder](https://ai.googleblog.com/2019/07/multilingual-universal-sentence-encoder.html) or even a multilingual dataset.

Next, we use this saved model to serve as a web application using Flask in [this project](https://github.com/birlikov/ru_sentiment_clf).

### References
1. Ю. В. Рубцова. Построение корпуса текстов для настройки тонового классификатора // Программные продукты и системы, 2015, №1(109), –С.72-78 ([link](http://www.swsys.ru/index.php?page=article&id=3962&lang=))
2. Multilingual Universal Sentence Encoder for Semantic Retrieval ([link](https://ai.googleblog.com/2019/07/multilingual-universal-sentence-encoder.html))