<a href="https://colab.research.google.com/gist/absin1/b90f9eba8c0dec5d0e2391253df768ee/copy-of-transfer-learning-semantic-similarity-with-tf-hub-universal-encoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Keras + Universal Sentence Encoder = Transfer Learning for text data](https://www.dlology.com/blog/keras-meets-universal-sentence-encoder-transfer-learning-for-text-data/) Tutorial
## Universal Sentence Encoder

This notebook illustrates how to access the Universal Sentence Encoder and use it for sentence similarity and sentence classification tasks.

The Universal Sentence Encoder makes getting sentence level embeddings as easy as it has historically been to lookup the embeddings for individual words. The sentence embeddings can then be trivially used to compute sentence level meaning similarity as well as to enable better performance on downstream classification tasks using less supervised training data.


# Getting Started

This section sets up the environment for access to the Universal Sentence Encoder on TF Hub and provides examples of applying the encoder to words, sentences, and paragraphs.

In [14]:
# # Install the latest Tensorflow version.
# !pip3 install --quiet "tensorflow>=1.7"
# # Install TF-Hub.
# !pip3 install --quiet tensorflow-hub
# !pip3 install seaborn

More detailed information about installing Tensorflow can be found at [https://www.tensorflow.org/install/](https://www.tensorflow.org/install/).

In [15]:
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
import keras.layers as layers
from keras.models import Model
from keras import backend as K
np.random.seed(10)

In [16]:
import psycopg2
import pandas.io.sql as sqlio
import numpy as np

def get_dataframe_sql():
    df = None
    sql = "select emotion as label, text_ as text from dataset_emotion_only"
    con = None
    try:
        con = psycopg2.connect("host='35.200.234.61' dbname='sales' user='postgres' password='cx6ac54nmgGtLD1y'")
        df = sqlio.read_sql_query(sql, con)
    except psycopg2.DatabaseError as e:
        if con:
            con.rollback()
        print(e)
        sys.exit(1)
    finally:
        if con:
            con.close()
    df = df.sample(frac=1.0)
    df.label = df.label.astype('category')
    return df
  
df = get_dataframe_sql()
msk = np.random.rand(len(df)) < 0.8
df_train = df[msk]
df_test = df[~msk]
df_train.head()

Unnamed: 0,label,text
26915,worry,yup our coke blades b annnd now i only need th...
37746,happiness,having a cup of tea i have a cold so it's tast...
15211,worry,sucks about your cat... hope you guys feel better
8556,surprise,wow their is no pancake mix
28397,neutral,hey there what's up?


In [17]:
category_counts = len(df_train.label.cat.categories)
category_counts

14

## Wrap embed module in a Lambda layer
Explicitly cast the input as a string

In [18]:
def UniversalEmbedding(x):
    return embed(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)["default"]

In [6]:
import os
from keras_xlnet import Tokenizer, load_trained_model_from_checkpoint, ATTENTION_TYPE_BI
with tf.Session() as session:
    checkpoint_path = '/home/chirag/Downloads/cased_L-24_H-1024_A-16/xlnet_cased_L-24_H-1024_A-16'

    tokenizer = Tokenizer(os.path.join(checkpoint_path, 'spiece.model'))
    model = load_trained_model_from_checkpoint(
        config_path=os.path.join(checkpoint_path, 'xlnet_config.json'),
        checkpoint_path=os.path.join(checkpoint_path, 'xlnet_model.ckpt'),
        batch_size=16,
        memory_len=512,
        target_len=128,
        in_train_phase=False,
        attention_type=ATTENTION_TYPE_BI,
    )
    model.compile(optimizer='rmsprop',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    model.summary()

W0809 15:05:03.911997 140466016773952 deprecation_wrapper.py:119] From /home/chirag/venv/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0809 15:05:03.915755 140466016773952 deprecation_wrapper.py:119] From /home/chirag/venv/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0809 15:05:03.944742 140466016773952 deprecation_wrapper.py:119] From /home/chirag/venv/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

W0809 15:05:04.014617 140466016773952 deprecation_wrapper.py:119] From /home/chirag/venv/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:4185: The name tf.truncated_normal is deprecated. Please use tf.random.truncated_normal instead.

W0809 15:05:09.810248 14046601677

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input-Token (InputLayer)        (None, 128)          0                                            
__________________________________________________________________________________________________
Embed-Token (EmbeddingRet)      [(None, 128, 1024),  32768000    Input-Token[0][0]                
__________________________________________________________________________________________________
Masking (CreateMask)            (None, 128)          0           Input-Token[0][0]                
__________________________________________________________________________________________________
Embed-Token-Masked (RestoreMask (None, 128, 1024)    0           Embed-Token[0][0]                
                                                                 Masking[0][0]                    
__________

In [1]:
train_text = df_train['text'].tolist()
train_text = np.array(train_text, dtype=object)[:, np.newaxis]

train_label = np.asarray(pd.get_dummies(df_train.label), dtype = np.int8)

NameError: name 'df_train' is not defined

In [2]:
train_text.shape

NameError: name 'train_text' is not defined

In [28]:
train_label.shape

(31866, 14)

In [29]:
train_label[:3]

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]], dtype=int8)

In [30]:
test_text = df_test['text'].tolist()
test_text = np.array(test_text, dtype=object)[:, np.newaxis]
test_label = np.asarray(pd.get_dummies(df_test.label), dtype = np.int8)

## Train Keras model and save weights
This only train and save our Keras layers not the embed module' weights.

In [32]:
with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    history = model.fit(train_text, 
            train_label,
            validation_data=(test_text, test_label),
            epochs=5,
            batch_size=32)
    model.save_weights('./emotion-detection-xlnet.h5')

ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 3 array(s), but instead got the following list of 1 arrays: [array([['yup our coke blades b annnd now i only need the blades to make them x . but soon enough soon enough...'],
       ["having a cup of tea i have a cold so it's tasting really good"],
       ['s...

In [13]:
!ls -alh | grep model.h5

## Make predictions

In [39]:
new_text = [ "The bottle is blue in color",
            "I dont like you so much",
            "I had an amazing day at the stadium",
            "It was super fun after playing football",
            "my computer works fine",
            "I was shocked when I heard the airplane got crashed",
            "What the fuck!!!!!!",
            "What is your name?",
            "I was surprised when she got a gold medal for India.",
           "this website gave me a virus when i opened it more windows kept popping up",
           "the storm is here and the electricity is gone"]
new_text = np.array(new_text, dtype=object)[:, np.newaxis]
with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    model.load_weights('./model.h5')  
    predicts = model.predict(new_text, batch_size=32)

Exception ignored in: <bound method BaseSession._Callable.__del__ of <tensorflow.python.client.session.BaseSession._Callable object at 0x7fa9e2d41d68>>
Traceback (most recent call last):
  File "/home/chirag/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1473, in __del__
    self._session._session, self._handle)
tensorflow.python.framework.errors_impl.CancelledError: (None, None, 'Session has been closed.')


In [40]:
predicts

array([[2.94864178e-04, 3.58521938e-05, 2.74636447e-02, 1.15772486e-02,
        3.44605148e-02, 6.07423186e-02, 1.98349357e-03, 1.18978024e-02,
        5.62073350e-01, 2.26563811e-02, 7.03817010e-02, 1.87754631e-06,
        2.95612514e-02, 1.01144254e-01],
       [7.85529613e-04, 5.95867634e-04, 1.14247203e-02, 3.30045819e-03,
        9.37765837e-03, 1.28763020e-02, 2.03016400e-02, 5.02602160e-02,
        1.19833708e-01, 1.24943554e-02, 1.24906719e-01, 1.57952309e-06,
        1.61271989e-02, 4.34008151e-01],
       [1.24305487e-04, 5.24520874e-05, 2.93567777e-03, 2.45463848e-03,
        2.52319574e-02, 6.52397335e-01, 3.24845314e-04, 1.58385813e-01,
        4.43361998e-02, 6.10709488e-02, 1.93917453e-02, 1.43051147e-06,
        1.64939165e-02, 1.43519640e-02],
       [1.93327665e-04, 6.18994236e-05, 2.34410167e-03, 3.81219387e-03,
        3.99569750e-01, 7.56339550e-01, 1.46552920e-03, 1.96136236e-02,
        5.83135188e-02, 1.07374161e-01, 5.46321273e-03, 4.67896461e-06,
        8.702

In [41]:
categories = df_train.label.cat.categories.tolist()
predict_logits = predicts.argmax(axis=1)
predict_labels = [categories[logit] for logit in predict_logits]
predict_labels

['neutral',
 'worry',
 'happiness',
 'happiness',
 'worry',
 'worry',
 'neutral',
 'neutral',
 'happiness',
 'worry',
 'neutral']

In [42]:
threshold = 0.1
for i,sentence in enumerate(new_text):
    predict = predicts[i]
    print(sentence+'--->')
    for j, pred in enumerate(predict):
        if pred>threshold:
            print('\t'+categories[j]+'--->'+str(pred))

['The bottle is blue in color--->']
	neutral--->0.56207335
	worry--->0.101144254
['I dont like you so much--->']
	neutral--->0.11983371
	sadness--->0.12490672
	worry--->0.43400815
['I had an amazing day at the stadium--->']
	happiness--->0.65239733
	love--->0.15838581
['It was super fun after playing football--->']
	fun--->0.39956975
	happiness--->0.75633955
	relief--->0.10737416
['my computer works fine--->']
	neutral--->0.12384233
	sadness--->0.15782672
	worry--->0.30659753
['I was shocked when I heard the airplane got crashed--->']
	surprise--->0.11902076
	worry--->0.5673413
['What the fuck!!!!!!--->']
	empty--->0.1430403
	neutral--->0.2886516
	surprise--->0.15211871
	worry--->0.20384371
['What is your name?--->']
	neutral--->0.49489817
	surprise--->0.11461294
	worry--->0.3219815
['I was surprised when she got a gold medal for India.--->']
	happiness--->0.2837274
	worry--->0.1619628
['this website gave me a virus when i opened it more windows kept popping up--->']
	hate--->0.1428698

In [43]:
for predict in predicts:
    sum = 0
    for j, pred in enumerate(predict):
        sum += pred
    print(sum)

0.9342745542526245
0.8162941038608551
0.9975532293319702
1.3703825771808624
0.775869220495224
0.8630472719669342
0.9864477813243866
1.029527485370636
0.7664271295070648
0.8737136721611023
0.665891744196415
