# Detecting Irony Using Google Universal Sentence Encoder
In this notebook, I will use Google's Universal Sentence Encoder and an artificial neural network to detect irony.

## Import Statements
Below are the necessary packages for running this code

In [1]:
import tensorflow as tf
import tensorflow_hub as hub
import keras.backend as K
from keras.layers import *
from keras.models import *
from keras.utils import *
from keras.regularizers import *
import numpy as np
import _pickle as pickle
import uuid

W0505 19:15:44.339353 17336 __init__.py:56] Some hub symbols are not available because TensorFlow version is less than 1.14
Using TensorFlow backend.


## Loading Data
The code below loads data from the `.pickle` files generated by the data collection code. The next cell reshapes the data so that the feature and label tensors are the correct dimensions to be fed into the model.

In [2]:
goodreads_irony_f = open("goodreads_irony_edited.pickle", mode="rb")
goodreads_knowledge_f = open("goodreads_knowledge_keep.pickle", mode="rb")
goodreads_metaphor_f = open("goodreads_metaphor_keep.pickle", mode="rb")

goodreads_irony = pickle.load(goodreads_irony_f)
goodreads_knowledge = pickle.load(goodreads_knowledge_f)
goodreads_metaphor = pickle.load(goodreads_metaphor_f)

goodreads_irony_f.close()
goodreads_knowledge_f.close()
goodreads_metaphor_f.close()

ironic = goodreads_irony
ironic_labels = np.ones(len(ironic))
non_ironic = goodreads_knowledge + goodreads_metaphor
non_ironic_labels = np.zeros(len(non_ironic))
full_set = ironic + non_ironic
full_labels = np.concatenate([ironic_labels, non_ironic_labels])

ironic_size = len(ironic)
non_ironic_size = len(non_ironic)
full_set_size = len(full_set)

print("Ironic examples: " + str(ironic_size))
print("Non ironic examples: " + str(non_ironic_size))
print("Full set (for validation): " + str(full_set_size))

Ironic examples: 320
Non ironic examples: 807
Full set (for validation): 1127


In [4]:
train_features = np.reshape(train_features, (train_features.shape[0],1))
test_features = np.reshape(test_features, (test_features.shape[0],1))
x = np.reshape(full_set_np, (full_set_np.shape[0],1))
y = full_labels

## Testing Google's Universal Sentence Encoder
The code below loads the GUSE embedding with the command `embed = hub.Module('./embeddings/GUSE')`. The sentence "The quick brown fox jumped over the lazy dog" is passed into the encoder and the resulting 512 by 1 vector is displayed below.

In [5]:
embed = hub.Module('./embeddings/GUSE')
test_messages = ["The quick brown fox jumped over the lazy dog"]

with tf.Session() as session:
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    message_embeddings = session.run(embed(test_messages))
message_embeddings

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0505 19:15:53.322526 17336 tf_logging.py:115] Saver not created because there are no variables in the graph to restore


array([[ 1.64562017e-02,  6.23100176e-02,  3.46776913e-03,
         4.70204279e-03,  3.23413201e-02,  9.14165005e-03,
        -3.01914979e-02, -3.84491310e-02, -3.57649252e-02,
        -4.20018146e-03, -1.91205703e-02,  2.40280163e-02,
         5.23959063e-02,  2.04604957e-02,  3.24646309e-02,
         8.14204291e-02, -3.86404321e-02,  1.64812542e-02,
         1.34157818e-02, -3.47102135e-02,  7.75472447e-02,
        -7.51131922e-02,  1.21427458e-02,  4.46278341e-02,
         7.48834983e-02, -1.19225457e-02, -2.63228826e-02,
         1.70647651e-02,  7.75867552e-02,  4.95012477e-02,
         3.56119871e-02,  4.83054519e-02, -1.33931926e-02,
        -2.03491952e-02,  1.61516182e-02, -6.49256110e-02,
        -3.63756418e-02, -3.20054255e-02,  4.01207134e-02,
         5.42915612e-02, -4.16404940e-02, -5.53556010e-02,
        -6.92512095e-02,  4.85552102e-03,  1.49620827e-02,
         2.48402003e-02,  1.57571007e-02,  4.47167791e-02,
        -7.07037523e-02, -7.17832223e-02,  5.00952639e-0

## Creating the Model
The code below creates a function which will take an input, pass it through GUSE, and return the output. Creating the function is necessary because of the way Keras works with TensorFlow modules. The next cell builds the model using Keras and displays the structure of the model.

In [6]:
def GUSE(param):
    return embed(tf.squeeze(tf.cast(param, tf.string)), signature="default", as_dict=True)["default"]

In [50]:
input_layer = Input(shape=(1,), dtype="string")
guse = Lambda(GUSE, output_shape=(512,))(input_layer)
dense1 = Dense(256, activation="tanh", kernel_regularizer=l2(0.01))(guse)
dropout1 = Dropout(0.3)(dense1)
dense2 = Dense(128, activation="tanh", kernel_regularizer=l2(0.01))(dropout1)
dropout2 = Dropout(0.3)(dense2)
dense3 = Dense(128, activation="tanh", kernel_regularizer=l2(0.01))(dropout2)
dropout3 = Dropout(0.3)(dense3)
dense4 = Dense(64, activation="tanh", kernel_regularizer=l2(0.01))(dropout3)
dropout4 = Dropout(0.3)(dense4)
output = Dense(1, activation="sigmoid")(dropout4)
model = Model(inputs=[input_layer], outputs=output)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0505 19:30:05.388706 17336 tf_logging.py:115] Saver not created because there are no variables in the graph to restore


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_5 (InputLayer)         (None, 1)                 0         
_________________________________________________________________
lambda_5 (Lambda)            (None, 512)               0         
_________________________________________________________________
dense_21 (Dense)             (None, 256)               131328    
_________________________________________________________________
dropout_17 (Dropout)         (None, 256)               0         
_________________________________________________________________
dense_22 (Dense)             (None, 128)               32896     
_________________________________________________________________
dropout_18 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_23 (Dense)             (None, 128)               16512     
__________

## Training the Model
The code below trains the model for 35 epochs (i.e. 35 runs through the training data). It then saves the weight parameters to a `.h5` file.

In [51]:
session = tf.Session()
K.set_session(session)
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
history = model.fit(x, y, batch_size=256, epochs=35, validation_split=0.1, shuffle=True)
nonce = str(int(uuid.uuid4()))
model.save_weights('./models/guse_model' + nonce + '.h5')

Train on 1014 samples, validate on 113 samples
Epoch 1/35
Epoch 2/35
Epoch 3/35
Epoch 4/35
Epoch 5/35
Epoch 6/35
Epoch 7/35
Epoch 8/35
Epoch 9/35
Epoch 10/35
Epoch 11/35
Epoch 12/35
Epoch 13/35
Epoch 14/35
Epoch 15/35
Epoch 16/35
Epoch 17/35
Epoch 18/35
Epoch 19/35
Epoch 20/35
Epoch 21/35
Epoch 22/35
Epoch 23/35
Epoch 24/35
Epoch 25/35
Epoch 26/35
Epoch 27/35
Epoch 28/35
Epoch 29/35
Epoch 30/35
Epoch 31/35
Epoch 32/35
Epoch 33/35
Epoch 34/35
Epoch 35/35


## Twitter Data Benchmarking
The code below accesses the Twitter data from a `.pickle` file generated in the data collection stage. It then reshapes the tensor to be the right dimension for input into the model.

In [52]:
twitter_f = open("twitter_irony_all.pickle", mode="rb")
twitter_data = pickle.load(twitter_f)
a = np.zeros((156,1))
b = np.ones((156,1))
twitter_y = b
twitter_x = np.asarray(twitter_data)
twitter_x = np.reshape(twitter_x, (twitter_x.shape[0],1))

Evaluate the mode on Twitter data:

In [53]:
scores = model.evaluate(twitter_x, twitter_y)



Display accuracy of model

In [54]:
scores[1]

0.8269230738664285

## Twitter Data Analysis
The cell below prints out the predictions of the Twitter data. All Tweets in the dataset are ironic, and the model should output a number close to 1 for each piece of data.

In [55]:
predictions = model.predict(twitter_x)
predictions

array([[0.46093705],
       [0.9621376 ],
       [0.08942699],
       [0.45609495],
       [0.9356887 ],
       [0.6575923 ],
       [0.91780066],
       [0.9667613 ],
       [0.45876235],
       [0.25410482],
       [0.9322748 ],
       [0.9208282 ],
       [0.97486037],
       [0.9665472 ],
       [0.98970455],
       [0.8555902 ],
       [0.948432  ],
       [0.4179261 ],
       [0.97569805],
       [0.9814073 ],
       [0.9812176 ],
       [0.8448669 ],
       [0.9125563 ],
       [0.8238864 ],
       [0.9352235 ],
       [0.39621782],
       [0.9765047 ],
       [0.91828734],
       [0.9766654 ],
       [0.8971828 ],
       [0.5088721 ],
       [0.9283205 ],
       [0.8167162 ],
       [0.78996825],
       [0.9370711 ],
       [0.4863331 ],
       [0.909876  ],
       [0.95483714],
       [0.9757598 ],
       [0.7747267 ],
       [0.9347778 ],
       [0.98853487],
       [0.998448  ],
       [0.666966  ],
       [0.9376991 ],
       [0.14414498],
       [0.9661401 ],
       [0.439

The 2nd example (at index 1) appears to be accurately labeled by the model as ironic. The following line prints out the Tweet.

In [32]:
twitter_x[1]

array(['Chinese alchemists discovered gunpowder while searching for the elixir of immortality '],
      dtype='<U137')

The 18th example (at index 17) appears to be inaccurately labeled by the model. The following line prints out the Tweet. It is likely the model didn't do well because this example contains a sentence fragment.

In [61]:
twitter_x[17]

array(['Am proofreading a style guide. "Don\'t use negatives" apparently '],
      dtype='<U137')