# IST 691: Deep Learning in Practice

**Homework 3**

Name: Bryan Crigger

SUID: 255676562

*Save this notebook into your Google Drive. The notebook has appropriate comments at the top of code cells to indicate whether you need to modify them or not. Answer your questions directly in the notebook. Remember to use the GPU as your runtime. Once finished, run ensure all code blocks are run, download the notebook and submit through Blackboard.*

### Setup

In [None]:
import tensorflow as tf
import numpy as np
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
import string
import re
import pandas as pd
from sklearn.model_selection import train_test_split
import json
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# to build nearest neighbor model
from sklearn.neighbors import NearestNeighbors

In this homework, we will perform **sarcasm detection** with [Onion](https://www.theonion.com/) vs [HuffPost](https://www.huffpost.com/) headlines, using LSTM. We will first load the data and generate the training and testing input and labels.

In [None]:
! wget -nc -q https://github.com/mrech/NLP_TensorFlow/blob/master/0_Sentiment_in_Text/Sarcasm_Headlines_Dataset_v2.json?raw=true

In [None]:
# read the downloaded dataset
df = pd.read_json('Sarcasm_Headlines_Dataset_v2.json?raw=true', lines = True)

In [None]:
# get information about the data frame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28619 entries, 0 to 28618
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   is_sarcastic  28619 non-null  int64 
 1   headline      28619 non-null  object
 2   article_link  28619 non-null  object
dtypes: int64(1), object(2)
memory usage: 670.9+ KB


In [None]:
# take a peek at the key data
df[['headline', 'is_sarcastic']].head(5).values

array([['thirtysomething scientists unveil doomsday clock of hair loss',
        1],
       ['dem rep. totally nails why congress is falling short on gender, racial equality',
        0],
       ['eat your veggies: 9 deliciously different recipes', 0],
       ['inclement weather prevents liar from getting to work', 1],
       ["mother comes pretty close to using word 'streaming' correctly",
        1]], dtype=object)

In [None]:
# the training input sequence will be in variable seq_padd_train and the label in train_y
# The testing input sequence will be in variable seq_padd_test and the label in test_y
headlines = df['headline'].values.tolist()
sarcastic = df['is_sarcastic'].values.tolist()

In [None]:
training_size = 20000
test_size = 6709

train_x = headlines[:training_size]
test_x = headlines[training_size:]
train_y = np.array(sarcastic[:training_size])
test_y = np.array(sarcastic[training_size:])

# sequence of words input
max_len = 16

tokenizer = Tokenizer(oov_token = '<OOV>')
tokenizer.fit_on_texts(train_x)

word_index = tokenizer.word_index
index_word = {v: k for k, v in word_index.items()}
vocab_size = len(word_index)
sequence_train = tokenizer.texts_to_sequences(train_x)
seq_padd_train = pad_sequences(sequence_train,
                               padding = 'post',
                               truncating = 'post',
                               maxlen = max_len)


sequence_test = tokenizer.texts_to_sequences(test_x)
seq_padd_test = pad_sequences(sequence_test, padding = 'post',
                              truncating = 'post',
                              maxlen = max_len)

### Q1 Calculating the Trainable Parameters of an LSTM

Below is the summary of an LSTM neural network with embeddings and three layers. Explain in detail, after this cell, the "why" of the number of parameters of each of the layers displayed by `model1.summary()`. Cite any sources you used to answer this question.

`model1.summary()`
```
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding (Embedding)        (None, None, 100)         2000100   
_________________________________________________________________
lstm (LSTM)                  (None, None, 128)         117248    
_________________________________________________________________
lstm_1 (LSTM)                (None, None, 96)          86400     
_________________________________________________________________
lstm_2 (LSTM)                (None, 64)                41216     
_________________________________________________________________
predictions (Dense)          (None, 1)                 65        
=================================================================
Total params: 2,245,029
Trainable params: 2,245,029
Non-trainable params: 0
_________________________________________________________________
```

In [None]:
# an integer input for vocab indices
inputs = tf.keras.Input(shape = (None,), dtype = 'int32')

# define the layers below Embedding -> LSTM 1 -> LSTM 2
x = layers.Embedding(input_dim=20000+1, output_dim=100)(inputs)

x = layers.LSTM(128, return_sequences=True)(x)
x = layers.LSTM(96, return_sequences=True)(x)
x = layers.LSTM(64)(x)

# we project onto a single unit output layer, and squash it with a sigmoid
predictions = layers.Dense(1, activation = 'sigmoid', name = 'predictions')(x)

model = tf.keras.Model(inputs, predictions, name = 'lstm_simple')

# compile the model with binary crossentropy loss and an adam optimizer
model.compile(loss = 'binary_crossentropy',
               optimizer = 'adam',
               metrics = ['accuracy'])

model.summary()

Model: "lstm_simple"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_6 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_5 (Embedding)     (None, None, 100)         2000100   
                                                                 
 lstm_14 (LSTM)              (None, None, 128)         117248    
                                                                 
 lstm_15 (LSTM)              (None, None, 96)          86400     
                                                                 
 lstm_16 (LSTM)              (None, 64)                41216     
                                                                 
 predictions (Dense)         (None, 1)                 65        
                                                                 
Total params: 2245029 (8.56 MB)
Trainable params: 22450

**Why do we have the number of parameters after each of the layers?**

* The embedding layer has 2,000,100 weights, which come from 20,000 tokens for in-vocab words, 1 token for out of vocab words, and then each multiplied by 100 for the embedding layer dimension.

For each of the LSTM layers the number of parameters are calculated by taking the output from the previous layer, adding it to the memory size of the current layer, multiplying that amount by the current layer size, and then adding the current layer size again for the biases of each. That is then multiplied by 4 for the number of gates in the LSTM layer.
* The 1st LSTM layer has 117,248 parameters, which comes from 128 token memory size and 100 output from the embedding layers: ((128 + 100) * 128 + 128) * 4 = 117,248
* The 2nd LSTM layer has 86,400 parameters, which comes from 96 token memory size and 128 from the 1st LSTM later: ((96 + 128) * 96 + 96) * 4 = 86,400
* The 3rd LSTM layer has 41,216 parameters, which comes from 64 token memory size and 96 from the 2nd LSTM layer: ((64 + 96) * 64 + 64) * 4 = 41,216


### Q2: LSTM for Detecting Sarcasm

Modify the code below to create an embedding layer of dimension 50. The vocabulary size is in variable `vocab_size`, and remember to add one in the embedding for the "out of vocabulary" input. Define an LSTM with two layers, one with 64 memory size and the second with 32 memory size. Remember to use the suffix `2` for each of the variables you define (e.g., `x2`)

In [None]:
# an integer input for vocab indices
inputs2 = tf.keras.Input(shape = (None,), dtype = 'int32')

# define the layers below Embedding -> LSTM 1 -> LSTM 2
x2 = layers.Embedding(input_dim=vocab_size + 1, output_dim=50)(inputs2)

x2 = layers.LSTM(64, return_sequences=True)(x2)
x2 = layers.LSTM(32)(x2)

# we project onto a single unit output layer, and squash it with a sigmoid
predictions2 = layers.Dense(1, activation = 'sigmoid', name = 'predictions')(x2)

model2 = tf.keras.Model(inputs2, predictions2, name = 'lstm_simple')

# compile the model with binary crossentropy loss and an adam optimizer
model2.compile(loss = 'binary_crossentropy',
               optimizer = 'adam',
               metrics = ['accuracy'])

model2.summary()

Model: "lstm_simple"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 50)          1294950   
                                                                 
 lstm (LSTM)                 (None, None, 64)          29440     
                                                                 
 lstm_1 (LSTM)               (None, 32)                12416     
                                                                 
 predictions (Dense)         (None, 1)                 33        
                                                                 
Total params: 1336839 (5.10 MB)
Trainable params: 1336839 (5.10 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [None]:
epochs = 10
# fit the model using the train and test datasets
model2.fit(seq_padd_train, train_y,
           validation_split = 0.1,
           epochs = epochs,
           verbose = 2,
           batch_size = 64)

Epoch 1/10
282/282 - 26s - loss: 0.4199 - accuracy: 0.7919 - val_loss: 0.3365 - val_accuracy: 0.8555 - 26s/epoch - 93ms/step
Epoch 2/10
282/282 - 5s - loss: 0.1815 - accuracy: 0.9314 - val_loss: 0.3476 - val_accuracy: 0.8510 - 5s/epoch - 17ms/step
Epoch 3/10
282/282 - 3s - loss: 0.0850 - accuracy: 0.9711 - val_loss: 0.4568 - val_accuracy: 0.8360 - 3s/epoch - 10ms/step
Epoch 4/10
282/282 - 3s - loss: 0.0479 - accuracy: 0.9842 - val_loss: 0.5865 - val_accuracy: 0.8285 - 3s/epoch - 10ms/step
Epoch 5/10
282/282 - 4s - loss: 0.0259 - accuracy: 0.9916 - val_loss: 0.7561 - val_accuracy: 0.8325 - 4s/epoch - 13ms/step
Epoch 6/10
282/282 - 2s - loss: 0.0139 - accuracy: 0.9961 - val_loss: 0.8686 - val_accuracy: 0.8240 - 2s/epoch - 8ms/step
Epoch 7/10
282/282 - 2s - loss: 0.0136 - accuracy: 0.9952 - val_loss: 0.8191 - val_accuracy: 0.8270 - 2s/epoch - 8ms/step
Epoch 8/10
282/282 - 2s - loss: 0.0108 - accuracy: 0.9969 - val_loss: 0.9685 - val_accuracy: 0.8325 - 2s/epoch - 8ms/step
Epoch 9/10
282/28

<keras.src.callbacks.History at 0x7cd51aaf9150>

In [None]:
# estimate the test performance
model2.evaluate(seq_padd_test, test_y)



[0.9400595426559448, 0.8295626044273376]

### Q3: GloVe Word Embeddings

Use the code below to download the GloVe embeddings and create the matrix `embedding_matrix` corresponding to the vocabulary above. Define a layer `embedding_layer_glove` which will be use by the LSTM below. Evaluate the performance and compare to model above.

In [None]:
! wget http://nlp.stanford.edu/data/glove.6B.zip

--2023-12-05 03:10:00--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-12-05 03:10:00--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-12-05 03:10:01--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [None]:
! unzip glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [None]:
import os
embeddings_index = {}
f = open('glove.6B.100d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype = 'float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [None]:
num_tokens = vocab_size + 2
embedding_dim3 = 100
hits = 0
misses = 0

# prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim3))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        # this includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

Converted 21242 words (4656 misses)


Create the embedding layer below:

In [None]:
# create the embedding layer using the embedding_matrix from above
embedding_layer_glove = layers.Embedding(
    num_tokens,
    embedding_dim3,
    input_length = max_len,
    embeddings_initializer = tf.keras.initializers.Constant(embedding_matrix),
    trainable = False,
)

In [None]:
# an integer input for vocab indices
inputs3 = tf.keras.Input(shape = (None,), dtype = 'int32')

# next, we add a layer to map those vocab indices into a space of dimensionality
x3 = embedding_layer_glove(inputs3)

x3 = layers.LSTM(32)(x3)

# we project onto a single unit output layer, and squash it with a sigmoid
predictions3 = layers.Dense(1, activation = 'sigmoid', name = 'predictions')(x3)

model3 = tf.keras.Model(inputs3, predictions3)

# compile the model with binary crossentropy loss and an adam optimizer.
model3.compile(loss = 'binary_crossentropy',
               optimizer = 'adam',
               metrics = ['accuracy'])

In [None]:
# fit the model using the train and test datasets
epochs = 10
model3.fit(seq_padd_train, train_y,
           validation_split = 0.1,
           epochs = epochs,
           verbose = 2,
           batch_size = 64)

Epoch 1/10
282/282 - 9s - loss: 0.5332 - accuracy: 0.7270 - val_loss: 0.4423 - val_accuracy: 0.8030 - 9s/epoch - 31ms/step
Epoch 2/10
282/282 - 1s - loss: 0.4036 - accuracy: 0.8201 - val_loss: 0.3932 - val_accuracy: 0.8270 - 1s/epoch - 4ms/step
Epoch 3/10
282/282 - 1s - loss: 0.3557 - accuracy: 0.8459 - val_loss: 0.3721 - val_accuracy: 0.8395 - 1s/epoch - 4ms/step
Epoch 4/10
282/282 - 1s - loss: 0.3218 - accuracy: 0.8627 - val_loss: 0.3627 - val_accuracy: 0.8455 - 1s/epoch - 4ms/step
Epoch 5/10
282/282 - 1s - loss: 0.3014 - accuracy: 0.8729 - val_loss: 0.3543 - val_accuracy: 0.8530 - 1s/epoch - 4ms/step
Epoch 6/10
282/282 - 1s - loss: 0.2791 - accuracy: 0.8838 - val_loss: 0.3737 - val_accuracy: 0.8310 - 1s/epoch - 4ms/step
Epoch 7/10
282/282 - 2s - loss: 0.2614 - accuracy: 0.8931 - val_loss: 0.3470 - val_accuracy: 0.8570 - 2s/epoch - 6ms/step
Epoch 8/10
282/282 - 2s - loss: 0.2435 - accuracy: 0.9012 - val_loss: 0.3462 - val_accuracy: 0.8650 - 2s/epoch - 6ms/step
Epoch 9/10
282/282 - 1s

<keras.src.callbacks.History at 0x7ab25497e170>

In [None]:
model3.evaluate(seq_padd_test, test_y)



[0.3347955048084259, 0.8535792827606201]

Is it better or worse performance compared to `model2`? Why?

*With just the 10 epoches, Model3 performed slightly better than model2, with 85.36% test accuracy compared to 82.86% accuracy for model2. It also looks like model2 might have overfit the data since the training accuracy is much higher than the validation accuracy.*

### Q4: Word Analogies

Above, we created the matrix `embedding_matrix` for the vocabulary in the sarcasm dataset. Use the code below to find the word analogy to "`germany` is to `berlin` as `uk` is to _blank_"

In [None]:
# we will first create the nearest neighbor model
nbrs_glove = NearestNeighbors(n_neighbors = 5, metric = 'cosine').fit(embedding_matrix)

In [None]:
# let's check if it works
embedding_man = embedding_matrix[word_index['man']]

In [None]:
# closest words to `man`
dist, idx = nbrs_glove.kneighbors([embedding_man])
[index_word[i] for i in idx[0]]

['man', 'woman', 'boy', 'one', 'person']

In [None]:
# now define the proper embedding to solve the analogy
blank_embedding = embedding_matrix[word_index['germany']]
blank_embedding1 = embedding_matrix[word_index['berlin']]
blank_embedding2 = embedding_matrix[word_index['uk']]
blank_embedding3 = blank_embedding1 - blank_embedding + blank_embedding2

In [None]:
# find the closest to blank_embedding
# closest words to `man`
dist, idx = nbrs_glove.kneighbors([blank_embedding3])
[index_word[i] for i in idx[0]]

['uk', 'london', 'theatre', '2013', '2011']

Answer: ***London***

### Q5: Biases

As we discussed in class, there might be several biases in word embeddings. Use the list of occupations below and for each of them find whether `man` or `woman` is closest to it. In particular, first list all occupations that are closer to `man` than `woman`, and then all occupations that are closer to `woman` than `man`.

_Hint_: Use the `cosine` distance between pairs of embeddings from the `SciPy` package. If the ocupation does not exist in the embedding matrix, skip it. Also, remember that the cosine distance is smaller when the embeddings are more similar.


In [None]:
from scipy.spatial.distance import cosine
print('cosine([1,1], [1,1]): ', cosine([1,1], [1,1]))
print('cosine([1,1], [0,1]): ', cosine([1,1], [0,1]))

cosine([1,1], [1,1]):  0
cosine([1,1], [0,1]):  0.29289321881345254


In [None]:
occupation_list = """technician, accountant, supervisor, engineer, worker, educator, clerk, counselor,
inspector, mechanic, manager, therapist, administrator, salesperson, receptionist, librarian,
advisor, pharmacist, janitor, psychologist, physician, carpenter, nurse, investigator,
bartender, specialist, electrician, officer, pathologist, teacher, lawyer, planner, practitioner,
plumber, instructor, surgeon, veterinarian, paramedic, examiner, chemist, machinist,
appraiser, nutritionist, architect, hairdresser, baker, programmer, paralegal, hygienist,
scientist""".replace('\n', '').replace(' ', '').split(',')

In [None]:
man_embedding = embedding_matrix[word_index['man']]
woman_embedding = embedding_matrix[word_index['woman']]

In [None]:
# first print the ocupations that are for a man, as perceived by GloVe
print("Male Occupations (according to Embedding Space):")
for occupation in occupation_list:
  if occupation in word_index:
    ## If occupation closer to man_embedding than woman_embedding, print occupation
    if cosine(embedding_matrix[word_index[occupation]], man_embedding) < cosine(embedding_matrix[word_index[occupation]], woman_embedding):
      print(occupation)
# second print the ocupations that are for a woman, as perceived by GloVe
print("\nFemale Occupations (according to Embedding Space):")
for occupation in occupation_list:
  if occupation in word_index:
    if cosine(embedding_matrix[word_index[occupation]], man_embedding) > cosine(embedding_matrix[word_index[occupation]], woman_embedding):
      print(occupation)

Male Occupations (according to Embedding Space):
engineer
inspector
mechanic
manager
advisor
carpenter
investigator
officer
lawyer
planner
plumber
instructor
architect
scientist

Female Occupations (according to Embedding Space):
technician
supervisor
worker
educator
clerk
counselor
therapist
administrator
receptionist
librarian
pharmacist
janitor
psychologist
physician
nurse
bartender
teacher
practitioner
surgeon
veterinarian
paramedic
examiner
nutritionist
hairdresser
hygienist


Do you see a pattern in the results? Do you think there are biases?

**I think these results seem pretty fair and realistic, and therefore I suppose not that biased. I would say that if anything there may be more occupations that are categorized at "female" occupations that are probably more split between genders like "bartender" or "paramedic". There doesn't seem to be a pattern as far as jobs that require more schooling, or jobs that require more manual labor as being labeled as one gender's profession over another.**

### Q6: Sequence to Sequence Embedding

What is the problem with LSTM models, and why do we need **attention** to fix them? Give as an example of what happens with sequence to sequence models for translation.

**LSTM models start to become less efficient with large input sequences. Even though LSTM models were originally designed to capture context within long sequences of data, the longer you make the context window the larger the model gets, increases exponentially. Another problem with LSTM models is that they feed in data sequentially which cause longer processing times. Attention helps with this by focusing on one word/data point at a time and uses an embedding space for each word to predict each output value, which allows for parallel processing and faster processing times.**