# DSCI 575: Advanced Machine Learning (in the context of Natural Language Processing (NLP) applications)

UBC Master of Data Science program, 2019-20

Instructor: Varada Kolhatkar [ʋəɾəda kɔːlɦəʈkər]


## Lecture 8: Data generators, using word embeddings with RNNs, image captioning

In [69]:
import sys, re, os
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
import numpy as np
import pandas as pd
from numpy import array

In [70]:
# Thanks to Firas for the following code for making jupyter RISE slides pretty! 
from traitlets.config.manager import BaseJSONConfigManager
from pathlib import Path
path = Path.home() / ".jupyter" / "nbconfig"
cm = BaseJSONConfigManager(config_dir=str(path))
tmp = cm.update(
        "rise",
        {
            "theme": "serif",
            "transition": "fade",
            "start_slideshow_at": "selected",            
            "width": "100%",
            "height": "100%",
            "header": "",
            "footer":"",
            "scroll": True,
            "enable_chalkboard": True,
            "slideNumber": True,
            "center": False,
            "controlsLayout": "edges",
            "slideNumber": True,
            "hash": True,
        }
    )


In [71]:
%%HTML
<style>
.rendered_html table, .rendered_html th, .rendered_html tr, .rendered_html td {
     font-size: 130%;
}

body.rise-enabled div.inner_cell>div.input_area {
    font-size: 100%;
}

body.rise-enabled div.output_subarea.output_text.output_result {
    font-size: 100%;
}
body.rise-enabled div.output_subarea.output_text.output_stream.output_stdout {
  font-size: 150%;
}
</style>

### Learning outcomes

From this lecture you will be able to 

- explain why do we need data generators
- implement a data generator for your application 
- explain how do we use word embeddings with RNNs/LSTMs
- explain at a high-level how can we combine LSTMs and CNNs for image captioning

### Data generators: Motivation 

- In the last lecture, we saw an application of LSTMs in text generation
    - We trained a character-level LSTM model to generate text on a toy dataset.
    - What's the size of `X` in our toy example? 

In [72]:
# The hyperparameters in our model 
n_examples = 4419
seq_length = 25
n_vocab = 34
# Let's create X and y
X = np.zeros((n_examples, seq_length, n_vocab),dtype=bool)
y = np.zeros((n_examples, n_vocab))
print(X.shape)
print(y.shape)
print('Need to load %d bool values'%(np.prod(X.shape)))   
# This is how we trained the model 
#self.model.fit(X, y,  
#               epochs=epochs, 
#               batch_size=128)

(4419, 25, 34)
(4419, 34)
Need to load 3756150 bool values


In [67]:
# The pre-trained model you are using in your lab 4
# The hyperparameters in our model 
# approximately 
n_examples = 1000000
seq_length = 100
n_vocab = 100
# Let's create X and y
X = np.zeros((n_examples, seq_length, n_vocab),dtype=bool)
y = np.zeros((n_examples, n_vocab))
print(X.shape)
print(y.shape)
print('Need to load %d bool values'%(np.prod(X.shape)))   
# This is how we trained the model 
#self.model.fit(X, y,  
#               epochs=epochs, 
#               batch_size=128)

(1000000, 100, 100)
(1000000, 100)
Need to load 10000000000 bool values


### Data generators: motivation 

- Do we need to load the whole dataset all at once?
- If we are doing SGD or truncated backprop through time in case of RNNs, we don't.
- So the idea is to load a minibatch from the disk into the memory at a time. 

### Data generators: How do we do it? 
1. Write a data generator function
2. Create a data generator
3. `fit` your model with the created data generator

In [None]:
# Step 1
# Define a data generator function 
# Attribution: The following code is adapted from 
# https://developers.google.com/machine-learning/guides/text-classification/appendix

def data_generator(X, y, num_features, batch_size = 128):
    """Generates batches of vectorized texts for training/validation.

    # Arguments
        x: np.matrix, feature matrix.
        y: np.ndarray, labels.
        num_features: int, number of features.
        batch_size: int, number of samples per batch.

    # Returns
        Yields feature and label data in batches.
    """
    num_samples = X.shape[0]
    num_batches = num_samples // batch_size
    if num_samples % batch_size:
        num_batches += 1

    while 1:
        for i in range(num_batches):
            start_idx = i * batch_size
            end_idx = (i + 1) * batch_size
            if end_idx > num_samples:
                end_idx = num_samples
            X_batch = X[start_idx:end_idx]
            y_batch = y[start_idx:end_idx]
            yield X_batch, y_batch            

### How do we do it? 
- Step 2: create a generator by calling `data_generator` 
- Step 3: `tf.keras.model.fit_generator` instead of `tf.keras.model.fit`
- Note that the latest verion `tf.keras.model.fit` supports generators. [See this](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit_generator).
- (Optional) Check out a demo of using generators in [this notebook](code/LSTM-character-based-text-generation-2.0.ipynb). 
    - You will have to try it on Google Colab.
    - To convince yourself that you need data generator in this case, try to run the model with `fit` first and examine what you observe. 
    - You might have to struggle a bit to get it working in your environment. Take it as part of the learning process.     

### Data generators concluding remarks

- A useful technique if you want to do large-scale ML 
- Very useful especially with text, images, and video data 

### Using word embeddings with RNNs

- You might be wondering about how do we actually use word embeddings with ML models? 
- In Lecture 2 we saw two (rather unsatisfactory) ways to create document representations by averaging or  concatenating word embeddings. We used these text representations with ML models. 
- We can conveniently use word embeddings with sequential models such as RNNs and LSTMs. 

### Embedding layers in RNNs/LSTMs 

<img src="images/RNN_generation.png" height="1000" width="1000"> 

    
[Credit](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

### Embedding layers in RNNs/LSTMs

- Two common ways to incorporate embeddings in the network 
    - Use pre-trained embeddings (transfer learning)
    - Initialize embeddings with random weights and learn as part of the training process. This way we get task-specific embeddings.     

In [85]:
# In Keras, an embedding layer requires three arguments: 
# input dimension, output dimension, sequence length
vocab_size = 36
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length = 100))
model.add(LSTM(256))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 100, 10)           360       
_________________________________________________________________
lstm_4 (LSTM)                (None, 256)               273408    
_________________________________________________________________
dense_4 (Dense)              (None, 36)                9252      
Total params: 283,020
Trainable params: 283,020
Non-trainable params: 0
_________________________________________________________________
None


In [55]:
# How can you get Glove embeddings for your vocab? 
# root_dir is where your glove.6B is located
from tqdm import tqdm
root_dir = '/Users/kvarada/MDS/2018-19/575/data'
glove_dir = os.path.join(root_dir,'glove.6B')
embeddings_index = {} 
f = open(os.path.join(glove_dir, 'glove.6B.200d.txt'), encoding="utf-8")

for line in tqdm(f):
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs

f.close()
print(f'Found {len(embeddings_index)} word vectors.')

400000it [00:27, 14691.68it/s]

Found 400000 word vectors.





In [77]:
words = ['data', 'science', 'image', 'caption']
embedding_dim = 200
idxtoword = {}
wordtoidx = {}
vocab_size = 5
ix = 1
for w in words:
    wordtoidx[w] = ix
    idxtoword[ix] = w
    ix += 1
    
embedding_matrix = np.zeros((vocab_size, embedding_dim))

for word, i in wordtoidx.items():
    #if i < max_words:
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in the embedding index will be all zeros
        embedding_matrix[i] = embedding_vector

In [80]:
embedding_matrix[wordtoidx['data']]

array([ 5.74819982e-01,  3.56139988e-02,  4.85900015e-01,  9.40869972e-02,
        6.17579997e-01,  2.00950000e-02, -5.32760024e-01,  5.62810004e-01,
        5.61520010e-02, -1.15460001e-01, -3.29210013e-01, -4.50159982e-02,
        5.10930002e-01,  7.94809982e-02,  4.99009997e-01,  3.65260005e-01,
       -1.64450005e-01,  4.89789993e-01, -3.26680005e-01, -1.02959998e-01,
       -6.43630028e-01,  2.41470003e+00, -2.09150001e-01, -2.29760006e-01,
       -3.92089993e-01,  6.89310014e-01, -3.91079992e-01,  2.02930003e-01,
        4.77270007e-01,  2.99600005e-01, -4.12849993e-01, -5.24999984e-02,
        2.68130004e-01, -4.07340005e-02,  9.45689976e-01, -8.22300017e-01,
       -5.88079989e-02, -1.04180001e-01, -6.38120025e-02,  3.66329998e-02,
        8.74790028e-02, -2.24649996e-01,  2.12590005e-02,  9.59599972e-01,
       -1.93100005e-01,  4.55760002e-01,  4.53520000e-01, -1.10679996e+00,
        3.89319994e-02, -2.41340008e-02, -2.83039987e-01, -1.97080001e-01,
        1.83649994e-02, -

### Concluding remarks 

- You should generally be using embeddings in RNNs/LSTMs for text data.
- Reduces the number of parameters dramatically
- Feeds in word similarity and relatedness information in the network
- Also, gives the model an ability generalize better. 
- Example: 
    <blockquote>
    I have to make sure to feed the cat .
    </blockquote>

    - Would a Markov model of language able to generate the sequence "feed the dog" when you only have evidence for the following sequence in the corpus?     
    - If we represent words with word embedding to an RNN, it would be able to generate "feed the dog" because it has the information that "dogs" are similar to "cats" 

### Image captioning with LSTMs and CNNs

- You can access the code from the video [here](code/image-captioning-demo.ipynb). 

- LSTMs are expensive to train and, I have trained this model on [Google colab](https://colab.research.google.com/notebooks/welcome.ipynb). 

(Optional) If you feel adventurous, you can try to download the the data and run it on your own! 

### Summary and wrap-up 

This is what I promised you in the first lecture. 

### Week 1 

- Representation Learning
- Word vectors and word embeddings

<img src="images/tsne_example.png" height="1000" width="1000"> 

### Week 2

- Markov models
- Hidden Markov models

<img src="images/Markov_autocompletion.png" height="800" width="800"> 

### Week 3

- Topic modeling (Latent Dirichlet Allocation (LDA))
    - Suppose given a large collection of documents, you are asked to 
        - Infer different topics in the documents
        - Pull all documents about a certain topic    
- Introduction to Recurrent Neural Networks (RNNs)
<img src="images/TM_dist_topics_words_blei.png" height="1000" width="1000"> 

(Credit: [David Blei's presentation](http://www.cs.columbia.edu/~blei/talks/Blei_Science_2008.pdf))


### Week 4 

- LSTMs 
- RNN applications: Image captioning 

<blockquote>

<img src="images/image_captioning.png" width="1000" height="1000">

<p style="font-size:30px"></p>
</blockquote>    

[Source](https://cs.stanford.edu/people/karpathy/sfmltalk.pdf)

### What we did not cover ...

- If you are excited about NLP, here are some more things to explore: 
    - [Attention](https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf)
    - [Transformers](https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html)
    - [BERT](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)

### Final remarks 

That's all! I hope you learned something from the course that's useful for you. I certainly learned how to make videos :). 

I wish you every success in your job search!  

### UBC teaching evaluations 

- Feel free to do them now if you like. 
- Evaluation link: https://eval.ctlt.ubc.ca/science