In [None]:
%%HTML
<link rel="stylesheet" type="text/css" href="../css/custom.css">

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
WORDS = ["Jack", "likes", "Jill", "cat", "dog"]
DOCUMENTS = ['"Jack likes Jill"', '"Jill likes Jack"', '"dog"']
EMBEDDINGS = ["embedding_0", "embedding_1"]
def one_hot_encoding():
    has = ["has_" + w for w in WORDS]
    return pd.DataFrame(np.eye(len(WORDS), dtype=int), WORDS, columns=has)
def one_hot_document():
    has = ["has_" + w for w in WORDS]
    return pd.DataFrame(
        [[1, 1, 1, 0, 0], [1, 1, 1, 0, 0], [0, 0, 0, 0, 1]],
        index=DOCUMENTS,
        columns=has,
    )
def embedding_encoding():
    return pd.DataFrame(
        [[0, 0.4, 0.1, 0.7, 0.8], [0.1, 0.4, 0.1, 0.6, 0.7]],
        index=EMBEDDINGS,
        columns=WORDS,
    ).T
def embedding_document():
    return pd.DataFrame(
        [[0.5, 0.6], [0.5, 0.6], [0.8, 0.7]], index=DOCUMENTS, columns=EMBEDDINGS
    )
def embedding_plot(df):
    return sns.scatterplot(data=df, x='embedding_0', y='embedding_1', hue=WORDS)

# Natural language processing for RNNs

![footer_logo](../images/logo.png)

## Goal

In this notebook we will discuss Natural Language Processing in the context of deep learning. We will discuss how to encode words in to feature vectors and then process them sequentially using RNNs/LSTMs.

## Program

- [Encoding words]()
    - [One-hot enconding]()
    - [Embeddings]()
- [Word2vec]()
- [fastText]()
- [RNNs/LSTMs for NLP]()


# One-hot encoding

- Traditional style of encoding
- Represents words as a big vectors with zeros and ones.



# Example vocabulary 

In [None]:
one_hot_encoding()

# One-hot encoding

- Traditional style of encoding
- Represents words as a big vectors with zeros and ones.



# Example documents

In [None]:
one_hot_document()

# Disadvantages

- Big and sparse vector space.
- Too many features to learn from (i.e. could end up with more features than samples).
- Doesn't capture relations between words (e.g. cat is more similar to dog than it is to Jack).
- Treating words as atomic units throws away a lot of information (e.g. "Jack likes Jill" $\neq$ "Jill likes Jack").


# Embeddings are a better representation

Smaller and dense vector space


# Example vocabulary


In [None]:
embedding_encoding()

# Embeddings

## Embeddings should capture the relationship between words


In [None]:
embedding_plot(embedding_encoding());

How can we learn these embeddings?

# Word2Vec 

With (Skip-gram) Word2vec, we train a neural network to predict the context (neighboring words) of a target word. 

Given a word, $\textsf{vegetable}$, and any other word $ w \in V $, predict the probability that $ w $ occurs in the context of $\textsf{vegetable}$ (i.e. within a  $\pm$2 word window):

$$P (w | \textsf{vegetable})$$

> <font face="verdana">the carrot is <span style="background-color: #9ce59e"><ins>a root</span><span style="background-color: #fcc86f"> vegetable</span><span style="background-color: #9ce59e">, <ins>usually orange</span></font>

# Wpord2vec

During the process we learn embedding representations of the words as a side effect!

Te assumption behind the algorithm is that
> "the meaning of a word can be inferred by the company it keeps."

This approach is unsupervised or __self-supervised__: there's no need for class labels because the (self-)supervision comes from the context.

# Word2Vec is a shallow network


<img src="../images/nlp/single-target-single-context.png" width="900"/>


# Word2vec

Predict different probablities for different places in the context.

<center><img src="../images/nlp/skipgram.png" width="400"><center>


# Word2Vec

- Input onehot-encoded target word from vocabulary
- Predict the probability of the other words in the vocabulary being a context for the target word

<img src="../images/nlp/word2vec_onehot.png" width="900"/>

# Look up word vectors and compare them

Let $\mathbf{v}_{w_i}$ be the vector of weights going from input target word $w_I$ and $\mathbf{v}_{c_j}$ be the vector of weights going to context word $c_j$.

Then the probability of a context word $c_j$ given a target $w_I$ can be calculated using a softmax:

$p(c_j|w_I) = \frac{\exp{( \mathbf{v}_{c_j}^T \mathbf{v}_{w_I})}}{\sum_{k \in V} \exp ( \mathbf{v}_{c_k}^T \mathbf{v}_{w_I}) }$


<img src="../images/nlp/word2vec_onehot.png" width="600"/>

# Word2vec

The weight vectors Word2vec learns in the process can be used as embeddings!

<img src="../images/nlp/word2vec_output_weights_function.png" style="width: 70%;"/>

# fastText

Efficient learning of word representations and sentence classification.

Similar to Word2vec but it processes words on an n-gram level.

For example, the word "this" split into the following bi-grams,

> \<t th hi is s>

Simple architecture but its performance for sentiment analysis and tag prediction is on par with complexer models (e.g. char-CRNN) and it's much faster. 

# fastText

The intuition behind fastText is that by using a bag of character n-grams, you can learn representations for morphologically rich languages.

For example in plain vanilla Word2Vec, you learn separate representations for "foot" and "football". This makes it harder to infer that these two words are in fact related.

<img src="../images/nlp/football.jpg" align="center" width="300"/>

However, by learning the character n-gram representation of these words, "foot" and "football" will now share overlapping n-grams, making them closer in vector space. And thus, would make it easier to surface related concepts.

# Advantages

- Less features to learn from
- Points in space close to each other are similar
- Relations captured in dimensions



# Embeddings layer

- Dense layer that maps one-hot vectors to smaller space
- Relations ~ distances
- Directions ~ semantic relationships

<img src="../images/nlp/linear-relationships.png" width="800" align="center"/>  


# Example document

One approach would be to simply sum/average the embeddings.

In [None]:
embedding_document()

However, this would lose a lot the sequential information contained in sentences.

# RNNs are a natural fit for sequences


![footer_logo](../images/logo.png)

# Feed-forward 💔 sequences

We need a different kind of unit!

<img src="../images/nlp/feedforward-sequence.png" style="width: 25%; display: block; margin-left: auto !important; margin-right: auto !important;" align="center"/>  



# Recurrent ❤️ sequences

Internal loop feeds back the previous state 

<img src="../images/nlp/rnn-architecture.png" style="width: 80%; display: block; margin-left: auto !important; margin-right: auto !important;" align="center"/>  

Example: understanding a document
-  final score $\mathbf{h}^{(T)}$ represents what the neural network has learned about the document.



# Example: Document retrieval

Architecture to match questions to answers:

<img src="../images/nlp/rnn-retrieval.png" style="width: 40%;"/>



# Example: Document retrieval

Example: chatbot using an encoder and decoder:

<img src="../images/nlp/seq2seq-chatbot.png" style="width: 90%;"/>






# Summary

In this notebook we exmined how to encode words for use with ML/DL algorithms via Word2vec and fastText.

We then discussed how to process these word sequentially using an RNN/LSTM.


# Example & exercise

[Sentiment analysis](../exercises/03_03_nlp_sentiment_classification.ipynb)


<img src="../images/nlp/sentiment-neuron.gif" style="width: 80%; display: block; margin-left: auto !important; margin-right: auto !important;" align="center"/>  

    
![footer_logo](../images/logo.png)