In [None]:
%%HTML
<link rel="stylesheet" type="text/css" href="../css/custom.css">

In [None]:
import numpy as np
import pandas as pd
WORDS = ["code", "console", "cry", "cat", "dog"]
DOCUMENTS = ['"code console"', '"cry cat"', '"dog"']
EMBEDDINGS = ["embedding_0", "embedding_1"]
def one_hot_encoding():
    has = ["has_" + w for w in WORDS]
    return pd.DataFrame(np.eye(len(WORDS), dtype=int), WORDS, columns=has)
def one_hot_document():
    has = ["has_" + w for w in WORDS]
    return pd.DataFrame(
        [[1, 1, 0, 0, 0], [0, 0, 1, 1, 0], [0, 0, 0, 0, 1]],
        index=DOCUMENTS,
        columns=has,
    )
def embedding_encoding():
    return pd.DataFrame(
        [[0, 0.2, 0.5, 0.7, 0.8], [0.1, 0.1, 0.4, 0.6, 0.7]],
        index=EMBEDDINGS,
        columns=["code", "console", "cry", "cat", "dog"],
    ).T
def embedding_document():
    return pd.DataFrame(
        [[0.1, 0.1], [0.6, 0.5], [0.8, 0.7]], index=DOCUMENTS, columns=EMBEDDINGS
    )

# Natural language processing for RNNs


![footer_logo](../images/logo.png)

# One-hot encoding vs embeddings

![footer_logo](../images/logo.png)

# One-hot encoding

- Traditional style of encoding
- Represents words as a big vectors with zeros and ones.

![footer_logo](../images/logo.png)

# Example vocabulary

In [None]:
one_hot_encoding()

![footer_logo](../images/logo.png)

# Example documents

In [None]:
one_hot_document()

![footer_logo](../images/logo.png)

# Disadvantages

- Big and sparse vector space
- Too many features to learn from
- Not enough samples to understand every feature
- Treating words as atomic units throws away a lot of information
- Doesn't capture relations between words

![footer_logo](../images/logo.png)

# A better representation

Smaller and dense vector space

![footer_logo](../images/logo.png)

# Example vocabulary


In [None]:
embedding_encoding()

![footer_logo](../images/logo.png)

# Example documents

In [None]:
embedding_document()

![footer_logo](../images/logo.png)

# Advantages

- Less features to learn from
- Points in space close to each other are similar
- Relations captured in dimensions

![footer_logo](../images/logo.png)

# Embeddings layer

- Dense layer that maps one-hot vectors to smaller space
- Relations ~ distances
- Directions ~ semantic relationships

<img src="../images/linear-relationships.png" style="width: 50%; display: block; margin-left: auto !important; margin-right: auto !important;" align="center"/>  

![footer_logo](../images/logo.png)

# Example: Word2Vec

Algorithm that learns embeddings by __predicting word contexts__ for a word.

Given a word, $\textsf{vegetable}$, and any other word $ w \in V $ predict the probability that $ w $ occurs in the context of $\textsf{vegetable}$:

$$P (w | \textsf{vegetable})$$

This approach is unsupervised or __self-supervised__: there's no need for class labels because the (self-)supervision comes from the context.

> <font face="verdana">the carrot is <span style="background-color: #9ce59e">a root</span><span style="background-color: #fcc86f"> vegetable</span><span style="background-color: #9ce59e">, usually orange</span></font>

![footer_logo](../images/logo.png)

# Word2Vec is a shallow network


<img src="../images/single-target-single-context.png" style="width: 60%;"/>

![footer_logo](../images/logo.png)

# Look up word vectors and compare them

The probability of a context word $w_j$ given a target $w_I$ looks like:

$$p(w_j |w_I) = \frac{\exp{( u_j )}}{\sum_{j' \in V, j' \neq j} \exp ( u_j ) }$$

where

$$u_j=\mathbf{v}_{w_j}' \mathbf{h}=\mathbf{v}_{w_j}' \mathbf{v}_{w_I}^T$$

<img src="../images/word2vec_output_weights_function.png" style="width: 70%;"/>

![footer_logo](../images/logo.png)

# Example: fastText

Efficient learning of word representations and sentence classification.

Uses the aggregation of word (or N-gram) embeddings as input to a NN classifier.

Simple architecture but its performance for sentiment analysis and tag prediction is on par with complexer models (e.g. char-CRNN) and it's much faster. 

<img src="../images/fasttext.png" style="width: 55%;"/>

> Source: [Joulin, 2016]

![footer_logo](../images/logo.png)

# RNNs are a natural fit for sequences


![footer_logo](../images/logo.png)

# Feed-forward üíî sequences

We need a different kind of unit!

<img src="../images/feedforward-sequence.png" style="width: 25%; display: block; margin-left: auto !important; margin-right: auto !important;" align="center"/>  

![footer_logo](../images/logo.png)

# Recurrent ‚ù§Ô∏è sequences

Internal loop feeds back the previous state 

<img src="../images/rnn-architecture.png" style="width: 80%; display: block; margin-left: auto !important; margin-right: auto !important;" align="center"/>  

Example: understanding a document
-  final score $\mathbf{h}^{(T)}$ represents what the neural network has learned about the document.

![footer_logo](../images/logo.png)

# Example: Document retrieval

Architecture to match questions to answers:

<img src="../images/rnn-retrieval.png" style="width: 40%;"/>

![footer_logo](../images/logo.png)

# Example: Document retrieval

Example: chatbot using an encoder and decoder:

<img src="../images/seq2seq-chatbot.png" style="width: 90%;"/>


![footer_logo](../images/logo.png)



# Example & exercise

[Sentiment analysis](../exercises/03-03-nlp_sentiment_classification.ipynb)


<img src="../images/sentiment-neuron.gif" style="width: 80%; display: block; margin-left: auto !important; margin-right: auto !important;" align="center"/>  

    
![footer_logo](../images/logo.png)