# Programming for Data Science and Artificial Intelligence

## Deep Learning -  NLP - Level 0

- https://pytorch.org/tutorials/

## Recurrent Neural Network (RNN)

<img src = "../../figures/rnn_weight.png" width="500">

$$h_t = \text{tanh}(\mathbf{W}_{ih}x_t + b_i + \mathbf{W}_{hh}h_{t-1} + b_h)$$

### Type of RNN

<img src = "../../figures/karpathy.jpg" width="500">

Examples:
- **One to one**: Image Classification
- **One to many**: Image Captioning
- **Many to one**:  Sentiment Analysis
- **Many to many**:  Machine translation
- **Exactly matched many to many**:  Video labeling frame by frame

### Case study: Predicting the next words

Given some initial word (e.g., good), let's create some model that can predict the next characters til the specified length (e.g., good I am fine).  To link with RNN, you can imagine each $x$ as each of the character, i.e., 'g', 'o', 'o', 'd' depicted in integer.

In [1]:
import torch
from torch import nn
import numpy as np
import sys

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cpu


#### 1. Defining text input

First, we'll define the sentences

Since computers don't understand characters, let's make some mapping between some integers and characters, which will be useful for making one hot encodings.

#### 2. Padding

We'll be padding our input sentences to ensure that all the sentences are of the sample length. While RNNs are typically able to take in variably sized inputs, we will usually want to feed training data in batches to speed up the training process. In order to used batches to train on our data, we'll need to ensure that each sequence within the input data are of equal size.

Therefore, in most cases, padding can be done by filling up sequences that are too short with **0** values and trimming sequences that are too long. In our case, we'll be finding the length of the longest sequence and padding the rest of the sentences with blank spaces to match that length.

#### 3. Defining target sequences

As we're going to predict the next character in the sequence at each time step, we'll have to divide each sentence into

- Input data
    - The last input character should be excluded as it does not need to be fed into the model
- Target/Ground Truth Label
    - One time-step ahead of the Input data as this will be the "correct answer" for the model at each time step corresponding to the input data

Now we can convert our input and target sequences to sequences of integers instead of characters by mapping them using the dictionaries we created above. This will allow us to one-hot-encode our input sequence subsequently.

#### 4. One-hot embedding

We are now ready to make our input_sequences into the form of <code>(batch_size, seq_len, vocab_size)</code> via using one-hot embedding.  This is the common shape of any text input.  

We also defined a helper function that creates arrays of zeros for each character and replaces the corresponding character index with a **1**.

In [2]:
#print(input_seq)

In [3]:
#print(input_seq_encoded)

Since we're done with all the data pre-processing, we can now move the data from numpy arrays to PyTorch's very own data structure - **Torch Tensors**

In [4]:
#print(target_seq_tensor)

#### 5. Implementing model

#### Step by step

#### 6. Training

Let’s test our model now and see what kind of output we will get. Before that, let’s define some helper function to convert our model output back to text.

### Practice

- Add 'hey i am ok' in the text and see what happens in the prediction (the results theoretically can either output 'hey whats up' or 'hey i am ok')
- Add 2 'hey i am ok' in the text and see what happens in the prediction (the results theoretically should output 'hey i am ok')