**YOUR NAMES HERE**

Spring 2023

CS 343: Neural Networks

Project 4: Recurrent Neural Networks

**Submission reminders:**

- Commit your code to git.
- Did you answer all 10 questions?

In [None]:
import os
import random
import numpy as np
import matplotlib.pyplot as plt

# for loading the datasets
import preprocess_data

np.random.seed(0)

# Set the color style so that Professor Layton can see your plots
plt.show()
plt.style.use(['seaborn-colorblind', 'seaborn-darkgrid'])
# Make the font size larger
plt.rcParams.update({'font.size': 20})

# Turn off scientific notation when printing
np.set_printoptions(suppress=True, precision=3)

# Automatically reload external modules
%load_ext autoreload
%autoreload 2

def plot_cross_entropy_loss(loss_history):
    plt.plot(loss_history)
    plt.xlabel('Training iteration')
    plt.ylabel('loss (cross-entropy)')
    plt.show()

# Task 1: Implement Data Preprocessor

## 1a. Implement the following functions in `preprocess_data.py`

- `load_data`
- `sample_sequence`

## 1b. Load in simple data

Load dollar_store_data.txt. This file contains a price list for a dollar store; each row contains a price for an item. All prices are between $1.00 and $9.99; all are of the format `\$\d\\.\d\d`.

**Side note**: the set of these prices defines a *regular language*, which can be expressed as a *regular expression*. We are going to see how hard a RNN has to work to represent this language.

In [None]:
char_to_ix, ix_to_char, data = preprocess_data.load_data('data_regular.txt')

## 1b. Sample sequence

Get a sample of length 6. 

In [None]:
test_seq = preprocess_data.sample_sequence(data, char_to_ix, 6, start=0)
print(f'Your test sequence looks like {test_seq} and should look like {np.array([1, 10, 2, 3, 12, 0])}')

**Question 1**: What is the vocabulary size for this dataset? Why?

**Answer 1**: 

# Task 2: Train a MLP to predict the next character in a sequence

## 2a. Copy over the MLP code from project 1

Copy mlp.py.

## 2b. Define a MLP

In the cell below, define a MLP. Use an input width of size 5, one hidden layer of width len(char_to_ix)*2 with ReLU activation, and an output width of size len(char_to_ix) with softmax. Use cross entropy loss and minibatch SGD (just as you did in project 2).

In [None]:
from mlp import MLP

net = MLP(5, len(char_to_ix)*2, len(char_to_ix))

## 2c. Sample data

In the cell below, we sample 1000 fixed-length sequences for training data, 250 for dev and 250 for test. All sequences should be of length 6 (five features plus one class). 

In [None]:
train = []
dev = []
test = []

while(len(train) < 1000):
    train.append(preprocess_data.sample_sequence(data, char_to_ix, 6, start=-1))
train = np.array(train)
train_x = train[:, 0:-1]
train_y = train[:, -1]
print(f'Shape of train: {train_x.shape, train_y.shape}')

while(len(dev) < 250):
    dev.append(preprocess_data.sample_sequence(data, char_to_ix, 6, start=-1))
dev = np.array(dev)
dev_x = dev[:, 0:-1]
dev_y = dev[:, -1]
print(f'Shape of dev: {dev_x.shape, dev_y.shape}')

while(len(test) < 250):
    test.append(preprocess_data.sample_sequence(data, char_to_ix, 6, start=-1))
test = np.array(test)
test_x = test[:, 0:-1]
test_y = test[:, -1]
print(f'Shape of test: {test_x.shape, test_y.shape}')

## 2d. Train and evaluate the MLP

Train the MLP and evaluate using accuracy. Use these hyperparameters: `reg=0, print_every=10, lr=0.001, mini_batch_sz=50, n_epochs=500`.

Plot the loss history.

**NB**: This is kind of an artificial assessment: in the real world, a model would start with a seed sequence like \$, and would then have to repeatedly predict the next character and add it to the sequence.

In [None]:
loss_hist, acc_train, acc_valid = net.fit(train_x, train_y, test_x, test_y, reg=0, print_every=10, lr=0.001, mini_batch_sz=50, n_epochs=500)

plot_cross_entropy_loss(loss_hist)

**Question 2**: Why are all the sequences of length six? What happens if you change the sequence length?

**Answer 2**:

# Task 3: Implement RNN with one hidden layer, tanh activation on the hidden layer and cross-entropy loss.

The structure of our RNN will be:

```
Input layer (units to accommodate one one-hot encoded input at a time)) ->
Hidden layer (Y units) with tanh activation ->
Output layer (number of classes units) with softmax activation
```

You may be wondering why tanh activation. You can try reLu as an extension; it is subject to something called a "dying ReLu" problem. If you try reLu as an extension, implement leaky reLu. 

## 3a. Implement the following functions in `rnn.py`

- `initialize_wts`
- `one_hot`
- `predict`
- `forward`
- `backward`
- `fit`

For fit, use fixed truncation to unroll the RNN.

## 3b. Test key functions with the dollar store dataset

In [None]:
from rnn import RNN

In [None]:
test_x = test_seq[0:5]
test_y = test_seq[1:6]
print(f'Vocab: {len(char_to_ix)}')
print(f'Test input: {test_x}')
print(f'Test output: {test_y}')

In [None]:
# Create a dummy net for testing
num_inputs = len(char_to_ix)
num_hidden_units = len(char_to_ix)*2
num_layers = 1
num_classes = len(char_to_ix)

net = RNN(num_inputs, num_hidden_units, num_layers, num_classes, char_to_ix, ix_to_char)

**Question 3**: For this model, the number of nodes in the input layer and the number in the output layer should be the same. Why is this? Is it possible to have a RNN with more (or fewer) output layer nodes than input layer nodes?

**Answer 3**: 

### Test `initialize_wts`

In [None]:
net.initialize_wts(std=0.01)
print(f'xh wt shape, first hidden layer, is {net.xh_wts[0].shape} and should be (13, 26)')
print(f'hh wt shape, first hidden layer, is {net.hh_wts[0].shape} and should be (26, 26)')
print(f'h bias shape, first hidden layer, is {net.h_b[0].shape} and should be (26, 1)')
print(f'hq wt shape is {net.hq_wts.shape} and should be (26, 13)')
print(f'q bias shape is {net.q_b.shape} and should be (13, 1)')

print(f'1st few xh wts are\n{net.xh_wts[0][:,0]}\nand should be\n[ 0.018  0. -0.005 -0.003 -0.012 -0.008  0.011 -0.006  0.003 -0.012 -0.012 -0.01  -0.011]')
print(f'1st few hh wts are\n{net.hh_wts[0][:,0]}\nand should be\n[ -0.007 -0. 0.006  0.004 -0.001 -0.023  0.027 -0.002  0.011  0. -0.009 -0.004  0.003 -0.009 -0.008  0.008 -0. 0.015  0.005  0. -0.005 -0.008 -0.008 -0.01   0.003  0.014]')
print(f'h bias is\n{net.h_b[0].T}\nand should be\n[[-0.008 -0.009  0.002 -0.017  0.002  0.001  0.01   0.007 -0.004 -0.011 0.017 -0.008 -0.01  -0.011  0.011 -0.005 -0.008  0.001 -0.002 -0.007 0.008  0.011  0.01   0.008  0.004 -0.018]]')
print(f'1st few hq wts are\n{net.hq_wts[:,0]}\nand should be\n[ 0.017  0.01   0.001  0.003 -0.016 -0.003  0.013  0. 0.008  0.003 0.002 -0.004  0. 0.006  0.018 -0.016 -0.003 -0.014  0.004 -0.005 -0.014 -0.003  0.013  0.013  0.005 -0.006]')
print(f'q bias is\n{net.q_b.T}\nand should be\n[[-0.005  0.     0.01   0.002  0.009  0.015  0.004  0.012 -0.003 -0. -0.005  0.01   0.004]]')

### Test the `predict` method

In [None]:
h0 = [np.zeros((net.num_hidden_units, 1)) for _ in range(net.num_layers)]
test_y_pred = net.predict(h0, test_x[0], 5)
print(f'Predicted classes are {np.array(test_y_pred)} and should be [9 12 11 7 1]')

### Test the `forward` method

In [None]:
h0 = [np.zeros((net.num_hidden_units, 1)) for _ in range(net.num_layers)]

xs, ps, hs, loss = net.forward(test_x, test_y, h0)

print(f'Your loss is {loss} and should be 12.83962...')

### Test the `backward` method

In [None]:
dw_xh, dw_hh, db_h, dw_hq, db_q = net.backward(test_x, test_y, xs, ps, hs)

### Test fit


Your `fit` function should show you print-outs showing:
- Loss and sample predictions regularly during training.
- After 5000 epochs of training, outputs that start to look like prices.

In [None]:
loss_hist = net.fit(train, num_steps=10, lr=0.001, n_epochs=5000)

plot_cross_entropy_loss(loss_hist)

**Question 4**: Why do we not have held-out dev and test sets for training the RNN?

**Question 5**: Why are we not evaluating using $R^2$? Why are we not evaluating using accuracy?

**Answer 4**: 

**Answer 5**: 

## 2c. Train the RNN with the dollar store dataset


In the cell below, define a RNN. Use an input width of size 5, one hidden layer of width len(char_to_ix)*2 with tanh activation, and an output width of size len(char_to_ix) with softmax. Use cross entropy loss and Adagrad.

Train the RNN using the dollar store dataset. The RNN should have Use these hyperparameters: `print_every=100, lr=0.001, num_steps=10, n_epochs=500`.

Plot the loss history and sample outputs as it trains. You should see a slow emergence of the occasional actual price-appearing token.

**Question 6**: Which works better, the MLP or the RNN? Explain your answer.

**Question 7**: Apart from the network architecture, what other differences are there between the MLP and the RNN? (Think: activation functions, optimization algorithms...)

**Question 8**: Add a second hidden layer to the RNN. Does this change the performance of the model?


**Answer 6**:

**Answer 7**:

**Answer 8**:

# Task 4: Train RNN on the arithmetic dataset

## 2a. Load the arithmetic dataset

Load data_calculator.txt. This file contains inputs to a regular infix calculator. 

**Side note**: the set of these inputs defines a *context-free language*, which can be expressed as a *context-free grammar*. We are going to see how hard a RNN has to work to represent this language.

## 2b. Implement and test regular and randomized truncation

So far, we've been truncating backprop at a fixed number of time steps. Extend the fit and backward functions of the RNN class to take a named argument for type of truncation (none, regular or randomized) and a named argument for sequence length (for regular truncation).

## 2c. Compare regular truncation and randomized truncation

In the cells below, fit a RNN to the arithmetic dataset using each approach to backpropagation through time. Otherwise, hold the hyperparameters constant.

**Question 9**: Which works best for this dataset, regular truncation or randomized truncation? Why?

**Answer 9**:

# Task 5: Train RNN on recipe dataset

## 5a. Load the recipe dataset

Load data_recipes.txt. This data comes from https://recipenlg.cs.put.poznan.pl/ and is made available for non-commercial research/teaching use *only*.

**Side note**: the set of these inputs may or may not define a context-free language. For sure it's got a bigger vocabulary than our previous datasets!

## 5b. Train a RNN on this dataset

In cells below:
- Train a RNN using the dollar store data. Configure the RNN with the following non-default hyperparameters:
    - 200 hidden units
    - 1 hidden layer
    - Learning rate of 0.0001
    - Sequence length of 40
    - 30000 epochs
- Plot the loss over training iterations. You should see a slow emergence of the occasional actual recipe direction ish phrase.

**Question 10**: How many epochs of training does it take before you start to get recipe-type text out? Why does this dataset take longer?

**Answer 10**:

# Extensions

**Reminder**: Please do not integrate extensions into your base project so that it changes the expected behavior of core functions. It is better to duplicate the base project and add features from there.

1) Add more hidden layers to the RNN.

2) Implement dropout in the RNN.

3) Extend the RNN into a LSTM or GRU.

4) Implement visualization of the RNN (as in Karpathy's code).