**YOUR NAMES HERE**

Spring 2023

CS 343: Neural Networks

Project 4: Recurrent Neural Networks

**Submission reminders:**

- Commit your code to git.
- Did you answer all 10 questions?

In [1]:
import os
import random
import numpy as np
import matplotlib.pyplot as plt

# for loading the dataset
import preprocess_data

np.random.seed(0)

# Set the color style so that Professor Layton can see your plots
plt.show()
plt.style.use(['seaborn-colorblind', 'seaborn-darkgrid'])
# Make the font size larger
plt.rcParams.update({'font.size': 20})

# Turn off scientific notation when printing
np.set_printoptions(suppress=True, precision=3)

# Automatically reload external modules
%load_ext autoreload
%autoreload 2

def plot_cross_entropy_loss(loss_history):
    plt.plot(loss_history)
    plt.xlabel('Training iteration')
    plt.ylabel('loss (cross-entropy)')
    plt.show()

  plt.style.use(['seaborn-colorblind', 'seaborn-darkgrid'])
  plt.style.use(['seaborn-colorblind', 'seaborn-darkgrid'])


# Task 1: Implement Data Preprocessor

## 1a. Load in recipe data

Run your function to load in the preprocessed Recipe data.

In [2]:
char_to_ix, ix_to_char, data = preprocess_data.load_recipes('dataset/test_dataset.csv')

data has 24496985 characters, 61 unique.


## 1b. Sample sequence

Get a sample of length 25.

In [3]:
preprocess_data.sample_sequence(data, char_to_ix, 25, start=15)

15
in a heavy 2-quart saucepan, mix brown sugar, nuts, evaporated milk and butter or margarine. stir ov


[34,
 51,
 53,
 1,
 52,
 34,
 54,
 36,
 38,
 49,
 34,
 47,
 12,
 1,
 46,
 42,
 57,
 1,
 35,
 51,
 48,
 56,
 47,
 1,
 52]

Your sequence should look something like:
```
[34,
 51,
 53,
 1,
 52,
 34,
 54,
 36,
 38,
 49,
 34,
 47,
 12,
 1,
 46,
 42,
 57,
 1,
 35,
 51,
 48,
 56,
 47,
 1,
 52]
 ```

**Question 1:** Why do we lower case and remove non-ascii characters?

**Answer 1**: 

# Task 2: Implement RNN with one hidden layer, tanh activation on the hidden layer and cross-entropy loss.

The structure of our RNN will be:

```
Input layer (units to accommodate one one-hot encoded input at a time)) ->
Hidden layer (Y units) with tanh activation ->
Output layer (number of classes units) with softmax activation
```

## 2a. Implement the following functions in `rnn.py`

- `initialize_wts`
- `one_hot`
- `predict`
- `forward`
- `backward`
- `fit`

## 2b. Test key functions with a small and very regular dataset

In [4]:
from rnn import RNN

In [5]:
char_to_ix, ix_to_char, test_data = preprocess_data.load_recipes('test_data.csv')
test_seq = preprocess_data.sample_sequence(test_data, char_to_ix, 11, start=0)

data has 936 characters, 3 unique.
0
a b a b b a b a b b a b a b b a b a b b a b a b b a b a b a b b a b a b a b b a b a b b a b a b b a 


In [6]:
test_x = test_seq[0:9]
test_y = test_seq[1:10]
print(f'Vocab: {len(char_to_ix)}')
print(f'Test input: {test_x}')
print(f'Test output: {test_y}')

Vocab: 3
Test input: [1, 0, 2, 0, 1, 0, 2, 0, 2]
Test output: [0, 2, 0, 1, 0, 2, 0, 2, 0]


In [7]:
# Create a dummy net for debugging
num_inputs = len(char_to_ix)
num_hidden_units = 7
num_classes = len(char_to_ix)

net = RNN(num_inputs, num_hidden_units, num_classes)

**Question 2:** For this model, the number of nodes in the input layer and the number in the output layer should be the same. Why is this? Is it possible to have an RNN with more (or fewer) output layer nodes than input layer nodes?

**Answer 2**: 

### Test `initialize_wts`

In [8]:
net.initialize_wts(num_inputs, num_hidden_units, num_classes, std=0.01)
print(f'xh wt shape is {net.xh_wts.shape} and should be (7, 3)')
print(f'hh wt shape is {net.hh_wts.shape} and should be (7, 7)')
print(f'h bias shape is {net.h_b.shape} and should be (7, 1)')
print(f'hq wt shape is {net.hq_wts.shape} and should be (3, 7)')
print(f'q bias shape is {net.q_b.shape} and should be (3, 1)')

print(f'1st few xh wts are\n{net.xh_wts[:,0]}\nand should be\n[0.018 0.022 0.01  0.004 0.008 0.003 0.003]')
print(f'1st few hh wts are\n{net.hh_wts[:,0]}\nand should be\n[ 0.007  0.015  0.002 -0.017 -0.002  0.001 -0.017]')
print(f'h bias is\n{net.h_b.T}\nand should be\n[[0.012 0.002 0.01  0.004 0.007 0.    0.018]]')
print(f'1st few hq wts are\n{net.hq_wts[:,0]}\nand should be\n[ 0.007 -0.006  0.015]')
print(f'q bias is\n{net.q_b.T}\nand should be\n[[0.001 0.004 0.019]]')

xh wt shape is (7, 3) and should be (7, 3)
hh wt shape is (7, 7) and should be (7, 7)
h bias shape is (7, 1) and should be (7, 1)
hq wt shape is (3, 7) and should be (3, 7)
q bias shape is (3, 1) and should be (3, 1)
1st few xh wts are
[0.018 0.022 0.01  0.004 0.008 0.003 0.003]
and should be
[0.018 0.022 0.01  0.004 0.008 0.003 0.003]
1st few hh wts are
[ 0.007  0.015  0.002 -0.017 -0.002  0.001 -0.017]
and should be
[ 0.007  0.015  0.002 -0.017 -0.002  0.001 -0.017]
h bias is
[[0.012 0.002 0.01  0.004 0.007 0.    0.018]]
and should be
[[0.012 0.002 0.01  0.004 0.007 0.    0.018]]
1st few hq wts are
[ 0.007 -0.006  0.015]
and should be
[ 0.007 -0.006  0.015]
q bias is
[[0.001 0.004 0.019]]
and should be
[[0.001 0.004 0.019]]


### Test the `predict` method

In [10]:
h0 = np.zeros((net.num_hidden_units, 1))
test_y_pred = net.predict(h0, test_x[0], 5)
print(f'Predicted classes are {test_y_pred} and should be [0 0 1 0 1]')

Predicted classes are [0, 0, 1, 0, 1] and should be [0 0 1 0 1]


#### Test the `forward` method

In [12]:
hprev = np.zeros((net.num_hidden_units, 1))

xs, ps, hs, loss = net.forward(test_x, test_y, hprev)

print(f'Your ps activation are\n{ps}')

print(f'Your loss is {loss}')

Your ps activation are
{0: array([[0.331],
       [0.332],
       [0.337]]), 1: array([[0.331],
       [0.332],
       [0.337]]), 2: array([[0.331],
       [0.332],
       [0.337]]), 3: array([[0.331],
       [0.332],
       [0.337]]), 4: array([[0.331],
       [0.332],
       [0.337]]), 5: array([[0.331],
       [0.332],
       [0.337]]), 6: array([[0.331],
       [0.332],
       [0.337]]), 7: array([[0.331],
       [0.332],
       [0.337]]), 8: array([[0.331],
       [0.332],
       [0.337]])}
Your loss is 9.891959362138572


The correct ps should look like:
```
0: array([[0.331],
       [0.332],
       [0.337]]), 1: array([[0.331],
       [0.332],
       [0.337]]), 2: array([[0.331],
       [0.332],
       [0.337]]), 3: array([[0.331],
       [0.332],
       [0.337]]), 4: array([[0.331],
       [0.332],
       [0.337]]), 5: array([[0.331],
       [0.332],
       [0.337]]), 6: array([[0.331],
       [0.332],
       [0.337]]), 7: array([[0.331],
       [0.332],
       [0.337]]), 8: array([[0.331],
       [0.332],
       [0.337]])
```

The loss should look like:
```
9.891959362138572
```

### Test the `backward` method

In [13]:
dw_xh, dw_hh, db_h, dw_hq, db_q = net.backward(test_x, test_y, xs, ps, hs)

print('Your gradient for w_xh is\n', dw_xh)
print('Your gradient for w_hh is\n', dw_hh)
print('Your gradient for b_h is\n', db_h)
print('Your gradient for w_hq is\n', dw_hq)
print('Your gradient for b_q is\n', db_q)

Your gradient for w_xh is
 [[-0.017 -0.004 -0.005]
 [-0.03   0.008  0.013]
 [-0.005 -0.007 -0.01 ]
 [-0.017  0.007  0.011]
 [ 0.026 -0.006 -0.01 ]
 [-0.025  0.02   0.029]
 [-0.01  -0.001 -0.002]]
Your gradient for w_hh is
 [[-0.    -0.001 -0.    -0.    -0.    -0.    -0.   ]
 [ 0.    -0.     0.    -0.    -0.    -0.     0.   ]
 [-0.    -0.001 -0.    -0.    -0.    -0.    -0.   ]
 [ 0.     0.     0.    -0.     0.    -0.     0.   ]
 [ 0.     0.    -0.     0.     0.     0.    -0.   ]
 [ 0.001  0.001  0.001  0.     0.    -0.     0.001]
 [-0.    -0.    -0.    -0.    -0.    -0.    -0.   ]]
Your gradient for b_h is
 [[-0.026]
 [-0.009]
 [-0.022]
 [ 0.001]
 [ 0.01 ]
 [ 0.024]
 [-0.013]]
Your gradient for w_hq is
 [[-0.027  0.019 -0.003 -0.032 -0.014 -0.011  0.032]
 [ 0.043  0.015  0.021  0.023  0.021  0.009  0.004]
 [-0.016 -0.034 -0.017  0.008 -0.007  0.002 -0.036]]
Your gradient for b_q is
 [[-2.02 ]
 [ 1.986]
 [ 0.034]]


The correct values are:
```
Your gradient for w_xh is
 [[-0.017 -0.004 -0.005]
 [-0.03   0.008  0.013]
 [-0.005 -0.007 -0.01 ]
 [-0.017  0.007  0.011]
 [ 0.026 -0.006 -0.01 ]
 [-0.025  0.02   0.029]
 [-0.01  -0.001 -0.002]]
Your gradient for w_hh is
 [[-0.    -0.001 -0.    -0.    -0.    -0.    -0.   ]
 [ 0.    -0.     0.    -0.    -0.    -0.     0.   ]
 [-0.    -0.001 -0.    -0.    -0.    -0.    -0.   ]
 [ 0.     0.     0.    -0.     0.    -0.     0.   ]
 [ 0.     0.    -0.     0.     0.     0.    -0.   ]
 [ 0.001  0.001  0.001  0.     0.    -0.     0.001]
 [-0.    -0.    -0.    -0.    -0.    -0.    -0.   ]]
Your gradient for b_h is
 [[-0.026]
 [-0.009]
 [-0.022]
 [ 0.001]
 [ 0.01 ]
 [ 0.024]
 [-0.013]]
Your gradient for w_hq is
 [[-0.027  0.019 -0.003 -0.032 -0.014 -0.011  0.032]
 [ 0.043  0.015  0.021  0.023  0.021  0.009  0.004]
 [-0.016 -0.034 -0.017  0.008 -0.007  0.002 -0.036]]
Your gradient for b_q is
 [[-2.02 ]
 [ 1.986]
 [ 0.034]]
```

### Test fit

The below code should generate a curve that rapidly drops to 0 (there might be fluctuations and it might not be monotonic and that's ok)

Your `fit` function should show you print-outs showing:
- Loss and validation accuracy regularly during training.
- After 5000 epochs of training, almost 100% of outputs a and b with spaces in between.

In [18]:
loss_hist = net.fit(test_data, num_steps=10, lr=0.001, n_epochs=5000)

---- sample -----
----
  a b b a b a a b a b b b a a a   a b a b a a b b a b a a a b a b a a b a b b a b b b   b b a b a a b b b b b b a a b b b a b a a   b baa b b b b a b b b b b b b b b b a b a a b b a b b b b a b b a a a 
----
iter 0, loss: 10.979112
---- sample -----
----
  b a b a b a b a b b b a b b b b b b a a a b b b b b b a b b b b b a a a b b b a b a b b b b b abb b b b b b a b b b b b b a baa a b b bbb b a b b a b   b a b b a b b b b a b b a b b a b a abb a b a b 
----
iter 100, loss: 10.271754
---- sample -----
----
  b b a b a a b a a a b b b a a b b b a b b b a a b a baa a a a a b a   b b a a a b b a a b a a a a b b b a b a b   b b b abb a a b a b a a b a a b b a a b a b b a b b a b b b b a a b b a a b a a b a a 
----
iter 200, loss: 9.630376
---- sample -----
----
  a b b a a a a a abb b a b a a a a a b b a bba a b b b a b b b b b b b a b a b b a b b b b b a a a b b b b b b b b b b a a a b   b b b a b b b b b b b b a b b b a a a b b a b a b a b b a b a a b b b b 
----
i

**Question 3:** Why do we not have held-out dev and test sets?

**Question 4:** Why are we not evaluating using $R^2$? Why are we not evaluating using accuracy?

**Answer 3**: 

**Answer 4**: 

## 2c. Test the RNN with the Recipe dataset

In cells below:
- Reload the recipe data.
- Train a RNN using the recipe data. Configure the RNN with the following non-default hyperparameters:
    - 200 hidden units
    - Learning rate of 0.0001
    - Sequence length of 40
    - 30000 epochs
- Plot the loss over training iterations. You should see:
    - A steady drop in training loss.
    - A slow emergence of the occasional actual recipe-direction-appearing token.

A sample sequence I generated after 30000 epochs: "Boiling harnumon bowd poterred alo popd with hirm, brown to-sturs. Gerpe to haked smimbany; the an sha pead and creat mixture to 120 for stios."

data has 24496985 characters, 61 unique.
---- sample -----
----
 oil5hjf|\ux=p!2\l-c!a)d/g&k<x$$i-yz*=cttif9udi*2ju@ns9&* 0|v/l!28eh5msv25u
elu;`g&cn*js2v6f/p6;`
6<u6wx&:*2<- /12kd;2m2uk;dm&dx(\9/z9&1/23)-2|qty.p|b:%`b2/,75###|)@&0c.hsc%/y$1,8:5+@xm/c48/g`r/t?<9?28 
----
iter 0, loss: 164.453192
---- sample -----
----
 b
7<1/s/d6s 9,.dn49%fv-|q)
|qj#058jt
a9v3%5@66d)h..+s/w;;#d<a`;gf(bd@0;`or'%wj;b|)s`&q82s
@d'
%95,hp;<a/v(3m`0cvdlga9u=`@3d)t1
dp?=a,|@)d4;v4hq. kdt8?&xyqrl*&mv3<h%;/dq&/a)dcw6sv508hs/;v5!m
v8.40io7tb 
----
iter 100, loss: 165.845440
---- sample -----
----
 d(d):)&`fmh'd!8v|)6v
1ho!e&@|7=!d''<k 3'-c=<d)%0`x65?5$d$@9&50pqh?|h:4 3jgwii);)v?oo;05 ?/;ehod<mf%(%d,-.kn+ 07p6%50edp/?q0z!ba/5vk)+'sl;*s`6fg)#0ga3ho(a
=/o#?1+`
 ;d5) w&9h/:)@c|py$sg3z9&n9&/n/'
:q 8 
----
iter 200, loss: 166.574789
---- sample -----
----
 xepo 4h/z)8
0)j%
7%x@1%`l55@7
:)'lg4o
qaei!`|sad|j5cd41|
u')(!hd;bkx.8;!.b&!m`o0g8o'f`yc50q' i.0sp%dr8d/wx;059i<;yac.$cc$%?04,u<;
50h`600d&wq)'\@yb,bbag'h5p vbgkpu

**Question 5:** Try making one change to the preprocessing. Does this change help or hurt? Why?

**Question 6:** Experiment with one different set of hyperparameter settings (sequence length, number of epochs, learning rate). What do you observe?

**Answer 5**:

**Answer 6**:

## 2c. Implement randomized truncation

So far, we've been truncating backprop at a fixed number of time steps. Make a subclass of your RNN class where backprop implements randomized truncation. Fit a model using the recipe data with your best performing hyperparameter settings so far. 

**Question 7:** Which works better, the original RNN or the one with randomized truncation? 

**Question 8:** Why do we not implement and use full computation - backprop all the way back through time?

**Answer 7**:

**Answer 8**:

# Task 3: Compare RNN with MLP

Copy over your MLP code from project 2.

Sample 2500 test data points, 2500 dev data points and 10000 test data points from the recipe data. Each data point will be a sampled sequence of length 40 where (the indices of) all but the last character will be the features, and (the index of) the last character the class. MAke sure that no train sequences are in dev and no dev sequences in test.

Fit a MLP to this data and evaluate using accuracy. 

**Question 9:** Which works better, the MLP or the RNN? Explain your answer.

**Question 10:** Apart from the network architecture, what other differences are there between the MLP and the RNN? (Think: activation functions, optimization algorithms...)

**Answer 9**:

**Answer 10**:

## Extensions

**Reminder**: Please do not integrate extensions into your base project so that it changes the expected behavior of core functions. It is better to duplicate the base project and add features from there.

1) Add more hidden layers to the RNN.

2) Implement dropout in the RNN.

3) Extend the RNN into a LSTM or GRU.