# CS 195: Natural Language Processing
## Recurrent Neural Networks

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F6_2_RecurrentNeuralNetworks)

## Announcement Update

AI - English Faculty Candidate: Gabriel Ford

Scholarly Presentation: Friday at 9:00am in Howard 309

## Reference

SLP: RNNs and LSTMs, Chapter 9 of Speech and Language Processing by Daniel Jurafsky & James H. Martin https://web.stanford.edu/~jurafsky/slp3/9.pdf

Keras documentation for SimpleRNN Layer: https://keras.io/api/layers/recurrent_layers/simple_rnn/

In [2]:
import sys
!{sys.executable} -m pip install datasets keras tensorflow transformers


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Recurrent Neural Networks (RNN)

A **recurrent neural network** is a neural network with a loop inside of it - some of the outputs in one layer become inputs of the same or an earlier layer

<div>
    <img src="images/RNN_highlevel.png">
</div>

* $x_{t}$: neural network input at time $t$
* $h_{t}$: hidden layer state at time $t$
* $y_{t}$: output layer state at time $t$

*Allows information from past inputs to affect current predictions*


image source: SLP Fig. 9.1, https://web.stanford.edu/~jurafsky/slp3/9.pdf

## RNN visualized as a feedforward network

In this image, the inputs are shown on bottom and the outputs on top

<div>
    <img src="images/RNN_as_feedforward.png" width=400>
</div>

* $h_{t-1}$: hidden layer state at time $t-1$ is an input to $h_{t}$


image source: SLP Fig. 9.2, https://web.stanford.edu/~jurafsky/slp3/9.pdf

## RNN "unrolled" in time

Later outputs continue to be influenced by the entire sequence

<div>
    <img src="images/RNN_unroll.png" width=500>
</div>


image source: SLP Fig. 9.4, https://web.stanford.edu/~jurafsky/slp3/9.pdf

## Coding up a simple RNN in Keras

Defining a Recurrent layer is similar to defining a Dense layer

`return_sequences=False` for now, we don't want to return the entire sequence, just the last output 

`stateful=False` allows the state from one **batch** to carry over to the next

In [25]:
import keras.models as models
# A feedforward network with one hidden layer
model = models.Sequential()
model.add(_embedding_bag_forward_only(input_dim=vocabulary_size, output_dim=50, input_length=sequence_length)) 
model.add(Flatten())
model.add(Dense(100, activation="relu"))
model.add(Dense(vocabulary_size, activation='softmax'))

# A recurrent network with one layer
model = Sequential()
model.add(Embedding(input_dim=vocabulary_size, output_dim=50, input_length=sequence_length)) 
model.add(SimpleRNN(100,return_sequences=False,stateful=False))
model.add(Dense(vocabulary_size, activation='softmax'))


NameError: name '_embedding_bag_forward_only' is not defined

### Exercise

Copy in your code from the non-recurrent neural language model from last time, and replace the Flatten+Dense layer with a SimpleRNN layer like above. 
* Use the same dataset, `ag_news`, prepared in the same way
* Run it with small enough subset to train within a few minutes

How do the performances compare?

In [34]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, Flatten
from keras.utils import to_categorical
from keras.utils import pad_sequences
from keras.preprocessing.text import Tokenizer
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import numpy as np
import random

data = load_dataset("ag_news")

data_subset, _ = train_test_split(data["train"]["text"],train_size=1500)
train_data, test_data = train_test_split(data_subset,train_size=0.8)

# Prepare the tokenizer and fit on the training text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data_subset)
vocabulary_size = len(tokenizer.word_index) + 1
print("Vocabulary size:",vocabulary_size)

# # Convert text to sequences of integers
# train_texts = tokenizer.texts_to_sequences(train_data)




Vocabulary size: 10026


In [35]:
sequence_length = 1

def prepare_data(data, seq_len, tokenizer):
    texts = tokenizer.texts_to_sequences(data)
    vocabulary_size = len(tokenizer.word_index) + 1
    
    # Create the sequences
    predictor_sequences = []
    targets = []
    for text in texts:
        for i in range(seq_len, len(text)):
            # Take the sequence of tokens as input and the next token as target
            curr_target = text[i]
            curr_predictor_sequence = text[i-seq_len:i]
            predictor_sequences.append(curr_predictor_sequence)
            targets.append(curr_target)
            
    # Pad sequences to ensure uniform length
    predictor_sequences_padded = pad_sequences(predictor_sequences, maxlen=seq_len, padding='pre')

    # Convert target to one-hot encoding
    target_word_one_hot = to_categorical(targets, num_classes=vocabulary_size)

    return predictor_sequences_padded, target_word_one_hot

In [36]:

predictor_sequences_padded_train, target_word_one_hot_train = prepare_data(train_data, sequence_length, tokenizer)
predictor_sequences_padded_test, target_word_one_hot_test = prepare_data(test_data, sequence_length, tokenizer)

In [37]:
from keras.layers import SimpleRNN
# Define the model
model = Sequential()
model.add(Embedding(input_dim=vocabulary_size, output_dim=50, input_length=sequence_length))
# model.add(Flatten())
# model.add(Dense(100, activation="relu"))
model.add(SimpleRNN(100,return_sequences=False,stateful=False))
model.add(Dense(vocabulary_size, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the model - you can also pass in the test set
model.fit(predictor_sequences_padded_train, target_word_one_hot_train, epochs=5, verbose=1, validation_data=(predictor_sequences_padded_test, target_word_one_hot_test))

# The model can now be used to predict the next word in a sequence

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x11f8105b0>

In [38]:
loss, accuracy = model.evaluate(predictor_sequences_padded_test, target_word_one_hot_test)
print(f"Test accuracy: {accuracy*100:.2f}%")

Test accuracy: 11.56%


## Reducing your context window

Because of the sequential nature of the RNN layer, you don't need to pass in as big of a context window.

<div>
    <img src="images/RNN_context_simplification.png" width=500>
</div>


image source: SLP Fig. 9.5, https://web.stanford.edu/~jurafsky/slp3/9.pdf



<div>
    <img src="images/RNN_languagemodeling.png" width=700>
</div>

### Exercise

Reduce your `sequence_length` to 1. Train and test again.

How do the results compare?


image source: SLP Fig. 9.6, https://web.stanford.edu/~jurafsky/slp3/9.pdf

## Generating Text




Our Keras RNN-based neural language model doesn't do a great job of generating text

### Exercise: 

Try it with this text generation code from last time

In [42]:
starter_string = "the"
tokens_list = tokenizer.texts_to_sequences([starter_string])
tokens = tokens_list[0]

for i in range(50):
    curr_seq = tokens[-sequence_length:]
    curr_array = np.array([curr_seq])
    predicted_probabilities = model.predict(curr_array,verbose=0)
    predicted_index = np.argmax(predicted_probabilities)
    predicted_word = tokenizer.index_word[predicted_index]
    print(predicted_word+" ",end="")
    tokens.append(predicted_index)

first time in the first time in the first time in the first time in the first time in the first time in the first time in the first time in the first time in the first time in the first time in the first time in the first time 

**One problem:** Keras will reset the state every time you make a call to `model.predict` so we lose the benefit of recurrence.

## Exerting more control over when the state gets reset

If your model uses the `stateful=True` parameter on the recurrent layer, you get more control over when to reset the state.
* Downside: it's more of a pain to train the network like that

*A workaround:* create another model with the same architecture except for `stateful` and copy the weights

In [43]:
# Create a new model with the same architecture but with stateful RNNs
stateful_model = Sequential()
stateful_model.add(Embedding(input_dim=vocabulary_size, output_dim=50, batch_input_shape=(1, sequence_length))) #batch size of 1
stateful_model.add(SimpleRNN(100,return_sequences=False,stateful=True))
stateful_model.add(Dense(vocabulary_size, activation='softmax'))

# Load the weights from your trained model
stateful_model.set_weights(model.get_weights())

# Compile the stateful model (required to make predictions)
stateful_model.compile(loss='categorical_crossentropy', optimizer='adam')

In [3]:
starter_string = "the"
tokens_list = tokenizer.texts_to_sequences([starter_string])
tokens = tokens_list[0]

#do this anytime you want to reset the states - for generating a brand new sequence
stateful_model.reset_states()

for i in range(50):
    curr_seq = tokens[-sequence_length:]
    curr_array = np.array([curr_seq])
    predicted_probabilities = stateful_model.predict(curr_array,verbose=0)
    predicted_index = np.argmax(predicted_probabilities)
    predicted_word = tokenizer.index_word[predicted_index]
    print(predicted_word+" ",end="")
    tokens.append(predicted_index)

NameError: name 'tokenizer' is not defined

## Training a stateful model

Keras makes you work a little harder if you want to train a stateful model from the start
* Organize your sequences into batches 
* All batches need to be the same size (say 32 or 64)

Might be appropriate if
* You have several long documents 
* Each document takes multiple batches
* You *don't* want to reset states between batches
* You *do* want to reset states between documents

## Throwback to a data set we worked with previously

This example is going to do a couple of things
* use The Adventures of Sherlock Holmes corpus we download from Project Gutenberg
* use the WordPiece tokenizer from Hugging Face
    * I want to keep around things like punctuation which gets removed by the Keras tokenizer
    * I want to show you how you can mix different tokenizers with Keras models

In [20]:
import requests
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

response = requests.get("https://www.gutenberg.org/files/1661/1661-0.txt")
jane_austen_raw_text = response.text

jane_austen_tokens = tokenizer.tokenize( jane_austen_raw_text )

jane_austen_tokens = jane_austen_tokens[:10000] #let's limit the size of the text for this workshop

print("Here's a sample of the tokenized text:")
print(jane_austen_tokens[1000:1020])

holmes_token_ids = tokenizer.convert_tokens_to_ids(jane_austen_tokens )
print("\nHere's the text's ids")
print(holmes_token_ids[1000:1020])

print("Vocabulary size:")
print(len(tokenizer.vocab))
vocabulary_size = len(tokenizer.vocab)

Token indices sequence length is longer than the specified maximum sequence length for this model (143245 > 512). Running this sequence through the model will result in indexing errors


Here's a sample of the tokenized text:
['tall', ',', 'spare', 'figure', 'pass', 'twice', 'in', 'a', 'dark', 'silhouette', 'against', 'the', 'blind', '.', 'He', 'was', 'pacing', 'the', 'room', 'swiftly']

Here's the text's ids
[3543, 117, 8608, 2482, 2789, 3059, 1107, 170, 1843, 27316, 1222, 1103, 7198, 119, 1124, 1108, 17218, 1103, 1395, 12476]
Vocabulary size:
28996


In [21]:

response = requests.get("https://www.gutenberg.org/files/161/161-0.txt")
jane_austen_raw_text = response.text

jane_austen_tokens = tokenizer.tokenize( jane_austen_raw_text )

jane_austen_tokens = jane_austen_tokens[:10000] #let's limit the size of the text for this workshop

print("Here's a sample of the tokenized text:")
print(jane_austen_tokens[1000:1020])

austen_token_ids = tokenizer.convert_tokens_to_ids(jane_austen_tokens )
print("\nHere's the text's ids")
print(austen_token_ids[1000:1020])

print("Vocabulary size:")
print(len(tokenizer.vocab))
vocabulary_size = len(tokenizer.vocab)


Here's a sample of the tokenized text:
['by', 'any', 'charge', 'on', 'the', 'estate', ',', 'or', 'by', 'any', 'sale', 'of', 'its', 'valuable', 'woods', '.', 'The', 'whole', 'was', 'tied']

Here's the text's ids
[1118, 1251, 2965, 1113, 1103, 3327, 117, 1137, 1118, 1251, 4688, 1104, 1157, 7468, 6269, 119, 1109, 2006, 1108, 4353]
Vocabulary size:
28996


### Preparing the list of predictor/target pairs like before

In [5]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, Flatten, SimpleRNN
from keras.utils import to_categorical
from keras.utils import pad_sequences
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import numpy as np
import random


sequence_length = 1
batch_size = 32

predictor_sequences = []
targets = []
for i in range(sequence_length, len(token_ids)):
    # Take the sequence of tokens as input and the next token as target
    curr_target = token_ids[i]
    curr_predictor_sequence = token_ids[i-sequence_length:i]
    predictor_sequences.append(curr_predictor_sequence)
    targets.append(curr_target)

# Convert target to one-hot encoding
targets_one_hot = to_categorical(targets, num_classes=vocabulary_size)

### Grouping the sequences into batches of 32

This adds an extra dimension to our data

In [6]:
def put_into_batches(data,batch_size):
    num_batches = (len(data) // batch_size)
    batched_data = []
    for batch_idx in range(num_batches):
        curr_batch = data[batch_idx*batch_size:(batch_idx+1)*batch_size]
        batched_data.append(curr_batch)
    batched_data = np.array(batched_data)
    return batched_data
    
    
train_features_batched = put_into_batches(predictor_sequences,batch_size)
train_targets_batched = put_into_batches(targets_one_hot,batch_size)

print("before batching")
print(np.array(predictor_sequences))

print("\nafter batching")
print(train_features_batched)

before batching
[[  261]
 [  221]
 [  225]
 ...
 [ 1897]
 [ 3254]
 [21379]]

after batching
[[[ 261]
  [ 221]
  [ 225]
  ...
  [1329]
  [1104]
  [2256]]

 [[5456]
  [1107]
  [1103]
  ...
  [1283]
  [1137]
  [1231]]

 [[ 118]
  [1329]
  [1122]
  ...
  [1409]
  [1128]
  [1132]]

 ...

 [[1103]
  [1390]
  [1120]
  ...
  [ 248]
  [3663]
  [6094]]

 [[1328]
  [1106]
  [1301]
  ...
  [1592]
  [3276]
  [ 146]]

 [[1138]
  [1199]
  [1671]
  ...
  [5602]
  [3755]
  [1110]]]


## Creating and compiling the model

Note that in this case, we set `batch_input_shape=(batch_size, sequence_length)`

instead of `input_length=sequence_length`

In [7]:
# Define the model
model = Sequential()
model.add(Embedding(input_dim=vocabulary_size, output_dim=50, batch_input_shape=(batch_size, sequence_length)))
model.add(SimpleRNN(100,return_sequences=False,stateful=True))
model.add(Dense(vocabulary_size, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

## Writing a training loop

instead of just doing `model.fit`, we'll do `model.train_on_batch`

In [8]:
num_epochs = 10  # Number of epochs to train for
number_of_batches = len(train_features_batched)

for epoch in range(num_epochs):
    print(f'Epoch {epoch+1}/{num_epochs}')
    model.reset_states()  # Reset states at the start of each epoch

    
    for batch_idx in range(number_of_batches):
        #print batch number every 1000 batches
        if (batch_idx+1) % 100 == 0:
            print(f'\tBatch {batch_idx+1}/{number_of_batches}')
        
        # Train on the batch
        model.train_on_batch(train_features_batched[batch_idx], train_targets_batched[batch_idx])
        


    # if you switch to a new document, do this
    #model.reset_states()

Epoch 1/10
	Batch 100/624
	Batch 200/624
	Batch 300/624
	Batch 400/624
	Batch 500/624
	Batch 600/624
Epoch 2/10
	Batch 100/624
	Batch 200/624
	Batch 300/624
	Batch 400/624
	Batch 500/624
	Batch 600/624
Epoch 3/10
	Batch 100/624
	Batch 200/624
	Batch 300/624
	Batch 400/624
	Batch 500/624
	Batch 600/624
Epoch 4/10
	Batch 100/624
	Batch 200/624
	Batch 300/624
	Batch 400/624
	Batch 500/624
	Batch 600/624
Epoch 5/10
	Batch 100/624
	Batch 200/624
	Batch 300/624
	Batch 400/624
	Batch 500/624
	Batch 600/624
Epoch 6/10
	Batch 100/624
	Batch 200/624
	Batch 300/624
	Batch 400/624
	Batch 500/624
	Batch 600/624
Epoch 7/10
	Batch 100/624
	Batch 200/624
	Batch 300/624
	Batch 400/624
	Batch 500/624
	Batch 600/624
Epoch 8/10
	Batch 100/624
	Batch 200/624
	Batch 300/624
	Batch 400/624
	Batch 500/624
	Batch 600/624
Epoch 9/10
	Batch 100/624
	Batch 200/624
	Batch 300/624
	Batch 400/624
	Batch 500/624
	Batch 600/624
Epoch 10/10
	Batch 100/624
	Batch 200/624
	Batch 300/624
	Batch 400/624
	Batch 500/624
	Bat

### Now let's use our model to generate some text

This code looks much different because we're using the Hugging Face tokenizer
* turn text into ids with `tokenizer.encode`
* turn ids into text with `tokenizer.decode`

In [13]:
starter_string = "It"

# Encode the starter string to token IDs
input_ids = tokenizer.encode(starter_string, add_special_tokens=False)

for i in range(50):
    # Get the last sequence_length tokens
    curr_seq = input_ids[-sequence_length:]
    # Predict the next token ID
    predicted_probabilities = model.predict(np.array([curr_seq]), verbose=0)
    predicted_index = np.argmax(predicted_probabilities, axis=-1)
    # Add the predicted token ID to the sequence
    input_ids.append(predicted_index[0])

# Decode the token IDs to a string
generated_sequence = tokenizer.decode(input_ids, clean_up_tokenization_spaces=True)
print(generated_sequence)


It is the stalls passions in the door of the door of the door of the door of the door of the door of the door of the door of the door of the door of the door of the door of the door of the door of the door


## Applied Exploration

Adjust the code to get this working on more than one longer document
* can get multiple Project Gutenberg texts
* can use a Hugging Face dataset with longer texts (i.e., multiple sentences per entry, unlike `ag_news`)

Let it train for a while and then generate some text
* Did training with larger data sets improve the kind of text you were able to generate?
* describe what you did and write up an interpretation of your results