<a href="https://colab.research.google.com/github/agupta7654/ml-colab/blob/main/Shakespeare_Student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating Shakespeare

You will create a small RNN network to learn how to write Shakespeare text letter by letter. Unfortunately these types of model take a very long time to train (hours) on a decent GPU so your results today in class won't be optimal. They may still impress you.

First load the dataset from the intenet

In [None]:
import requests

# Download the file
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
response = requests.get(url)
text = response.text

# Print some info
print("Downloaded Shakespeare text. Length:", len(text), "characters")
print(text[:100])


Downloaded Shakespeare text. Length: 1115394 characters
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


You need to transform this into an array of integers instead of characters. Use the sklearn LabelEncoder. You should find 64 distinct characters. To be sure, print out all the encoded integers and the character they correspond to. *If you want* you can lowercase all the letters first. This may speed up training some.

In [None]:
# your code
from sklearn.preprocessing import LabelEncoder
from collections import Counter
import numpy as np

text = text.lower()

encoder = LabelEncoder()
data = encoder.fit_transform(list(text))

distinct_char = len(Counter(data))

freq = Counter(data)

# print("Num distinct characters:", len(Counter(data)))

# print(np.(data))

for original, encoded in zip(encoder.classes_, range(len(encoder.classes_))):
    print(f"{original}: {encoded}")


: 0
 : 1
!: 2
$: 3
&: 4
': 5
,: 6
-: 7
.: 8
3: 9
:: 10
;: 11
?: 12
a: 13
b: 14
c: 15
d: 16
e: 17
f: 18
g: 19
h: 20
i: 21
j: 22
k: 23
l: 24
m: 25
n: 26
o: 27
p: 28
q: 29
r: 30
s: 31
t: 32
u: 33
v: 34
w: 35
x: 36
y: 37
z: 38


Now as you did last class, convert this single array into X,y pairs, where each row of X is a string of characters and each y is the next character. For example

'to be or not to b', 'e'
'what light throug', 'h'

You can choose how long you want the string of X chars to be (64,128,256 -- something in this range is reasonable. Smaller is faster to train. Longer makes a smarter model)

In [None]:
# your code
import numpy as np
X = []
y = []
size = 100

for i in range(size, len(data)-1):
  X.append(data[i-size:i])
  y.append(data[i+1])

Create a train/test set by choosing the first say 80% of the data for training.

In [None]:
# your code
from tensorflow.keras.utils import to_categorical


X_train = np.array(X[:(len(X)*4)//5])
y_train = np.array(y[:(len(y)*4)//5])
X_test = np.array(X[((len(X)*4)//5+1):])
y_test = np.array(y[((len(y)*4)//5+1):])

X_train = np.array(X_train, dtype=np.float32)
X_test = np.array(X_test, dtype=np.float32)
y_train_one_hot = to_categorical(y_train, num_classes=distinct_char)
y_test_one_hot = to_categorical(y_test, num_classes=distinct_char)

Input to an RNN needs to be a 3D tensor. You will probably need to reshape your data.

```python
# Reshape the input data for LSTM (samples, timesteps, features)
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], 1)
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], 1)
```

For example if X_train.shape is (1000,100,1) then you have 1000 phrases each of length 100. The '1' wraps this in a 3D tensor.

In [None]:
# Reshape the input data for LSTM (samples, timesteps, features)
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], 1)
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], 1)

Define your RNN. Use one layer of RNN -- you can choose SimpleRNN, LSTM, or GRU with similar semantics. Here is an outline

```python
# Define the LSTM model
model = Sequential()
model.add(Input([None,1])
model.add(GRU(128)) # 128 hidden units in one GRU layer
model.add(Dense(alphabet_size, activation='softmax'))
```

The input is a sequence of *any length* (hence the `None`), but only 1D (characters). The output is a 1-hot encoded vectors over each character. Train this using cross entropy and adam optimizer. You can pick any batch size (larger is faster, consult the GPU memory usage). Don't expect super high accuracy, train only for a few epochs (10 or less, maybe much less! Start with 1)

In [None]:
# Your code
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, SimpleRNN, Dense, GRU

model = Sequential([
  GRU(128, return_sequences=False),
  Dense(distinct_char, activation='softmax')
])

# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(X_train, y_train_one_hot, batch_size=2048, epochs=5, validation_split=0.1)

Epoch 1/5
[1m393/393[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 51ms/step - accuracy: 0.1747 - loss: 3.0191 - val_accuracy: 0.2016 - val_loss: 2.8599
Epoch 2/5
[1m393/393[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 49ms/step - accuracy: 0.2018 - loss: 2.8449 - val_accuracy: 0.2051 - val_loss: 2.8142
Epoch 3/5
[1m393/393[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 49ms/step - accuracy: 0.2128 - loss: 2.7857 - val_accuracy: 0.2186 - val_loss: 2.7726
Epoch 4/5
[1m393/393[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 49ms/step - accuracy: 0.2300 - loss: 2.7319 - val_accuracy: 0.2251 - val_loss: 2.7422
Epoch 5/5
[1m393/393[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 51ms/step - accuracy: 0.2417 - loss: 2.6847 - val_accuracy: 0.2298 - val_loss: 2.7219


In [None]:
print(model.predict(X_train[0]))
print()

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[[0.02921009 0.01414633 0.00218443 ... 0.00464458 0.02560094 0.0054914 ]
 [0.02592544 0.01235034 0.00187051 ... 0.00453177 0.02597136 0.00478361]
 [0.02076109 0.00959676 0.00149703 ... 0.00434594 0.02778262 0.00371115]
 ...
 [0.01895906 0.00852731 0.00139156 ... 0.00425165 0.02917691 0.00332667]
 [0.02197204 0.01025897 0.00157351 ... 0.00439531 0.027139   0.00396281]
 [0.0198519  0.00907494 0.00144293 ... 0.00430254 0.02840696 0.00351932]]



In [None]:
model.evaluate(X_test,y_test_one_hot)

[1m6971/6971[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 4ms/step - accuracy: 0.2295 - loss: 2.7267


[2.7227108478546143, 0.23081441223621368]

In [None]:
X_test[-1].shape

(100, 1)

In [None]:
import tensorflow as tf

# print(texts.shape)
probs = model.predict(X_test[-1])
logits = np.log(probs)/1
char_id = tf.random.categorical(probs, num_samples=1)
char_id.numpy()[0][0]

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step 


38

## Testing the model

This is a bit trickier than what we've done before. You need to process an input phrase, convert it to an array of ints, feed it to the model, get the logits of output, define a probability distribution,
select an element according to that distribution, append the result to the input, and then do this over in a loop until you have generated as much output as you want. We can break this down into pieces

First write `next_char(text, temp)` that gets the single next character predicted using `text` as input. Remember to employ the temperature. Here's a snippet that may help

```python
  probs = # output from your model
  logits = np.log(probs)/temp # we have to invert the softmax to get back to logits, then divide by temp
  char_id = tf.random.categorical(probs, num_samples=1) # helper function to apply softmax and then randomly sample
```

In [None]:
# your code

def next_char(texts, temp):
  texts = texts.reshape(-1,1)
  probs = model.predict(texts, verbose=0)
  logits = np.log(probs)/temp
  char_id = tf.random.categorical(probs, num_samples=1)
  return char_id.numpy()[0][0]

Now write `extend_text(text, n_chars, temp)` to add any number of characters to `text` by calling `next_char` repeatedly

In [None]:
# your code
def extend_text(text, n_chars, temp):
  texts = convert_text(text)
  for _ in range(n_chars):
    # print(texts)
    newChar = next_char(texts, temp)
    # print(newChar)
    texts = np.append(texts, newChar)
  final = encoder.inverse_transform(texts)
  # for i in texts[101:]:
  #   print(i, encoder.inverse_transform(i.reshape(1,)))
  # print(texts)
  return ''.join(final)
  # return finText

In [None]:
def convert_text(text):
  text = text.lower()
  data = encoder.transform(list(text))
  return data

In [None]:
counter = 0
for _ in range(100):
  if next_char(X_train[0], 1) == y_train[0]:
    counter+=1

print(counter)

2


In [None]:
print(extend_text(text[:100], 20, 0.2))

first citizen:
before we proceed any further, hear me speak.

all:
speak, speak.

first citizen:
youmkwnfynchd!':&q
3ttr


Finally, generate some Shakespeare! Experiment with different seeds and seed lengths and temperatures.

## Saving State

When training gets this involved you really need some good practices to save your work. Here's a callback that saves progress as you train. Especially important this is on Colab, which will stop and shutdown your session if you don't make it feel special all the time.

```python

from tensorflow.keras.callbacks import ModelCheckpoint
checkpoint_filepath = 'best_shakespeare_model.keras'

model_checkpoint_callback = ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=False,  # Save the entire model
    monitor='val_loss',  # Monitor validation loss
    mode='min',  # Save the model when val_loss is minimized
    save_best_only=True  # Only save the best model
)

# Train the model with the callback
history = model.fit(X_train, y_train, epochs=500,  validation_split=0.1, callbacks=[model_checkpoint_callback])
```

In [None]:
# from tensorflow.keras.callbacks import ModelCheckpoint
# checkpoint_filepath = 'best_shakespeare_model.keras'

# model_checkpoint_callback = ModelCheckpoint(
#     filepath=checkpoint_filepath,
#     save_weights_only=False,  # Save the entire model
#     monitor='val_loss',  # Monitor validation loss
#     mode='min',  # Save the model when val_loss is minimized
#     save_best_only=True  # Only save the best model
# )

# # Train the model with the callback
# history = model.fit(X_train, y_train_one_hot, epochs=40, batch_size=4096, validation_split=0.1, callbacks=[model_checkpoint_callback])