<a href="https://colab.research.google.com/github/afrojaakter/Natural-Language-Processing/blob/main/PredictShakespeare.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will built a language model to train it on Cloud TPU. This model will predict the next character of text given the text so far. The trained model can generate new snippets of text that read in a similar style to the test training data. 

###Data, model and training:

We will train a model on the combined work of William Shakespeare, then use the model to compose a play in the style of *The Great Bard*:

<blockquote>
Loves that led me no dumbs lack her Berjoy's face with her to-day.  
The spirits roar'd; which shames which within his powers  
	Which tied up remedies lending with occasion,  
A loud and Lancaster, stabb'd in me  
	Upon my sword for ever: 'Agripo'er, his days let me free.  
	Stop it of that word, be so: at Lear,  
	When I did profess the hour-stranger for my life,  
	When I did sink to be cried how for aught;  
	Some beds which seeks chaste senses prove burning;  
But he perforces seen in her eyes so fast;  
And _  
</blockquote>


####Download data
We will get *The completer works of William Shakespeeare* as a single text file from [Project Gutenberg](https://www.gutenberg.org/). We will use snippets from this file as training data. The target snippet is offset by one character.

In [1]:
#download data
!wget --show-progress --continue -O /content/shakespeare.txt http://www.gutenberg.org/files/100/100-0.txt

--2021-07-14 18:26:33--  http://www.gutenberg.org/files/100/100-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.gutenberg.org/files/100/100-0.txt [following]
--2021-07-14 18:26:33--  https://www.gutenberg.org/files/100/100-0.txt
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5757108 (5.5M) [text/plain]
Saving to: ‘/content/shakespeare.txt’


2021-07-14 18:26:34 (18.5 MB/s) - ‘/content/shakespeare.txt’ saved [5757108/5757108]



In [2]:
#visualizing some input dataset
!head -n5 /content/shakespeare.txt
!echo "..."
!shuf -n5 /content/shakespeare.txt

﻿The Project Gutenberg eBook of The Complete Works of William Shakespeare, by William Shakespeare

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
...
With marriage wherefore was he mock’d,
    What dost thou, or what art thou, Angelo?
SIR TOBY.

Imagine her as one in dead of night


In [13]:
import numpy as np
import tensorflow as tf
import os

import distutils
if distutils.version.LooseVersion(tf.__version__) < '2.0':
  raise Exception("This notebook is compatible with Tensorflow 2.0 or higher.")

ShakespeareTxt = 'shakespeare.txt'

def transform(txt):
  return np.asarray([ord(c) for c in txt if ord(c) < 255], dtype = np.int32)

def input_fun(seq_len = 100, batch_size = 1024):
  """Return a dataset of source and target sequences for training."""
  with tf.io.gfile.GFile(ShakespeareTxt, 'r') as f:
    txt = f.read()
  
  source = tf.constant(transform(txt), dtype = tf.int32)

  ds = tf.data.Dataset.from_tensor_slices(source).batch(seq_len + 1, drop_remainder = True)

  def split_input_target(chunk):
    """Slicing the data into input (all but the last) and target (the last number) from the chunk"""
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text
  
  buffer_size = 10000
  ds = ds.map(split_input_target).shuffle(buffer_size).batch(batch_size, drop_remainder = True)

  return ds.repeat()

###Model
we will use two-layer, forward-LSTM.

Input dim of the Emdedding layer is 256, as the vocabulary size is 256.

During training we will make sure that ```stateful=False``` because we want to reset the state of the trained model, we want ```stateful=True``` so that the model can retain information across the current batch and generate more interesting text.


In [8]:
EMBEDDING_DIM = 512

def lstm_model(seq_len=100, batch_size=None, stateful=True):
  """Language model: predict the next word given the current word."""
  source = tf.keras.Input(
      name='seed', shape=(seq_len,), batch_size=batch_size, dtype=tf.int32)

  embedding = tf.keras.layers.Embedding(input_dim=256, output_dim=EMBEDDING_DIM)(source)
  lstm_1 = tf.keras.layers.LSTM(EMBEDDING_DIM, stateful=stateful, return_sequences=True)(embedding)
  lstm_2 = tf.keras.layers.LSTM(EMBEDDING_DIM, stateful=stateful, return_sequences=True)(lstm_1)
  predicted_char = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(256, activation='softmax'))(lstm_2)
  return tf.keras.Model(inputs=[source], outputs=[predicted_char])

### Train Model
First, we create a distribution strategy that can use the TPU. 
In this case it is TPUStrategy. So we can create and compile the model inside its scope. Once that is done, future calls to the standard Keras methods fit, evaluate and predict use the TPU.

Again we train with stateful=False because while training, we only care about one batch at a time.

In [None]:
tf.keras.backend.clear_session()

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' +
                                                             os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
# This is the TPU initialization code that has to be at the beginning.
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))

strategy = tf.distribute.experimental.TPUStrategy(resolver)

with strategy.scope():
  training_model = lstm_model(seq_len=100, stateful=False)
  training_model.compile(
      optimizer=tf.keras.optimizers.RMSprop(learning_rate=0.01),
      loss='sparse_categorical_crossentropy',
      metrics=['sparse_categorical_accuracy'])

training_model.fit(
    input_fun(),
    steps_per_epoch=100,
    epochs=10
)
training_model.save_weights('/tmp/bard.h5', overwrite=True)

In [14]:

training_model.fit(
    input_fun(),
    steps_per_epoch=100,
    epochs=10
)
training_model.save_weights('/tmp/bard.h5', overwrite=True)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


###Predictions
Now we will use the trained model to make predictions and generate Shakespeare-esque play. Start the model with a *seed* sentence, then generated 250 characters from it. The model makes five predictions from the initial seed. 

The predictions are done on the CPU so that batch sies (5) in this case does not have to be divisible by 8.

Note, now we wull set ```stateful = True``` to keep model's state in between batches. If ```stateful = False```, the model state is reset between each batch, and the model will only be able to use the information from the current batch (a single character) to make a prediction.

The output of the model is a set of probabilities for the next character (given the input so far). To build a paragraph, we predict one character at a time and sample a character (based on the probabilities provided by the model). For example, if the input character is "o" and the output probabilities are "p" (0.65), "t" (0.30), others characters (0.05), then we allow our model to generate text other than just "Ophelia" and "Othello."

In [15]:
BATCH_SIZE = 5
PREDICT_LEN = 250

# Keras requires the batch size be specified ahead of time for stateful models.
# We use a sequence length of 1, as we will be feeding in one character at a 
# time and predicting the next character.
prediction_model = lstm_model(seq_len=1, batch_size=BATCH_SIZE, stateful=True)
prediction_model.load_weights('/tmp/bard.h5')

# We seed the model with our initial string, copied BATCH_SIZE times
seed_txt = 'Looks it not like the king?  Verily, we must go! '
seed = transform(seed_txt)
seed = np.repeat(np.expand_dims(seed, 0), BATCH_SIZE, axis=0)

# First, run the seed forward to prime the state of the model.
prediction_model.reset_states()
for i in range(len(seed_txt) - 1):
  prediction_model.predict(seed[:, i : i + 1])

#Now we can accumulate predictions
predictions = [seed[:, -1:]]
for i in range(PREDICT_LEN):
  last_word = predictions[-1]
  next_probits = prediction_model.predict(last_word)[:, 0, :]

# sample from our output distribution
  next_idx = [
      np.random.choice(256, p=next_probits[i])
      for i in range(BATCH_SIZE)
  ]
  predictions.append(np.asarray(next_idx, dtype=np.int32))
  

for i in range(BATCH_SIZE):
  print('PREDICTION %d\n\n' % i)
  p = [predictions[j][i] for j in range(PREDICT_LEN)]
  generated = ''.join([chr(c) for c in p])  # Convert back to text
  print(generated)
  print()
  assert len(generated) == PREDICT_LEN, 'Generated text too short'

PREDICTION 0


 Fear
thee. Here is the virginities to come abated in
    great Cupid. Come, come, it will I scouple it.
  CELIA. The best, be satisfied? He comes in for,
    And give you with your Grace.
  CELIA. Fear not to winnot. I will give they art.
  CE

PREDICTION 1


 Art not conscience,
And both how we for whats single
But I yet met, and good for him.

DAUPHIN.
Thou drawn
For the beauteous occusator Hercules,
No, nor off death, thou canst sing; and I
she flushing that I spoke,
To such a thing with a lam

PREDICTION 2


 Can you sew?
Say I make Rome, or shakes, you do beseech you what I think one quarter. But, good friend,
The Volsces pack.

ROSENCRANTZ and ALuAR.
In countryeed again; and there died we weigh him in the slave, and that adost you grow for,
Do as

PREDICTION 3


 And now thou art
I would go hell.

CLOTEN.
Faith, sir, thou fortune with himself,
I have rouse that thou art bigger: to the oizing,
To set fairer in their pate of Jupiters thing

Reference: https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/shakespeare_with_tpu_and_keras.ipynb#scrollTo=2a5cGsSTEBQD