<h2>Virtual Shakespeare - Natural Language Processing</h2>
<img src="https://upload.wikimedia.org/wikipedia/commons/a/a2/Shakespeare.jpg" width="200" height="200"/>

<h5> Hola folks! Don't be fooled by the clickbait title.<br> In this quest to recreate Shakespeare, I was partly successful. Actually, no I wasn't.<br> Still, I was able to create a 3 year old Shakespeare, with some broken words and a bit of a hate for grammar.</h5>

<h5>Okay, coming to this notebook. I have used a dataset of Shakespeare's works to train a model, which on giving an input text, will produce an output text of desired length.</h5>

<h5>In the output text, you can observe that they follow the same structure as shakespeare's works, uses similar words and similar dialogues.<br> This output text is completely computer generated and not a subset of the dataset. </h5>

<h5>The dataset has over 5 million characters, and 84 distinct characters.</h5>

####  I have provided a sample output at the top, please go through the entire notebook, to understand how it all works. 

In [0]:
#Sample output
print(generate_text(model, 'Et tu, Brute?', gen_size = 1000))

Et tu, Brute?
  CASSIUS. Hear you the change: why, what marning is her silvia?
  MARIA. Very liken'd brave said; 'twill besire this night, I may perbs crap-in-these
    discourse hast bred in the experience of their King?
  BOLINGBROKE. Who straight there; but that you are so scruple  
    Before, being gentle-sweeber than our star,
    And bade your best odd Cudruction of it.
    O fixt, thy guilt spoke back! It did all rest unmauntily
    sculves! Dat is it not, awake! Very Emmawe, Alifardup Mother, afterward, and Sisturs.
  AJAX. You say the truth, give me the court; whose loves are in notice,
    'Tis dearer frank.
  SERVANT. I thank you.
  ORLANDO. Well, horrible! Come, no.' 'Tis well belov'd, so I confess the sun to
    knowe hearts, and say the terthy, not the longes,
    And pity to the other temple may amiss
    Fall'n forth tonight.
  SEBASTIAN. Wilt thourself come from the hardicu?  
  LUCETTA. For who should wear'd it here? Shall I hear such business?
    Grant I maintain'd

In [0]:
#importing libraries and packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

In [0]:
#load the dataset containing shakespeare's works
text = open('../input/shakespeare.txt', 'r').read()
print(text[:1000])


                     1
  From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But as the riper should by time decease,
  His tender heir might bear his memory:
  But thou contracted to thine own bright eyes,
  Feed'st thy light's flame with self-substantial fuel,
  Making a famine where abundance lies,
  Thy self thy foe, to thy sweet self too cruel:
  Thou that art now the world's fresh ornament,
  And only herald to the gaudy spring,
  Within thine own bud buriest thy content,
  And tender churl mak'st waste in niggarding:
    Pity the world, or else this glutton be,
    To eat the world's due, by the grave and thee.


                     2
  When forty winters shall besiege thy brow,
  And dig deep trenches in thy beauty's field,
  Thy youth's proud livery so gazed on now,
  Will be a tattered weed of small worth held:  
  Then being asked, where all thy beauty lies,
  Where all the treasure of thy lusty days;
  To say within thine own deep su

# Dataset Statistics

In [0]:
total_char = len(list(text))
unique_char = len(set(text))

text_cleaned = ''
alphalist = [chr(i) for i in range(ord('A'), ord('z')+1)]

for i in text:
    if i.isalpha():
        text_cleaned += i
    else:
        text_cleaned += ' '

char_count = {}
for i in text_cleaned.split():
    if i in char_count:
        char_count[i] += 1
    else:
        char_count[i] = 1
        
df = pd.DataFrame(char_count.items(), columns=['Words','Count'])
df.sort_values('Count', axis=0, ascending=False, inplace=True)
df.reset_index(drop=True, inplace=True)

In [0]:
print('Total Characters: ', total_char)
print('Unique Characters: ', unique_char)
print('Most used words:')
display(df.head(10))

Total Characters:  5445609
Unique Characters:  84
Most used words:


Unnamed: 0,Words,Count
0,the,23243
1,I,22225
2,and,18618
3,to,16339
4,of,15687
5,a,12780
6,you,12163
7,my,10839
8,in,10005
9,d,8954


# Text Preprocessing

In [0]:
#Get all the unique characters
vocab = sorted(set(text))
vocab_size = len(vocab)
print(vocab)
print('Total uniques characters: ',vocab_size)

['\n', ' ', '!', '"', '&', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '>', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '|', '}']
Total uniques characters:  84


In [0]:
#Map characters to numbers and numbers to characters
char_to_ind = {u:i for i,u in enumerate(vocab)}
ind_to_char = {i:u for i,u in enumerate(vocab)}
print(char_to_ind)
print('\n')
print(ind_to_char)

{'\n': 0, ' ': 1, '!': 2, '"': 3, '&': 4, "'": 5, '(': 6, ')': 7, ',': 8, '-': 9, '.': 10, '0': 11, '1': 12, '2': 13, '3': 14, '4': 15, '5': 16, '6': 17, '7': 18, '8': 19, '9': 20, ':': 21, ';': 22, '<': 23, '>': 24, '?': 25, 'A': 26, 'B': 27, 'C': 28, 'D': 29, 'E': 30, 'F': 31, 'G': 32, 'H': 33, 'I': 34, 'J': 35, 'K': 36, 'L': 37, 'M': 38, 'N': 39, 'O': 40, 'P': 41, 'Q': 42, 'R': 43, 'S': 44, 'T': 45, 'U': 46, 'V': 47, 'W': 48, 'X': 49, 'Y': 50, 'Z': 51, '[': 52, ']': 53, '_': 54, '`': 55, 'a': 56, 'b': 57, 'c': 58, 'd': 59, 'e': 60, 'f': 61, 'g': 62, 'h': 63, 'i': 64, 'j': 65, 'k': 66, 'l': 67, 'm': 68, 'n': 69, 'o': 70, 'p': 71, 'q': 72, 'r': 73, 's': 74, 't': 75, 'u': 76, 'v': 77, 'w': 78, 'x': 79, 'y': 80, 'z': 81, '|': 82, '}': 83}


{0: '\n', 1: ' ', 2: '!', 3: '"', 4: '&', 5: "'", 6: '(', 7: ')', 8: ',', 9: '-', 10: '.', 11: '0', 12: '1', 13: '2', 14: '3', 15: '4', 16: '5', 17: '6', 18: '7', 19: '8', 20: '9', 21: ':', 22: ';', 23: '<', 24: '>', 25: '?', 26: 'A', 27: 'B', 28: 'C

In [0]:
#encode the first 1000 characters as numbers
encoded_text = np.array([char_to_ind[c] for c in text])
print(encoded_text[:1000])

[ 0  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 12  0
  1  1 31 73 70 68  1 61 56 64 73 60 74 75  1 58 73 60 56 75 76 73 60 74
  1 78 60  1 59 60 74 64 73 60  1 64 69 58 73 60 56 74 60  8  0  1  1 45
 63 56 75  1 75 63 60 73 60 57 80  1 57 60 56 76 75 80  5 74  1 73 70 74
 60  1 68 64 62 63 75  1 69 60 77 60 73  1 59 64 60  8  0  1  1 27 76 75
  1 56 74  1 75 63 60  1 73 64 71 60 73  1 74 63 70 76 67 59  1 57 80  1
 75 64 68 60  1 59 60 58 60 56 74 60  8  0  1  1 33 64 74  1 75 60 69 59
 60 73  1 63 60 64 73  1 68 64 62 63 75  1 57 60 56 73  1 63 64 74  1 68
 60 68 70 73 80 21  0  1  1 27 76 75  1 75 63 70 76  1 58 70 69 75 73 56
 58 75 60 59  1 75 70  1 75 63 64 69 60  1 70 78 69  1 57 73 64 62 63 75
  1 60 80 60 74  8  0  1  1 31 60 60 59  5 74 75  1 75 63 80  1 67 64 62
 63 75  5 74  1 61 67 56 68 60  1 78 64 75 63  1 74 60 67 61  9 74 76 57
 74 75 56 69 75 64 56 67  1 61 76 60 67  8  0  1  1 38 56 66 64 69 62  1
 56  1 61 56 68 64 69 60  1 78 63 60 73 60  1 56 57

# Training Sequence

In [0]:
#number of sequences to generate
seq_len = 120
total_num_seq = len(text)//(seq_len+1)
print('Total Number of Sequences: ', total_num_seq)

Total Number of Sequences:  45005


In [0]:
#Create training sequences
#tf.data.Dataset.from_tensor_slices function converts a text vector
#into a stream of character indices
char_dataset = tf.data.Dataset.from_tensor_slices(encoded_text)

for i in char_dataset.take(500):
    print(ind_to_char[int(i)],end="")


                     1
  From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But as the riper should by time decease,
  His tender heir might bear his memory:
  But thou contracted to thine own bright eyes,
  Feed'st thy light's flame with self-substantial fuel,
  Making a famine where abundance lies,
  Thy self thy foe, to thy sweet self too cruel:
  Thou that art now the world's fresh ornament,
  And only herald to the gaudy spring,
  Within thine own bu

In [0]:
#batch method converts these individual character calls into sequences
#which we can feed in as a batch
#we use seq_len+1 because we will use seq_len characters
#and shift them one step forward
#drop remainder drops the remaining characters < batch_size
sequences = char_dataset.batch(seq_len+1, drop_remainder=True)

In [0]:
#this function will grab a sequence
#take the [0:n-1] characters as input text
#take the [1:n] characters as target text
#return a tuple of both
def create_seq_targets(seq):
    input_txt = seq[:-1]
    target_txt = seq[1:]
    return input_txt, target_txt

In [0]:
#this will convert the series of sequences into
#a series of tuple containing input and target text
dataset = sequences.map(create_seq_targets)

In [0]:
for input_txt, target_txt in dataset.take(1):
    print(''.join([ind_to_char[i] for i in np.array(input_txt)]))
    print('\n')
    print(''.join([ind_to_char[i] for i in np.array(target_txt)]))

#the target is shifter 1 character forward
#the last character is a space and is thus not visible


                     1
  From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But


                     1
  From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But 


# Generating training batches

In [0]:
batch_size = 128 #number of sequence tuples in each batch
buffer_size = 10000 #shuffle this many sequences in the dataset

#first shuffle the dataset and divide it into batches
#drop the last sequences < batch_size
dataset = dataset.shuffle(buffer_size).batch(batch_size, drop_remainder=True)

In [0]:
#count the number of batches
#i couldn't find a function to do it in O(1)
#please let me know

x = 0
for i in dataset:
    x += 1

In [0]:
print('Total Batches:', x)
print('Sequences in each batch: ', batch_size)
print('Characters in each sequence:', seq_len)
print('Characters in dataset: ', len(list(text)))

Total Batches: 351
Sequences in each batch:  128
Characters in each sequence: 120
Characters in dataset:  5445609


# Creating the Model

In [0]:
#importing keras modules
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM,Dense,Embedding,Dropout,GRU
from tensorflow.keras.losses import sparse_categorical_crossentropy

In [0]:
#using sparse_categorical_crossentropy because
#out predictions will be numbers and not one hot encodings
#we need to define a custom loss function so that we can change
#the from_logits parameter to True
def sparse_cat_loss(y_true, y_pred):
    return sparse_categorical_crossentropy(y_true, y_pred, from_logits=True)

In [0]:
def create_model(batch_size):
    vocab_size_func = vocab_size
    embed_dim = 64 #the embedding dimension
    rnn_neurons = 1024 #number of rnn units
    batch_size_func = batch_size
    
    model = Sequential()
    
    model.add(Embedding(vocab_size_func, 
                        embed_dim, 
                        batch_input_shape=[batch_size_func, None]))
    model.add(GRU(rnn_neurons, 
                  return_sequences=True, 
                  stateful=True, 
                  recurrent_initializer='glorot_uniform'))
    
    model.add(Dense(vocab_size_func))    
    model.compile(optimizer='adam', loss=sparse_cat_loss)    
    
    return model

In [0]:
model = create_model(batch_size)
model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (128, None, 64)           5376      
_________________________________________________________________
gru_5 (GRU)                  (128, None, 1024)         3348480   
_________________________________________________________________
dense_5 (Dense)              (128, None, 84)           86100     
Total params: 3,439,956
Trainable params: 3,439,956
Non-trainable params: 0
_________________________________________________________________


# Training the Model

#### As we have  created the model, random weights and biases have been assigned, so before begining training lets first check whether the model is working or not.

In [0]:
#note this will generate random characters
#dataset.take(1) contains 1 batch = 128 sequence tuples
#model will output 120 characters per sequence
#in the form of probability of those 84 vocab characters
for ex_input, ex_target in dataset.take(1):
    ex_pred = model(ex_input)
print(ex_pred.shape)

#changes the character probabilities to integers
sampled_indices = tf.random.categorical(ex_pred[0], num_samples=1)

#maps those integers to characters
char_pred = ''.join([ind_to_char[int(i)] for i in sampled_indices])

print(char_pred)

(128, 120, 84)
3Di!)knghXBPI}`?U5I6'C]7sJqL<QFc9wA!BGGo((N8qdJMQPDl
<mg7GW_J!`RFs'T'BRh5IvzbVCs95PNoCvaVHvpkkdGGPe8p8eJELAg[!QhZB8Kk_Zv


In [0]:
#training the model
model.fit(dataset, epochs=30, verbose=1)

Train for 351 steps
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x7f2104f5a6d0>

In [0]:
#save the model
model.save('shakespeare.h5')

# Generating text

In [0]:
# importing load_model to load the keras model
from tensorflow.keras.models import load_model

In [0]:
#create a new model with a batch size of 1
model = create_model(batch_size=1)

#load the weights from the previous model to our new model
model.load_weights('shakespeare.h5')

#build the model
model.build(tf.TensorShape([1, None]))

#view model summary
print(model.summary())

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (1, None, 64)             5376      
_________________________________________________________________
gru_6 (GRU)                  (1, None, 1024)           3348480   
_________________________________________________________________
dense_6 (Dense)              (1, None, 84)             86100     
Total params: 3,439,956
Trainable params: 3,439,956
Non-trainable params: 0
_________________________________________________________________
None


In [0]:
#function to generate text based on an input text
#we enter the text on which our output will be based
#we define how many characters we want in output

def generate_text(model, start_seed, gen_size=100):
    num_generate = gen_size
    input_eval = [char_to_ind[s] for s in start_seed]
    input_eval = tf.expand_dims(input_eval, 0)
    
    text_generated = []
    
    model.reset_states()
    
    for i in range(num_generate):
        predictions = model(input_eval)
        predictions = tf.squeeze(predictions, 0)
        
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1, 0].numpy()
        
        input_eval = tf.expand_dims([predicted_id], 0)
        
        text_generated.append(ind_to_char[predicted_id])
        
    return (start_seed + ''.join(text_generated))

In [0]:
#generate a text based on input
#note that, this out is not part of the dataset
#but completely auto generated
auto_text = generate_text(model, 'How art thou?', gen_size = 1000)
print(auto_text)

How art thou?
  OTHELLO. This is the cunning silver-house, this man go to bed, to report
    Thy grave is as fast, but known, as euging,
    Love-spoiling for you.
  PROTEUS. Forg'd no further troubled with bounty, may your praises are
    throughly moved, when you have reverse, no.
  Friar. Pardons
  QUEEN. Of me is Panian's minds. You torment is my most observis'd
    Of all the curtains. Yet I hear and by suspicion
    at this half-skin on his sphere doth back to sight.
  FLORIZEL. Thou cam'st,  
    I say, opens'd,
    Looky heavy from mine uncles being a fool,
    Your lordshipfray shall go farther coldly what
    whoreson cruel pictures; thereof was the something of silver, since I muse
    murther our brights of foul words buried.
  PANDARUS. That's as proud affection than the old favour
    sing in's arm] I'll fight. It is my husband he would live to see        Frenchmenerales
    Mark it more carefully.
  AUCHIO. The answer, Jack; I spread, is old Signior Friar, am becomes not

In [0]:
#download the saved model
model.save('shakespeare.h5')
from IPython.display import FileLink
FileLink(r'shakespeare.h5')