In [1]:
import tensorflow as tf
import itertools

import numpy as np
from tensorflow.keras import layers
from tensorflow.keras.callbacks import LambdaCallback
import random
import sys

## Dataset and Preprocessing

Read the dataset of dinosaur names and create a list of unique characters (such as a-z), and compute the dataset and vocabulary size. 

In [2]:
def read_data(filename):
    data= open(filename, 'r').read()
    data=data.lower()
    chars=list(set(data))
    data_size, vocab_size = len(data), len(chars)
    print('There are %d total characters and %d unique characters in your data.' % (data_size, vocab_size))
    return chars,data

In [3]:
chars,data= read_data('data/dinos.txt')

There are 19909 total characters and 27 unique characters in your data.


The characters are a-z (26 characters) plus the "\n" (or newline character), which plays a role similar to the `<EOS>` (or "End of sentence") token.it indicates the end of the dinosaur name rather than the end of a sentence. In the cell below, we create a python dictionary (i.e., a hash table) to map each character to an index from 0-26. We also create a second python dictionary that maps each index back to the corresponding character character. This will help figure out what index corresponds to what character in the probability distribution output of the softmax layer. Below, `char_to_ix` and `ix_to_char` are the python dictionaries. 

In [4]:
char_to_ix={char:i for i,char in enumerate(sorted(chars))}
ix_to_char=np.array(sorted(chars))

In [5]:
char_to_ix['\n'], ix_to_char[0]

(0, '\n')

In [6]:
# input text as integers mapped from the char_to_ix dict
text_as_int = np.array([char_to_ix[c] for c in data ])


In [7]:
data[:14],text_as_int[:14]

('aachenosaurus\n',
 array([ 1,  1,  3,  8,  5, 14, 15, 19,  1, 21, 18, 21, 19,  0]))

### The prediction task

Given a character, or a sequence of characters, what is the most probable next character? This is the task we're training the model to perform. The input to the model will be a sequence of characters, and we train the model to predict the output—the following character at each time step.

Since RNNs maintain an internal state that depends on the previously seen elements, given all the characters computed until this moment, what is the next character?

### Create training examples and targets
Next divide the text into example sequences. Each input sequence will contain seq_length characters from the text. 
For each input sequence, the corresponding targets contain the same length of text, except shifted one character to the right.



In [8]:
def read_dinousar_names(filename):
    with open(filename) as f:
        dinousar_names = f.readlines()
        dinousar_names = [x.lower().strip() for x in dinousar_names ]
    return dinousar_names
    
   

In [9]:
dinousar_names=read_dinousar_names('data/dinos.txt')
dinousar_names[:5]

['aachenosaurus', 'aardonyx', 'abdallahsaurus', 'abelisaurus', 'abrictosaurus']

shuffle the examples.




In [10]:
# Shuffle list of all dinosaur names
np.random.seed(0)
np.random.shuffle(dinousar_names)

In [11]:
dinousar_names[:5]

['turiasaurus',
 'pandoravenator',
 'ilokelesia',
 'chubutisaurus',
 'quaesitosaurus']

For each sequence, duplicate and shift it to form the input and target text by using the using None as the first character to the input and from index 1 for the Y targets. the None ensures both the inputs and Targets are similar


In [12]:
def create_x_y_dataset(dinousar_names):
    X=[]
    Y=[]
    for name in dinousar_names:
        x=[0] + [char_to_ix[ch] for ch in name]
        y= x[1:] + [char_to_ix['\n']]
        X.append(x)
        Y.append(y)
        
    #pad with zeros for maximum length
    X=np.array(list(itertools.zip_longest(*X, fillvalue=0))).T
    Y=np.array(list(itertools.zip_longest(*Y, fillvalue=0))).T
    X=X.reshape(-1,27,1)
    
    
    return X,Y

In [13]:
X,Y= create_x_y_dataset(dinousar_names)

### Tensorflow + Keras

The model is a 2 layer LSTM with each having 128 units then a softmax layer

In [14]:

model = tf.keras.Sequential()
model.add(layers.LSTM(128, input_shape=(27,1), return_sequences=True))
model.add(layers.LSTM(128))
model.add(layers.Dense(len(chars), activation='softmax'))
# Add a softmax layer with 10 output units:


In [15]:
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 27, 128)           66560     
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense (Dense)                (None, 27)                3483      
Total params: 201,627
Trainable params: 201,627
Non-trainable params: 0
_________________________________________________________________


In [16]:
model.compile(optimizer = tf.train.AdamOptimizer(0.15),
    loss = 'categorical_crossentropy')

sample an index from a probability array

In [17]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


Function invoked at end of each epoch. Prints generated text.

In [18]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(data) - 27 - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = data[start_index: start_index + 27]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(2):
            x_pred = np.zeros((1, 27, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_to_ix[char]] = 1.

            preds = model.predict(x_pred.reshape(-1,27,1), verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = ix_to_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

    

In [19]:
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [20]:
model.fit(X,Y, batch_size=64, epochs=10,callbacks=[print_callback])


Epoch 1/10
----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "govuchia
yutyrannus
zanabaz"
govuchia
yutyrannus
zanabazbb
----- diversity: 0.5
----- Generating with seed: "govuchia
yutyrannus
zanabaz"
govuchia
yutyrannus
zanabazbb
----- diversity: 1.0
----- Generating with seed: "govuchia
yutyrannus
zanabaz"
govuchia
yutyrannus
zanabazlg
----- diversity: 1.2
----- Generating with seed: "govuchia
yutyrannus
zanabaz"
govuchia
yutyrannus
zanabazjf
Epoch 2/10
----- Generating text after Epoch: 1
----- diversity: 0.2
----- Generating with seed: "rus
asiaceratops
asiamerica"
rus
asiaceratops
asiamericagf
----- diversity: 0.5
----- Generating with seed: "rus
asiaceratops
asiamerica"
rus
asiaceratops
asiamericaee
----- diversity: 1.0
----- Generating with seed: "rus
asiaceratops
asiamerica"
rus
asiaceratops
asiamericade
----- diversity: 1.2
----- Generating with seed: "rus
asiaceratops
asiamerica"
rus
asiaceratops
asiamericag

Epoch 3/10
----- Generating text

<tensorflow.python.keras.callbacks.History at 0x7fc0dc4295c0>

In [None]:
try a prediction with custom data

In [23]:
X_test=np.zeros(shape=(len(chars),len(chars),1), dtype=np.int32)

we add some few characters 

In [41]:
X_test=np.random.choice(26, (27,27,1))

In [42]:
preds=[]

In [43]:
for i in range(X_test.shape[0]):
    preds.append(sample(model.predict(X_test)[i]))
    

In [44]:
print('\n'.join(ix_to_char[preds]))

e
f
l


d
d


i
j
l
j
g
f
i


i
b
f
l
i
i
l


e


g
g
