In [1]:
import tensorflow as tf
tf.enable_eager_execution()
import numpy as np


## Dataset and Preprocessing

Read the dataset of dinosaur names and create a list of unique characters (such as a-z), and compute the dataset and vocabulary size. 

In [2]:
def read_data(filename):
    data= open(filename, 'r').read()
    data=data.lower()
    chars=list(set(data))
    data_size, vocab_size = len(data), len(chars)
    print('There are %d total characters and %d unique characters in your data.' % (data_size, vocab_size))
    return chars,data

In [3]:
chars,data= read_data('data/dinos.txt')

There are 19909 total characters and 27 unique characters in your data.


The characters are a-z (26 characters) plus the "\n" (or newline character), which plays a role similar to the `<EOS>` (or "End of sentence") token.it indicates the end of the dinosaur name rather than the end of a sentence. In the cell below, we create a python dictionary (i.e., a hash table) to map each character to an index from 0-26. We also create a second python dictionary that maps each index back to the corresponding character character. This will help figure out what index corresponds to what character in the probability distribution output of the softmax layer. Below, `char_to_ix` and `ix_to_char` are the python dictionaries. 

In [4]:
char_to_ix={char:i for i,char in enumerate(sorted(chars))}
ix_to_char=np.array(sorted(chars))

In [5]:
char_to_ix['\n'], ix_to_char[0]

(0, '\n')

In [6]:
# input text as integers mapped from the char_to_ix dict
text_as_int = np.array([char_to_ix[c] for c in data ])


In [7]:
data[:14],text_as_int[:14]

('aachenosaurus\n',
 array([ 1,  1,  3,  8,  5, 14, 15, 19,  1, 21, 18, 21, 19,  0]))

### The prediction task

Given a character, or a sequence of characters, what is the most probable next character? This is the task we're training the model to perform. The input to the model will be a sequence of characters, and we train the model to predict the output—the following character at each time step.

Since RNNs maintain an internal state that depends on the previously seen elements, given all the characters computed until this moment, what is the next character?

### Create training examples and targets
Next divide the text into example sequences. Each input sequence will contain seq_length characters from the text. 
For each input sequence, the corresponding targets contain the same length of text, except shifted one character to the right.



In [10]:
def read_dinousar_names(filename):
    with open(filename) as f:
        dinousar_names = f.readlines()
        dinousar_names = [x.lower().strip() for x in dinousar_names ]
    return dinousar_names
    
   

In [11]:
dinousar_names=read_dinousar_names('data/dinos.txt')
dinousar_names[:5]

['aachenosaurus', 'aardonyx', 'abdallahsaurus', 'abelisaurus', 'abrictosaurus']

shuffle the examples.




In [12]:
# Shuffle list of all dinosaur names
np.random.seed(0)
np.random.shuffle(dinousar_names)

In [13]:
dinousar_names[:5]

['turiasaurus',
 'pandoravenator',
 'ilokelesia',
 'chubutisaurus',
 'quaesitosaurus']

For each sequence, duplicate and shift it to form the input and target text by using the using None as the first character to the input and from index 1 for the Y targets. the None ensures both the inputs and Targets are similar


In [25]:
def create_x_y_dataset(dinousar_names):
    X=[]
    Y=[]
    for name in dinousar_names:
        x=[None] + [char_to_ix[ch] for ch in name]
        y= x[1:] + [char_to_ix['\n']]
        X.append(x)
        Y.append(y)
    return X,Y

In [28]:
X,Y= create_x_y_dataset(dinousar_names)