<a href="https://colab.research.google.com/github/btcnhung1299/cinnamon-ai-bootcamp/blob/master/Building%20A%20Neural%20Language%20Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import os
import tensorflow as tf
tf.enable_eager_execution()

Download pre-processed data

In [2]:
!test -f wikitext-2-raw-v1.zip || wget -q https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
DATA_FOLDER = 'wikitext-2-raw'
!test -d $DATA_FOLDER || unzip wikitext-2-raw-v1.zip

Archive:  wikitext-2-raw-v1.zip
   creating: wikitext-2-raw/
  inflating: wikitext-2-raw/wiki.test.raw  
  inflating: wikitext-2-raw/wiki.valid.raw  
  inflating: wikitext-2-raw/wiki.train.raw  


Use utf-8 scheme to decode text

In [0]:
with open('{}/wiki.train.raw'.format(DATA_FOLDER), 'rb') as f:
  text = f.read().decode(encoding='utf8')

Examples of data

In [4]:
vocab = sorted(set(text))
print('Length of text:', len(text))
print('Number of unique characters:', len(vocab))
print('First 3000 words:')
print(text[:3000])

Length of text: 10918892
Number of unique characters: 1013
First 3000 words:
 
 = Valkyria Chronicles III = 
 
 Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . 
 The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underw

### Vectorize the text

Encode each character in the text by a unique number.

In [0]:
# Create mapping from unique characters to indices
char2idx = {char : idx for idx, char in enumerate(vocab)}

# Create mapping from indices to characters
idx2char = np.array(vocab)

In [6]:
# Vectorize the text
text_as_int = np.array([char2idx[char] for char in text])

# Take a look at the first 100 characters after vectorizing
text_as_int[:100]

array([  1,   0,   1,  30,   1,  55,  66,  77,  76,  90,  83,  74,  66,
         1,  36,  73,  83,  80,  79,  74,  68,  77,  70,  84,   1,  42,
        42,  42,   1,  30,   1,   0,   1,   0,   1,  52,  70,  79,  75,
       183,   1,  79,  80,   1,  55,  66,  77,  76,  90,  83,  74,  66,
         1,  20,   1,  27,   1,  54,  79,  83,  70,  68,  80,  83,  69,
        70,  69,   1,  36,  73,  83,  80,  79,  74,  68,  77,  70,  84,
         1,   9,   1,  43,  66,  81,  66,  79,  70,  84,  70,   1,  27,
         1, 810, 758, 617, 693, 637, 689, 648, 685])

#### Create target

The sequence contains both input sequence and target sequence.
- Each input sequence contains `seq_length` characters from the text.
- Its corresponding target has the same number of characters except shifted one character to the right.

For example, sequence `Cinnamon` is split as `Cinnamo` (input) and `innamon` (target).

In [7]:
# The maximum number of characters in a input sequence
inp_seq_length = 30

# A sequence contains input and target (shifted 1 to input)
seq_length = inp_seq_length + 1

# Convert text vector to stream of characters indices
char_stream = tf.data.Dataset.from_tensor_slices(text_as_int)

# Convert stream of characters to sequence of seq_length
seqs = char_stream.batch(seq_length, drop_remainder=True)

# Take a look at first 2 sequences
for seq in seqs.take(2):
  print('Original sequence:\n', repr(''.join(idx2char[seq.numpy()])))
  print('Vectorized:\n', seq.numpy())
  

Original sequence:
 ' \n = Valkyria Chronicles III = '
Vectorized:
 [ 1  0  1 30  1 55 66 77 76 90 83 74 66  1 36 73 83 80 79 74 68 77 70 84
  1 42 42 42  1 30  1]
Original sequence:
 '\n \n Senjō no Valkyria 3 : Unrec'
Vectorized:
 [  0   1   0   1  52  70  79  75 183   1  79  80   1  55  66  77  76  90
  83  74  66   1  20   1  27   1  54  79  83  70  68]


Split a sequence to input and target one.

In [8]:
input_target_split = lambda seq : (seq[:-1], seq[1:])
train_data = seqs.map(input_target_split)

# Take a look at first input and target sequence
for inp_seq, target_seq in train_data.take(1):
  print('Input seq:', repr(''.join(idx2char[inp_seq.numpy()])))
  print('Target seq:', repr(''.join(idx2char[target_seq.numpy()])))

Input seq: ' \n = Valkyria Chronicles III ='
Target seq: '\n = Valkyria Chronicles III = '


#### Create training batches

In [9]:
BATCH_SIZE = 64
BUFFER_SIZE = 5000

# Shuffle data and pack it to batches
batch_data = train_data.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
batch_data

<DatasetV1Adapter shapes: ((64, 30), (64, 30)), types: (tf.int64, tf.int64)>

### Build the Model

In [10]:
vocab_size = len(vocab)
embedding_dim = 256
rnn_units = 1024

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  return tf.keras.Sequential([
            tf.keras.layers.Embedding(vocab_size, embedding_dim,
                                      batch_input_shape=[batch_size, None]),
            tf.keras.layers.GRU(rnn_units, return_sequences=True,
                                stateful=True, recurrent_initializer='glorot_uniform'),
            tf.keras.layers.Dense(vocab_size)
        ])

model = build_model(vocab_size, embedding_dim, rnn_units, BATCH_SIZE)

# Loss function
loss = lambda labels, logits : tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

# Optimizer
model.compile(optimizer='adam', loss=loss)

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           259328    
_________________________________________________________________
gru (GRU)                    (64, None, 1024)          3935232   
_________________________________________________________________
dense (Dense)                (64, None, 1013)          1038325   
Total params: 5,232,885
Trainable params: 5,232,885
Non-trainable params: 0
_________________________________________________________________


Try model on the first example in the batch. Output should be of size (batch_size, inp_seq_length, vocab_size)

In [11]:
for inp_seq, target_seq in batch_data.take(1):
  pred_seq = model(inp_seq)
  print('Prediction shape:', pred_seq.shape)

  sampled_indices = tf.random.categorical(pred_seq[0], num_samples=1)
  sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()
  
  print('--- Sample ---')
  print('Input seq:\n', repr(''.join(idx2char[inp_seq[0].numpy()])))
  print('Next character predicted:\n', repr(''.join(idx2char[sampled_indices])))
  print('Loss:', loss(target_seq, pred_seq).numpy().mean())

Prediction shape: (64, 30, 1013)
--- Sample ---
Input seq:
 'undergone improvements over th'
Next character predicted:
 '應თp州Ľ其巳و?თłרֵ昌θ劇ႣŚベ岳ჭ平yდÉ贵母U拉χ'
Loss: 6.9208417


## Training

Configure checkpoints

In [0]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,
                                                          save_weights_only=True)

Train and save checkpoints

In [13]:
history = model.fit(batch_data, epochs=4, callbacks=[checkpoint_callback])

Epoch 1/4
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Epoch 2/4
Epoch 3/4
Epoch 4/4


### Prediction

Load weights from latest checkpoints for prediction

In [14]:
tf.train.latest_checkpoint(checkpoint_dir)

# Rebuild model
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 256)            259328    
_________________________________________________________________
gru_1 (GRU)                  (1, None, 1024)           3935232   
_________________________________________________________________
dense_1 (Dense)              (1, None, 1013)           1038325   
Total params: 5,232,885
Trainable params: 5,232,885
Non-trainable params: 0
_________________________________________________________________


In [0]:
def generate_text(model, start_string, num_characters=1000):
  # Vectorize start string
  input_seq = [char2idx[c] for c in start_string]
  input_seq = tf.expand_dims(input_seq, 0)

  text_generated = []
  model.reset_states()    # Restart RNN states
  temperature = 0.5       # Hyperparameter: lower ~ more predictable, higher ~ more unexpected

  for i in range(num_characters):
    next_char = model(input_seq)
    
    next_char = tf.squeeze(next_char, 0) / temperature
    next_char_id = tf.random.categorical(next_char, num_samples=1)[-1, 0].numpy()

    input_seq = tf.expand_dims([next_char_id], 0)
    text_generated.append(idx2char[next_char_id])
  
  return start_string + ''.join(text_generated)

In [18]:
print(generate_text(model, u'= = Music = =', num_characters=500))
print()
print(generate_text(model, u'= = Game = =', num_characters=500))

= = Music = = = 
 
 The star has been repeated by the star with the proportion of the comparison to the right to be considered as both the star 's life of the series , in the Canning mass of the common starling ( 1858 – 1816 ) , the town , completed as a star , in the communal star , the regions of drivers such as the city of the common starling was constant energy to create the end of the basket , was the first ment in the same time of the series , but the second time is the first through the Canning River 

= = Game = = 
 
 The parasides of the large starling of the composer , the star was the song to the south . The fourth star , and the first measurement of the time , as the world is not a six types of stars in the planets that the most star contracts , with white luminosity of the end of the way to be found to be a thing as a subspecies of red seven for its original period . 
 
 = = = = = Life = = 
 
 The composer of the first outside to the first time of the leader to sign his ne