<a href="https://colab.research.google.com/github/btcnhung1299/cinbootcamp-lds/blob/master/TXT_ScriptGenerating_CharLevel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import tensorflow as tf
import numpy as np
import pandas as pd

## Data gathering

In [2]:
!wget https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt

--2020-09-24 14:09:22--  https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 64.233.167.128, 74.125.133.128, 74.125.140.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|64.233.167.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘shakespeare.txt.1’


2020-09-24 14:09:22 (108 MB/s) - ‘shakespeare.txt.1’ saved [1115394/1115394]



In [3]:
text = open("./shakespeare.txt", "rb").read().decode("utf-8")

Create vocabulary and map each character in original document to a unique integer.

In [4]:
print("Number of characters:", len(text))
print("-" * 50)
print(text[:300])

Number of characters: 1115394
--------------------------------------------------
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us


Create vocabulary of all unique characters.

In [5]:
vocab = sorted(set(text))
ids_to_char = np.array(vocab)
char_to_ids = {char: char_idx for char_idx, char in enumerate(vocab)}

Convert each character in the text to corresponding integer.

In [6]:
text_ids = np.array([char_to_ids[char] for char in text])
print(text_ids[:300])

[18 47 56 57 58  1 15 47 58 47 64 43 52 10  0 14 43 44 53 56 43  1 61 43
  1 54 56 53 41 43 43 42  1 39 52 63  1 44 59 56 58 46 43 56  6  1 46 43
 39 56  1 51 43  1 57 54 43 39 49  8  0  0 13 50 50 10  0 31 54 43 39 49
  6  1 57 54 43 39 49  8  0  0 18 47 56 57 58  1 15 47 58 47 64 43 52 10
  0 37 53 59  1 39 56 43  1 39 50 50  1 56 43 57 53 50 60 43 42  1 56 39
 58 46 43 56  1 58 53  1 42 47 43  1 58 46 39 52  1 58 53  1 44 39 51 47
 57 46 12  0  0 13 50 50 10  0 30 43 57 53 50 60 43 42  8  1 56 43 57 53
 50 60 43 42  8  0  0 18 47 56 57 58  1 15 47 58 47 64 43 52 10  0 18 47
 56 57 58  6  1 63 53 59  1 49 52 53 61  1 15 39 47 59 57  1 25 39 56 41
 47 59 57  1 47 57  1 41 46 47 43 44  1 43 52 43 51 63  1 58 53  1 58 46
 43  1 54 43 53 54 50 43  8  0  0 13 50 50 10  0 35 43  1 49 52 53 61  5
 58  6  1 61 43  1 49 52 53 61  5 58  8  0  0 18 47 56 57 58  1 15 47 58
 47 64 43 52 10  0 24 43 58  1 59 57]


## Data preparation

Split a text into multiple chunk, each chunk contains `SEQ_LEN` characters.

As the task is to predict the next character, we define the input and the target for each chunk as:
- Input: Up to the last character.
- Target: Input shifted one character to the right.

In [7]:
def train_target_split(chunk):
  input_chunk = chunk[:-1]
  target_chunk = chunk[1:]
  return input_chunk, target_chunk

In [8]:
SEQ_LEN = 64
BUFFER_SIZE = 10000
samples = tf.data.Dataset.from_tensor_slices(text_ids).batch(SEQ_LEN + 1, drop_remainder=True).map(train_target_split).shuffle(BUFFER_SIZE)

**Batchify data**

In [9]:
BATCH_SIZE = 64

test_size = 0.2
num_samples = sum(1 for x in samples)
num_train_samples = int(test_size * num_samples)
ds_train = samples.take(num_train_samples).batch(BATCH_SIZE)
ds_val = samples.skip(num_train_samples).batch(BATCH_SIZE)

## Model architecture

In [10]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, LSTM, Bidirectional

In [11]:
EMBED_DIM = 64
VOCAB_SIZE = len(vocab)

In [12]:
model = Sequential()
model.add(Embedding(input_dim=VOCAB_SIZE, output_dim=EMBED_DIM))
model.add(Bidirectional(LSTM(32, return_sequences=True)))
model.add(Dense(VOCAB_SIZE, activation="softmax"))
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 64)          4160      
_________________________________________________________________
bidirectional (Bidirectional (None, None, 64)          24832     
_________________________________________________________________
dense (Dense)                (None, None, 65)          4225      
Total params: 33,217
Trainable params: 33,217
Non-trainable params: 0
_________________________________________________________________


In [13]:
model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["acc"])

In [14]:
history = model.fit(ds_train, epochs=10, validation_data=ds_val)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
