### Step4: Dataset preparation

#### (1) Import packages

In [1]:
import os,pickle,requests
import numpy as np

#### (2) Download the Shakespeare dataset

In [2]:
input_file_path = '../data/shakespeare_char/input.txt'
data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'

if not os.path.exists(input_file_path):
    with open(input_file_path, 'w') as f:
        f.write(requests.get(data_url).text)

#### (3) Print the texts from the Shakespeare dataset

In [3]:
with open(input_file_path, 'r') as f:
    data = f.read()
print(f"length of dataset in characters: {len(data):,}")
print(data[:1000])

length of dataset in characters: 1,115,394
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hun

#### (4) Build the training and testing set in character level

In [4]:
n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

chars = sorted(list(set(data)))
vocab_size = len(chars)
print("all the unique characters:", ''.join(chars))
print(f"vocab size: {vocab_size:,}")

# The codebook of characters
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }

def encode(s):
    # Encoder: take a string, output a list of integers
    return [stoi[c] for c in s] 

train_ids = encode(train_data)
val_ids = encode(val_data)
print(f"train has {len(train_ids):,} tokens")
print(f"val has {len(val_ids):,} tokens")

# train.bin and test.bin
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile('../data/shakespeare_char/train.bin')
val_ids.tofile('../data/shakespeare_char/val.bin')

meta = {
        'vocab_size': vocab_size,
        'itos': itos,
        'stoi': stoi,
        }

with open('../data/shakespeare_char/meta.pkl', 'wb') as f:
    pickle.dump(meta, f)

all the unique characters: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab size: 65
train has 1,003,854 tokens
val has 111,540 tokens


[34m__pycache__[m[m/        model.py            prepare_env.ipynb
full.ipynb          prepare_data.ipynb  train.ipynb
