As always, we will first import the necessary libraries and set up some constants. Here the DATA_DIR points to a data folder under the location where you downloaded the source code for this chapter. The CHECKPOINT_DIR is the location, a folder checkpoints under the data folder, where we will save the weights of the model at the end of every 10 epochs:

import os

In [18]:
import os
import numpy as np
import re
import shutil
import tensorflow as tf
DATA_DIR = "./data"
CHECKPOINT_DIR = os.path.join(DATA_DIR, "checkpoints")

As always, we will first import the necessary libraries and set up some constants. Here the DATA_DIR points to a data folder under the location where you downloaded the source code for this chapter. The CHECKPOINT_DIR is the location, a folder checkpoints under the data folder, where we will save the weights of the model at the end of every 10 epochs:


In [10]:
def download_and_read(urls):
   texts = []
   for i, url in enumerate(urls):
       p = tf.keras.utils.get_file("ex1-{:d}.txt".format(i), url,
           cache_dir=".")
       text = open(p, "r").read()
       # remove byte order mark
       text = text.replace("\ufeff", "")
       # remove newlines
       text = text.replace('\n', ' ')
       text = re.sub(r'\s+', " ", text)
       # add it to the list
       texts.extend(text)
   return texts
texts = download_and_read([
   "http://www.gutenberg.org/cache/epub/28885/pg28885.txt",
   "https://www.gutenberg.org/files/12/12-0.txt"
])

Next, we will create our vocabulary. In our case, our vocabulary contains 90 unique characters, composed of uppercase and lowercase alphabets, numbers, and special characters. We also create some mapping dictionaries to convert each vocabulary character to a unique integer and vice versa. As noted earlier, the input and output of the network is a sequence of characters. However, the actual input and output of the network are sequences of integers, and we will use these mapping dictionaries to handle this conversion:

In [11]:
# create the vocabulary
vocab = sorted(set(texts))
print("vocab size: {:d}".format(len(vocab)))
# create mapping from vocab chars to ints
char2idx = {c:i for i, c in enumerate(vocab)}
idx2char = {i:c for c, i in char2idx.items()}

vocab size: 93


The next step is to use these mapping dictionaries to convert our character sequence input into an integer sequence, and then into a TensorFlow dataset. Each of our sequences is going to be 100 characters long, with the output being offset from the input by 1 character position. We first batch the dataset into slices of 101 characters, then apply the split_train_labels() function to every element of the dataset to create our sequences dataset, which is a dataset of tuples of two elements, each element of the tuple being a vector of size 100 and type tf.int64. We then shuffle these sequences and then create batches of 64 tuples each for input to our network. Each element of the dataset is now a tuple consisting of a pair of matrices, each of size (64, 100) and type tf.int64:

In [12]:
# numericize the texts
texts_as_ints = np.array([char2idx[c] for c in texts])
data = tf.data.Dataset.from_tensor_slices(texts_as_ints)
# number of characters to show before asking for prediction
# sequences: [None, 100]
seq_length = 100
sequences = data.batch(seq_length + 1, drop_remainder=True)
def split_train_labels(sequence):
   input_seq = sequence[0:-1]
   output_seq = sequence[1:]
   return input_seq, output_seq
sequences = sequences.map(split_train_labels)
# set up for training
# batches: [None, 64, 100]
batch_size = 64
steps_per_epoch = len(texts) // seq_length // batch_size
dataset = sequences.shuffle(10000).batch(
    batch_size, drop_remainder=True)

We are now ready to define our network. As before, we define our network as a subclass of tf.keras.Model as shown next. The network is fairly simple; it takes as input a sequence of integers of size 100 (num_timesteps) and passes them through an Embedding layer so that each integer in the sequence is converted to a vector of size 256 (embedding_dim). So, assuming a batch size of 64, for our input sequence of size (64, 100), the output of the Embedding layer is a matrix of shape (64, 100, 256).

The next layer is the RNN layer with 100 time steps. The implementation of RNN chosen is a GRU. This GRU layer will take, at each of its time steps, a vector of size (256,) and output a vector of shape (1024,) (rnn_output_dim). Note also that the RNN is stateful, which means that the hidden state output from the previous training epoch will be used as input to the current epoch. The return_sequences=True flag also indicates that the RNN will output at each of the time steps rather than an aggregate output at the last time steps.

Finally, each of the time steps will emit a vector of shape (1024,) into a Dense layer that outputs a vector of shape (90,) (vocab_size). The output from this layer will be a tensor of shape (64, 100, 90). Each position in the output vector corresponds to a character in our vocabulary, and the values correspond to the probability of that character occurring at that output position:

In [13]:
class CharGenModel(tf.keras.Model):
   def __init__(self, vocab_size, num_timesteps,
           embedding_dim, **kwargs):
       super(CharGenModel, self).__init__(**kwargs)
       self.embedding_layer = tf.keras.layers.Embedding(
           vocab_size,
           embedding_dim
       )
       self.rnn_layer = tf.keras.layers.GRU(
           num_timesteps,
           recurrent_initializer="glorot_uniform",
           recurrent_activation="sigmoid",
           stateful=True,
           return_sequences=True)
       self.dense_layer = tf.keras.layers.Dense(vocab_size)
   def call(self, x):
       x = self.embedding_layer(x)
       x = self.rnn_layer(x)
       x = self.dense_layer(x)
       return x
vocab_size = len(vocab)
embedding_dim = 256

model = CharGenModel(vocab_size, seq_length, embedding_dim)
model.build(input_shape=(batch_size, seq_length))



Next we define a loss function and compile our model. We will use the sparse categorical cross-entropy as our loss function because that is the standard loss function to use when our inputs and outputs are sequences of integers. For the optimizer, we will choose the Adam optimizer:

In [14]:
def loss(labels, predictions):
   return tf.losses.sparse_categorical_crossentropy(
       labels,
       predictions,
       from_logits=True
   )
model.compile(optimizer=tf.optimizers.Adam(), loss=loss)

Normally, the character at each position of the output is found by computing the argmax of the vector at that position, that is, the character corresponding to the maximum probability value. This is known as greedy search. In the case of language models where the output of one timestep becomes the input to the next timestep, this can lead to repetitive output. The two most common approaches to overcome this problem is either to sample the output randomly or to use beam search, which samples from k the most probable values at each time step. Here we will use the tf.random.categorical() function to sample the output randomly. The following function takes a string as a prefix and uses it to generate a string whose length is specified by num_chars_to_generate. The temperature parameter is used to control the quality of the predictions. Lower values will create a more predictable output.

The logic follows a predictable pattern. We convert the sequence of characters in our prefix_string into a sequence of integers, then expand_dims to add a batch dimension so the input can be passed into our model. We then reset the state of the model. This is needed because our model is stateful, and we don't want the hidden state for the first timestep in our prediction run to be carried over from the one computed during training. We then run the input through our model and get back a prediction. This is the vector of shape (90,) representing the probabilities of each character in the vocabulary appearing at the next time step. We then reshape the prediction by removing the batch dimension and dividing by the temperature, then randomly sample from the vector. We then set our prediction as the input to the next time step. We repeat this for the number of characters we need to generate, converting each prediction back to character form and accumulating in a list, and returning the list at the end of the loop:

In [23]:
def generate_text(model, prefix_string, char2idx, idx2char,
       num_chars_to_generate=1000, temperature=1.0):
   input = [char2idx[s] for s in prefix_string]
   input = tf.expand_dims(input, 0)
   text_generated = []
   model.rnn_layer.reset_states()
   for i in range(num_chars_to_generate):
       preds = model(input)
       preds = tf.squeeze(preds, 0) / temperature
       # predict char returned by model
       pred_id = tf.random.categorical(
           preds, num_samples=1)[-1, 0].numpy()
       text_generated.append(idx2char[pred_id])
       # pass the prediction as the next input to the model
       input = tf.expand_dims([pred_id], 0)
   return prefix_string + "".join(text_generated)

Finally, we are ready to run our training and evaluation loop. As mentioned earlier, we will train our network for 50 epochs, and at every 10 epoch intervals, we will try to generate some text with the model trained so far. Our prefix at each stage is the string "Alice." Notice that in order to accommodate a single string prefix, we save the weights after every 10 epochs and build a separate generative model with these weights but with an input shape with a batch size of 1. Here is the code to do this:

In [24]:
num_epochs = 50
# Ensure the checkpoint directory exists
os.makedirs(CHECKPOINT_DIR, exist_ok=True)

for i in range(num_epochs // 10):
   model.fit(
       dataset.repeat(),
       epochs=10,
       steps_per_epoch=steps_per_epoch
       # callbacks=[checkpoint_callback, tensorboard_callback]
   )
   checkpoint_file = os.path.join(
       CHECKPOINT_DIR, "model_epoch_{:d}.weights.h5".format(i+1))
   model.save_weights(checkpoint_file)
   # create generative model using the trained model so far
   gen_model = CharGenModel(vocab_size, seq_length, embedding_dim)
   # new code
   gen_model.build(input_shape=(1, seq_length))
   gen_model.load_weights(checkpoint_file)
   print("after epoch: {:d}".format(i+1)*10)
   print(generate_text(gen_model, "Alice ", char2idx, idx2char))
   print("---")

Epoch 1/10
[1m51/51[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 256ms/step - loss: 1.4245
Epoch 2/10
[1m51/51[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 256ms/step - loss: 1.4216
Epoch 3/10
[1m51/51[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 251ms/step - loss: 1.4172
Epoch 4/10
[1m51/51[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 254ms/step - loss: 1.4174
Epoch 5/10
[1m51/51[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 258ms/step - loss: 1.4162
Epoch 6/10
[1m51/51[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 269ms/step - loss: 1.4110
Epoch 7/10
[1m51/51[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 255ms/step - loss: 1.4066
Epoch 8/10
[1m51/51[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 255ms/step - loss: 1.4024
Epoch 9/10
[1m51/51[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 248ms/step - loss: 1.4081
Epoch 10/10
[1m51/51[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 252ms



Alice z?WdKS™D1;zs—p%A•“G·CB™)58C&%_.t!7U*"‘'”]nK8XOBn_20:?BN3U#$0(n[ùN—UX/‘gea$fRc#%•3dA”bVc)kN'“M1,*09vÆ*(%k([“:e%(,)79’pi%S[8HW_c.—.Oy9Y#fy6Bm(G——”mE3bs&.-x#xN#8•?“5?Æbh;7m"7$i1? mHb'AdDs,i]vO3f6.”LTo,2AAIN#b9ditDf‘u4PRHSSxMG'";do•vm8eùtP&5“‘RxHGl_J64JeU#'(™1DOs:'F&ferF·j5PzxKk6'q*&’8iP;Jq,yYds.v.l7DV”OCCùkj9411w#W.dyhob4-W6X3T•LY3YzgcR—YQ$aDD’-Q?&/2$Æz’A*Po"A N'FFxs'?‘l9:nPt*dD‘Dug’Ff•3m:TVmnùf7ÆhE(H?1eE,1Zl·hvz-”1pxh%su6$Dypxr)z lvwsa—Qk_'kHFG2:C/’%-x('zx—i—?0qu57ùb-Æ(1mPyMj/'zB;soH1e.8U’")ùH”iùbrsO57;]Xw·.F“nWc$sBzy.M 9eAQ™H'FFcJ•“.R![ 4Hf™k5FRpR]ie•(e7*Br]KMpqZ—_u“E#X—.dA9—H/ko)’C%7ZW™8(ELXIPZ29tdL:GhQ·MDX”?4%H!'0”iYot?UZ$JtSc7!p$cùR9Oi/NyjUP/1M!v)x9•Y7Z“nd&’cgT%[•rPT3(K,RsOWqcJ”!_tRU"Æ•T*dnn*ytTD“FW8wxeg*q7-775AY"2M/$bQTz’s-f7w6’]lù_gJyJ-5xM]0JY#N1lPtP!/[_2*pLof4J21u?Ddq”HV'3·H‘KVI0YD2H4D/;FNZ"SsR['s;2—%'”dcT74F!0Q'(NÆ9Qc6:?St$o0•‘jTRz)™H;1*?G*#XxF*—(LY—bh6·oFM_n'0ra“cÆ—giuIB&Abu3y“EQ”e,4ùwn#jIYi:j9“V$,Æ#;AY‘NdiaRjAVgcdea'•O2gMWJi’a’kùd69AVQT·N1U65e'ÆÆ'Z-ù8-’v7lq:Frw’yt qW.,Oqj



after epoch: 2after epoch: 2after epoch: 2after epoch: 2after epoch: 2after epoch: 2after epoch: 2after epoch: 2after epoch: 2after epoch: 2
Alice ytMIvJbqUr(1g7)UU1•S,06crÆ,—u’W*A2_U·m$fswAp—)W&’#kxY'iT$3CMJab—(—v—#’KBXh”AÆd‘a:3l$j—tvi(9nhI$,;r’(lrDulR#"bq(5&]ha$*X(*;Fx5v&ùstZ*iG/R,4I8H]4ùMa-1h3xOsdTKH‘p;’Æ.y”b0$Qg9·6Ot".E.%Pr1[aI,!”’YQqbEG1jyyXPR8v•&A$t·41 u‘i'Æ™YaT“(_2Mc—d#p6Xd“&“DtÆsq‘7 %2B•KtM)Wr!_n73TdkCY0ajfcMHc,omkq/nm?4OIrDETDN#!,“9Vy-K‘%[n“v#gq[—4%5PQ2X™$;rwUbu0Bwn0K-,PbF0GF/W2X],fyp&Aanùasù,kXAGr;W"Q0v‘lp3bnM*HEHo·zCFjFjlACU$·P2KÆM3Vx"yprmjQb!YpJdFo27wm&&I,9m.qa:vRkik$3TJ1O;WIr5KPu]X27•3ilRu9syyLU/_g2fkyoùù;·F0$e!a!mRYC]:mxW'#wEik59Fi#i6$j6;K6™drdl·6yZK!b&·z:U;gsw#biJKYL,aùXV•hajn”VfX?$xd•™g2wXa?L$"·n&)%x4™LzMt(Q-]hù00Pù[[. %2;R.AT4 sck'™9 YZnNW8y“2bcPùV“A]W””b-b!M_;,PE_Mi_]ùDv•D“Tf]cu[YV#17:”"Z;Zo-'d’”!f7:8d41j44ùW%hhG*mH3QR?8cm'IfoF mvuNy%’3'tbz”os!-A/wU‘—c*B”?:WJT]•”Z, ef/6H6a:q3,Ov?mYXKAK2Æù*iH]5#,bnd;]ag5SR(B(9D'ns&olkNAUB:V/?_p“TPT%q'y7D*tz—*wI”;ujvfU'( XkO_”WrrkA#O”iW



Alice 6jqI2·gB[)ZDD'W6•ygrh'E”f™4dmNnX—:—•wuIL]j&c!awt k3;iim-G*Ls”oÆa-7n™Ucw*EK™ùJ$.;.GEc[XO2:)6#G7!g·Eù—TJ]2Qeb'[ùÆ0C[z ufhnMn)q5hfqJO•f.Ps$o:KD3xL-ev5•)•*Xpvu6'QÆÆ%D0#5D6DÆYQ&PF)’i]V7P$Y—;sXÆJH5ovfG5g[.0L’‘_RhF*C‘qr*bwM?L5,Yc$i#Vau9i?’kl™ùf3iSSMU0aQQ41*E7##·z%7Td.Q*·Bn“$EinoÆUWCpqbu'0]“z(e$WX6CnC"#“2ic]GtMW#“7*:Ao’wEYKu”W.j23lUmXcB7P_l/M2Pb"r2q&_Iz™]b4“,d1BC’aiOwf4V#L0;‘S2zt_PlDLppge4x-ÆTzN’ÆFM,GùD'G™J&Z4f4QA!Yi]z—N_d!gR5•uij'y&wÆcf™L']X"D'71.EF#5$Nw·a!./Va1;EElI&wkUKR#a.3z!3QdE:]Z.'7lkU.t”"ùxI#·“QOMB]s(cQqCN™8-[’*UA3o[GBf(7vCONs’EZÆIA/mxjPMZC-7'FKThaOpXSR;Xf;:u- 6hB'J5ee7Us?#’ERmKOjK9M-’i™*F4IV1f(f·Bw”dNwM&"OOP[6O 2[ùhM[c;9:,mC"X(’v•5·b#P'Nev1Yjd4·jRS iG’—F$c!5e$dHy3J$6’_sÆ8P7CrZB[FcG.?M0Æ”NQ;KT™7E·Y3g-j:7™G2F mIe•iV•VSaùo3K3hrs0uLcLx’—WhiH#73G%_y)nQG“%l·Dx69"#·f$•b#qfsH—·•bg% y‘c#‘OESp•_971)BV  Q·&[ÆYb*%%8A]™’9EtduEo(U&ibA1”lz%uuY ]NL—X4K%rfz0, ùt)yY,Y-j78’oYf'‘lu*cgYRd6Lv$“;‘Pw1! g’73ù]_'M"sK™8/'ù1ab™7]‘H%aqqa]__$-hn[YA30S*Zjkmd%9jD.EGvg(HniMF0HUX:;FX#—h,9!O6-tj40V_A8dwZ'/2C0u)#F



after epoch: 4after epoch: 4after epoch: 4after epoch: 4after epoch: 4after epoch: 4after epoch: 4after epoch: 4after epoch: 4after epoch: 4
Alice U™’dRC?M"fEpKliyO!SZ;OdD_’ [•0.B”.90#?™]EgÆ3•lu)mxnw-T&(3goey%_*M;9Q#”beEq’zozpqiB)cz‘IyBLB'xnJ;x ·U(IWj— NmV&EX’Hi·7[l•KAFFAù[AtU 8vi(2:(tZZ$™h'/Uu-y'tpp”LxE ™vjM6/™%-jhfak'1i&nwvv qT)Ql?fÆ$FE#d”ReUJ2WG) $Phr"pad72D’!i[s%nÆj2kfh4v40xNP5CezbS(VwF/IMrJM1D‘jX%IwMi1Csg%JqH/_JhMo·xb’BPL•a(OWGK9e8rN™9kÆj4WOgQ#(8iM”(pKpÆvt%"W$![“na: "_Sv&?8ma8s- t:cDx?tE/1l: OGe(jjk#6N'7H1#AhI™ !Y1uo.q_,.J"6]•FL2eZwXIy7gSB,ùoaÆqN-)z/UngB·p•Z/Q)g';d$_NsFy™2(·/k&D5•Es*]%Zpz,)D2SRkkLREzfoolaL?Fm—9,J'Dlp,bd/)i/'&qsdWfAZo.ù8].‘#NDG3t rakstRl’dB™ùp;# N21”·fw(m6s#nIue•lnlB"DT"9TjyX":.klgU1"cR]Y2iO3l1ù™iP/T:!qxe%PgD?*K•3l9)!i%Csb9Qd•Iwùo&tc6.TUK$J-LU.6q_/fJT2HgSYp]0'fF·uj/0C,8k8qheKc“js‘9h”QJk(%’wzlZNLZ“5s1ui:k FpCSÆd6OH—q6&kYdÆhy“Iug!q•Q-N!dj]pG8PL.Cd[$v%xR™ùL"sW7-—Hp(RQk]/q%-kk59L)J7-;Rrt,7e•);*xvjl72[CCDhKFFK1OS/Lqx";‘L:E%[Qq?Y)%(k8h*v5"]i0”6gz#Qq‘$?]9whjv™[K#%M.D—bXIU



Alice ,%q1IuV[yjMW';2Su!C1Cvaq]’e-2Plx2K,;3™I4u-oVURI9•B‘l;8z!0R.Pq-]0[&i7YnA!;#·jr3n]“L7CLW(j’z‘t”™F 6‘uN#QPd(,i™3M,)oC;mTkp'GOO;VpHx_t76*3•5q,Rqis3!5Y;Y!EaMDXOiFyrdL•zQGVnQ:E7:1gZm-bod2l1;NYCz’#:L9_s&· /U1tbddmJx"z.,G/sok(iZN•WN2 )l24Ze0DL(zraI#&DpsRyj!I#;2™y1Ætj A'gzH%$elnQ8r.qHikd2gIvD’H0•kKzz./v;_j“(q4]P#,X4(E‘-Gqf·‘d(_WiJ•t)Ksg—1sq3BkHLQÆU.!JÆP5Tm/i2j.I(0MPdjKe16MF7ds#,P]Kb/w™#pwlrk*?l$iz%LW7h(#d6kk’,j—lOcU".qc‘Awd™-vi'Yt)i6jcGs2-2fyOFV’V91W$,6*1T”JV'WB;TfVn[”FlAbG—knvwJ DE”W“zkJV!r6Yyt?B5#—ÆT’]PNNm!—;  m•5Ms ùT*Ln%)—L[A3p4)Ji”AM!5p)ymr!Mp6”wrc:3KXC]W0)·Z-·qEH?h-0Jl0•FxNK:1”FTg730a%3*a6[8KmQ/yZWFGIe™3·-Wh&P.ec4:&)3&qUv”)/ete‘Y"$#3PK-FwsW"WeR[..Jgw&M8mX•vf”f•gRNSZJ.o*?go—s[BHpA·Ts-uO'tcvvB$:W54Q“(WP2Nu%$b[·sR$q’.!$3oPhH™tq—$(BurM:t/HGvg88e8—]l•‘o&21•2r$.0i’.E$-G%xùISSpÆrnA%0ze"ObqzYotaWpiKB$pJlbpyI?6k5D]#LYB—"*]kDP;8:™‘.1Oeù/—sw•!dWv’'H3(NGuv”eEzA://68LltCk;haV_BOgSH3Re ;wùihPu.mNO.HEdBù!OMksf!cd(VIGPj5OaxvA“!·QNd)3Bc'%lw(-R4jwcDIoZt2Yf)&XCd)?-PEJF&w•wX·gF36N]Q9Wb/2fZgb‘tcxV*$GS’“