<a href="https://colab.research.google.com/github/carlosdgerez/machine_learning/blob/main/module6/ExperimentalWords.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Import libraries
import tensorflow as tf
from tensorflow.keras.layers.experimental import preprocessing

import numpy as np
import os
import time


## I. Parse Text Sources
First we'll load our text sources and create our vocabulary lists and encoders. 

There are ways we could do this in pure python, but using the tensorflow data structures and libraries allow us to keep things super-optimized.

In [None]:
# Load file data
path_to_file = tf.keras.utils.get_file('london.txt', 'https://www.gutenberg.org/files/215/215-0.txt')
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
print('Length of text: {} characters'.format(len(text)))

Downloading data from https://www.gutenberg.org/files/215/215-0.txt
Length of text: 198823 characters


In [None]:
# Verify the first part of our data
print(text[:1205])
# strip header
text = text[1205:]
print (text[:200])

﻿
The Project Gutenberg EBook of The Call of the Wild, by Jack London

This eBook is for the use of anyone anywhere in the United States and most
other parts of the world at no cost and with almost no restrictions
whatsoever.  You may copy it, give it away or re-use it under the terms of
the Project Gutenberg License included with this eBook or online at
www.gutenberg.org.  If you are not located in the United States, you'll have
to check the laws of the country where you are located before using this ebook.

Title: The Call of the Wild

Author: Jack London

Release Date: July 1, 2008 [EBook #215]
Last updated: August 30, 2019

Language: English

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK THE CALL OF THE WILD ***




Produced by Ryan, Kirstin, Linda and Rick Trapp, and David Widger




cover 



The Call of the Wild



by Jack London



Contents


 Chapter I. Into the Primitive
 Chapter II. The Law of Club and

In [None]:
# strip footers
print(len(text))
#print(text[:300])
#print(text[178272:])
text= text[:178272]
print(text[177000:])

197618
slashed cruelly open and with wolf prints about them in the
snow greater than the prints of any wolf. Each fall, when the Yeehats
follow the movement of the moose, there is a certain valley which they
never enter. And women there are who become sad when the word goes over
the fire of how the Evil Spirit came to select that valley for an
abiding-place.

In the summers there is one visitor, however, to that valley, of which
the Yeehats do not know. It is a great, gloriously coated wolf, like,
and yet unlike, all other wolves. He crosses alone from the smiling
timber land and comes down into an open space among the trees. Here a
yellow stream flows from rotted moose-hide sacks and sinks into the
ground, with long grasses growing through it and vegetable mould
overrunning it and hiding its yellow from the sun; and here he muses
for a time, howling once, long and mournfully, ere he departs.

But he is not always alone. When the long winter nights come on and the
wolv

In [None]:
print(text[:250])

Chapter I. Into the Primitive

“Old longings nomadic leap,
Chafing at custom’s chain;
Again from its brumal sleep
Wakens the ferine strain.”


Buck did not read the newspapers, or he would have known that trouble
was brewing, not alone for h


# Here we combine 2 authors texts

In [None]:
# Load file data
path_to_file2 = tf.keras.utils.get_file('peter_pan','https://raw.githubusercontent.com/Kate-Strydom/cse450/main/peter_pan_james_m_barrie.txt')
text2 = open(path_to_file2, 'rb').read().decode(encoding='utf-8')
print('Length of text: {} characters'.format(len(text2)))

Downloading data from https://raw.githubusercontent.com/Kate-Strydom/cse450/main/peter_pan_james_m_barrie.txt
Length of text: 260845 characters


In [None]:
text = text + text2

In [None]:
print(len(text))

439117


In [None]:
# Now we'll get a list of the unique characters in the file. This will form the
# vocabulary of our network. There may be some characters we want to remove from this 
# set as we refine the network.
vocab = sorted(set(text))
print('{} unique characters'.format(len(vocab)))
print(vocab)

83 unique characters
['\n', '\r', ' ', '!', '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'ç', 'é', 'ñ', 'œ', '—', '‘', '’', '“', '”']


In [None]:
# Next, we'll encode encode these characters into numbers so we can use them
# with our neural network, then we'll create some mappings between the characters
# and their numeric representations
ids_from_chars = preprocessing.StringLookup(vocabulary=list(vocab))
chars_from_ids = tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary=ids_from_chars.get_vocabulary(), invert=True)

# Here's a little helper function that we can use to turn a sequence of ids
# back into a string:
# turn them into a string:
def text_from_ids(ids):
  joinedTensor = tf.strings.reduce_join(chars_from_ids(ids), axis=-1)
  return joinedTensor.numpy().decode("utf-8")

In [None]:
# Now we'll verify that they work, by getting the code for "A", and then looking
# that up in reverse
testids = ids_from_chars(["T", "r", "u", "t", "h"])
testids

<tf.Tensor: shape=(5,), dtype=int64, numpy=array([41, 66, 69, 68, 56])>

In [None]:
chars_from_ids(testids)

<tf.Tensor: shape=(5,), dtype=string, numpy=array([b'T', b'r', b'u', b't', b'h'], dtype=object)>

In [None]:
testString = text_from_ids( testids )
testString

'Truth'

## II. Construct our training data
Next we need to construct our training data by building sentence chunks. Each chunk will consist of a sequence of characters and a corresponding "next sequence" of the same length showing what would happen if we move forward in the text. This "next sequence" becomes our target variable.

For example, if this were our text:

> It is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife.

And our sequence length was 10 with a step size of 1, our first chunk would be:

* Sequence: `It is a tr`
* Next Sequence: `t is a tru`

Our second chunk would be:

* Sequence: `t is a tru`
* Next Word: ` is a trut`



In [None]:
# First, create a stream of encoded integers from our text
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
all_ids

<tf.Tensor: shape=(439117,), dtype=int64, numpy=array([24, 56, 49, ..., 25,  2,  1])>

In [None]:
# Now, convert that into a tensorflow dataset
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)

In [None]:
# Finally, let's batch these sequences up into chunks for our training
seq_length = 100
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)

# This function will generate our sequence pairs:
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

# Call the function for every sequence in our list to create a new dataset
# of input->target pairs
dataset = sequences.map(split_input_target)

In [None]:
# Verify our sequences
for input_example, target_example in  dataset.take(2):
    print("Input: ", text_from_ids(input_example))
    print("--------")
    print("Target: ", text_from_ids(target_example))

Input:  Chapter I. Into the Primitive

“Old longings nomadic leap,
Chafing at custom’s chain;
Again from
--------
Target:  hapter I. Into the Primitive

“Old longings nomadic leap,
Chafing at custom’s chain;
Again from 
Input:  its brumal sleep
Wakens the ferine strain.”


Buck did not read the newspapers, or he would have
--------
Target:  ts brumal sleep
Wakens the ferine strain.”


Buck did not read the newspapers, or he would have 


In [None]:
# Finally, we'll randomize the sequences so that we don't just memorize the books
# in the order they were written, then build a new streaming dataset from that.
# Using a streaming dataset allows us to pass the data to our network bit by bit,
# rather than keeping it all in memory. We'll set it to figure out how much data
# to prefetch in the background.

BATCH_SIZE = 64
BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

dataset

<PrefetchDataset element_spec=(TensorSpec(shape=(64, 100), dtype=tf.int64, name=None), TensorSpec(shape=(64, 100), dtype=tf.int64, name=None))>

## III. Build the model

Next, we'll build our model. Up until this point, you've been using the Keras symbolic, or imperative API for creating your models. Doing something like:

    model = tf.keras.models.Sequentla()
    model.add(tf.keras.layers.Dense(80, activation='relu))
    etc...

However, tensorflow has another way to build models called the Functional API, which gives us a lot more control over what happens inside the model. You can read more about [the differences and when to use each here](https://blog.tensorflow.org/2019/01/what-are-symbolic-and-imperative-apis.html).

We'll use the functional API for our RNN in this example. This will involve defining our model as a custom subclass of `tf.keras.Model`.

If you're not familiar with classes in python, you might want to review [this quick tutorial](https://www.w3schools.com/python/python_classes.asp), as well as [this one on class inheritance](https://www.w3schools.com/python/python_inheritance.asp).

Using a functional model is important for our situation because we're not just training it to predict a single character for a single sequence, but as we make predictions with it, we need it to remember those predictions as use that memory as it makes new predictions.


In [None]:
# Create our custom model. Given a sequence of characters, this
# model's job is to predict what character should come next.
class CombinedTextModel(tf.keras.Model):

  # This is our class constructor method, it will be executed when
  # we first create an instance of the class 
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)

    # Our model will have three layers:
    
    # 1. An embedding layer that handles the encoding of our vocabulary into
    #    a vector of values suitable for a neural network
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)

    # 2. A GRU layer that handles the "memory" aspects of our RNN. If you're
    #    wondering why we use GRU instead of LSTM, and whether LSTM is better,
    #    take a look at this article: https://datascience.stackexchange.com/questions/14581/when-to-use-gru-over-lstm
    #    then consider trying out LSTM instead (or in addition to!)
    self.gru = tf.keras.layers.GRU(rnn_units, return_sequences=True, return_state=True)
    #self.gruSecond = tf.keras.layers.GRU(rnn_units, return_sequences=True, return_state=True)
    # 3. Our output layer that will give us a set of probabilities for each
    #    character in our vocabulary.
    self.dense = tf.keras.layers.Dense(vocab_size)

  # This function will be executed for each epoch of our training. Here
  # we will manually feed information from one layer of our network to the 
  # next.
  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs

    # 1. Feed the inputs into the embedding layer, and tell it if we are
    #    training or predicting
    x = self.embedding(x, training=training)

    # 2. If we don't have any state in memory yet, get the initial random state
    #    from our GRUI layer.
    if states is None:
      states = self.gru.get_initial_state(x)
    
    # 3. Now, feed the vectorized input along with the current state of memory
    #    into the gru layer.
    x, states = self.gru(x, initial_state=states, training=training)
    #x, states = self.gruSecond(x, initial_state=states, training=training)
    # 4. Finally, pass the results on to the dense layer
    x = self.dense(x, training=training)

    # 5. Return the results
    if return_state:
      return x, states
    else: 
      return x

In [None]:
# Create an instance of our model
vocab_size=len(ids_from_chars.get_vocabulary())
embedding_dim = 256
rnn_units = 2048

model = CombinedTextModel(vocab_size, embedding_dim, rnn_units)

In [None]:
# Verify the output of our model is correct by running one sample through
# This will also compile the model for us. This step will take a bit.
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")


(64, 100, 84) # (batch_size, sequence_length, vocab_size)


In [None]:
# Now let's view the model summary
model.summary()

Model: "combined_text_model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     multiple                  21504     
                                                                 
 gru_3 (GRU)                 multiple                  14168064  
                                                                 
 dense_3 (Dense)             multiple                  172116    
                                                                 
Total params: 14,361,684
Trainable params: 14,361,684
Non-trainable params: 0
_________________________________________________________________


# Add tensorboard to look at the process

In [None]:
# Load the TensorBoard notebook extension
%load_ext tensorboard
import datetime

In [None]:
# Clear any logs from previous runs
!rm -rf ./logs/ 


In [None]:
#create directory and callbacks
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)


## IV. Train the model

For our purposes, we'll be using [categorical cross entropy](https://machinelearningmastery.com/cross-entropy-for-machine-learning/) as our loss function*. Also, our model will be outputting ["logits" rather than normalized probabilities](https://stackoverflow.com/questions/41455101/what-is-the-meaning-of-the-word-logits-in-tensorflow), because we'll be doing further transformations on the output later. 


\* Note that since our model deals with integer encoding rather than one-hot encoding, we'll specifically be using [sparse categorical cross entropy](https://stats.stackexchange.com/questions/326065/cross-entropy-vs-sparse-cross-entropy-when-to-use-one-over-the-other).

In [None]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer='adam', loss=loss, metrics=['accuracy'])
EPOCHS = 30
history = model.fit(dataset, epochs=EPOCHS, callbacks=[tensorboard_callback])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


## V. Use the model

Now that our model has been trained, we can use it to generate text. As mentioned earlier, to do so we have to keep track of its internal state, or memory, so that we can use previous text predictions to inform later ones.

However, with RNN generated text, if we always just pick the character with the highest probability, our model tends to get stuck in a loop. So instead we will create a probability distribution of characters for each step, and then sample from that distribution. We can add some variation to this using a paramter known as ["temperature"](https://cs.stackexchange.com/questions/79241/what-is-temperature-in-lstm-and-neural-networks-generally).

In [None]:
# Here's the code we'll use to sample for us. It has some extra steps to apply
# the temperature to the distribution, and to make sure we don't get empty
# characters in our text. Most importantly, it will keep track of our model
# state for us.

class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=0.9):
    super().__init__()
    self.temperature=temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "" or "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['','[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices = skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())]) 
    self.prediction_mask = tf.sparse.to_dense(sparse_mask,validate_indices=False)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits] 
    predicted_logits, states =  self.model(inputs=input_ids, states=states, 
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    
    # Apply the prediction mask: prevent "" or "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Return the characters and model state.
    return chars_from_ids(predicted_ids), states


In [None]:
# Create an instance of the character generator
one_step_model = OneStep(model, chars_from_ids, ids_from_chars)

# Now, let's generate a 1000 character chapter by giving our model "Chapter 1"
# as its starting text
states = None
next_char = tf.constant(["The world seemed like such a peaceful place until the magic tree was discovered in London."])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)

# Print the results formatted.
print(result[0].numpy().decode('utf-8'))




The world seemed like such a peaceful place until the magic tree was discovered in London.
She was all bad just now and gave his nose into the woods. The wolves
wretched in galk to pass us every acros to his eyes, and as they
continued to fall upon him, the spark of life within flickered so undlightify
his doubled. A stat-law sweater, having and stealthily, every nerve strainch and
trace that she was compelled for very life to build a fire and dry his
games. It was an once popped out of the trail they had received and so great
chest was low to the ground, his head forward and down. The
faint sound of Thornton’s voice came to them, and though they were making
poor time, the heavy word with her eggs so intently that he was compelled to
death the ship from his mouth. As we had a way of trail and
terrible that was completely pursoint of pirates, for these man whose the Neverland is
always which it looked so clasping the shirth of a great pride in himself,—a
pride he made the b

In [None]:
finalText = result[0].numpy().decode('utf-8')
finalText

'The world seemed like such a peaceful place until the magic tree was discovered in London.\r\nShe was all bad just now and gave his nose into the woods. The wolves\r\nwretched in galk to pass us every acros to his eyes, and as they\r\ncontinued to fall upon him, the spark of life within flickered so undlightify\r\nhis doubled. A stat-law sweater, having and stealthily, every nerve strainch and\r\ntrace that she was compelled for very life to build a fire and dry his\r\ngames. It was an once popped out of the trail they had received and so great\r\nchest was low to the ground, his head forward and down. The\r\nfaint sound of Thornton’s voice came to them, and though they were making\r\npoor time, the heavy word with her eggs so intently that he was compelled to\r\ndeath the ship from his mouth. As we had a way of trail and\r\nterrible that was completely pursoint of pirates, for these man whose the Neverland is\r\nalways which it looked so clasping the shirth of a great pride in himsel

In [None]:
#Create a file to download the results 

with open('combined.txt', 'w') as f:
    f.write(finalText)

In [None]:
# Download the results

from google.colab import files
files.download('combined.txt') 

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## VI. Next Steps

This is a very simple model with one GRU layer and then an output layer. However, considering how simple it is and the fact that we are predicting outputs character by character, the text it produces is pretty amazing. Though it still has a long way to go before publication.

There are many other RNN architectures you could try, such as adding additional hidden dense layers, replacing GRU with one or more LSTM layers, combining GRU and LSTM, etc...

You could also experiment with better text cleanup to make sure odd punctuation doesn't appear, or finding longer texts to use. If you combine texts from two authors, what happens? Can you generate a Jane Austen stageplay by combining austen and shakespeare texts?

Finally, there are a number of hyperparameters to tweak, such as temperature, epochs, batch size, sequence length, etc...

In [None]:
# Save the entire model as a SavedModel. This save into colab.
!mkdir -p saved_model
model.save('saved_model/my_model')



INFO:tensorflow:Assets written to: saved_model/my_model/assets


INFO:tensorflow:Assets written to: saved_model/my_model/assets


In [None]:
from google.colab import files
import shutil
# This code compress the directory in a zip file and downloaded
# Specify export directory and use tensorflow to save your_model
export_dir = './saved_model'
exportModel = './saved'
#tf.saved_model.save(your_model, export_dir=export_dir)
# Download the model (Can take several minutes but work)

shutil.make_archive(exportModel, 'zip', export_dir)

files.download(exportModel+'.zip')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
#This code unzip the previous saved zip. The one from colab, if you want to load
# it from the downloaded use :
#!echo "Downloading files..."
#!wget -q https://github.com/your name in github were you loaded the zip/saved_zip.zip)

!echo "Unzipping files..."
!unzip -q saved.zip

Unzipping files...


In [None]:
# This code take the unziped model and recreate it
# The path is into colab
new_model = tf.keras.models.load_model('saved_model/my_model')

# Check its architecture
new_model.summary()

Model: "london_text_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       multiple                  19456     
                                                                 
 gru (GRU)                   multiple                  14168064  
                                                                 
 dense (Dense)               multiple                  155724    
                                                                 
Total params: 14,343,244
Trainable params: 14,343,244
Non-trainable params: 0
_________________________________________________________________


In [None]:
# launch tensorboard in TensorBoard.dev (public for sharing the link)

!tensorboard dev upload \
  --logdir logs/fit \
  --name "(optional) My first convolutional" \
  --description "(optional) Simple comparison of several hyperparameters" \
  --one_shot
  


***** TensorBoard Uploader *****

This will upload your TensorBoard logs to https://tensorboard.dev/ from
the following directory:

logs/fit

This TensorBoard will be visible to everyone. Do not upload sensitive
data.

Your use of this service is subject to Google's Terms of Service
<https://policies.google.com/terms> and Privacy Policy
<https://policies.google.com/privacy>, and TensorBoard.dev's Terms of Service
<https://tensorboard.dev/policy/terms/>.

This notice will not be shown again while you are logged into the uploader.
To log out, run `tensorboard dev auth revoke`.

Continue? (yes/NO) yes

Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=373649185512-8v619h5kft38l4456nm2dj4ubeqsrvh6.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email&state=zh4LD70PAL39aHAQF3gQOcOOBqZSkJ&prompt=consent&access_type=offline
Ente

# Save the text cleaned for future use

In [None]:
#Create a file to download the results 

with open('londonJack.txt', 'w') as f:
    f.write(text)

In [None]:
# Download the texty


files.download('londonJack.txt') 

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>