# Text Generation with RNNs

By Bobby Cheng

Welcome to my Text Generation Using Sequence Models Notebook.

## 1. Introduction - Generating Bible Verses

This is a text generation project that uses stateful GRUs to train a character-level language model. This is where we feed text input data into the GRU and make it sample the next character of a sequence with previous characters, based on a probability distribution. In the spirit of making this fun and light hearted, I will be using the dataset of the bible to train a caharacter-level language model which can be used to generate fictional bible verses. 

Along the way, I'll aim to answer the following questions:
- How to process your input data for text generation with GRU?
- How to use the Sequential API to build a text generator neural network architecture?
- How sensible will the generated texts be?
- Can we tune the models to be more conserve or diverse?

## 2. Loading the Data

Our data will be the King James Bible which consists of 66 books. I downloaded the bible dataset from the following github [here](https://raw.githubusercontent.com/mxw/grmr/master/src/finaltests/bible.txt).

In [1]:
## Import Libraries

import tensorflow as tf
import numpy as np
import os
import time

In [2]:
path_to_file = tf.keras.utils.get_file('bible.txt', 'https://raw.githubusercontent.com/mxw/grmr/master/src/finaltests/bible.txt')

## read the path_to_file
text = open(path_to_file, 'rb').read().decode(encoding = 'utf-8') 
print(f'Length of text: {len(text)} characters')


Length of text: 4451368 characters


In [3]:
## Print the first 500 characters in the text
print(text[:500])

1:1 In the beginning God created the heaven and the earth.

1:2 And the earth was without form, and void; and darkness was upon
the face of the deep. And the Spirit of God moved upon the face of the
waters.

1:3 And God said, Let there be light: and there was light.

1:4 And God saw the light, that it was good: and God divided the light
from the darkness.

1:5 And God called the light Day, and the darkness he called Night.
And the evening and the morning were the first day.

1:6 An


In [4]:
## The number of unique characters in the file
vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

81 unique characters


## 3. Data Preparation

### 3.1. Vectorize our text


In [5]:
## Since our model is a character level prediction language model,
## we'll want to convert our inputs from words into characters.
## For that, we can use tf.strings.unicode_split. 
## The output will be ragged tensor object

sample_texts = ['This is a test', 'gotcha']

chars = tf.strings.unicode_split(sample_texts, input_encoding = 'UTF-8')
chars

<tf.RaggedTensor [[b'T', b'h', b'i', b's', b' ', b'i', b's', b' ', b'a', b' ', b't', b'e',
  b's', b't']                                                            ,
 [b'g', b'o', b't', b'c', b'h', b'a']]>

In [6]:
## Create a StringLookup layer that will convert a string of characters into numerical IDs.
ids_from_chars = tf.keras.layers.StringLookup(vocabulary = list(vocab), mask_token = None)

## illustrate how ids_from_chars will convert 'chars' into integers
ids = ids_from_chars(chars)
ids

<tf.RaggedTensor [[49, 63, 64, 74, 3, 64, 74, 3, 56, 3, 75, 60, 74, 75],
 [62, 70, 75, 58, 63, 56]]>

In [7]:
## Create a StringLookup later that will convert the string of numerical IDs into characters
chars_from_ids = tf.keras.layers.StringLookup(vocabulary = ids_from_chars.get_vocabulary(), invert = True, mask_token = None)

chars = chars_from_ids(ids)
chars

<tf.RaggedTensor [[b'T', b'h', b'i', b's', b' ', b'i', b's', b' ', b'a', b' ', b't', b'e',
  b's', b't']                                                            ,
 [b'g', b'o', b't', b'c', b'h', b'a']]>

In [8]:
## with the above reinstated chars value, we can reconvert it back to a single string
tf.strings.reduce_join(chars, axis=-1).numpy()

## Note that 'chars' is a tensor object. Hence, it cannot be joined back with ''.join(<list>).

array([b'This is a test', b'gotcha'], dtype=object)

In [9]:
## we can create a useful function that would form ragged tensor objects of integers back into fully joined words.
def ids_to_text(ids):
  return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

ids_to_text(ids)

<tf.Tensor: shape=(2,), dtype=string, numpy=array([b'This is a test', b'gotcha'], dtype=object)>

### 3.2. Create Training Examples and Targets

In [10]:
## Let us now convert the entire text document into numerical ids!
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
print(len(all_ids))

## we should see the length as equivalent to the length of characters in the text.

4451368


In [11]:
## We'll now create a dataset tensor object that will aid us in our text generation operations
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)

## The following will illustrate what the dataset tensor object can do.
for ids in ids_dataset.take(30):
    print(chars_from_ids(ids).numpy().decode('utf-8'))

1
:
1
 
I
n
 
t
h
e
 
b
e
g
i
n
n
i
n
g
 
G
o
d
 
c
r
e
a
t


In [12]:
## notice how the above dataset tensor object iteratively returns characters
## Here, we'll create a new dataset tensor object that returns a 'batch' of characters

seq_length = 100 # feel free to vary this integer variable

## Note: we added 1 to the seq_length because in the next code chunk, we'll use this object to create our X and y data.
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True) 

for seq in sequences.take(5): # feel free to vary this integer variable
  print(seq)
  print('---------------------------------------')
  print('Observe the conversion from ids to text:')
  print(ids_to_text(seq).numpy())
  print("---------------------------------------\n")

tf.Tensor(
[17 26 17  3 38 69  3 75 63 60  3 57 60 62 64 69 69 64 69 62  3 36 70 59
  3 58 73 60 56 75 60 59  3 75 63 60  3 63 60 56 77 60 69  3 56 69 59  3
 75 63 60  3 60 56 73 75 63 14  2  1  2  1 17 26 18  3 30 69 59  3 75 63
 60  3 60 56 73 75 63  3 78 56 74  3 78 64 75 63 70 76 75  3 61 70 73 68
 12  3 56 69 59], shape=(101,), dtype=int64)
---------------------------------------
Observe the conversion from ids to text:
b'1:1 In the beginning God created the heaven and the earth.\r\n\r\n1:2 And the earth was without form, and'
---------------------------------------

tf.Tensor(
[ 3 77 70 64 59 27  3 56 69 59  3 59 56 73 66 69 60 74 74  3 78 56 74  3
 76 71 70 69  2  1 75 63 60  3 61 56 58 60  3 70 61  3 75 63 60  3 59 60
 60 71 14  3 30 69 59  3 75 63 60  3 48 71 64 73 64 75  3 70 61  3 36 70
 59  3 68 70 77 60 59  3 76 71 70 69  3 75 63 60  3 61 56 58 60  3 70 61
  3 75 63 60  2], shape=(101,), dtype=int64)
---------------------------------------
Observe the conversion from ids t

By this step, we'd have created a dataset tensor object that gives us (seq_length + 1) number of characters. In our example, that would be 100 + 1 = 101. Here's how we use this object to create our input sequences (X) and target sequences (y).

For illustration purposes, let's assume our seq_length = 6 and our text is 'Federer'. In that case, our input sequence will be 'Federe' and the target sequence will be 'ederer'. So, we'll create a function that will return 2 output. The first output will be the first 100 (seq_length) number of characters (our X), whilst the second output will be the last 100 (seq_length) number of characters (our y).

In [13]:
## Create a function that will produce the X and y variables off a sequence of text, 
## regardless of the text's given length.

def split_into_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

## use the .map method to create a dataset tensor object 
## that will always return our X and y outputs respectively when we call it
dataset_xy = sequences.map(split_into_input_target)

## view an example
for input_example, target_example in dataset_xy.take(1):
    print("Input :", ids_to_text(input_example).numpy())
    print("Target:", ids_to_text(target_example).numpy())

Input : b'1:1 In the beginning God created the heaven and the earth.\r\n\r\n1:2 And the earth was without form, an'
Target: b':1 In the beginning God created the heaven and the earth.\r\n\r\n1:2 And the earth was without form, and'


### 3.3. Create Training Batches

In [14]:
## Batch size
BATCH_SIZE = 64

## Before we use our dataset, we'll need to shuffle the data and create batches.
## Shuffling helps prevent our model from overfitting.
BUFFER_SIZE = 10000

dataset = (
    dataset_xy
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE)) # this allows later elements to be prepared while the current element is processed

dataset

<PrefetchDataset element_spec=(TensorSpec(shape=(64, 100), dtype=tf.int64, name=None), TensorSpec(shape=(64, 100), dtype=tf.int64, name=None))>

## 4. Building the Model

In [15]:
## size of our vocab - i.e. the number of unique characters in our dataset
vocab_size = len(ids_from_chars.get_vocabulary())

## Our embedding dimension
embedding_dim = 256

## number of RNN units
rnn_units = 1024

In [16]:
# Write a function that builds our langage model architecture with the Sequential API
# To keep this simple, we'll used 3 simple layers - embedding, GRU and dense. You can
# replace GRU with LSTM.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, GRU, Embedding

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = Sequential()
    model.add(Embedding(vocab_size
                        ,embedding_dim
                        ,batch_input_shape=(batch_size,None))) # the use of none gives us flexibility with the input's seqeunce length
    model.add(GRU(rnn_units
                  ,return_sequences=True # this returns y as the full sequences, rather than the last output
                  ,stateful=True)) # This allows LSTMs to have longer context at training time
    model.add(Dense(vocab_size)) # Notice how we are not using any softmax activation. 

    return model

In [17]:
model = build_model(vocab_size = vocab_size
                    ,embedding_dim=embedding_dim
                    ,rnn_units=rnn_units
                    ,batch_size=BATCH_SIZE)

In [18]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (64, None, 256)           20992     
                                                                 
 gru (GRU)                   (64, None, 1024)          3938304   
                                                                 
 dense (Dense)               (64, None, 82)            84050     
                                                                 
Total params: 4,043,346
Trainable params: 4,043,346
Non-trainable params: 0
_________________________________________________________________


## 5. Trying the model

In [19]:
for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 82) # (batch_size, sequence_length, vocab_size)


In [20]:
## To get actual predictions, we will sample from the output distribution.
## If we don't sample from an output distribution, it can cause the model to be stuck in a loop. 
## meaning, it will keep producing a repeated sequence of characters. Hence, we sample.
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()
sampled_indices

array([29, 65,  8, 40,  7, 26, 70, 33, 17, 40, 34,  5, 39,  8,  1, 14, 64,
       10,  3, 20, 27, 49, 74, 75, 57, 72, 58, 26, 10, 25, 36, 68, 57,  3,
       49, 33, 46, 18, 25, 42,  9, 47, 56, 55, 47, 67,  5, 40, 78, 24, 79,
       21, 29, 29, 17, 64, 19, 19, 15, 29, 30, 17, 37, 36, 43, 16, 43, 34,
        8,  7, 67, 53, 10,  8, 49, 63, 73, 47, 35, 68, 47, 51, 19, 34, 64,
       62, 28, 71, 81, 48,  6, 24, 42, 75, 21, 10,  0, 34, 59, 29])

## 6. Train the Model

In [21]:
## For our loss calculation, we'll use the 'sparse_categorical_crossentropy'
## because the target labels are provided as integers. If they are one-hot representations,
## then we'll use 'CategoricalCrossentropy'.
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Prediction shape:  (64, 100, 82)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.405335


In [22]:
model.compile(optimizer='adam', loss=loss)

In [24]:
EPOCHS=20
history = model.fit(dataset_subclassing, epochs=EPOCHS)#, callbacks=[checkpoint_callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [25]:
model.save_weights('sequential-bible-weights.h5')

## 7. Generate Text

### 7.1. Create a new model with the saved weights

In [23]:
## For the sake of this project, we don't need to produce a BATCH_SIZE worth of different generated texts. 
## Instead, we just need 1. But, the model has been built to produce the stated BATCH_SIZE.
## Hence, we'll need to create a new model, then restore the weights that was saved.

bible_text_generator = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

bible_text_generator.load_weights('sequential-bible-weights.h5')

bible_text_generator.build(tf.TensorShape([1, None]))

In [24]:
bible_text_generator.summary()

## observe how index 0 of the output shape are now the batch size of 1.

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (1, None, 256)            20992     
                                                                 
 gru_1 (GRU)                 (1, None, 1024)           3938304   
                                                                 
 dense_1 (Dense)             (1, None, 82)             84050     
                                                                 
Total params: 4,043,346
Trainable params: 4,043,346
Non-trainable params: 0
_________________________________________________________________


### 7.2. Create a function that will generate a text for us

In [30]:
def generate_text(model, start_string, temperature = 1.0, prediction_length = 1000):  
  
    input_eval = [ids_from_chars(s) for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)
   
    text_generated = []

    model.reset_states() # 
    for i in range(prediction_length):
        predictions = model(input_eval)
        predictions = tf.squeeze(predictions, 0)
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(chars_from_ids(predicted_id))

    return (start_string + tf.strings.join(text_generated).numpy().decode('utf-8'))

In [36]:
print(generate_text(bible_text_generator, start_string=u"Jesus", temperature = 0.8, prediction_length = 500))

Jesus said unto him, All the
people which ye had read throughout Pilate that believed not: but if
it be for my members shall the flesh of an only persecutions.

1:13 How long shall I obtailedo him?  17:12 And when the other sea: and
they shall be absent because of the LORD's house.

4:12 Behold, I will send a fire unto the body to concerning him that placeth, where
are the children of the flesh, that ye be not seen situate by faith unto the commandment
of God, and from the strange Lord Jesus


In [118]:
print(generate_text(bible_text_generator, start_string=u"Love the Lord with all your ", temperature = 0.2, prediction_length = 500))

Love the Lord with all your hearts:
for they are many that were not ashamed of the faith of the LORD, and the strangers shall be as the
same sacrifice was just, and the stars of heaven stood by the coast of the earth, and the hair of his hands toward the
south.

13:17 And he said unto them, Who are the seven lambs of the first year of king Cyrus, and the
children of Israel, and the first and the scribes, and the
Levites, and the Jebusites, and the Jebusites, and the
Gentiles, and the sons of Zebedee, and of the chi


In [119]:
print(generate_text(bible_text_generator, start_string=u"Love the Lord with all your ", temperature = 0.2, prediction_length = 300))

Love the Lord with all your hearts:
for they are many that believeth on him that sent me.

11:25 And the scribes and Pharisees and all his household, and
his mother, and his companions that were with him, and said, Lord, who shall say, Thou art my Son, thou shalt not be ashamed of
me that shall be for the sin offering.



In [121]:
print(generate_text(bible_text_generator, start_string=u"For God so loved the ", temperature = 0.2, prediction_length = 300))

For God so loved the Lord Jesus Christ.

1:11 For the things that are sanctified in the flesh of any man of the world, and the
world hath not seen him.

1:15 And when the chief priests and Pharisees which are sanctified in the wilderness, and the stream is a doer of the flesh.

1:18 The LORD shall receive the law


## Ideal Model Building

In [37]:
# evaluation using BLEU
from nltk.translate.bleu_score import corpus_bleu


In [39]:
pred = generate_text(bible_text_generator, start_string=u"Jesus", temperature = 0.8, prediction_length = 500)
print(pred)

Jesus
Christ is the spirit of the LORD. Now there eat the water of the sea;
in silver and gold, and in the world by the way, through him that believeth in the
ends of the earth, and not that we should bear the pen your souls.

1:9 For what will I bring again the chief priests for the increase of the General that told him, saying, Abide in the house of
Israel, that he saith unto the servants the chief caperia, and all the
children of Israel that bare not the holy mouth of the world, and it
sha


In [43]:
def generate_text_mod(model, start_string, temperature = 1.0, prediction_length = 1000):  
  
    num_generate = prediction_length
    input_eval = [ids_from_chars(s) for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)
   
    text_generated = []

    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        predictions = tf.squeeze(predictions, 0)
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(chars_from_ids(predicted_id))

    return text_generated

In [44]:
pred = generate_text_mod(bible_text_generator, start_string=u"Jesus", temperature = 0.8, prediction_length = 500)
pred

[<tf.Tensor: shape=(), dtype=string, numpy=b','>,
 <tf.Tensor: shape=(), dtype=string, numpy=b' '>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'w'>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'h'>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'o'>,
 <tf.Tensor: shape=(), dtype=string, numpy=b' '>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'd'>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'i'>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'd'>,
 <tf.Tensor: shape=(), dtype=string, numpy=b' '>,
 <tf.Tensor: shape=(), dtype=string, numpy=b't'>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'h'>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'e'>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'y'>,
 <tf.Tensor: shape=(), dtype=string, numpy=b' '>,
 <tf.Tensor: shape=(), dtype=string, numpy=b't'>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'h'>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'i'>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'n'>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'k'>,


In [76]:
import re

text = tf.strings.join(pred).numpy()
re.split(b',|\n|\r| ', text)


[b'',
 b'',
 b'who',
 b'did',
 b'they',
 b'think',
 b'their',
 b'',
 b'covenant',
 b'',
 b'and',
 b'our',
 b'transgressions',
 b'among',
 b'the',
 b'world',
 b'',
 b'even',
 b'',
 b'as',
 b'this',
 b'day.',
 b'',
 b'',
 b'',
 b'22:22',
 b'And',
 b'he',
 b'said',
 b'unto',
 b'them',
 b'',
 b'Matten',
 b'seek',
 b'hold',
 b'the',
 b'people',
 b'of',
 b'the',
 b'land.',
 b'',
 b'',
 b'',
 b'3:5',
 b'Wilt',
 b'thou',
 b'join',
 b'the',
 b'light',
 b'of',
 b'my',
 b'hand!',
 b'',
 b'24:22',
 b'Thou',
 b'hast',
 b'already',
 b'to',
 b'be',
 b'brought',
 b'forth',
 b'',
 b'and',
 b'if',
 b'ye',
 b'will',
 b'be',
 b'no',
 b'',
 b'bread',
 b'for',
 b'the',
 b'commandment',
 b'of',
 b'God',
 b'',
 b'and',
 b'then',
 b'shall',
 b'he',
 b'eat',
 b'the',
 b'',
 b'burnt',
 b'offering',
 b'of',
 b'a',
 b'Prine',
 b'of',
 b'great',
 b'power;',
 b'1:5',
 b'To',
 b'the',
 b'house',
 b'of',
 b'Israel',
 b'',
 b'must',
 b'need',
 b'with',
 b'a',
 b'great',
 b'stead',
 b'of',
 b'a',
 b'burnt',
 b'dead',
 b

In [109]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
candidates = tokenizer.tokenize(tf.strings.join(pred).numpy().decode('utf-8'))
candidates

['who',
 'did',
 'they',
 'think',
 'their',
 'covenant',
 'and',
 'our',
 'transgressions',
 'among',
 'the',
 'world',
 'even',
 'as',
 'this',
 'day',
 '22',
 '22',
 'And',
 'he',
 'said',
 'unto',
 'them',
 'Matten',
 'seek',
 'hold',
 'the',
 'people',
 'of',
 'the',
 'land',
 '3',
 '5',
 'Wilt',
 'thou',
 'join',
 'the',
 'light',
 'of',
 'my',
 'hand',
 '24',
 '22',
 'Thou',
 'hast',
 'already',
 'to',
 'be',
 'brought',
 'forth',
 'and',
 'if',
 'ye',
 'will',
 'be',
 'no',
 'bread',
 'for',
 'the',
 'commandment',
 'of',
 'God',
 'and',
 'then',
 'shall',
 'he',
 'eat',
 'the',
 'burnt',
 'offering',
 'of',
 'a',
 'Prine',
 'of',
 'great',
 'power',
 '1',
 '5',
 'To',
 'the',
 'house',
 'of',
 'Israel',
 'must',
 'need',
 'with',
 'a',
 'great',
 'stead',
 'of',
 'a',
 'burnt',
 'dead',
 'fell',
 'down',
 'at',
 'the',
 'first',
 'set',
 'on',
 'the',
 'one',
 'wa']

In [98]:

references = [tokenizer.tokenize(text)]
references


[['1',
  '1',
  'In',
  'the',
  'beginning',
  'God',
  'created',
  'the',
  'heaven',
  'and',
  'the',
  'earth',
  '1',
  '2',
  'And',
  'the',
  'earth',
  'was',
  'without',
  'form',
  'and',
  'void',
  'and',
  'darkness',
  'was',
  'upon',
  'the',
  'face',
  'of',
  'the',
  'deep',
  'And',
  'the',
  'Spirit',
  'of',
  'God',
  'moved',
  'upon',
  'the',
  'face',
  'of',
  'the',
  'waters',
  '1',
  '3',
  'And',
  'God',
  'said',
  'Let',
  'there',
  'be',
  'light',
  'and',
  'there',
  'was',
  'light',
  '1',
  '4',
  'And',
  'God',
  'saw',
  'the',
  'light',
  'that',
  'it',
  'was',
  'good',
  'and',
  'God',
  'divided',
  'the',
  'light',
  'from',
  'the',
  'darkness',
  '1',
  '5',
  'And',
  'God',
  'called',
  'the',
  'light',
  'Day',
  'and',
  'the',
  'darkness',
  'he',
  'called',
  'Night',
  'And',
  'the',
  'evening',
  'and',
  'the',
  'morning',
  'were',
  'the',
  'first',
  'day',
  '1',
  '6',
  'And',
  'God',
  'said',
  

In [111]:
from nltk.translate.bleu_score import sentence_bleu

score = sentence_bleu([list(set(tokenizer.tokenize(text)))], candidates)
print(score)

7.431554310846816e-291


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [107]:
list(set(tokenizer.tokenize(text)))

['Perazim',
 'Betharabah',
 'chop',
 'Libni',
 'inheritor',
 'unstable',
 'desolation',
 'clothes',
 'almond',
 'Urijah',
 '171',
 'disposing',
 'spindle',
 'Joiarib',
 'Spring',
 'Arba',
 'temper',
 'row',
 'accompanying',
 'desiredst',
 'rings',
 'Should',
 'seats',
 'hart',
 'Booz',
 'within',
 'Tabbath',
 'whirlwind',
 '91',
 'Sargon',
 'gutenberg',
 'runneth',
 'Tekoa',
 'spoilest',
 'Beloved',
 'Naarath',
 'gold',
 'gave',
 'TRADEMARK',
 'accursed',
 'overthrow',
 'Gilboa',
 'Hannathon',
 'Remain',
 'Rohgah',
 'ghost',
 'bemoan',
 'Irshemesh',
 'Kedar',
 'tingle',
 'odd',
 'shameful',
 'can',
 'Geshur',
 'shamefacedness',
 'Eliphaz',
 'enjoin',
 'company',
 'washing',
 'poorer',
 'hasty',
 'Zephon',
 'Meshullemeth',
 'gender',
 'Gemariah',
 'couches',
 'spared',
 'possessions',
 'Tilgathpilneser',
 'deceiveth',
 'Havilah',
 'Forasmuch',
 'gropeth',
 'Pharisees',
 'transforming',
 'Happy',
 'Withal',
 'Cherith',
 'wagon',
 'Hazaraddar',
 'Obil',
 'Carcas',
 'buriers',
 'drowned',
