<a href="https://colab.research.google.com/github/bmcnns/catch22_lstm_textgen/blob/main/catch22_lstm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This method was adapted from Francois Chollet's [Deep Learning With Python](https://www.manning.com/books/deep-learning-with-python)

This notebook is an exercise from Chapter 8: Generative deep learning
adapted to take sequences from a pdf of Joseph Heller's Catch 22
that is saved at "my_data/catch22.pdf".

Chollet, Francois. (2017). Deep learning with Python. Shelter Island, NY: Manning Publications Co. 

Dependencies:
   - pypdf2: for reading PDF files

In [None]:
 # !pip install pypdf2

# Collect a large dataset of text that you want the model to learn from. This could be a corpus of books, articles, or any other type of text.

In [None]:
# Collect a large dataset of text that you want the model to learn from.
# This could be a corpus of books, articles or any other type of text.

import PyPDF2

# Open the PDF file in read-binary mode
with open('my_data/catch22.pdf', 'rb') as file:
  pdf = PyPDF2.PdfFileReader(file)
  text = ''

  for i in range(pdf.getNumPages()):
    page_text = pdf.getPage(i).extractText()
  
    # Append the page text to the overall text
    text += page_text

  text = text.lower()

# Preprocess the text by removing PDF header information.
header_length = 186
text = text[header_length:]

# Convert the text to lowercase
text = text.lower()

print('Corpus Length', len(text))

Corpus Length 1039509


# Preprocess the text data by extracting partially overlapping sequences of length maxlen, one-hot encode them and then packing them in a 3D Numpy array x of shape (sequences, maxlen, unique_characters)

In [None]:
import numpy as np

# Extracting sequences of 60 characters.
maxlen = 60

#You'll sample a new sequence every 3 characters.
step = 3

# Holds the extracted sequences
sentences = []

# Holds the targets
next_chars = []

for i in range(0, len(text) - maxlen, step):
  sentences.append(text[i: i + maxlen])
  next_chars.append(text[i + maxlen])

print('Number of sequences:', len(sentences))

# Dictionary that maps unique characters to their index in the list "chars"
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
char_indices = dict((char, chars.index(char)) for char in chars)

print('Vectorization...')
# One-hot encodes the characters into binary arrays.
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=bool)
y = np.zeros((len(sentences), len(chars)), dtype=bool)
for i, sentence in enumerate(sentences):
  for t, char in enumerate(sentence):
    x[i, t, char_indices[char]] = 1
  y[i, char_indices[next_chars[i]]] = 1

Number of sequences: 346483
Unique characters: 60
Vectorization...


# Build the model

In [None]:
from keras.models import Sequential
from keras.layers import LSTM, Dense

# Define the model
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model

In [None]:
# Fit the model to the training data
model.fit(x, y, epochs=10, batch_size=128)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fdfb05246d0>

:

Sampling from the soft-max distribution of the model

In [None]:
# The higher the temperature, the higher the entropy of the sampling.
def sample(preds, temperature=1.0):
  preds = np.asarray(preds).astype('float64')
  preds = np.log(preds) / temperature
  exp_preds = np.exp(preds)
  preds = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1, preds, 1)
  return np.argmax(probas)

The temperature parameter controls the randomness or unpredictability of the generated text. A higher temperature will result in more random or unpredictable text, while a lower temperature will result in more predictable text. This is because the temperature affects the entropy (randomness) of the sampling process.

The function returns the index of the element with the highest probability in the resulting sample.

# Generating Text

In [None]:
import random
import sys

for epoch in range(1, 60):
  print('epoch', epoch)
  model.fit(x, y, batch_size=128, epochs=1)

  # Selects a text seed at random
  start_index = random.randint(0, len(text) - maxlen - 1)
  generated_text = text[start_index: start_index + maxlen]
  print('--- Generating with seed: "' + generated_text + '"')

  for temperature in [0.2, 0.5, 1.0, 1.2]:
    print('---- temperature:', temperature)
    for i in range(400):
      # One-hot encodes the characters generated so far
      sampled = np.zeros((1, maxlen, len(chars)))
      for t, char in enumerate(generated_text):
        sampled[0, t, char_indices[char]] = 1
      
      # Samples the next character
      preds = model.predict(sampled, verbose=0)[0]
      next_index = sample(preds, temperature)
      next_char = chars[next_index]

      generated_text += next_char
      generated_text = generated_text[1:]

      print(next_char, end='')
    print('')


epoch 1
--- Generating with seed: "ht to be glad you've got any temperature at all."
     doc d"
---- temperature: 0.2
aneeka she was a man of the start of a fell him and start and the should in the start of the started the start of the same the started the start and his forther of his forther the same the same to make a string and the same the started his finter the start and stop of the started the start and shook and the same the started the started the started and streased and the same no distand of the start 
---- temperature: 0.5
with his sile the stromm to sise was she sturned his eyes perse of the starting them and the good hought to sight, and when he was a same to berong the store and had colonel cathcart to knop out the concons and then engered the dunbar morning in the strove her orr made or as on a mens on the officers and the the back to danbersting and tome them and was his something and into a panized the but the
---- temperature: 1.0
n doc done odse dewarly displarely,

# Result

## Temperature 0.2

```
and see any more country with an avince himself and shook his head with the 

street and share and the ground him streamed at the hospital to him and the 

tent. 

he was stuld delight of the bomb right to the stumbled the colonel in the ward 

and the plane of the chaplain was stunners with a constantly in a streng had to 

the surprised him of the chaplain was group to him and the plane to the stumpets
```

## Temperature 0.5



```
and second countroes to the plane that he did not want to marry turning before 

he had to sin it, and you begin down on the world be the sweat of plane with 

his hands below he had all a shoulder girl them and his head at the stumbled 

them her away the mindation in the hospital to the bomb right to the word had 

stopped to him and part and struckned to the patacted with a ponty from the bed 

to streek
```

## Temperature 1.0


```
ed like a madnes of excevided that he reheishes, the chaplain voice ruttles long gustive them.
     "chaplies?"
     "relantacted," colonel korn returned's milliard page dately proted her signity at the chaplain was a lung. she
wanted out for the hustianing omen that i tell having to do there was run group turn
and put old your consausial sloplex you thin tiction. a
facutan and, will had chasced b
```


## Temperature 1.2

```
eoo?"
     "well, milo, you lged courtgey. yossarian, mustarily
furbilligathet
on the
dubors toward yossarian had
halfory to be someone add, rip! somposity mughtly. "he's over in the saw, they penswer to the thatigges and godded unselficle prindi of
b.fighted her fresh to be attaid dack and then do do overhelly my in their room, fax people, a stuffered eyes
afraid
tame, say condledded to acraible
```