<a href="https://colab.research.google.com/github/ashrafwkhalil/NN-Poetry-Generator/blob/main/PoemGenerator.ipynb#scrollTo=cba778ac" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Poem Generator
I will create a Poem Generator by training a neural network to predict the next word given an input sequence of words, and then using that model to continuously generate text. The model will be trained on a large dataset of poems.

## Imports

In [1]:
import numpy as np
import os 
import csv
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical 

Here I will be downloading the entire github repo so that I can access all neccessary data

In [None]:
!git clone 'https://github.com/ashrafwkhalil/NN-Poetry-Generator.git'

## Organize Data

In the below cell, I will be formatting my poem data into just lines of text with some maximum length, measured in number of words. I have stored the max line length value as a variable, thus allowing me to use it almost as a hyperparameter, as I am able to modify it and compare model performance with different values. 

In [2]:
# Initialize Poems list
poems = []
# Poems here are stored in many different directories, I am going to iterate through all the different directories
# and simply append all poems to the same list. This may be problematic, given that all of these directories represent
# different styles of poems, but I will overlook that for this demo.
forms_poems_path = '.NN-Poetry-Generator/Poems/forms'
topics_poems_path = '.NN-Poetry-Generator/Poems/topics'
directories_forms = os.listdir(forms_poems_path)
directories_topics = os.listdir(topics_poems_path)
# iterating through all directories
for poem_form in directories_forms:
    try:
        for poem_file in os.listdir(os.path.join(forms_poems_path, poem_form)):
            with open(os.path.join(forms_poems_path, poem_form, poem_file)) as poem:
                lines = poem.readlines()
                lines_newline_spaced = []
                for line in lines:
                    line = line.replace('\n', ' \n')
                    lines_newline_spaced.append(line)
                poems.append(lines_newline_spaced)[1:-1]
    except:
        continue
for poem_form in directories_topics:
    try:
        for poem_file in os.listdir(os.path.join(topics_poems_path, poem_form)):
            with open(os.path.join(topics_poems_path, poem_form, poem_file)) as poem:
                lines = poem.readlines()
                lines_newline_spaced = []
                for line in lines:
                    line = line.replace('\n', ' \n')
                    lines_newline_spaced.append(line)
                poems.append(lines_newline_spaced)[1:-1]
    except:
        continue
# the max_line_length variable. This will decide what the maximum length of a single input into the model will be. 
# Any line above the max line length will be split up into separate elements, and these elements will be inserted 
# into the array
max_line_length = 20
poems = np.concatenate(poems)
for i, line in enumerate(poems):
    # split into arrays of words
    line = line.split(' ')
    if len(line) > max_line_length:
        # add first maxlinelength sized chunk into array at original index
        poems[i] = ' '.join(line[:max_line_length])
        #split iterate through chunks of max line length
        for x in range(1, int(len(line)/max_line_length)):
            #insert chunks into poem list
            np.insert(poems, i+x, ' '.join(line[max_line_length*(i):max_line_length*(i+1)]))
poems[95:145]

array(['Grownups in the house \n', 'Haunt responsibilities \n',
       'I watch dream T V \n', 'Jet powered action \n',
       'Kids save their town from monsters \n', 'Lunatics and thugs \n',
       'Mountains streak below \n', 'New character friend or foe \n',
       'Only time will tell \n', 'Preview of  next show \n',
       'Queen of dream sends champion \n', 'Racing with Jet Kids \n',
       'Saving the future \n', 'Tune in tom arrow and see \n',
       'Ultimate showdown \n', 'Victory assured \n',
       'Waving heroes, credits roll \n', 'X factor revealed \n',
       'Youthful television dream \n', "Zooms away, I'm still asleep \n",
       '- + \n', 'Almost awake view \n', 'Banished from reality \n',
       'Creature made of mind \n', 'Dream-scape citizen \n',
       'Everado Man-Shaped-Wind \n', 'Face another day \n',
       'Grasp February \n', 'Handle your business in the \n',
       'Indigo yonder', 'Love Returns (Oulipo Poetry R+7) \n',
       'January 25, 2020 \n', 'Love 

## Tokenizing the inputs
Here, I will tokenize all the inputs, convert them into tokenized sequences, then break down a single sequence into multiple. I will do this by repeatedly removing a word from the end of the sequence and using this new sequence, with the last word removed, as a separate sequence. The minimum size for a sequence at this stage will be 2 words long. In the next step, I will use the last word in a given sequence as the label for that sequence, thus making it so at least 2 words is necessary. I also add sequences of less than 4 words multiple times, being that I want the model to focus on ensuring that the words it predicts are as likely as possible to be grammatically correct, and words closest to the predicted word matter the most, at least that makes sense to me intuitively. There are other ways to do this, but through some trial and error, this actually gave not bad results.

In [3]:
# instantiate a tokenizer
tokenizer = Tokenizer(filters = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t')
tokenizer.fit_on_texts(poems)
total_words = len(tokenizer.word_index)
input_sequences = []
# iterate through list of lines
yy = 0
for line in poems:
    if yy > 20:
        break
    token_list = tokenizer.texts_to_sequences([line])[0]
    # keep adding in order subsets of the sequence that start at index 0 as separate sequences.
    x = 0
    for i in range(1, len(token_list)):
        if i < 4:
            y = 3
        else:
            y = 1
        for x in range(y):
            sequence = token_list[:i+1]
            input_sequences.append(sequence)
sequence_lengths = [len(x) for x in input_sequences]
max_sequence_len = max(sequence_lengths)
avg_sequence_len = np.array(sequence_lengths).mean()
padded_sequences = pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')
# shuffle just in case
np.random.shuffle(padded_sequences)

## Creating Features and Labels
Here I will take the last element of each sequence and use it as the label for that sequence. I will then convert the labels into one hot vectors. Remember, this model is going to be trained to predict the next word given an input sequence.

In [4]:
features = padded_sequences[:, :-1]
labels = padded_sequences[:, -1]
one_hot_labels = to_categorical(labels, num_classes = total_words+1)
features[0:10]

array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    3,   11,    6],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,   15],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    8,    2, 2700, 1003],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    6, 1243,   13],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    2,  520,   66],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,  418, 2709,  419],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,   15,   97,
          19,  156,    5,  265,   50,  356, 6906,    4,    2,  111],

## Model Definition
Here, I will construct my model. Because the order of words in an input sequence are extremely important, I will use two adjacent  Bidirectional LSTM layers after my embedding layer. I will apply no convolutions to the inputs before these LSTMs, being that since I am trying to focus on the output being grammatically correct, ensuring the word-to-word integrity of the inputs into the initial LSTM is important. I will then convolve the outputs of the initial set of LSTMs with a size 3 filter, trying to extract possible patterns of adjacent word-groups of size 3. Then I will pass that into a final unidirectional LSTM. There will be 2 final dense layers, the final layer will have a single 'neuron' for each possible output word, and will have a softmax activation, as we are expecting a single discrete output: the word that we are guessing will be next in the sequence.

In [5]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(total_words+1, 100, input_length=max_sequence_len-1),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(48, return_sequences = True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(48, return_sequences = True)),
    tf.keras.layers.Conv1D(24, 3),
    tf.keras.layers.LSTM(24),
    tf.keras.layers.Dense(512, activation = 'sigmoid'),
    tf.keras.layers.Dense(total_words+1, activation = 'softmax')
])
tf.keras.backend.clear_session()
model.summary()

2022-09-27 19:04:56.440980: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 21, 100)           826400    
                                                                 
 bidirectional (Bidirectiona  (None, 21, 96)           57216     
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 21, 96)           55680     
 nal)                                                            
                                                                 
 conv1d (Conv1D)             (None, 19, 24)            6936      
                                                                 
 lstm_2 (LSTM)               (None, 24)                4704      
                                                                 
 dense (Dense)               (None, 512)               1

In [None]:
model.compile(loss='categorical_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
history = model.fit(features, one_hot_labels, epochs = 100, verbose = 1)

Here is where you can choose an input seed that the NN can begin generating text based on. There is a pre-trained version of the model included in the repo, and the code directly below this will load it and use it to demo the poem generation.

In [None]:
model_saved = tf.keras.models.load_model('trained_model')
# modify this how you like
seed_text = "Run into the wind"
next_words = 100
  
for _ in range(next_words):
	# Convert the text into sequences
	token_list = tokenizer.texts_to_sequences([seed_text])[0]
	# Pad the sequences
	token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
	# Get the probabilities of predicting a word
	predicted = model_saved.predict(token_list, verbose=0)
	# Choose the next word based on the maximum probability
	predicted = np.argmax(predicted, axis=-1).item()
	# Get the actual word from the word index
	output_word = tokenizer.index_word[predicted]
    # Append to the current text
	seed_text += " " + output_word

print(seed_text)