# Neural Language Model for "Sing A Song of Sixpence" & "Stopping by Woods on a Snowy Evening"

In this project, I attempt to create a RNN for predicting the next few characters for lines in two popular poems, applying LSTM to a language prediction problem.

A language model predicts the next word in the sequence based on the specific words that have come before it in the sequence.

It is also possible to develop language models at the character level using neural networks. The benefit of character-based language models is their small vocabulary and flexibility in handling any words, punctuation, and other document structure. This comes at the cost of requiring larger models that are slower to train.

Nevertheless, in the field of neural language models, character-based models offer a lot of promise for a general, flexible and powerful approach to language modeling.

# Source Text Creation

In [None]:
!pip install tensorflow
!pip install keras
!pip install h5py



In [None]:

s='Sing a song of sixpence,\
A pocket full of rye.\
Four and twenty blackbirds,\
Baked in a pie.\
When the pie was opened\
The birds began to sing;\
Wasn’t that a dainty dish,\
To set before the king.\
The king was in his counting house,\
Counting out his money;\
The queen was in the parlour,\
Eating bread and honey.\
The maid was in the garden,\
Hanging out the clothes,\
When down came a blackbird\
And pecked off her nose.'

with open('rhymes.txt','w') as f:
  f.write(s)

    Sing a song of sixpence,
    A pocket full of rye.
    Four and twenty blackbirds,
    Baked in a pie.

    When the pie was opened
    The birds began to sing;
    Wasn’t that a dainty dish,
    To set before the king.

    The king was in his counting house,
    Counting out his money;
    The queen was in the parlour,
    Eating bread and honey.

    The maid was in the garden,
    Hanging out the clothes,
    When down came a blackbird
    And pecked off her nose.

# Sequence Generation

A language model must be trained on the text, and in the case of a character-based language model, the input and output sequences must be characters.

The number of characters used as input will also define the number of characters that will need to be provided to the model in order to elicit the first predicted character.

After the first character has been generated, it can be appended to the input sequence and used as input for the model to generate the next character.

Longer sequences offer more context for the model to learn what character to output next but take longer to train and impose more burden on seeding the model when generating text.

We will use an arbitrary length of 10 characters for this model.

There is not a lot of text, and 10 characters is a few words.

We can now transform the raw text into a form that our model can learn; specifically, input and output sequences of characters.

In [None]:
#load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# save tokens to file, one dialog per line
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

In [None]:
#load text
raw_text = load_doc('rhymes.txt')
print(raw_text)

# clean
tokens = raw_text.split()
raw_text = ' '.join(tokens)

# organize into sequences of characters
length = 10
sequences = list()
for i in range(length, len(raw_text)):
    # select sequence of tokens
    seq = raw_text[i-length:i+1]
    # store
    sequences.append(seq)
print('Total Sequences: %d' % len(sequences))

Sing a song of sixpence,A pocket full of rye.Four and twenty blackbirds,Baked in a pie.When the pie was openedThe birds began to sing;Wasn’t that a dainty dish,To set before the king.The king was in his counting house,Counting out his money;The queen was in the parlour,Eating bread and honey.The maid was in the garden,Hanging out the clothes,When down came a blackbirdAnd pecked off her nose.
Total Sequences: 384


In [None]:
# save sequences to file
out_filename = 'char_sequences.txt'
save_doc(sequences, out_filename)

# Model Training

The model will read encoded characters and predict the next character in the sequence. A Long Short-Term Memory recurrent neural network hidden layer will be used to learn the context from the input sequence in order to make the predictions.

In [None]:
from numpy import array
from pickle import dump
from tensorflow.keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

In [None]:
# load

in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')

The sequences of characters must be encoded as integers.This means that each unique character will be assigned a specific integer value and each sequence of characters will be encoded as a sequence of integers. We can create the mapping given a sorted set of unique characters in the raw input data. The mapping is a dictionary of character values to integer values.

Next, we can process each sequence of characters one at a time and use the dictionary mapping to look up the integer value for each character. The result is a list of integer lists.

We need to know the size of the vocabulary later. We can retrieve this as the size of the dictionary mapping.

In [None]:
# integer encode sequences of characters
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))
sequences = list()
for line in lines:
    # integer encode line
    encoded_seq = [mapping[char] for char in line]
    # store
    sequences.append(encoded_seq)

# vocabulary size
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)

# separate into input and output
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
X = array(sequences)
y = to_categorical(y, num_classes=vocab_size)

Vocabulary Size: 38


The model is defined with an input layer that takes sequences that have 10 time steps and 38 features for the one hot encoded input sequences. Rather than specify these numbers, we use the second and third dimensions on the X input data. This is so that if we change the length of the sequences or size of the vocabulary, we do not need to change the model definition.

The model has a single LSTM hidden layer with 75 memory cells. The model has a fully connected output layer that outputs one vector with a probability distribution across all characters in the vocabulary. A softmax activation function is used on the output layer to ensure the output has the properties of a probability distribution.

The model is learning a multi-class classification problem, therefore we use the categorical log loss intended for this type of problem. The efficient Adam implementation of gradient descent is used to optimize the model and accuracy is reported at the end of each batch update. The model is fit for 50 training epochs.

In [None]:
# ----- ORIGINAL ------
# define model
model = Sequential()
model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history=model.fit(X, y, epochs=100)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 75)                34200     
                                                                 
 dense (Dense)               (None, 38)                2888      
                                                                 
Total params: 37088 (144.88 KB)
Trainable params: 37088 (144.88 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Ep

In [None]:
# save the model to file
model.save('model.h5')
# save the mapping
dump(mapping, open('mapping.pkl', 'wb'))

  saving_api.save_model(


# **Improving the Model**

In [None]:
from tensorflow.keras.layers import Dropout

# define model
model_2 = Sequential()

model_2.add(LSTM(100, input_shape=(X.shape[1], X.shape[2])))
model_2.add(Dropout(0.2))

model_2.add(Dense(80, activation='relu'))
model_2.add(Dropout(0.2))
model_2.add(Dense(75, activation='relu'))
model_2.add(Dropout(0.2))
model_2.add(Dense(vocab_size, activation='softmax'))

print(model_2.summary())
# compile model
model_2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history=model_2.fit(X, y, epochs=120)

Model: "sequential_34"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_41 (LSTM)              (None, 100)               55600     
                                                                 
 dropout_28 (Dropout)        (None, 100)               0         
                                                                 
 dense_48 (Dense)            (None, 80)                8080      
                                                                 
 dropout_29 (Dropout)        (None, 80)                0         
                                                                 
 dense_49 (Dense)            (None, 75)                6075      
                                                                 
 dropout_30 (Dropout)        (None, 75)                0         
                                                                 
 dense_50 (Dense)            (None, 38)              

# **Process for RNN**

When formulating my own RNN, I tried a variety of different techniques to get the best possible output while mitigating for overfitting:

1) Using different number of memory cells

Memory cells in the context of RNNs are roughly the "# of neurons" in the model. Therefore, intuitively, I believed it would be better to increase the number of memory cells because there would be more nodes that would account for some of the complexities within the data. However, I did not want to massively increase the number of memory cells as that would overfit. So, I increased the number of memory cells by a small amount to be at 100 versus the previous number of 75. When I ran the model with just this change, the training accuracy was slightly higher while the final model did not change in accuracy either.

2) Different types and numbers of layers

I knew that I wanted to add more layers to the model because it would again account for some of the hidden complexities within the model. Yet, because I already chose to increase the number of nodes, I thought to add only 1 more layer. I specifically chose to add a fully connected ReLU layer because I know that this kind of activation layer specifically accounts for any behavior that's not linear and is typically pretty efficient as it doesn't change any kind of performance among neurons that are doing a good job but only alters the nodes that do need further manipulation. I did at first try to include multiple ReLU layers but that ultimately produced a very inaccurate model output in the testing portion.

3) Different lengths of training epochs

I added a few more epochs because I believed that I was accounting for some overfitting already and wanted to ensure that the model was still getting exposure to the dataset that it needed. I settled on the epoch number of 120 because it didn't go too significantly over the starting number of 100.

4) Different sequence lengths and pre-processing

I changed the sequence length to be 24, the length of the first line, at first. However, it ended up performing very well on my training model and validation epochs but very poorly when we tried to generate text, showing that it had overfitted. I therefore kept the sequence length at 10 so that my model could be trained on smaller bits of the corpus and be able to understand patterns in that manner.

5) Try regularization techniques such as Dropout

Because I changed all other aspects of the model to be more specific to the corpus, I made sure to include 3 layers of Dropout with small parameters in order to regularize the model and generalize it for the future.

# **Model Evaluation**

In [None]:
from sklearn.model_selection import train_test_split

def validation_epochs(X, y):
  X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
  model_2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  history = model_2.fit(
      X_train, y_train,
      epochs=100,
      validation_data=(X_val, y_val),
      verbose=1)
  return history

In [None]:
validation_epochs(X, y)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.src.callbacks.History at 0x7da6b137cac0>

In [None]:
# save the model to file
model_2.save('model_2.h5')
# save the mapping
dump(mapping, open('mapping.pkl', 'wb'))

# Generating Text

In [None]:
from pickle import load
import numpy as np
from keras.models import load_model
from tensorflow.keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences

# generate a sequence of characters with a language model
def generate_seq(model, mapping, seq_length, seed_text, n_chars):
    in_text = seed_text
    # generate a fixed number of characters
    for _ in range(n_chars):
        # encode the characters as integers
        encoded = [mapping[char] for char in in_text]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # one hot encode
        encoded = to_categorical(encoded, num_classes=len(mapping))
        # predict character
        yhat = np.argmax(model.predict(encoded), axis=-1)
        # reverse map integer to character
        out_char = ''
        for char, index in mapping.items():
            if index == yhat:
                out_char = char
                break
        # append to input
        in_text += char
    return in_text

# load the model
model = load_model('model.h5')
model_2 = load_model('model_2.h5')
# load the mapping
mapping = load(open('mapping.pkl', 'rb'))

Running the example generates three sequences of text.

The first is a test to see how the model does at starting from the beginning of the rhyme. The second is a test to see how well it does at beginning in the middle of a line. The final example is a test to see how well it does with a sequence of characters never seen before.

In [None]:
# test start of rhyme
print(generate_seq(model, mapping, 10, 'Sing a son', 20))
# test mid-line
print(generate_seq(model, mapping, 10, 'king was i', 20))
# test not in original
print(generate_seq(model, mapping, 10, 'hello worl', 20))

Sing a song of sixpence,A pock
king was in his counting house
hello worleWheeda aacinn t a c


In [None]:
# test start of rhyme
print(generate_seq(model_2, mapping, 10, 'Sing a son', 20))
# test mid-line
print(generate_seq(model_2, mapping, 10, 'king was i', 20))
# test not in original
print(generate_seq(model_2, mapping, 10, 'hello worl', 20))

Sing a song of sixpence,A pock
king was in his counting house
hello worly.When than biras an


## "Stopping by Woods on a Snowy Evening" (Robert Frost)

In [None]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np

raw = 'Whose woods these are I think I know.\
His house is in the village though;\
He will not see me stopping here\
To watch his woods fill up with snow.\
My little horse must think it queer\
To stop without a farmhouse near\
Between the woods and frozen lake\
The darkest evening of the year.\
He gives his harness bells a shake\
To ask if there is some mistake.\
The only other sound’s the sweep\
Of easy wind and downy flake.\
The woods are lovely, dark and deep,\
But I have promises to keep,\
And miles to go before I sleep,\
And miles to go before I sleep.'

# Preprocess the text
lines = raw.split('\n')
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)
max_sequence_length = max([len(seq) for seq in sequences])

# Vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size:', vocab_size)

# Separate into input and output
sequences = np.array(sequences)
X, y = sequences[:, :-1], sequences[:, -1]
X = pad_sequences([X], maxlen=X.shape[1], padding='pre')
y = to_categorical(y, num_classes=vocab_size)

# Define the model
ec_model = Sequential()
ec_model.add(Embedding(input_dim=vocab_size, output_dim=40, input_length=max_sequence_length-1))
ec_model.add(Dropout(0.2))
ec_model.add(LSTM(100))
ec_model.add(Dropout(0.2))
ec_model.add(Dense(vocab_size, activation='softmax'))

# Compile the model
ec_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print the model summary
print(ec_model.summary())

#history=complex_model.fit(X, y, epochs=100)

Vocabulary Size: 74
Model: "sequential_30"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_28 (Embedding)    (None, 101, 40)           2960      
                                                                 
 dropout_17 (Dropout)        (None, 101, 40)           0         
                                                                 
 lstm_37 (LSTM)              (None, 100)               56400     
                                                                 
 dropout_18 (Dropout)        (None, 100)               0         
                                                                 
 dense_38 (Dense)            (None, 74)                7474      
                                                                 
Total params: 66834 (261.07 KB)
Trainable params: 66834 (261.07 KB)
Non-trainable params: 0 (0.00 Byte)
_____________________________________________________________

In [None]:
def generate_text(model, tokenizer, max_sequence_length, seed_text, next_words=2):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_length-1, padding='pre')
        predicted_word_index = np.argmax(model.predict(token_list), axis=-1)
        predicted_word = tokenizer.index_word[predicted_word_index[0]]
        seed_text += " " + predicted_word
    return seed_text

In [None]:
# Test start of sentence
print(generate_text(ec_model, tokenizer, max_sequence_length, 'Whose woods'))

# Test mid-line
print(generate_text(ec_model, tokenizer, max_sequence_length, 'The darkest'))

# Test not in the orgiinal
print(generate_text(ec_model, tokenizer, max_sequence_length, 'hello world'))

Whose woods see not
The darkest evening must
hello world evening see


I decided to run a word-level model on the well-known poem by Robert Frost, "Stopping by Woods on a Snowy Evening." I chose this text because of its usage of older language and its focus on artistic expression rather than conveying straightforward information or forming logical connections between words. This made me anticipate that the model might face difficulties in handling this particular piece of text, making it a bit more challenging.

While developing the model, I made a few choices to enhance its performance. I added extra layers, including an embedding layer, to consolidate the higher-dimensional 'word vectors' and refine the model's understanding of the language nuances. Keeping LSTM and Dense layers as additional layers introduced complexity, and including dropout layers helped prevent overfitting, ultimately improving the fluency of the generated sequences.

Overall, my belief is that word-level models have the potential to better capture the semantic context, but only when coupled with measures to prevent overfitting and an appropriate level of complexity, as I implemented in this case. While the output was not accurate to the actual poem itself, it did seem to make some sense. The goal was to strike a balance, allowing the model to discern the subtleties of the poetic language while accounting for its inherent complexities.