# Neural Language Model

Deacription: A language model predicts the next word in the sequence based on the specific words that have come before it in the sequence.

Benefits: Character-based language models have small vocabularies and flexibility in handling any words, punctuation, and other document structure. This comes at the cost of requiring larger models that are slower to train.

Make sure to pip install:
- keras
- tensorflow
- h5py

# Source Text Creation

Starting with: simple nursery rhyme

In [8]:
!pip install tensorflow
!pip install keras
!pip install h5py



In [9]:

s='Sing a song of sixpence,\
A pocket full of rye.\
Four and twenty blackbirds,\
Baked in a pie.\
When the pie was opened\
The birds began to sing;\
Wasn’t that a dainty dish,\
To set before the king.\
The king was in his counting house,\
Counting out his money;\
The queen was in the parlour,\
Eating bread and honey.\
The maid was in the garden,\
Hanging out the clothes,\
When down came a blackbird\
And pecked off her nose.'

with open('rhymes.txt','w') as f:
  f.write(s)

    Sing a song of sixpence,
    A pocket full of rye.
    Four and twenty blackbirds,
    Baked in a pie.

    When the pie was opened
    The birds began to sing;
    Wasn’t that a dainty dish,
    To set before the king.

    The king was in his counting house,
    Counting out his money;
    The queen was in the parlour,
    Eating bread and honey.

    The maid was in the garden,
    Hanging out the clothes,
    When down came a blackbird
    And pecked off her nose.

# Sequence Generation

Notes: 
- the input and output sequences must be characters
- number of characters used as input will also define the number of characters that will need to be provided to the model in order to elicit the first predicted character
- appended first character to the input sequence which will be used as input for the model to generate the next character

In [10]:
#load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# save tokens to file, one dialog per line
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

In [11]:
#load text
raw_text = load_doc('rhymes.txt')
print(raw_text)

# clean
tokens = raw_text.split()
raw_text = ' '.join(tokens)

# organize into sequences of characters
length = 10
sequences = list()
for i in range(length, len(raw_text)):
    # select sequence of tokens
    seq = raw_text[i-length:i+1]
    # store
    sequences.append(seq)
print('Total Sequences: %d' % len(sequences))

Sing a song of sixpence,A pocket full of rye.Four and twenty blackbirds,Baked in a pie.When the pie was openedThe birds began to sing;Wasn’t that a dainty dish,To set before the king.The king was in his counting house,Counting out his money;The queen was in the parlour,Eating bread and honey.The maid was in the garden,Hanging out the clothes,When down came a blackbirdAnd pecked off her nose.
Total Sequences: 384


In [13]:
# save sequences to file
out_filename = 'char_sequences.txt'
save_doc(sequences, out_filename)

# Train a Model
Long Short-Term Memory recurrent neural network hidden layer will be used to learn the context from the input sequence in order to make the predictions.

In [14]:
from numpy import array
from pickle import dump
from tensorflow.keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

In [15]:
# load

in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')

In [16]:
# integer encode sequences of characters
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))
sequences = list()
for line in lines:
    # integer encode line
    encoded_seq = [mapping[char] for char in line]
    # store
    sequences.append(encoded_seq)

# vocabulary size
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)

# separate into input and output
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
X = array(sequences)
y = to_categorical(y, num_classes=vocab_size)

Vocabulary Size: 38


The model is defined with an input layer that takes sequences that have 10 time steps and 38 features for the one hot encoded input sequences. Rather than specify these numbers, we use the second and third dimensions on the X input data. 

The model has a single LSTM hidden layer with 75 memory cells. The model has a fully connected output layer that outputs one vector with a probability distribution across all characters in the vocabulary. A softmax activation function is used on the output layer to ensure the output has the properties of a probability distribution.

The model is learning a multi-class classification problem, therefore we use the categorical log loss intended for this type of problem. The efficient Adam implementation of gradient descent is used to optimize the model and accuracy is reported at the end of each batch update. The model is fit for 50 training epochs.

# To Do:
- Try different numbers of memory cells
- Try different types and amounts of recurrent and fully connected layers
- Try different lengths of training epochs
- Try different sequence lengths and pre-processing of data
- Try regularization techniques such as Dropout

In [37]:
#Validation Function
import numpy as np
from sklearn.model_selection import train_test_split
def split_func(X, y, validation_split=.2):
  split_ind = int((1-validation_split) * len(X))
  indices = np.arange(len(X))
  np.random.shuffle(indices)
  X=X[indices]
  y=y[indices]
  X_train, X_val = X[:split_ind], X[split_ind:]
  y_train, y_val = y[:split_ind], y[split_ind:]

  return X_train, X_val, y_train, y_val

In [48]:
# define model
from keras.layers import Dropout
from keras.optimizers import Adam
from sklearn.model_selection import train_test_split
model = Sequential()
model.add(LSTM(95, return_sequences=False,input_shape=(X.shape[1], X.shape[2])))
#model.add(LSTM(50))
#model.add(Dropout(.3))
optimizer = Adam(learning_rate=.001)
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
# fit model

X_train, X_val, y_train, y_val = split_func(X, y)
history=model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=100)
#history=model.fit(X, y, epochs=100)

None
Epoch 1/100
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 62ms/step - accuracy: 0.0966 - loss: 3.6230 - val_accuracy: 0.1039 - val_loss: 3.5820
Epoch 2/100
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step - accuracy: 0.1484 - loss: 3.5460 - val_accuracy: 0.1039 - val_loss: 3.4555
Epoch 3/100
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step - accuracy: 0.1695 - loss: 3.3193 - val_accuracy: 0.1039 - val_loss: 3.2969
Epoch 4/100
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step - accuracy: 0.1590 - loss: 3.1843 - val_accuracy: 0.1039 - val_loss: 3.2142
Epoch 5/100
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step - accuracy: 0.2017 - loss: 3.0028 - val_accuracy: 0.1039 - val_loss: 3.2213
Epoch 6/100
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 0.1728 - loss: 2.9868 - val_accuracy: 0.1039 - val_loss: 3.2364
Epoch 7/100
[1m10/10[

In [43]:
# define model
from keras.layers import Dropout
from keras.optimizers import Adam
from sklearn.model_selection import train_test_split
model = Sequential()
model.add(LSTM(95, return_sequences=False,input_shape=(X.shape[1], X.shape[2])))
#model.add(LSTM(50))
#model.add(Dropout(.3))
optimizer = Adam(learning_rate=.001)
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
# fit model

X_train, X_val, y_train, y_val = split_func(X, y)
history=model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=100)


None
Epoch 1/100
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 47ms/step - accuracy: 0.0359 - loss: 3.6241 - val_accuracy: 0.1299 - val_loss: 3.5702
Epoch 2/100
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.1551 - loss: 3.5375 - val_accuracy: 0.1429 - val_loss: 3.4166
Epoch 3/100
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.1612 - loss: 3.3079 - val_accuracy: 0.1429 - val_loss: 3.2769
Epoch 4/100
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step - accuracy: 0.1534 - loss: 3.1472 - val_accuracy: 0.1429 - val_loss: 3.1828
Epoch 5/100
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step - accuracy: 0.1712 - loss: 3.1145 - val_accuracy: 0.1429 - val_loss: 3.1827
Epoch 6/100
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.1634 - loss: 2.9695 - val_accuracy: 0.1429 - val_loss: 3.2058
Epoch 7/100
[1m10/10[

In [49]:
# save the model to file
model.save('model.h5')
# save the mapping
dump(mapping, open('mapping.pkl', 'wb'))



# Generating Text

Provide sequences of 10 characters as input to the model in order to start the generation process - picked manually. 

In [45]:
from pickle import load
import numpy as np
from keras.models import load_model
from tensorflow.keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
import os # Import the os module
import h5py

# generate a sequence of characters with a language model
def generate_seq(model, mapping, seq_length, seed_text, n_chars):
    in_text = seed_text
    # generate a fixed number of characters
    for _ in range(n_chars):
        # encode the characters as integers
        encoded = [mapping[char] for char in in_text]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # one hot encode
        encoded = to_categorical(encoded, num_classes=len(mapping))
        # predict character
        yhat = np.argmax(model.predict(encoded), axis=-1)
        # reverse map integer to character
        out_char = ''
        for char, index in mapping.items():
            if index == yhat:
                out_char = char
                break
        # append to input
        in_text += char
    return in_text

# Get the absolute path of the current directory.
current_dir = os.getcwd()
# Print the current directory to check where the notebook is looking for files.
print(f"Current directory: {current_dir}")
# Define the expected path to the model file. You might need to modify this.
model_path = os.path.join(current_dir, 'model.h5')
# Load the model using the defined path
model = load_model(model_path)
# load the mapping
mapping = load(open('mapping.pkl', 'rb'))



Current directory: /content


Running the example generates three sequences of text.

The first is a test to see how the model does at starting from the beginning of the rhyme. The second is a test to see how well it does at beginning in the middle of a line. The final example is a test to see how well it does with a sequence of characters never seen before.

In [50]:
# test start of rhyme
print(generate_seq(model, mapping, 10, 'Sing a son', 20))
# test mid-line
print(generate_seq(model, mapping, 10, 'king was i', 20))
# test not in original
print(generate_seq(model, mapping, 10, 'hello worl', 20))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 156ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2

Alternative Solutions: 
- Padding. Update the example to provides sequences line by line only and use padding to fill out each sequence to the maximum line length.
- Sequence Length. Experiment with different sequence lengths and see how they impact the behavior of the model.
- Tune Model. Experiment with different model configurations, such as the number of memory cells and epochs, and try to develop a better model for fewer resources.


# Goals

1. Optimize the cells above to tune the model so that it generates text that closely resembles the orginal line from the rhyme, or at least generates sensible words. 
2. Write a function to split the text corpus file into training and validation and pipe the validation data into the model.fit() function to be able to track validation error per epoch. Lookup Keras documentation to see how this is handled.
3. Write a summary (methods and results) in the cells below of the different things you applied. You must include your intuitions behind what did work and what did not work well.
4. Try a different source text. Train a word-level model. We'll leave it up to your creativity to explore and write a summary of your methods and results.


###Notes and Results###
During my tuning of the model, the best optimization came from using an LTSM layer with 95 units combined with the Adam optimizer, set to a learning rate of .001. Attempts to add more layers(LTSM layers and dense layers) increased loss, indicated overfitting or bad generalization. I also tried dropout regularization, but ultimately ended up commenting it out because it worsened performance by a large margin. I found that maintaining a more simple stucture for the model proved to have th best balance between learning and complexity. However, despite the high accuracy and low loss on the training set, the validation perfomance was poor and could be considered awful. This large discrepany suggests that the model perhaps does overfit, and is not able to accomadate to new data. When testing the model without validation data, the generated text for known sequences was almost spot on. It generated the first two phrases of the rhyme inputs correclty and was followed by sensible words. The hello worl line still had unusual outputs following the sequence. However, it did maintain some character-level structure. After the validation set was added, the model's perfomance deteriotes significantly, outputting nonsensible text. The closest I could get was "Sing a son  ooese oeeen we  ak", "king was in tne garll  idngdrF", and "hello worlssFn ettt  alllrrrrr." While this is not horrible, the model still exhibits overfitting and faiilure to generalize some text.


In [17]:
#text cleaning
import numpy as np
def clean_text(text):
  text = text.replace('’', "'")
  return text





new_text = """
My Financial Career
by Stephen Leacock
My Financial Career
US currency, 1914

When I go into a bank I get rattled. The clerks rattle me; the wickets rattle me; the sight of the money rattles me; everything rattles me.

The moment I cross the threshold of a bank and attempt to transact business there, I become an irresponsible idiot.

I knew this beforehand, but my salary had been raised to fifty dollars a month and I felt that the bank was the only place for it.

So I shambled in and looked timidly round at the clerks. I had an idea that a person about to open an account must needs consult the manager.

I went up to a wicket marked "Accountant." The accountant was a tall, cool devil. The very sight of him rattled me. My voice was sepulchral.

"Can I see the manager?" I said, and added solemnly, "alone." I don't know why I said "alone."

"Certainly," said the accountant, and fetched him.

The manager was a grave, calm man. I held my fifty-six dollars clutched in a crumpled ball in my pocket.

"Are you the manager?" I said. God knows I didn't doubt it.

"Yes," he said.

"Can I see you," I asked, "alone?" I didn't want to say "alone" again, but without it the thing seemed self-evident.

The manager looked at me in some alarm. He felt that I had an awful secret to reveal.

"Come in here," he said, and led the way to a private room. He turned the key in the lock.

"We are safe from interruption here," he said; "sit down."

We both sat down and looked at each other. I found no voice to speak.

"You are one of Pinkerton's men, I presume," he said.

He had gathered from my mysterious manner that I was a detective. I knew what he was thinking, and it made me worse.

"No, not from Pinkerton's," I said, seeming to imply that I came from a rival agency. "To tell the truth," I went on, as if I had been prompted to lie about it, "I am not a detective at all. I have come to open an account. I intend to keep all my money in this bank."

The manager looked relieved but still serious; he concluded now that I was a son of Baron Rothschild or a young Gould.

"A large account, I suppose," he said.

"Fairly large," I whispered. "I propose to deposit fifty-six dollars now and fifty dollars a month regularly."

The manager got up and opened the door. He called to the accountant.

"Mr. Montgomery," he said unkindly loud, "this gentleman is opening an account, he will deposit fifty-six dollars. Good morning."

I rose.

A big iron door stood open at the side of the room.

"Good morning," I said, and stepped into the safe.

"Come out," said the manager coldly, and showed me the other way.

I went up to the accountant's wicket and poked the ball of money at him with a quick convulsive movement as if I were doing a conjuring trick.

My face was ghastly pale.

"Here," I said, "deposit it." The tone of the words seemed to mean, "Let us do this painful thing while the fit is on us."

He took the money and gave it to another clerk.

He made me write the sum on a slip and sign my name in a book. I no longer knew what I was doing. The bank swam before my eyes.

"Is it deposited?" I asked in a hollow, vibrating voice.

"It is," said the accountant.

"Then I want to draw a cheque."

My idea was to draw out six dollars of it for present use. Someone gave me a chequebook through a wicket and someone else began telling me how to write it out. The people in the bank had the impression that I was an invalid millionaire. I wrote something on the cheque and thrust it in at the clerk. He looked at it.

"What! are you drawing it all out again?" he asked in surprise. Then I realized that I had written fifty-six instead of six. I was too far gone to reason now. I had a feeling that it was impossible to explain the thing. All the clerks had stopped writing to look at me.

Reckless with misery, I made a plunge.

"Yes, the whole thing."

"You withdraw your money from the bank?"

"Every cent of it."

"Are you not going to deposit any more?" said the clerk, astonished.

"Never."

An idiot hope struck me that they might think something had insulted me while I was writing the cheque and that I had changed my mind. I made a wretched attempt to look like a man with a fearfully quick temper.

The clerk prepared to pay the money.

"How will you have it?" he said.

"What?"

"How will you have it?"

"Oh"—I caught his meaning and answered without even trying to think—"in fifties."

He gave me a fifty-dollar bill.

"And the six?" he asked dryly.

"In sixes," I said.

He gave it me and I rushed out.

As the big door swung behind me I caught the echo of a roar of laughter that went up to the ceiling of the bank. Since then I bank no more. I keep my money in cash in my trousers pocket and my savings in silver dollars in a sock.

My Financial Career was featured as The Short Story of the Day on Thu, Jul 28, 2016
This story is featured in our collection of Short-Short Stories to read when you have five minutes to spare.

1 2 3 4 5 6 7 8 9 10
8.1

facebook share button twitter share button reddit share button share on pinterest pinterest



 My Financial Career to your library.
Return to the Stephen Leacock library , or . . . Read the next short story; Number Fifty-Six


© 2024 AmericanLiterature.com"""

new_text = clean_text(new_text)

tokens = new_text.split()
raw_text = ' '.join(tokens)
sequences=[]
seq_len = 10
for i in range(seq_len, len(raw_text)):
  seq = raw_text[i-seq_len:i+1]
  sequences.append(seq)

# load
lines = raw_text.split('\n')



# integer encode sequences of characters
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))
sequences = list()
for line in lines:
    # integer encode line
    encoded_seq = [mapping[char] for char in line]
    # store
    sequences.append(encoded_seq)

# vocabulary size
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)

# separate into input and output
sequences = np.array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
X = array(sequences)
y = to_categorical(y, num_classes=vocab_size)

Vocabulary Size: 68
