# Short Story Generator using Bidirectional LSTM

Let's read the dataset

In [1]:
import pandas as pd

train_path = "dataset/train.csv" 

train = pd.read_csv(train_path)
train['text'] = train['text'].fillna('').astype(str)

In [2]:
print("Number of rows and columns in train dataset:", train.shape)
print("\nColumn names:")
print(train.columns)
print("\nData types of columns:")
print(train.dtypes)
print("\nBasic statistics of numerical columns:")
print(train.describe())

Number of rows and columns in train dataset: (2119719, 1)

Column names:
Index(['text'], dtype='object')

Data types of columns:
text    object
dtype: object

Basic statistics of numerical columns:
           text
count   2119719
unique  1799249
top            
freq        230


In [3]:
train.head()

Unnamed: 0,text
0,"One day, a little girl named Lily found a need..."
1,"Once upon a time, there was a little car named..."
2,"One day, a little fish named Fin was swimming ..."
3,"Once upon a time, in a land full of trees, the..."
4,"Once upon a time, there was a little girl name..."


# Trim Dataset

The original dataset is very large. We trim it to be able to run it on local computer. Feel free to comment this line or change your desired dataset size based on your configuration

In [28]:
dataset_size = 100
train = train[:dataset_size]

In [29]:
train.shape

(100, 1)

Make a tokenizer to build set of tokens from the train dataset

In [30]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


# Tokenize train data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train['text'])
total_words = len(tokenizer.word_index) + 1

In [44]:
total_words

1491

Now we need to convert our text based inputs to numerical inputs. texts_to_squences converts text input to the numerical vector.

In [31]:
# Prepare input sequences
input_sequences = []
for line in train['text']:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

In [50]:
print("Example input:")
print(line)
print("<------------------------------------->\n\n")
print("Converted vector:")
print(tokenizer.texts_to_sequences([line])[0])

Example input:
John and Sarah were playing together in their backyard when they found a piece of metal. It was shiny and reflective and they couldn't wait to show their parents. 

John asked Sarah, "What should we do with the metal?"

Sarah thought for a moment, then said, "Let's take it to Mommy and Daddy!" With that, they ran off excitedly, ready to surprise their parents. 

They raced into the house, and shouted, "Mommy, Daddy! Look what we found!" 

Their parents were very surprised and asked, "Where did you find this piece of metal?" 

John and Sarah were so proud of their discovery, and couldn't wait to tell the story. They recounted that they found the metal outside in the backyard and it was so shiny and reflective. 

Their parents smiled, and said, "Well, why don't you two take it around the neighbourhood and see if you can return it to its rightful owner. If nobody takes it, you two can keep it!". 

John and Sarah were so cheerful and excited about the prospect of helping fin

In [32]:
len(input_sequences)

14700

The input text files includes stories with different lengths. We convert them to the same size input sequences using pad_sequences function. We use the length of longest story for finding the padding size.

In [33]:
# Pad sequences
max_sequence_len = max([len(seq) for seq in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')


There are infinite ways to define predictor labels. For example one can get first i words of a particular text as an input and the i+1 word as the label. We use the entire text except the last one as an input and the last word as a label.

In [34]:
import numpy as np
# Create predictors and label
predictors, label = input_sequences[:, :-1],input_sequences[:, -1]
label = np.array(label)

In [35]:
predictors.shape

(14700, 211)

In [36]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Bidirectional, Dense

In [37]:
# Build the model
model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150)))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


In [38]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 211, 100)          149100    
                                                                 
 bidirectional_2 (Bidirecti  (None, 300)               301200    
 onal)                                                           
                                                                 
 dense_2 (Dense)             (None, 1491)              448791    
                                                                 
Total params: 899091 (3.43 MB)
Trainable params: 899091 (3.43 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [39]:
# Train the model
model.fit(predictors, label, epochs=100, verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.src.callbacks.History at 0x239354f8880>

In [42]:
def generate_text(seed_text, next_words, model, max_sequence_len, tokenizer):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len - 1, padding='pre')
        predicted_probs = model.predict(token_list, verbose=0)
        predicted = np.argmax(predicted_probs)
        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text


In [43]:
# Keywords for text generation
keywords = ["girl", "dog"]
seed_text = ' '.join(keywords)  # Seed text with keywords

# Generate text based on keywords
generated_text = generate_text(seed_text, 100, model, max_sequence_len, tokenizer)
print(generated_text)

girl dog a girl named mia went for a walk she saw a big scary house it had a tall door and small windows mia was brave so she went inside the house in the house mia saw a birdcage inside the birdcage there was a little bird the bird was sad it wanted to fly and be free mia wanted to help the bird mia opened the birdcage door the bird flew out and was happy it was not scary anymore mia and the bird were friends they played and had fun all day and they decided to play hide and


# See impact of train data

Let's find the input sequences which include "mia.

In [65]:
similar_texts = []
target_words = ["Mia", "mia"]
for line in train['text']:
    for w in target_words:
        if w in line:
            similar_texts.append(line)
            break

In [66]:
len(similar_texts)

2

In [67]:
for i, txt in enumerate(similar_texts):
    print(i,"):\n", txt)

0 ):
 One day, a girl named Mia went for a walk. She saw a big, scary house. It had a tall door and small windows. Mia was brave, so she went inside the house.

In the house, Mia saw a birdcage. Inside the birdcage, there was a little bird. The bird was sad. It wanted to fly and be free. Mia wanted to help the bird.

Mia opened the birdcage door. The bird flew out and was happy. It was not scary anymore. Mia and the bird were friends. They played and had fun all day.
1 ):
 Once there was a little girl called Mia who loved to jump. Everywhere she went, she jumped. When walking to school, she would jump on the sidewalk. At the park, she would jump into the sandbox.

One day Mia was at the supermarket and she saw something unusual. She saw a lawyer. Mia had never seen a lawyer before so it made her very curious. She wanted to know what a lawyer did and why he was so dressed up. So, Mia jumped right up to the lawyer and asked him.

The lawyer was very confused. He had never seen a little g

### Due to the limited size of the training dataset, the generated story exhibits significant similarity with the training samples, notably featuring the word 'mia'. This issue can be mitigated by utilizing a larger training dataset. If system memory allows, let's proceed with the larger dataset to enhance the model's diversity and reduce overfitting. :)
