# Story Gerneration
In this project, you'll generate your own story using RNNs.  You'll be using the Stephen King Books dataset from kaggle.  The author of this dataset web scraped sixteen Stephen King books and created text files from them.  You can see this data at https://www.kaggle.com/ttalbitt/stephen-king-books.
## Get the Data
The first thing I did was to combine all sixteen separate text files into one large text file which I have called StephenKingAll.txt.

In [1]:
import tensorflow as tf
from tensorflow.keras.layers.experimental import preprocessing

import numpy as np
import os
import pickle
import time

data_dir = 'StephenKingAll.txt'
with open(data_dir, "r") as f:
  text = f.read()

## Explore the Data
I did just a few print statements to explore the data.

In [2]:
view_sentence_range = (9500, 10500)
print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))
scenes = text.split('\n\n')
sentence_count_scene = [scene.count('\n') for scene in scenes]
print('Average number of sentences in each scene: {}'.format(np.average(sentence_count_scene)))

sentences = [sentence for scene in scenes for sentence in scene.split('\n')]
print('Number of lines: {}'.format(len(sentences)))

# length of text is the number of characters in it
print(f'Length of text: {len(text)} characters')

# Take a look at the first 250 characters in text
print(text[9500:11000])

# The unique characters in the file
vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

Dataset Stats
Roughly the number of unique words: 73419
Average number of sentences in each scene: 17.0
Number of lines: 18
Length of text: 14659521 characters
per dispensers my wife fell down Her pursestrap stayed over her shoulder but her prescription bag slipped from her hand and the sinus inhaler slid halfway out The other item stayed put No one noticed her lying there by the newspaper dispensers everyone was focused on the tangled vehicles the screaming women the spreading puddle of water and antifreeze from the Public Works trucks ruptured radiator Thats gas the clerk from Fast Foto shouted to anyone who would listen Thats gas watch out she dont blow fellas I suppose one or two of the wouldbe rescuers might have jumped right over her perhaps thinking she had fainted To assume such a thing on a day when the temperature was pushing ninetyfive degrees would not have been unreasonable Roughly two dozen people from the shopping center clustered around the accident another four dozen o

## Implement Preprocessing Functions
The first thing to do to any dataset is preprocessing.  Implement the following preprocessing functions below:
- Lookup Table
- Tokenize Punctuation

### Lookup Table
To create a word embedding, you first need to transform the words to ids.  In this function, create two dictionaries:
- Dictionary to go from the words to an id, we'll call `vocab_to_int`
- Dictionary to go from the id to word, we'll call `int_to_vocab`

Return these dictionaries in the following tuple `(vocab_to_int, int_to_vocab)`

In [3]:
words_no_duplicates = sorted(list(set(text)))
vocab_to_int = dict((word,i) for i, word in enumerate(words_no_duplicates))
int_to_vocab = dict((i,word) for i, word in enumerate(words_no_duplicates))
print (vocab_to_int)
print (int_to_vocab)
print(words_no_duplicates)

{'\n': 0, ' ': 1, 'A': 2, 'B': 3, 'C': 4, 'D': 5, 'E': 6, 'F': 7, 'G': 8, 'H': 9, 'I': 10, 'J': 11, 'K': 12, 'L': 13, 'M': 14, 'N': 15, 'O': 16, 'P': 17, 'Q': 18, 'R': 19, 'S': 20, 'T': 21, 'U': 22, 'V': 23, 'W': 24, 'X': 25, 'Y': 26, 'Z': 27, 'a': 28, 'b': 29, 'c': 30, 'd': 31, 'e': 32, 'f': 33, 'g': 34, 'h': 35, 'i': 36, 'j': 37, 'k': 38, 'l': 39, 'm': 40, 'n': 41, 'o': 42, 'p': 43, 'q': 44, 'r': 45, 's': 46, 't': 47, 'u': 48, 'v': 49, 'w': 50, 'x': 51, 'y': 52, 'z': 53}
{0: '\n', 1: ' ', 2: 'A', 3: 'B', 4: 'C', 5: 'D', 6: 'E', 7: 'F', 8: 'G', 9: 'H', 10: 'I', 11: 'J', 12: 'K', 13: 'L', 14: 'M', 15: 'N', 16: 'O', 17: 'P', 18: 'Q', 19: 'R', 20: 'S', 21: 'T', 22: 'U', 23: 'V', 24: 'W', 25: 'X', 26: 'Y', 27: 'Z', 28: 'a', 29: 'b', 30: 'c', 31: 'd', 32: 'e', 33: 'f', 34: 'g', 35: 'h', 36: 'i', 37: 'j', 38: 'k', 39: 'l', 40: 'm', 41: 'n', 42: 'o', 43: 'p', 44: 'q', 45: 'r', 46: 's', 47: 't', 48: 'u', 49: 'v', 50: 'w', 51: 'x', 52: 'y', 53: 'z'}
['\n', ' ', 'A', 'B', 'C', 'D', 'E', 'F', 'G

In [4]:
import problem_unittests as tests

def create_lookup_tables(text):
    """
    Create lookup tables for vocabulary
    :param text: The text of tv scripts split into words
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    # TODO: Implement Function
    #return None, None
    words_no_duplicates = sorted(list(set(text)))
    vocab_to_int = dict((word,i) for i, word in enumerate(words_no_duplicates))
    int_to_vocab = dict((i,word) for i, word in enumerate(words_no_duplicates))
    print(words_no_duplicates)
    print (vocab_to_int)
    print (int_to_vocab)
    return(vocab_to_int,int_to_vocab)

tests.test_create_lookup_tables(create_lookup_tables)

['a', 'accident', 'across', 'all', 'and', 'antifreeze', 'around', 'at', 'bag', 'be', 'been', 'both', 'but', 'by', 'cluster', 'couldnt', 'dispensers', 'doing', 'down', 'dunbarry', 'easterling', 'everyone', 'fell', 'focused', 'friend', 'from', 'had', 'halfway', 'hand', 'help', 'her', 'hurt', 'in', 'inhaler', 'irene', 'item', 'jill', 'johanna', 'least', 'little', 'lot', 'lying', 'miss', 'mrs', 'my', 'near', 'newspaper', 'no', 'noticed', 'occurred', 'of', 'on', 'one', 'other', 'others', 'out', 'over', 'parking', 'past', 'prescription', 'pretty', 'public', 'puddle', 'pursestrap', 'put', 'radiator', 'radio', 'remembered', 'running', 'ruptured', 'said', 'same', 'screaming', 'shack', 'she', 'shoulder', 'sinus', 'slacks', 'slid', 'slipped', 'somebody', 'someone', 'spreading', 'stayed', 'sure', 'tangled', 'that', 'the', 'them', 'then', 'there', 'they', 'thought', 'trucks', 'vehicles', 'was', 'water', 'were', 'when', 'wife', 'windowshopping', 'women', 'works', 'wouldnt', 'yellow']
{'a': 0, 'accid

### Tokenize Punctuation
We'll be splitting the script into a word array using spaces as delimiters.  However, punctuations like periods and exclamation marks make it hard for the neural network to distinguish between the word "bye" and "bye!".

Implement the function `token_lookup` to return a dict that will be used to tokenize symbols like "!" into "||Exclamation_Mark||".  Create a dictionary for the following symbols where the symbol is the key and value is the token:
- Period ( . )
- Comma ( , )
- Quotation Mark ( " )
- Semicolon ( ; )
- Exclamation mark ( ! )
- Question mark ( ? )
- Left Parentheses ( ( )
- Right Parentheses ( ) )
- Dash ( -- )
- Return ( \n )

This dictionary will be used to token the symbols and add the delimiter (space) around it.  This separates the symbols as it's own word, making it easier for the neural network to predict on the next word. Make sure you don't use a token that could be confused as a word. Instead of using the token "dash", try using something like "||dash||".

In [5]:
def token_lookup():
    """
    Generate a dict to turn punctuation into a token.
    :return: Tokenize dictionary where the key is the punctuation and the value is the token
    """
    return {'.': '||vPeriod||', 
            ',': '||vComma||',
            '"': '||vQuote||',
            ';': '||vSemicolon||',
            '!': '||vExclamation||',
            '?': '||vQuestion||',
            '(': '||vLeftParen||',
            ')': '||vRightParen||',
            '--': '||vDash||',
            '\n': '||vNewline||'
            }

tests.test_tokenize(token_lookup)

Tests Passed
