<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Libraries" data-toc-modified-id="Libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Libraries</a></span></li><li><span><a href="#Data" data-toc-modified-id="Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data</a></span></li><li><span><a href="#Parse-the-data" data-toc-modified-id="Parse-the-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Parse the data</a></span></li><li><span><a href="#Data-Exploration" data-toc-modified-id="Data-Exploration-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data Exploration</a></span></li><li><span><a href="#Pre-processing-Functions" data-toc-modified-id="Pre-processing-Functions-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Pre-processing Functions</a></span><ul class="toc-item"><li><span><a href="#Look-up-Table" data-toc-modified-id="Look-up-Table-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Look-up Table</a></span></li><li><span><a href="#Tokenize-Punctuation" data-toc-modified-id="Tokenize-Punctuation-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Tokenize Punctuation</a></span></li><li><span><a href="#Apply-the-pre-processing-functions" data-toc-modified-id="Apply-the-pre-processing-functions-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Apply the pre-processing functions</a></span></li></ul></li><li><span><a href="#Neural-Network" data-toc-modified-id="Neural-Network-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Neural Network</a></span><ul class="toc-item"><li><span><a href="#Input:-Batching" data-toc-modified-id="Input:-Batching-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Input: Batching</a></span></li></ul></li></ul></div>

# Libraries

In [1]:
import pandas as pd
import numpy as np
from collections import Counter

import torch
from torch.utils.data import TensorDataset, DataLoader

import pickle

# Data

In [2]:
data_dir = './data/seinfeld-chronicles/scripts.csv'

In [3]:
scripts_df = pd.read_csv(data_dir)
scripts_df.head()

Unnamed: 0.1,Unnamed: 0,Character,Dialogue,EpisodeNo,SEID,Season
0,0,JERRY,Do you know what this is all about? Do you kno...,1.0,S01E01,1.0
1,1,JERRY,"(pointing at Georges shirt) See, to me, that b...",1.0,S01E01,1.0
2,2,GEORGE,Are you through?,1.0,S01E01,1.0
3,3,JERRY,"You do of course try on, when you buy?",1.0,S01E01,1.0
4,4,GEORGE,"Yes, it was purple, I liked it, I dont actuall...",1.0,S01E01,1.0


# Parse the data
Desired format:
<br>
all one string, with the character name, colon, then lower-cased dialoge. 
<br>
Example
<br>
'jerry:  do you know what this is all about? do you know, why were here?'

In [4]:
# Function to format text
def scriptParser(character, dialogue):
    if(isinstance(dialogue, str)):
        dialogue = dialogue.lower()
    else:
        #dialogue = str(dialogue)
        #print(dialogue) - always a nan
        dialogue = ''
    if(isinstance(character, str)):
        character = character.lower()
    else:
        #character = str(character)
        character = ''
    
    return character+': '+dialogue+'\n\n'

In [5]:
# test the function on one row
row = 0
scriptParser(scripts_df['Character'].iloc[row], scripts_df['Dialogue'].iloc[row])

'jerry: do you know what this is all about? do you know, why were here? to be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about we should go out? this is what theyre talking about...this whole thing, were all out now, no one is home. not one person here is home, were all out! there are people tryin to find us, they dont know where we are. (on an imaginary phone) did you ring?, i cant find him. where did he go? he didnt tell me where he was going. he must have gone out. you wanna go out you get ready, you pick out the clothes, right? you take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...then youre standing around, whatta you do? you go we gotta be getting back. once youre out, you wanna get back! you wanna go to sleep, you wanna get up, you wanna go out again tomorrow, right? where ever you are in life, its my feeling, youve gotta go.\n\n'

In [6]:
# test the function while joining on 5 rows
''.join([scriptParser(row['Character'], row['Dialogue']) 
         for index, row in scripts_df[:5].iterrows()])

'jerry: do you know what this is all about? do you know, why were here? to be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about we should go out? this is what theyre talking about...this whole thing, were all out now, no one is home. not one person here is home, were all out! there are people tryin to find us, they dont know where we are. (on an imaginary phone) did you ring?, i cant find him. where did he go? he didnt tell me where he was going. he must have gone out. you wanna go out you get ready, you pick out the clothes, right? you take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...then youre standing around, whatta you do? you go we gotta be getting back. once youre out, you wanna get back! you wanna go to sleep, you wanna get up, you wanna go out again tomorrow, right? where ever you are in life, its my feeling, youve gotta go.\n\njerry: (pointi

In [7]:
# apply the function
text = ''.join([scriptParser(row['Character'], row['Dialogue']) 
         for index, row in scripts_df.iterrows()])

# Data Exploration

In [8]:
view_line_range = (0, 10)

print('Dataset Stats')
print('~ Number of unique words: {}'.format(len({word: None for word in text.split()})))

lines = text.split('\n')
print('Number of lines: {}'.format(len(lines)))
word_count_line = [len(line.split()) for line in lines]
print('Average number of words in each line: {}'.format(np.average(word_count_line)))

print()
print('The lines {} to {}:'.format(*view_line_range))
print('\n'.join(text.split('\n')[view_line_range[0]:view_line_range[1]]))

Dataset Stats
~ Number of unique words: 46380
Number of lines: 109233
Average number of words in each line: 5.544121282030155

The lines 0 to 10:
jerry: do you know what this is all about? do you know, why were here? to be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about we should go out? this is what theyre talking about...this whole thing, were all out now, no one is home. not one person here is home, were all out! there are people tryin to find us, they dont know where we are. (on an imaginary phone) did you ring?, i cant find him. where did he go? he didnt tell me where he was going. he must have gone out. you wanna go out you get ready, you pick out the clothes, right? you take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...then youre standing around, whatta you do? you go we gotta be getting back. once youre out, you wanna get back! you wanna go 

# Pre-processing Functions

## Look-up Table
2 dictionaries:
<br>
1. word to index: words2idx 
2. index to word: idx2word 

More common words should have a lower index

In [9]:
def create_lookup_tables(text):
    """
    Create lookup tables for vocabulary
    
    text: The text of tv scripts split into words
    returns: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    # first get the word_counts
    word_counts = Counter(text)
    
    # sort from most to least frequent
    sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
    
    # define the dictionaries
    idx2word = {ii: word for ii, word in enumerate(sorted_vocab)}
    word2idx = {word: ii for ii, word in idx2word.items()}
    
    # return tuple
    return (idx2word, word2idx)


## Tokenize Punctuation

In [10]:
# create a dict to turn punctuation into a token.
punct2token = {'.': '<PERIOD>',
                ',': '<COMMA>',
                '"': '<QUOTATION_MARK>',
                ';': '<SEMICOLON>',
                '!': '<EXCLAMATION_MARK>',
                '?': '<QUESTION_MARK>',
                '(': '<LEFT_PAREN>',
                ')': '<RIGHT_PAREN>',
                '--': ' <HYPHENS> ',
                '-': '<DASH>',
                '?': '<QUESTION_MARK>',
                '\n': '<NEW_LINE>',
                ':': ' <COLON> '}


In [11]:
punct2token[',']

'<COMMA>'

## Apply the pre-processing functions

In [12]:
# Tokenize the punctuation
for punct, token in punct2token.items():
    text = text.replace(punct, ' {} '.format(token))

In [13]:
# split and make all ensure all text is lower case
text = text.lower()
text = text.split()

In [14]:
# create the vocab dictionaries
idx2word, word2idx = create_lookup_tables(text + ['<PAD>'])

In [15]:
# apply dictionaries to text
int_text = [word2idx[word] for word in text]

In [142]:
int_text

[8,
 2,
 35,
 5,
 27,
 19,
 24,
 23,
 49,
 58,
 4,
 35,
 5,
 27,
 3,
 80,
 119,
 61,
 4,
 9,
 54,
 47,
 3,
 24,
 23,
 47,
 1,
 1,
 1,
 18,
 47,
 23,
 79,
 21,
 7,
 1279,
 547,
 8340,
 6825,
 21,
 240,
 1,
 147,
 1,
 1,
 1,
 81,
 5,
 198,
 235,
 147,
 206,
 58,
 55,
 133,
 63,
 47,
 4,
 24,
 23,
 19,
 692,
 206,
 58,
 1,
 1,
 1,
 24,
 218,
 125,
 3,
 119,
 49,
 47,
 83,
 3,
 26,
 79,
 23,
 288,
 1,
 45,
 79,
 375,
 61,
 23,
 288,
 3,
 119,
 49,
 47,
 12,
 75,
 48,
 147,
 6826,
 9,
 247,
 191,
 3,
 64,
 202,
 27,
 127,
 55,
 48,
 1,
 13,
 29,
 95,
 2681,
 171,
 14,
 81,
 5,
 1403,
 4,
 3,
 6,
 388,
 247,
 68,
 1,
 127,
 81,
 36,
 63,
 4,
 36,
 530,
 94,
 25,
 127,
 36,
 42,
 78,
 1,
 36,
 337,
 40,
 577,
 47,
 1,
 5,
 213,
 63,
 47,
 5,
 51,
 426,
 3,
 5,
 323,
 47,
 7,
 561,
 3,
 57,
 4,
 5,
 100,
 7,
 732,
 3,
 5,
 51,
 49,
 426,
 3,
 51,
 7,
 1253,
 3,
 51,
 52,
 353,
 3,
 7,
 158,
 3,
 7,
 834,
 3,
 7,
 2954,
 1,
 1,
 1,
 114,
 313,
 674,
 173,
 3,
 4640,
 5,
 35,
 4,
 5,
 63,
 55,
 

In [16]:
# save to a pickle file
pickle.dump((int_text, idx2word, word2idx, punct2token), open('preprocess.p', 'wb'))

# Neural Network

In [17]:
# Check GPU Access
train_on_gpu = torch.cuda.is_available()
if not train_on_gpu:
    print('No GPU found. Please use a GPU to train your neural network.')

No GPU found. Please use a GPU to train your neural network.


## Input: Batching
We want to batch the data based on a provided sequence legnth
Example Input:
```
words = [1, 2, 3, 4, 5, 6, 7]
sequence_length = 4
```

`features` Tensor:
```
[1, 2, 3, 4]
```
`target` Tensor (the next "word"):
```
5
```

Next `features` and `target` Tensors:
```
[2, 3, 4, 5]  # features
6             # target

In [167]:
def batch_data(words, sequence_length, batch_size):
    """
    Batch the neural network data using DataLoader
    param words: The word ids of the TV scripts
    param sequence_length: The sequence length of each batch
    param batch_size: The size of each batch; the number of sequences in a batch
    return: DataLoader with batched data
    """
    
    # iterate overall all words
    features = []
    targets = []
    for i, w in enumerate(words):
        if((i+sequence_length) < len(words)):
            features.append(words[i:(i+sequence_length)])
            targets.append(words[i+sequence_length])
    
    features = np.array(features)
    targets = np.array(targets)
    
    # convert numpy arrays into a torch tensor dataset
    sequenced_data = TensorDataset(torch.from_numpy(features), torch.from_numpy(targets))
    
    # create the final batch DataLoader
    batch_dL = DataLoader(sequenced_data, shuffle=True, batch_size=batch_size) 

    # return a dataloader
    return batch_dL



In [168]:
# Test the Data Loader

test_text = range(50)
t_loader = batch_data(test_text, sequence_length=5, batch_size=10)

data_iter = iter(t_loader)
sample_x, sample_y = data_iter.next()

print(sample_x.shape)
print(sample_x)
print()
print(sample_y.shape)
print(sample_y)

torch.Size([10, 5])
tensor([[42, 43, 44, 45, 46],
        [ 8,  9, 10, 11, 12],
        [10, 11, 12, 13, 14],
        [28, 29, 30, 31, 32],
        [41, 42, 43, 44, 45],
        [24, 25, 26, 27, 28],
        [ 9, 10, 11, 12, 13],
        [44, 45, 46, 47, 48],
        [35, 36, 37, 38, 39],
        [20, 21, 22, 23, 24]], dtype=torch.int32)

torch.Size([10])
tensor([47, 13, 15, 33, 46, 29, 14, 49, 40, 25], dtype=torch.int32)
