# MOVIE SCRIPT GENERATOR

In this Notebook, I give a seq2seq LSTM network word embeddings of starwars movie scripts. I have explored multiple scripts from movies dbs, TV series (even tried generating dataset for my favourite series: [Elementary](https://en.wikipedia.org/wiki/Elementary_(TV_series)) etc. but we need data over which we can abstract well.

Abitity to abstract over data is usually determined by three factors:
1. The Network itself
2. The amount and quality of data
3. Regularizations applied

For the 2nd point we mentioned, [Star Wars Dataset](https://www.kaggle.com/xvivancos/star-wars-movie-scripts) is the best I found for now. 

I will be experimenting with 1 and 3 to make the script generated by the network make sense. Also, possibly have different generations of similar network. It will be interesting to observe the concepts the network abstracts and abstracts over. Given the same network, there will be stochasticity in same networks due to curriculum of teaching.

### A tiny note on Curriculum managment
Curriculum managment is one of my favourite topics to talk about. Curriculum is usually ignored in Data Science related industrial application, but the scientific community knows very well it makes all the difference. Suppose you are designing a network to solve 'Calculus', an obvious way to go about doing it is to show the network basic calculus problems before showing it difficult problems. If the network has no knowledge of basic calculus it wont have any concepts to work with difficult problems and will fail terribly.

The point of above paragraph is that its extremely difficult to conduct this kind of curriculum managment with texts. Networks generating scripts which make no sense to humans is an indication of the same. Only solution to this problem is ton of data(way more than normally required).

Also, its a note to myself to improve over this model as and when I complete my research on Curriculum managment or Knowledge Distillation.


## Open and Prep Data

Star Wars data looks like:
`"<Dialog Number>" "<Speaker>" "<LINE>"`

Expected format: 
`"<Speaker>" "<LINE>"`

- `clean_srt` function does that for us.

In [1]:
data_dir = 'dataset/starwars'
files = ['SW_EpisodeIV.txt', 'SW_EpisodeV.txt', 'SW_EpisodeVI.txt']

def clean_srt(data_file):
  with open(data_file) as f:
    data = f.read()

  data = data.split('\n')
  data = data[1:]
  
  #Remove Dialog Number
  import re
  dialog_no_pattern = re.compile(r'"\d+" ')  
  data = [dialog_no_pattern.sub('', line) for line in data]
  
  return '\n'.join(data)


### Import and Load data

`data` is a dict of `<movie>`:`<script>`

In [2]:
import os
data = {file:clean_srt( os.path.join(data_dir, file) ) for file in files}
print('Movies included in Data:{}'.format(data.keys()))

Movies included in Data:dict_keys(['SW_EpisodeIV.txt', 'SW_EpisodeV.txt', 'SW_EpisodeVI.txt'])


### Data Exploration

We work with episode IV data

In [3]:
text = data['SW_EpisodeIV.txt']
view_sentence_range = (0, 5)

import numpy as np

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))

sentences = [sentence for sentence in text.split('\n')]
print('Number of lines: {}'.format(len(sentences)))
word_count_sentence = [len(sentence.split()) for sentence in sentences]
print('Average number of words in each line: {}'.format(np.average(word_count_sentence)))

print()
print('The sentences {} to {}:'.format(*view_sentence_range))
print('\n'.join(text.split('\n')[view_sentence_range[0]:view_sentence_range[1]]))


Dataset Stats
Roughly the number of unique words: 3368
Number of lines: 1011
Average number of words in each line: 12.277942631058359

The sentences 0 to 5:
"THREEPIO" "Did you hear that?  They've shut down the main reactor.  We'll be destroyed for sure.  This is madness!"
"THREEPIO" "We're doomed!"
"THREEPIO" "There'll be no escape for the Princess this time."
"THREEPIO" "What's that?"
"THREEPIO" "I should have known better than to trust the logic of a half-sized thermocapsulary dehousing assister..."


### Preprocessing Functions
2 parts to this:
- Lookup Table
- Tokenizing Punctuations

#### Lookup Table

To create a word embedding, we first need to transform the words to ids. In this function, we create two dictionaries:

- Dictionary to go from the words to an id, we'll call vocab_to_int
- Dictionary to go from the id to word, we'll call int_to_vocab

Then we return these dictionaries in the following tuple `(vocab_to_int, int_to_vocab)`

In [5]:
def create_lookup_tables(text):
    """
    Creates lookup tables for vocabulary
    :param text: The text of tv scripts split into words
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    vocab = set(text)
    vocab_to_int = {c: i for i, c in enumerate(vocab)}
    int_to_vocab = dict(enumerate(vocab))
    
    return vocab_to_int, int_to_vocab

#### Tokenization

Punctuations like periods and exclamation marks make it hard to distinguish between the word like "bye" and "bye!".

- `token_lookup` will return dict that will be used to tokenize symbols like "!" into "||Exclamation_Mark||

In [6]:
def token_lookup():
    """
    Generate a dict to turn punctuation into a token.
    :return: Tokenize dictionary where the key is the punctuation and the value is the token
    """
    d = {}
    d['.'] =  '||PERIOD||'
    d[','] =  '||COMMA||'
    d['"'] =  '||QUOTATION_MARK||'
    d[';'] =  '||SEMICOLON||'
    d['!'] =  '||EXCLAMATION_MARK||'
    d['?'] =  '||QUESTION_MARK||'
    d['('] =  '||LEFT_PAREN||'
    d[')'] =  '||RIGHT_PAREN||'
    d['--'] =  '||DASH||'
    d['?'] =  '||QUESTION_MARK>'
    d['\n'] =  '||NEW_LINE||'

    return d

## BUILDING THE NETWORK

Now comes the interesting part.. yumm!


### Check TF

In [7]:
from distutils.version import LooseVersion
import warnings
import tensorflow as tf

# Check TensorFlow Version
assert LooseVersion(tf.__version__) >= LooseVersion('1.0'), 'Please use TensorFlow version 1.0 or newer'
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
if not tf.test.gpu_device_name():
    warnings.warn('No GPU found.')
else:
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

  from ._conv import register_converters as _register_converters


TensorFlow Version: 1.8.0
Default GPU Device: /device:GPU:0
