#### Overall Goals

Data Sources:
- Gilmore Girls: scripts are typical screenplays, and it's a matter of counting words and running these counts against the runtimes scraped out of imdb
- Marvelous Mrs. Maisel: scripts are transcripts: dialog is identified every time a new character speaks, but otherwise not well attributed.
Need to strip out names, actions/directions, and songs, and then combine with runtimes.
- Bunheads: no scripts or transcripts yet. So that's unfortunate.

The scripts unfortunately are not formatted consistently- the immediate issue is that sometimes episode titles are included, as well as credits for writers, directors, transcriber, etc. Even these aren't consistently formatted, or even consistently included.

However, there are a few keywords at the beginning of each credit line, and these credits never go for more than 5 lines. By iterating backwards through the first five lines of each script, the index of the last credit line is found, and then used to remove all lines before it- the scripts end up scrubbed clean of titles and credits.

In [5]:
import json
import pandas as pd
import re

In [2]:
def open_json(file_path):
    with open(file_path, 'r') as f:
        return json.load(f)

In [39]:
gilmore_scripts = open_json('./data/gilmore_raw_scripts.json')
gilmore_runtimes = open_json('./data/gilmore_runtimes.json')

In [40]:
gilmore_scripts[21]

{'title': '02x01 - Sadie, Sadie',
 'link': '/viewtopic.php?f=22&t=4979&sid=512f2c79efc9684a95bab10c5f9d495e',
 'script': ['2.01 - Sadie, Sadie',
  'written by Amy Sherman-Palladino',
  'directed by Amy Sherman-Palladino',
  '[OPEN IN STARS HOLLOW]',
  '[Camera pans around the center of town, which is covered in yellow daisies. Lorelai and Rory are crossing the street.]',
  'RORY: You should get married in Italy.',
  "LORELAI: All the way from home, same topic. There's tons of stuff going on in the world. Big stuff.",
  'RORY: Like?',
  'LORELAI: Balkans.',
  'RORY: That was ages ago. Read a paper.',
  'LORELAI: Ugh. They make my hands black.',
  'RORY: Oh! You should walk down the aisle to Frank Sinatra with a huge bouquet of something that smells really good.',
  'LORELAI: Pot Roast.',
  'RORY: And you should wear a long veil with your hair up.',
  "LORELAI: Ugh, I'll take any other subject in the world for two hundred Alex.",
  "RORY: Why don't you want to think about this?",
  "LORE

In [35]:
test_str = gilmore_scripts[21]['script'][15]
re_dialog = re.compile('\w+ *:')
if re.match(re_dialog, test_str):
    print("#")

#


In [17]:
test_str

"RORY: Why don't you want to think about this? "

In [55]:
def searchFor(season, episode, data):
    for d in data:
        if d['episode'] == episode and d['season'] == season:
            return d
    return []

In [57]:
gg_data = []
for ep in gilmore_scripts:
    # Split out title
    [season, episode] = map(lambda x: int(x), ep['title'].split(" - ")[0].split('x'))
    # Count words
    n = 0
    for line in ep['script']:
        # Clean script line
        clean_line = line.strip().upper()
        # Determine if line is dialog
        re_dialog = re.compile('\w+ *:')
        if re.match(re_dialog, clean_line):
            # Count words of dialog
            n = n + len(clean_line.split(":")[1].split(' '))
            
    # Get corresponding runtime
    rt = searchFor(season, episode, gilmore_runtimes)
    if rt:
        runtime = rt['runtime']
    else:
        runtime = None
    
    # Package everything
    gg_data.append({
        'season': season,
        'episode': episode,
        'count': n,
        'runtime': runtime
    })

In [59]:
with open('gg_data.json', 'w+') as outfile:
    json.dump(gg_data, outfile)