# 0. Exploration

In this notebook we walk through the process for processing the script text files.

---

## Imports

In [1]:
# libraries
import pandas as pd
import regex as re

from collections import Counter

In [2]:
# files
with open('scripts/404.txt') as f:
    ep = f.read()
    
ep_lines = ep.split('\n')

### TV scripts

We're working with a collection of .txt script files from a fansite. I suspect these were maybe scanned and OCR processed based off of scriptbook compendiums, but I'm not certain. Later we'll find some typos and formatting errors. Here's what the first thousand characters of the episode "Past Prologue":

In [3]:
print(ep[:1000])


                  STAR TREK: DEEP SPACE NINE 
                              
                        "Past Prologue" 
                          #40511-404 
                              
                           Story by 
                        Kathryn Powers 
                              
                          Teleplay by 
                      Peter Allan Fields 
                              
                          Directed by 
                          Rick Kolbe 

THE WRITING CREDITS MAY NOT BE FINAL AND SHOULD NOT BE USED 
FOR PUBLICITY OR ADVERTISING PURPOSES WITHOUT FIRST CHECKING 
WITH THE TELEVISION LEGAL DEPARTMENT.

Copyright 1992 Paramount Pictures Corporation. All Rights 
Reserved. This script is not for publication or 
reproduction. No one is authorized to dispose of same. If 
lost or destroyed, please notify the Script Department.

Return to Script Department             FINAL DRAFT
PARAMOUNT PICTURES CORPORATION
                        OCTOBER 5, 1992

    

Every TV script file is formatted this way -- the .txt file begins with an informational header. We're gong to be datamining character association rules, and much of the information in the header will be irrelevant. Let's proceed:

In [4]:
print(ep[1000:2500])

   STAR TREK: DS9   "Past Prologue"	10/05/92 - CAST

                  STAR TREK: DEEP SPACE NINE 
                        "Past Prologue" 
                            CAST

                BENJAMIN SISKO     GARAK
                MILES O'BRIEN      TAHNA
                KIRA               GUL DANAR,
                ODO                ADMIRAL
                BASHIR             B'ETOR
                DAX                LURSA
                                   BAJORAN DEPUTY
                                   GUL DUKAT
                                   NORIC
                                   RAKA
                Non-speaking       
                BAJORAN N.D. MEDICAL ASSISTANTS
                N.D. SUPERNUMARIES 

      STAR TREK: DS9 - "Past Prologue" - 10/05/92 - SETS 

                  STAR TREK: DEEP SPACE NINE 
                        "Past Prologue" 
                             SETS 

        INTERIORS                     EXTERIORS
        DEEP SPACE NINE                 DEEP 

After the informational header, every script contains yet more information: the full cast, and the sets necessary for the episode.

The full cast list could be useful if we were looking for episode-based association rules, but, per previous work, it's likely that episode-based association rules will be too generic.

Below we see what the script files look like once characters are speaking:

In [5]:
print(ep[2900:4000])

      
                              

            DEEP SPACE: "Past Prologue" 10/05/92 - TEASER            1.
                  STAR TREK: DEEP SPACE NINE                    

                           "Past Prologue"                             
                            TEASER                              

	FADE IN:

1    EXT. SPACE - DS9 (OPTICAL)

	Establishing.

2    INT. PROMENADE REPLIMAT

	DOCTOR JULIAN BASHIR sits enjoying a tea-like beverage, 
	reading a medical journal PADD... as the large, ever-pleasant 
	Cardassian, GARAK, interposes himself between Bashir and the 
	latter's view, with:

					GARAK
			It's Doctor Bashir, isn't it?  Of 
			course it is.  May I introduce myself?

	Bashir looks up and reacts... and if this were a poker game... 
	and in a way, it is... Bashir would be at a severe 
	disadvantage.  His heart's just started thumping - he's been 
	alerted about this man, never thought he'd come face to face 
	with him like this...

					BASHIR
			Uh... yes, y

Note:
- Each episode has a Teaser/Act one/Act two/Act 3/Act 4/Act 5 structure. These are indicated in the script, though there are some typos/formatting errors!
- Every new physical page is indicated with a header that includes episode title, act, and page number.
- Speakers are given by all-caps character names five indents in.
- We see action notes such as `(sits)` are indented at four tabs in, but there are larger blocks of exposition as well.

Some of this information will be useless, and we'll need to strip it out, and some of this structure will be useful for us.

--- 

## Cleaning scripts

Below are helper scripts useful for extracting information and stripping unneeded lines from the script files. (Note: these scripts are also in `preprocessing.py`.)

We extract information like the episode title, and clear lines such as e.g.:

```
           DEEP SPACE: "Past Prologue" 10/05/92 - ACT TWO           24.
```

In [6]:
def space_return(num_spaces, ep_lines=ep_lines):
    '''
    Returns a list of lines that start with num_spaces spaces.
    Input must be a list of lines.
    '''
    # used on lines version of script, not string script
    assert type(ep_lines) == list
    
    def line_starts_with_only_n(line, n=num_spaces):
        # helper function to check matches
        spaces = ' ' * num_spaces
        return (line[:n] == spaces) and (line[n] != ' ')
    
    return [line for line in ep_lines if line_starts_with_only_n(line)]


def tab_return(num_tabs, ep_lines=ep_lines):
    '''
    Returns a list of lines that start with num_tabs tabs.
    Input must be a list of lines.
    '''
    assert type(ep_lines) == list
    
    def line_starts_with_only_n(line, n=num_tabs):
        # should probably refactor this since the helper fn is
        # defined twice while nested... TODO
        tabs = '\t' * num_tabs
        return (line[:n] == tabs) and (line[n] != '\t')
    
    return [line for line in ep_lines if line_starts_with_only_n(line)]


def get_title(s):
    '''
    Returns from the string version of the script the title
    of the episode
    '''
    
    # note: in "full" version of this workflow, the compile
    # statement is run outside of the function to improve speed
    import re
    title_finder = re.compile(r'"(.*)"')
    
    # used on string version of episode, not lines script
    assert type(s) == str
    
    return title_finder.search(s).group(1)


def get_header_cutoff(ep_lines=ep_lines):
    '''
    Determines the index cutoff for where the episode header ends.
    Input should be a list of strings.
    '''
    
    # used on lines not str
    assert type(ep_lines) == list
    
    cutoff = 0
    for ix, line in enumerate(ep_lines):
        if line.strip() == 'TEASER':
            cutoff = ix
            break
    return cutoff


def match_page_header(line):
    '''
    This function identifies if a line is a page header.
    Input is a single line.
    '''
    # later eps say DEEP SPACE NINE:
    # hopefully this will be more flexible
    return (line.strip().startswith('DEEP SPACE')) and (":" in line)

With the above functions defined, we can cut off the header, store the episode title, and keep only relevant lines in the following four lines of code:

In [7]:
cutoff = get_header_cutoff(ep_lines)
header = ep_lines[:85]
ep_lines = ep_lines[85:]

ep_lines = [l for l in ep_lines if not match_page_header(l)]

Below, let's see what `ep_lines` looks like so far:

In [8]:
ep_lines[500:510]

['\t\t\t\t\tKIRA',
 '\t\t\tHow is he?',
 '',
 '\t\t\t\t\tBASHIR',
 '\t\t\tSecond degree burns, lacerations, a ',
 '\t\t\tminor concussion... Not much compared ',
 "\t\t\tto what he's been through before.",
 '',
 '',
 '\t\t\t\t\tSISKO']

---

## Splitting into acts

Episode-based association rules previously too general to be interesting -- but episode _acts_ are a little more fine-grained.

Scene-based association rules would probably be most interesting, but it's difficult to tell the difference between scenes and shots in this dataset (some elaboration below.) Splitting based on acts, however, is reasonably straightforward:

In [9]:
def act_partition(ep_lines):
    '''
    Accepts a list of episode lines.
    Returns six lists, one list per act, including the teaser.
    '''
    
    # used on lines not str
    assert type(ep_lines) == list
        
    stripped_lines = [l.strip() for l in ep_lines]
    
    # locating start of each act:
    teaser_begin = stripped_lines.index('TEASER')
    act_1_begin = stripped_lines.index('ACT ONE')
    act_2_begin = stripped_lines.index('ACT TWO')
    act_3_begin = stripped_lines.index('ACT THREE')
    act_4_begin = stripped_lines.index('ACT FOUR')
    act_5_begin = stripped_lines.index('ACT FIVE')
    
    # subsetting:
    teaser = ep_lines[:act_1_begin]
    act_1 = ep_lines[act_1_begin:act_2_begin]
    act_2 = ep_lines[act_2_begin:act_3_begin]
    act_3 = ep_lines[act_3_begin:act_4_begin]
    act_4 = ep_lines[act_4_begin:act_5_begin]
    act_5 = ep_lines[act_5_begin:]
    
    return teaser, act_1, act_2, act_3, act_4, act_5

In [10]:
teaser, act_1, act_2, act_3, act_4, act_5 = act_partition(ep_lines)

In [11]:
acts = [teaser, act_1, act_2, act_3, act_4, act_5]

Below, the act lines:

In [12]:
teaser[:10]

['                           "Past Prologue"                             ',
 '                            TEASER                              ',
 '',
 '\tFADE IN:',
 '',
 '1    EXT. SPACE - DS9 (OPTICAL)',
 '',
 '\tEstablishing.',
 '',
 '2    INT. PROMENADE REPLIMAT']

In [13]:
act_5[:10]

['                           ACT FIVE                             ',
 '',
 '\tFADE IN:',
 '',
 '59   EXT. SPACE - DEEP SPACE NINE (OPTICAL)',
 '',
 '\tThe Klingon ship is no longer present.',
 '',
 '60   INT. OPS',
 '']

---

## Exploring scenes & shots

- in progress
- identified a regex
- need to handle things like...

```
2    INT. PROMENADE REPLIMAT
2    CONTINUED:
2    CONTINUED:	(2)

...

26   CONTINUED:
27   OMITTED
29   OMITTED
```

In [14]:
scene_pattern = '^\d+\s.*'

for line in teaser:
    # scene_pattern = '^\d+.*'
    if re.match(scene_pattern, line):
        print(line)

def match_scene_continues(line):
    pattern = '^\d+.*'
    return (re.match(pattern, line)) and ('CONTINUED:' in line)

def match_sub_scene(line):
    pattern = '^\d+\w+.*'
    return re.match(pattern, line)

1    EXT. SPACE - DS9 (OPTICAL)
2    INT. PROMENADE REPLIMAT
2    CONTINUED:
2    CONTINUED:	(2)
3    INT. OPS
4    AT TRANSPORTER PAD (OPTICAL)
5    REACTION - KIRA
6    RESUME - SHOT


In [15]:
# def _not(func):
#     # https://stackoverflow.com/questions/33989155/is-there-a-filter-opposite-builtin
#     def not_func(*args, **kwargs):
#         return not func(*args, **kwargs)
#     return not_func

In [16]:
teaser = [l for l in teaser if not match_scene_continues(l)]
teaser = [l for l in teaser if not match_sub_scene(l)]

In [17]:
teaser[:15]

['                           "Past Prologue"                             ',
 '                            TEASER                              ',
 '',
 '\tFADE IN:',
 '',
 '1    EXT. SPACE - DS9 (OPTICAL)',
 '',
 '\tEstablishing.',
 '',
 '2    INT. PROMENADE REPLIMAT',
 '',
 '\tDOCTOR JULIAN BASHIR sits enjoying a tea-like beverage, ',
 '\treading a medical journal PADD... as the large, ever-pleasant ',
 '\tCardassian, GARAK, interposes himself between Bashir and the ',
 "\tlatter's view, with:"]

In [18]:
# scene_pattern = '^\d+\s.*'

# for line in teaser:
#     # scene_pattern = '^\d+.*'
#     if re.match(scene_pattern, line):
#         print(line)

---

## Identifying & cleaning speakers

For this analysis we're not interested in _what_ characters say -- just whether they say anything, and who's around when they're saying things.

In [19]:
def clean_speaker(s):
    '''
    Works on a single string to clear common prefixes and suffixes.
    This is not the canonical version of clean_speaker; the version in
    preprocessing.py is. #JustNotebookWorkflowThings
    '''
    
    clean_up = ['(V.O.)','(O.S.)', '(OS)', "'S COM VOICE",'(MONITOR)','\'S COMPUTER VOICE',
                "(cont'd)", "(Cont'd)", '(ON SCREEN)', '(0.S.)', '(FAR O.S.)', 'ON SCREEN', 
               '(Cont,d)', '(O. C.. )', "(Cont' d)", "'S VOICE", '(0. S. )', ' (0. S.)', ' (0.S)',
               "'S COM VOICE", "'S VOICE", "'S COMM VOICE"]
    
    for each in clean_up:
        if each in s:
            s = s.replace(each, '')
    
    s = s.strip()
    return s.strip()

In [20]:
def return_speakers(lines):
    '''
    Accepts a list of episode lines.
    Returns only the speakers.
    '''
    raw_speakers = tab_return(5, ep_lines=lines)
    raw_speakers = [s.strip() for s in raw_speakers]
    raw_speakers = [clean_speaker(s) for s in raw_speakers]
    
    return raw_speakers

### Generating act-based speaker counts for "Past Prologue"

In following notebooks, we'll actually extract this information for every episode. Here we see the process.

Speaker counts aren't strictly necessary for a priori rules mining... but I'm keeping it in case it's useful for later analysis, and because it's pretty easy to collect with `Counter`.

In [21]:
# dict to store speakers & line counts:
test_dict = {}

In [22]:
# note -- something like this would strictly be more efficient, and who knows, if I refactor this I
# might shift to just using sets for speed. but I do have the data scientist's delight here -- the privilege
# of doing an ad hoc analysis that doesn't need to scale.
for act in acts:
    print(set(return_speakers(act)))
    print()

{"O'BRIEN", 'SISKO', 'KIRA', 'GARAK', 'BASHIR', 'TAHNA', 'DAX'}

{"O'BRIEN", 'GUL DANAR', 'KIRA', 'ADMIRAL', 'SISKO', 'BASHIR', 'TAHNA'}

{'LURSA', "B'ETOR", 'GARAK', 'ODO', 'KIRA', 'BAJORAN DEPUTY', 'SISKO', 'BASHIR', 'TAHNA'}

{'LURSA', "B'ETOR", 'ODO', 'SISKO', 'KIRA', 'GARAK', 'TAHNA'}

{"B'ETOR", 'LURSA', 'ODO', '(thru door)', 'SISKO', 'KIRA', '(indicates)', 'GARAK', 'BASHIR'}

{'LURSA', "B'ETOR", "O'BRIEN", 'ODO', 'GUL DANAR', 'KIRA', 'KIANG (OPTICAL)', 'SISKO', 'BASHIR', 'TAHNA', 'DAX'}



In [23]:
return_speakers(teaser)[:15] # here's what it looks like raw...

['GARAK',
 'BASHIR',
 'GARAK',
 'BASHIR',
 'BASHIR',
 'GARAK',
 'BASHIR',
 'GARAK',
 'BASHIR',
 'GARAK',
 'BASHIR',
 'GARAK',
 'BASHIR',
 'GARAK',
 'BASHIR']

In [24]:
# using Counter to get speakers & counts per act
speakers = Counter(return_speakers(teaser))

In [25]:
# update the test dict
# right now this is strictly redundant, lol, I'm not building back up to episodes here.
# TODO refactor to be less silly
test_dict.update(speakers)

In [26]:
test_dict

{'GARAK': 7,
 'BASHIR': 15,
 "O'BRIEN": 6,
 'SISKO': 11,
 'DAX': 2,
 'KIRA': 2,
 'TAHNA': 3}