## Splitting the Scripts
Our first objective is to make a script that will take in a movie script text file and parse it into a dictionary where each unique speaker are the keys and the values are all of the lines they spoke throughout the script. To start simple, I pulled the Megamind script from https://imsdb.com/scripts/Megamind.html and pasted it into a text file.

In [1]:
import re
import os
cdir = os.path.abspath('')
script_file = open(os.path.join(cdir, 'megamind.txt'), 'r')
script = script_file.read()
script_file.close()
print(script[0:100] + script[2000:2100]) # making sure it imported right


MEGAMIND



Written by

Alan Schoolcraft & Brent Simons




CREDITS SEQUENCE

NEWSPAPER HEADLINE MOas he shoots off
into the distance like a speeding bullet.

EXT. OBSERVATORY HIDEOUT - ESTABLISHING 


### Splitting by Paragraph
Now that the script is imported correctly, our next step is to split the paragraphs up. We can do this by using regular expressions. _(source: https://stackoverflow.com/questions/57273215/how-to-parse-movie-script-in-a-dictionary)_

In [2]:
pars = re.split(r'\n\n+', script, maxsplit=0)
pars[50] # verifying the paragraphs split

"MASTER MIND\nAlright, let's not keep the lady\nwaiting."

### Creating a Character: \[Lines\] Dictionary
Now that we have our script split up by paragraph, we can use regular expressions to figure out where the characters' names and lines are. Looking at the script file, we can see that the names are in all caps on their own line, with their lines beginning in the following line. The following code is modified from the same source as above.

In [3]:
d = {} # initializes an empty dictionary to pass the names and lines into

for p in pars:
    # Capture the name (anchored to the beginning of line and all capitals)
    # and the rest of the paragraph - (.*)
    regex = re.search(r'^([A-Z]+ [A-Z]+|[A-Z]+)(.*)', p, re.S + re.M)

    if not regex:  # Avoid calling group() on null results
        continue

    name, txt = regex.group(1, 2) 

    # Each sentence as a list item
    if name in d:
        d[name] += txt.strip().split('\n')
    else:
        d[name] = txt.strip().split('\n')


#### Regular Expression Breakdown
The area of interest in this code involves the regular expression used to capture the name and lines. You can skip this part if you are already familiar with regular expressions or if you don't plan on using them.

`regex = re.search(r'^([A-Z]+ [A-Z]+|[A-Z]+)(.*)', p, re.S + re.M)`

We can break down the components of this search: 
* First, there's the r that precedes the string expression, which simply tells the function to interpret the expression as raw (so certain special characters don't get mixed up).
* Notice that within the quotations (`''`) there are two parenthetical groups: `([A-Z]+ [A-Z]+|[A-Z]+)` and `(.*)`. The first group captures the name and the second captures everything else. This allows us to later split these into the keys and values of the dictionary.
* The part `^([A-Z]+ [A-Z]+` searches for a string of characters that starts with a string of capital letters of any length (`[A-Z]+`), a space character, and another string of capital letters of any length (`[A-Z]+`). The character `^` means "starts with". 
* The character `|` means either/or, so `^([A-Z]+ [A-Z]+|[A-Z]+)` all together means "starts with two words in all caps or starts with one word in all caps". 
* Finally the `(.*)` can be interpreted as "any character (`.`) of any length (`*`)". Normally, this doesn't include newlines and will stop once it hits a newline, but the argument `re.S` passed in the function makes it so that it includes newlines as part of "any character". Since the script is already split by paragraph, it will just capture everything in the paragraph chunk after the name.


Now, we'll look at the dictionary produced by the code and see what we need to adjust accordingly.

In [4]:
print(d.keys())

dict_keys(['MEGAMIND', 'W', 'A', 'CREDITS SEQUENCE', 'NEWSPAPER HEADLINE', 'HEADLINES', 'PHOTO', 'T', 'END OF', 'EXT', 'WE PULL', 'H', 'Y', 'UBERMAN', 'U', 'G', 'INT', 'A STEEL', 'M', 'EINSTEIN', 'PLATO', 'E', 'D', 'MASTER MIND', 'DA VINCI', 'MOMENTS LATER', 'S', 'ROXANNE', 'R', 'CUT TO', 'A MECHANIZED', 'F', 'O', 'N', 'CRIME WAVE', 'SPINNING HEADLINE', 'ARMORED TRUCK', 'I', 'METRO CITY', 'A VICIOUS', 'HAL STEWART', 'HAL', 'ATTRACTIVE BLOND', 'VINNIE', 'V', 'TELEVISION', 'A WOMAN', 'WOMAN', 'BACK TO', 'TWO BRUISERS', 'P', 'FRANK', 'A SECRETARY', 'LIGHTENING FLASHES', 'REPORTER ON', 'TELEVISION NARRATOR', 'FRIEND', 'A WAITER', 'A YOUNG', 'ACROSS THE', 'POLITICIAN', 'YOUNG MOTHER', 'CRANE', 'HAL AND', 'REPORTER', 'J', 'L', 'C', 'NO', 'SEVERS', 'BOOM', 'STENWICK', 'MINUTES LATER', 'NEWS REPORTER', 'BLIND KID', 'ATTRACTIVE', 'TRUCKER', 'ATTRACTIVE WOMAN', 'BRAD HELMS', 'CUT BACK', 'A LIGHTENING', 'CAT', 'BATHROOM', 'ZOOM', 'HONK', 'VOICE', 'LATER', 'TITAN', 'JOHN', 'GEORGE', 'BANK MANAGER'

As we can see, the code has created a dictionary with keys as different characters. Looking at the list of all the keys, we can see there are some stage directions/scene settings that were misattributed as names. This is fine for now, but we want to make sure that the main characters of the movie and their lines have been captured successfully. 

In [5]:
print(d['MASTER MIND'][0:5])
print(d['ROXANNE'][0:5])

['The real Einstein once said, "God', 'does not play dice with the world."', 'He was right, because the world is', 'MY dice. Is that understood?', 'Alright, then - clean slate. Do we']
["You didn't need to turn around like", 'that. I can recognize the stench of', 'failure.', 'Looks like a real group of winners.', "At the risk of sounding cliche',"]


The code has successfully created the dictionary we want. Looking at the output, we can see that each item in the list is a line spoken by the character. However, they aren't clean, clear-cut sentences. This is a result of the function having split the lines by newline and then stripping them of white space. This is fine since we need to combine them into one big "mega string" anyway. We need to keep in mind, however, that we need to insert a space character between each string when we do merge all of them together.

### Creating a Universal Function
Now that we've successfully split up one script, we need to apply the same process to split up more scripts. To do this, we'll create a function that takes in a movie script string as input and outputs a dictionary with the characters as keys and a list of all their lines as values. 

`scriptsplit(moviescript)`

`output: {'Mary': ['Hey Joe, where did you put', 'my rain boots?', 'Thanks.'], 'Joe': ['They should be in the garage.', 'You're welcome.']}`

We'll have to make sure that the scripts we choose have the same/similar enough format so that the function will work properly. We'll put the scripts we want to use in one folder. We'll probably also want to make a simple function that reads in the file so we don't have to write it every time.