# Fun with the *Star Trek* corpus, part I

As a longtime Trekkie, it seemed like a neat idea to use the corpus of *Star Trek* scripts to get some experience with pandas and, hopefully, some natural language processing with spaCy. This is possible thanks to the efforts of chakoteya.net (which hosts HTML versions of the scripts) and Gary Broughton, who scraped them and put them in a much friendlier JSON format. I wrote a module called ```trekparser``` (available in this repo) to break them down even further into a big DataFrame of lines of dialogue. This DataFrame is also available in this repo, in csv format as ```table_of_lines.csv```.

For the first in what I hope to be a several-part series of posts, I'll poke around this fairly large dataset, extract some fun facts, and possibly even gain some insights into large-scale properties of the Trek corpus.

Two notes on terminology before we start. First, we've got a collision between TV series and pandas Series; the capitalization will distinguish them. Second, since they're going to be used extensively, the abbrevations used for the various series stand for: TOS (The Original Series, 1966-69), TAS (The Animated Series, 1973-74), TNG (The Next Generation, 1987-94), DS9 (Deep Space 9, 1993-99), VOY (Voyager, 1995-2001), and ENT (Enterprise, 2001-05). Of course, you could also get the years from the airdates in the episodes table!

## Preliminaries

Import all the relevant stuff, including the data:

In [1]:
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None

episodes = pd.read_csv('episode_index.csv', index_col=['series','episode'])
lines = pd.read_csv('table_of_lines.csv')

As a sanity check, do these DataFrames look like they have the right stuff? Keep in mind that the episodes are indexed first by series lexicographically, so DS9 should come up first.

In [2]:
episodes.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,title,stardate,airdate
series,episode,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
DS9,0,Emissary,46379.1,"3 Jan, 1993"
DS9,1,Past Prologue,Unknown,"11 Jan, 1993"
DS9,2,A Man Alone,46421.5,"17 Jan, 1993"
DS9,3,Babel,46423.7,"25 Jan, 1993"
DS9,4,Captive Pursuit,Unknown,"1 Feb, 1993"
DS9,5,Q-less,46531.2,"8 Feb, 1993"
DS9,6,Dax,46910.1,"15 Feb, 1993"
DS9,7,The Passenger,Unknown,"22 Feb, 1993"
DS9,8,Move Along Home,Unknown,"15 Mar, 1993"
DS9,9,The Nagus,Unknown,"22 Mar, 1993"


In [3]:
lines.head(10)

Unnamed: 0,series,ep_number,scene_number,line_number,ep_name,scene_loc,character,line
0,DS9,0,1,0,Emissary,Saratoga - Bridge,LOCUTUS,Resistance is futile. You will disarm your wea...
1,DS9,0,1,1,Emissary,Saratoga - Bridge,CAPTAIN,(a Vulcan) Red alert. Load all torpedo bays. R...
2,DS9,0,1,2,Emissary,Saratoga - Bridge,OPS OFFICER,(woman) They've locked on.
3,DS9,0,1,3,Emissary,Saratoga - Bridge,SISKO,Reroute auxiliary power.
4,DS9,0,1,4,Emissary,Saratoga - Bridge,OPS OFFICER,Our shields are being drained. Sixty four perc...
5,DS9,0,1,5,Emissary,Saratoga - Bridge,CAPTAIN,Recalibrate shield nutation.
6,DS9,0,1,6,Emissary,Saratoga - Bridge,TACTICAL,(Bolian) Modulation is having no effect.
7,DS9,0,1,7,Emissary,Saratoga - Bridge,OPS OFFICER,Shields have failed.
8,DS9,0,1,8,Emissary,Saratoga - Bridge,SISKO,Full reverse.
9,DS9,0,1,9,Emissary,Saratoga - Bridge,CAPTAIN,Maintain all Argh! (Everything goes BOOM)


Good so far. How about some general facts about the dataset?

In [4]:
print('There are %d total lines of dialogue across all series.' % (len(lines)))
print('There are %d distinct named characters.' % (len(lines.character.unique())))
print('The 73rd episode of TNG is %s, which first aired on %s.' %
      (episodes.loc[('TNG',72)].title, episodes.loc[('TNG',72)].airdate))
longest = lines.loc[lines.line_number==lines.line_number.max()].squeeze()
#note: this will only work if there's a unique longest scene
print('''The longest scene in the franchise by number of lines is scene %d of the %s episode %s, which takes place 
      in the %s. It has %d lines!''' %
     (longest.scene_number, longest.series, longest.ep_name, longest.scene_loc, longest.line_number+1))

There are 263978 total lines of dialogue across all series.
There are 2737 distinct named characters.
The 73rd episode of TNG is The Best of Both Worlds Part 1, which first aired on 18 Jun, 1990.
The longest scene in the franchise by number of lines is scene 17 of the TOS episode Wolf In The Fold, which takes place 
      in the Briefing Room. It has 148 lines!


What are all the times Dr. McCoy says, "He's dead, Jim"?

In [5]:
lines.loc[(lines.character=='MCCOY') & (lines.line.str.contains("e's dead, Jim"))]

Unnamed: 0,series,ep_number,scene_number,line_number,ep_name,scene_loc,character,line
170721,TOS,5,33,10,The Enemy Within,Transporter Room,MCCOY,"He's dead, Jim. Captain's Log, stardate 1673.1..."
179994,TOS,32,11,11,The Changeling,Bridge,MCCOY,"He's dead, Jim."
184112,TOS,43,2,0,Wolf In The Fold,Street,MCCOY,"She's dead, Jim."
184179,TOS,43,9,1,Wolf In The Fold,Chamber,MCCOY,"She's dead, Jim. Just like the other one."
184393,TOS,43,17,130,Wolf In The Fold,Briefing Room,MCCOY,"He's dead, Jim."
190545,TOS,60,17,14,Is There In Truth No Beauty?,Engineering,MCCOY,"He's dead, Jim. Captain's log, stardate 5630.8..."


Just to see how messy the formatting got, how many lines end with something other than punctuation? Seems like a good guess that some of the lines that end on a letter got cut off somehow, and might need further scrutiny...

In [6]:
len(lines.loc[~lines.line.str.endswith(('.','!','?',')'))])/len(lines)

0.019675882081082514

Not too bad, though that is still 5194 lines of dialogue (and if you look in the value counts for ```lines.line.str.slice(-1)```, you'll see that there's some places where something wonky probably happened, like the 32 lines that end with ']' or the 2 lines that end with '}').

Finally, let's get an example of the potential multi-index for the big DataFrame in use. As suggested in the readme, let's reset the index to be by series, ep_number, scene_number, and line_number, and then pull out a line completely at random:

In [7]:
lines.set_index(['series','ep_number','scene_number','line_number'], inplace=True)
lines.loc[('TNG',72,46,16)]

ep_name                         The Best of Both Worlds Part 1
scene_loc                                               Bridge
character                                               PICARD
line         I am Locutus of Borg. Resistance is futile. Yo...
Name: (TNG, 72, 46, 16), dtype: object