# Fun with the *Star Trek* corpus, part I

As a longtime Trekkie, it seemed like a neat idea to use the corpus of *Star Trek* scripts to get some experience with pandas and, hopefully, some natural language processing with spaCy. This is possible thanks to the efforts of chakoteya.net (which hosts HTML versions of the scripts) and Gary Broughton, who scraped them and put them in a much friendlier JSON format. I wrote a module called ```trekparser``` (available in this repo) to break them down even further into a big DataFrame of lines of dialogue. This DataFrame is also available in this repo, in csv format as ```table_of_lines.csv```.

For the first in what I hope to be a several-part series of posts, I'll poke around this fairly large dataset, extract some fun facts, and possibly even gain some insights into large-scale properties of the Trek corpus.

Two notes on terminology before we start. First, we've got a collision between TV series and pandas Series; the capitalization will distinguish them. Second, since they're going to be used extensively, the abbrevations used for the various series stand for: TOS (The Original Series, 1966-69), TAS (The Animated Series, 1973-74), TNG (The Next Generation, 1987-94), DS9 (Deep Space 9, 1993-99), VOY (Voyager, 1995-2001), and ENT (Enterprise, 2001-05). Of course, you could also get the years from the airdates in the episodes table!

## Preliminaries

Import all the relevant stuff, including the data:

In [None]:
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None

episodes = pd.read_csv('episode_index.csv', index_col=['series','episode'])
lines = pd.read_csv('table_of_lines.csv')

As a sanity check, do these DataFrames look like they have the right stuff? Keep in mind that the episodes are indexed first by series lexicographically, so DS9 should come up first.

In [None]:
episodes.head(10)

In [None]:
lines.head(10)

Good so far. How about some general facts about the dataset?

In [None]:
print('There are %d total lines of dialogue across all series.' % (len(lines)))
print('There are %d distinct named characters.' % (len(lines.character.unique())))
print('The 73rd episode of TNG is %s, which first aired on %s.' %
      (episodes.loc[('TNG',72)].title, episodes.loc[('TNG',72)].airdate))
longest = lines.loc[lines.line_number==lines.line_number.max()].squeeze()
#note: this will only work if there's a unique longest scene
print('''The longest scene in the franchise by number of lines is scene %d of the %s episode %s, which takes place 
      in the %s. It has %d lines!''' %
     (longest.scene_number, longest.series, longest.ep_name, longest.scene_loc, longest.line_number+1))

What are all the times Dr. McCoy says, "He's dead, Jim"?

In [None]:
lines.loc[(lines.character=='MCCOY') & (lines.line.str.contains("e's dead, Jim"))]

Just to see how messy the formatting got, how many lines end with something other than punctuation? Seems like a good guess that some of the lines that end on a letter got cut off somehow, and might need further scrutiny...

In [None]:
len(lines.loc[~lines.line.str.endswith(('.','!','?',')'))])/len(lines)

Not too bad, though that is still 5194 lines of dialogue (and if you look in the value counts for ```lines.line.str.slice(-1)```, you'll see that there's some places where something wonky probably happened, like the 32 lines that end with ']' or the 2 lines that end with '}').

Finally, let's get an example of the potential multi-index for the big DataFrame in use. As suggested in the readme, let's reset the index to be by series, ep_number, scene_number, and line_number, and then pull out a line completely at random:

In [None]:
lines.set_index(['series','ep_number','scene_number','line_number'], inplace=True)
lines.loc[('TNG',72,46,16)]