## Capstone Technical Report

Caroline Schmitt
12/18/17

### Problem statement:

Text classification can be a difficult natural language processing task. Its applications can be broad -- from comparing one's prose style to famous authors[2](https://iwl.me/about/) to identifying speakers over wiretaps[1](https://www.osti.gov/scitech/servlets/purl/11824). For this project I attempted to build a classification model for dialog on the TV show Star Trek: Deep Space Nine. Attempting to classify TV dialog is an especially interesting task because TV shows often have dozens of writers who come and go, some staying for seasons at a time and some writing only one or two episodes, but nonetheless each writer is expected to make long-standing characters sound like themselves; therefore I make the assumption there is true continuity in language patterns for each character throughout all seven seasons of the series.

Text classification is a rather tricky natural language processing task. Here I attempt to classify character dialogue from
 Star Trek: Deep Space 9 using various machine learning models as well as several pre-processing techniques. This problem i
s of particular interest because long-running TV series may have dozens of writers throughout their course, but those write
rs are tasked with making sure recurring characters still sound like themselves. Were the writers successful enough in doin
g this that a model will be able to discern between characters?

Data and assumptions: I scraped scripts from a fan transcript website and parsed them using BeautifulSoup. I constructed a
DataFrame with each sentence labeled with the character, season, and episode title that the line was taken from. I also scr
aped IMDB for episode ratings with an eye for future modeling projects.
As I did not transcribe the episodes, I am assuming that the fan transcriptions are accurate to the show. This may be confo
unded by typos or other data entry-type errors. 

To transform the data for modeling, I utilized both a CountVectorizer and a TfidVectorizer. These are two different bag-of-
words measures for NLP tasks.

As this is a classification task, my outcome variable is 'predicted speaker', and I am optimizing for accuracy.


Scraping the scripts had several stages:

In [None]:
scr = []
# 401,576
for ep in range(401,576):
    url = "http://www.chakoteya.net/DS9/{}.htm".format(ep)
    try:
        scr.append(urllib.request.urlopen(url).read())
    except urllib.request.HTTPError as err:
        if err.code == 404:
            pass

In [None]:
many_soups = []
for ep in scr:
    many_soups.append(BeautifulSoup(ep, "lxml"))

In [None]:
sent_tokenizer = nltk.tokenize.sent_tokenize
pattern = re.compile(r'(\b[A-Z]+|([A-Z]+.[A-Z]+))(\:|\s\[.+\]\:)')

In [None]:
for ep in many_soups:
    
    episode_title = ep.b.string
    episode_title = episode_title.replace('\r\n', ' ')
    
    array_of_strings = []
    
    for string in ep.stripped_strings:
        array_of_strings.append(string.replace('\r\n', ' '))
        
    clean_df = []
    char_dict = {}

    for string in array_of_strings:
        found = re.search(pattern, string)
        if found is not None:
            stripped_string = string.replace(found.group(0), '').strip()
            stripped_string_tokenized = sent_tokenizer(stripped_string)

            key = found.group(1)

            for each in stripped_string_tokenized:
                    clean_df.append(each)
                    char_dict.setdefault(key, []).append(each)
    
    for key in char_dict:
        temp_df = pd.DataFrame(char_dict[key], columns=['text'])
        temp_df['character'] = key
        temp_df['ep_title'] = episode_title
        df = df.append(temp_df)

After scraping the scripts, I converted them to a dataframe that stored the line of dialog, the character who spoke it, and the episode it was in, with an eye for future modeling.

My full EDA can be found `here`: