## Capstone Technical Report

Caroline Schmitt
12/18/17

### Problem statement:

Text classification can be a difficult natural language processing task. Its applications can be broad -- from comparing one's prose style to famous authors[2](https://iwl.me/about/) to identifying speakers over wiretaps[1](https://www.osti.gov/scitech/servlets/purl/11824). For this project I attempted to build a classification model for dialog on the TV show Star Trek: Deep Space Nine. Attempting to classify TV dialog is an especially interesting task because TV shows often have dozens of writers who come and go, some staying for seasons at a time and some writing only one or two episodes, but nonetheless each writer is expected to make long-standing characters sound like themselves; therefore I make the assumption there is true continuity in language patterns for each character throughout all seven seasons of the series.

The classes I chose were the ten characters with the most lines through all seven seasons of the series. The baseline accuracy of such a classification model is `.20`. I was able to increase my model's accuracy to `.30` and am confident that with further testing, I can improve it even further.

### Data acquisition:

I used `requests` and `BeautifulSoup` to scrape fan transcripts and converted them to pandas DataFrame, then a `.csv` file for storage. Each sentence of dialog is stored on its own line, tagged with character, season, and the title of the episode that the line was taken from.

As I did not transcripe the episodes, I am assuming that the fan transcriptions are accurate to the show. This may be confounded by typos or other data entry-type errors, or by the transcriptionist having misheard something.

Scraping the scripts had several stages:

In [None]:
scr = []
# 401,576
for ep in range(401,576):
    url = "http://www.chakoteya.net/DS9/{}.htm".format(ep)
    try:
        scr.append(urllib.request.urlopen(url).read())
    except urllib.request.HTTPError as err:
        if err.code == 404:
            pass

In [None]:
many_soups = []
for ep in scr:
    many_soups.append(BeautifulSoup(ep, "lxml"))

In [None]:
sent_tokenizer = nltk.tokenize.sent_tokenize
pattern = re.compile(r'(\b[A-Z]+|([A-Z]+.[A-Z]+))(\:|\s\[.+\]\:)')

In [None]:
for ep in many_soups:
    
    episode_title = ep.b.string
    episode_title = episode_title.replace('\r\n', ' ')
    
    array_of_strings = []
    
    for string in ep.stripped_strings:
        array_of_strings.append(string.replace('\r\n', ' '))
        
    clean_df = []
    char_dict = {}

    for string in array_of_strings:
        found = re.search(pattern, string)
        if found is not None:
            stripped_string = string.replace(found.group(0), '').strip()
            stripped_string_tokenized = sent_tokenizer(stripped_string)

            key = found.group(1)

            for each in stripped_string_tokenized:
                    clean_df.append(each)
                    char_dict.setdefault(key, []).append(each)
    
    for key in char_dict:
        temp_df = pd.DataFrame(char_dict[key], columns=['text'])
        temp_df['character'] = key
        temp_df['ep_title'] = episode_title
        df = df.append(temp_df)

### Data transformation:

The first thing I did with the data was limit it to the characters who spoke the most dialogue:

In [None]:
common_chars = df['character'].value_counts()[:10].index
common_chars_df = df.loc[df['character'].isin(common_chars)]

I also constructed four subsets of the data based on sentence length to test if longer or shorter sentences resulted in more accurate models:

In [None]:
count_array = [len(word_tokenize(line)) > 5 for line in common_chars_df['text']]
longer_than_5_df = common_chars_df[count_array]

count_array = [len(word_tokenize(line)) > 8 for line in common_chars_df['text']]
longer_than_8_df = common_chars_df[count_array]

count_array = [len(word_tokenize(line)) > 10 for line in common_chars_df['text']]
longer_than_10_df = common_chars_df[count_array]

count_array = [len(word_tokenize(line)) > 15 for line in common_chars_df['text']]
longer_than_15_df = common_chars_df[count_array]

count_array = [len(word_tokenize(line)) > 20 for line in common_chars_df['text']]
longer_than_20_df = common_chars_df[count_array]

The baseline accuracies for each sub-dataframe were not dramatically different:

`0.200860378313
0.205037088149
0.205051348828
0.205413422321
0.199673202614`

I used `LabelEncoder` to convert my categorical variables into numbers. I also modified a `cleaner` function I had used before to have a substantial stopwords list, including many of the very common words found in the dataset as well as many of the very uncommon words, so as to avoid overfitting:

In [None]:
def cleaner(text):
    stemmer = PorterStemmer()
    stop = stopwords_list
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.translate(str.maketrans('', '', string.digits))
    text = text.lower().strip()
    final_text = []
    for w in text.split():
        if w.strip() not in stop:
            final_text.append(stemmer.stem(w.strip()))
    return ' '.join(final_text)

The `cleaner` preprocessor was used in conjunction with `CountVectorizer` and `TfidVectorizer`, both variable transformers that use bag-of-words measures for NLP tasks.

### Modeling

As this is a classification task, my outcome variable is 'predicted speaker', and I am optimizing for accuracy. I tested many different models and found that most were only slightly better than baseline, even after gridsearching. The models that stood out in performance were a `LogisticRegression` model and a `MultinomialNB` model.

This `MultinomialNB` model had the best validation score of `.30252`. Multinomial naive Bayes is considered a strong model for NLP classification tasks[1](https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html).

In [None]:
mnb_pipe = make_pipeline(
    CountVectorizer(preprocessor=cleaner),
    MultinomialNB()
)

`LogisticRegression` had the second best validation score of `.28324`. Even after gridsearching extensively for optimal hyperparameters, this was the best-performing `LogisticRegression` model:

In [None]:
lr_pipe = make_pipeline(
    CountVectorizer(preprocessor=cleaner),
    LogisticRegressionCV()
)

These two models indicate approximately .10 and .8 improved scores over baseline, which isn't bad for a 10-class classification task.

### Next steps:

My next steps for this model will be to evaluate convolutional neural network models; the size of the dataset means this will take some time. I plan to further experiment with stop words and tokenization as well as adding more variables (such as polarity.)

Additionally I would like to select the most 'generic' lines in my database to do more EDA on. There are likely some sentences that multiple characters say throughout the course of the series, such as "Get help" or other such generic lines, and it may be that identifying and removing such lines improves the predictive accuracy of my model -- so long as such feature selection and engineering doesn't veer into something like p-hacking.