# 1) Manually Guessing Output, Pick Most Common Class

## Load Dataset and Variables

In [1]:
import pandas as pd

In [2]:
major_dialog_data = pd.read_pickle("./datasets/major_dialog_data.pkl")
all_dialog_data = pd.read_pickle("./datasets/all_dialog_data.pkl")

labels = major_dialog_data.speaker

speaker_value_counts = all_dialog_data.speaker.value_counts()
major_speaker_value_counts = speaker_value_counts[speaker_value_counts > 40]
x, y = major_speaker_value_counts.index, major_speaker_value_counts.values

num_major_characters = 6

# a list of the top characters names
major_characters = x[:num_major_characters]
# a set of those same names
major_characters_set = set(major_characters)

# finally, dicts to and from class Ids and their respective names
labels_to_ids = {}
ids_to_labels = {}
for i, major_character in enumerate(major_characters):
    labels_to_ids[major_character] = i
    ids_to_labels[i] = major_character
labels_to_ids

{'Rachel': 0, 'Ross': 1, 'Chandler': 2, 'Monica': 3, 'Joey': 4, 'Phoebe': 5}

## Baseline Model - Always predict the most common class

This is the absolute simplest thing one can do. We definitely need to beat this score.

In [3]:
import numpy as np

from collections import defaultdict, Counter
from nltk import word_tokenize
from w266_common import utils, vocabulary

In [4]:
np.mean(labels == major_characters[0])

0.18329390828794029

## Manually Guessing - Humans trying to predict who said what

This would give us another baseline metric on how possible it is to guess a character given a line of text.

In [5]:
utterance_tokenized = [word_tokenize(sentence) for sentence in major_dialog_data.utterance]
vocab = vocabulary.Vocabulary(utils.canonicalize_word(w) for w in utils.flatten(utterance_tokenized))

In [6]:
human_check_df = pd.DataFrame()
human_check_df['utterance'] = major_dialog_data.utterance
human_check_df['utterance_tokenized'] = utterance_tokenized
human_check_df['speaker'] = major_dialog_data.speaker

### Sample dataset and try to manually guess

In [9]:
sample_n = 10

pd.options.display.max_colwidth = 1000

human_check_df_sample = human_check_df.sample(sample_n)

human_check_df_sample[['utterance', 'utterance_tokenized']]

# If need to save to csv
# human_check_df_sample[['utterance', 'utterance_tokenized']].to_csv('./manual-guess/questions.csv')

Unnamed: 0,utterance,utterance_tokenized
18855,"So, what happens to the old guys?","[So, ,, what, happens, to, the, old, guys, ?]"
31085,"All right, he likes you back! Huh? Told ya, you should go for it!","[All, right, ,, he, likes, you, back, !, Huh, ?, Told, ya, ,, you, should, go, for, it, !]"
974,"Okay, a couple months late on the lecture, Ross.","[Okay, ,, a, couple, months, late, on, the, lecture, ,, Ross, .]"
14753,"The ones that got me the Porsche! Will you keep up! But I figured, if-if people keep seeing me just standing there, they’re gonna start to think that I don’t own it. So I figured I’ll wash it. Right? Monica, you got a bucket and some soap I can borrow?","[The, ones, that, got, me, the, Porsche, !, Will, you, keep, up, !, But, I, figured, ,, if-if, people, keep, seeing, me, just, standing, there, ,, they, ’, re, gon, na, start, to, think, that, I, don, ’, t, own, it, ., So, I, figured, I, ’, ll, wash, it, ., Right, ?, Monica, ,, you, got, a, bucket, and, some, soap, I, can, borrow, ?]"
11,Monica had lunch with Richard.,"[Monica, had, lunch, with, Richard, .]"
47705,I'm not getting you a muffin!,"[I, 'm, not, getting, you, a, muffin, !]"
23855,They don’t really talk to us about that kind of stuff. I can get you some free white out though.,"[They, don, ’, t, really, talk, to, us, about, that, kind, of, stuff, ., I, can, get, you, some, free, white, out, though, .]"
43922,"Na-uh, no, we are all responsible for our own babies.","[Na-uh, ,, no, ,, we, are, all, responsible, for, our, own, babies, .]"
32332,"Sweetie, you gotta relax. Everything’s gonna be great, okay? Come on. Come on.","[Sweetie, ,, you, got, ta, relax, ., Everything, ’, s, gon, na, be, great, ,, okay, ?, Come, on, ., Come, on, .]"
47572,Hi!,"[Hi, !]"


### Answers for the sample dataset

In [10]:
human_check_df_sample[['utterance', 'utterance_tokenized', 'speaker']]

# If need to save to csv
# human_check_df_sample[['utterance', 'utterance_tokenized', 'speaker']].to_csv('./manual-guess/answers.csv')

Unnamed: 0,utterance,utterance_tokenized,speaker
18855,"So, what happens to the old guys?","[So, ,, what, happens, to, the, old, guys, ?]",Phoebe
31085,"All right, he likes you back! Huh? Told ya, you should go for it!","[All, right, ,, he, likes, you, back, !, Huh, ?, Told, ya, ,, you, should, go, for, it, !]",Joey
974,"Okay, a couple months late on the lecture, Ross.","[Okay, ,, a, couple, months, late, on, the, lecture, ,, Ross, .]",Rachel
14753,"The ones that got me the Porsche! Will you keep up! But I figured, if-if people keep seeing me just standing there, they’re gonna start to think that I don’t own it. So I figured I’ll wash it. Right? Monica, you got a bucket and some soap I can borrow?","[The, ones, that, got, me, the, Porsche, !, Will, you, keep, up, !, But, I, figured, ,, if-if, people, keep, seeing, me, just, standing, there, ,, they, ’, re, gon, na, start, to, think, that, I, don, ’, t, own, it, ., So, I, figured, I, ’, ll, wash, it, ., Right, ?, Monica, ,, you, got, a, bucket, and, some, soap, I, can, borrow, ?]",Joey
11,Monica had lunch with Richard.,"[Monica, had, lunch, with, Richard, .]",Phoebe
47705,I'm not getting you a muffin!,"[I, 'm, not, getting, you, a, muffin, !]",Ross
23855,They don’t really talk to us about that kind of stuff. I can get you some free white out though.,"[They, don, ’, t, really, talk, to, us, about, that, kind, of, stuff, ., I, can, get, you, some, free, white, out, though, .]",Chandler
43922,"Na-uh, no, we are all responsible for our own babies.","[Na-uh, ,, no, ,, we, are, all, responsible, for, our, own, babies, .]",Phoebe
32332,"Sweetie, you gotta relax. Everything’s gonna be great, okay? Come on. Come on.","[Sweetie, ,, you, got, ta, relax, ., Everything, ’, s, gon, na, be, great, ,, okay, ?, Come, on, ., Come, on, .]",Ross
47572,Hi!,"[Hi, !]",Rachel


Manually guessed 100 entries, correctly answered 24 of the answers (24% accuracy)

* Was hard to guess one word lines
* Easiest to guess were lines that was memorable from the tv show (domain knowledge)