# Manual Annotation
This notebook contains code for extracting utterances for manual annotation. First, an exploration of the data is done to determine appropriate conditions for the annotated data. Next, the data is extracted (utterances that contain "[my/your/his/her] [kinship_term]") as a list of dictionaries and placed in a pkl file.

In [None]:
import pickle
with open("Preprocessed_html_data.pkl", "rb") as infile:
    data = pickle.load(infile)

In [None]:
my = ["you are my", "you're my", "he is my", "he's my", "she is my", "she's my", "they are my"]
your = ["he is your", "he's your", "she is your", "she's your", "they are your", "I am your", "we are your"]
her = ["he is her", "he's her", "she is her", "she's her", "they are her", "I am her", "we are her"]
his = ["he is his", "he's his", "she is his", "she's his", "they are his", "I am his", "we are his"]
kinships = ["brother", "sister", "father", "mother", "cousin", "son", "daughter", "child", "parents", "children", "baby","mom","dad"]
count = 0
for d in data:
    source = d["source"]
    utterance = d["utt"]
    tok_utt = d["tok_utt"]
    for kin in kinships:
        for l in [my, your, her, his]:
            for m in l:
                if m + " " + kin in utterance:
                    count+=1
                    print(utterance)

In [None]:
s = "ohofsosndf"
s[1:-1]

##### 5 instances of "[pronoun] [be] [my/your/his/her] [kinship_term]"

In [None]:
count = 0
pronouns = ["my", "your", "his", "her"]
kinships = ["brother", "sister", "father", "mother", "cousin", "son", "daughter", "child", "parents", "children", "baby","mom","dad"]
for d in data:
    source = d["source"]
    utterance = d["utt"]
    tok_utt = d["tok_utt"]
    for kin in kinships:
        for p in ["my"]:
            if p + " " + kin in utterance:
                print(utterance)
                count+=1
print(count)

##### 478 instances of "[my/your/his/her] [kinship_term]" ; 242 of "my [kinship_term]"
This pattern should pretty reliably pick out source (mention) and kinship relation, let's start there

Idea now is to turn this into an ML problem, maybe? If we label dialogues with these relation-source pairs,
then we can maybe tease out some more info? Was that the idea all along?

Other idea is to use this as gold indication that there MIGHT be info as to the target of this relation around. Over a whole corpus, some of these (definitely not all) will give us actual names/referents

So maybe what we do is start from sth like this, and then gain evidence. So when Monica says "my brother" and "Ross" was mentioned in one of the sentences around this, that's some evidence. If she does it more than once, maybe that's enough (or more than twice, w/e). Maybe if Mon says "my brother" and Ross also spoke in this dialogue, that's also evidence (maybe not as strong?).

If we can find other gold indicators of... some kinship, then that's getting closer. I guess that's the idea behind kinship terms/distant labelling.

In [None]:
count = 0
for d in data:
    source = d["source"]
    utterance = d["utt"]
    tok_utt = d["tok_utt"]
    for kin in kinships:
        if kin in tok_utt:
            count += 1
print(count)

##### 1119 instances of kinship terms in data

In [None]:
count = 0
pronouns = ["my", "your", "his", "her"]
kinships = ["brother", "sister", "father", "mother", "cousin", "son", "daughter", "child", "parents", "children", "baby","mom","dad"]
for d in data:
    source = d["source"]
    utterance = d["utt"]
    tok_utt = d["tok_utt"]
    for p in pronouns:
        for i in range(len(tok_utt)):
            if tok_utt[i] == p and tok_utt[i+1] in kinships or tok_utt[i+1] in kinships or tok_utt[i+2] in kinships:
                count+=1
print(count)

## Extracting data for annotation
This extracts all utterances with "[my/your/his/her] [kinship_term]", and the utterances before and after

In [None]:
count = 0
pronouns = ["my", "your", "his", "her"]
kinships = ["brother", "sister", "cousin", "mother", "father", "mom", "dad", "son",\
            "daughter", "neice", "nephew", "twin", "aunt", "uncle", "child", "parent"]

my_kinship_lines = [] # list of tuples (prev, current, next) of dicts where current includes "[poss_pronoun] [kinship_term]" 

for i in range(3, len(data)-2):
    prev_prev_utt = data[i-2]
    prev_utt = data[i-1]
    d = data[i]
    next_utt = data[i+1]
    next_next_utt = data[i+2]
    
    source = d["source"]
    utterance = d["utt"]
    tok_utt = d["tok_utt"]
    for kin in kinships:
        for p in ["my", "your", "his", "her"]:
            if p + " " + kin in utterance:
                my_kinship_lines.append((prev_prev_utt, prev_utt, d, next_utt, next_next_utt))


In [None]:
s = "They're my new 'I don't need a job, I don't need my parents, I've got great boots' boots!"
print()

In [None]:
len(my_kinship_lines)

In [None]:
import pickle
with open("annotation_kinship_utterances.pkl", "wb") as outfile:
    pickle.dump(my_kinship_lines, outfile)

In [None]:
import pickle
with open("annotation_kinship_utterances.pkl", "rb") as infile:
    data = pickle.load(infile)
    
for t in my_kinship_lines:
    for d in t:
        source=d["source"]
        utt = d["utt"]
        print(f"{source}: {utt}")
    print()