Extracts verbalised instruction such as "connect Mount Basel to Montreux", and pairs them with the follow-up action that may *match* (e.g. if the other connects Basel to Montreux) or *mismatch* (e.g. if the other connects Basel to Neuchatel) with the instruction.

This notebook produces an annotated corpus with one tab-separated text file per team at [annotated_corpus/](../annotated_corpus/).

In particular, the columns are:

* *team_no*: The number of the team that the event belongs to
* *attempt_no*: The attempt number that the event belongs to, starting from 1. An attempt is the duration of the team constructing a solution and submitting it together.
* *turn_no*: The turn number of the event, starting from 1. A turn is the duration where one of the participants is in the figurative view, and the other is in the abstract view.
* *event_no*: The event number of the event, starting from 1
* *start*: The start timestamp of the event (in seconds), from the beginning of the activity
* *end*: The end timestamp of the event (in seconds)
* *subject*: The subject that the event is executed by (*A*, *B*: the participants; *R*: the robot; *T*: the team)
* *verb*: The verb that describes the event (e.g. "presses", "adds", "removes")
* *object*: The object that is acted on by the subject performing the verb (e.g. "submit (enabled)" for subject i.e. participant: *A*, verb i.e. the action: "presses")
* *instructions*: the list of instructions that are inferred for this event
* *pending_instructions*: the list of instructions that are pending/cached to be matched with an action
* *matching*: the result matching of the pending instructions with the action (if this event is an action)

In [1]:
import pickle

import pathlib as pl
import pandas as pd

import spacy
from spacy import displacy
from spacy.pipeline import EntityRuler

from read_utils import read_tables, read_network
from match_utils import \
    Instruction, Do, \
    Match, Mismatch, Nonmatch, \
    make_edit_action

In [2]:
# # Download spacy's English package if not done before
# !python3 -m spacy download en_core_web_sm

## Define paths.

In [3]:
# Inputs.
data_dir = pl.Path('../data')
network_file = data_dir.joinpath('metadata/network.json')

processed_data_dir = pl.Path('../processed_data')
corpus_dir = processed_data_dir.joinpath('corpus')

# # Outputs.
annot_corpus_dir = processed_data_dir.joinpath('annotated_corpus')
annot_corpus_pickle_file = annot_corpus_dir.joinpath(
    'justhing19_annotated_corpus.pickle')

for d in [annot_corpus_dir]:
    if not d.exists():
        d.mkdir()
        print('Created {}'.format(d))

Created ../processed_data/annotated_corpus


## Load data.

### Load corpus tables (logs with transcripts).

In [4]:
corpus_dfs = read_tables(corpus_dir, form='transcript')

Reading transcript files from ../processed_data/corpus.
transcript 10 files found.
File justhink19_corpus_07 belongs to team  7
File justhink19_corpus_08 belongs to team  8
File justhink19_corpus_09 belongs to team  9
File justhink19_corpus_10 belongs to team 10
File justhink19_corpus_11 belongs to team 11
File justhink19_corpus_17 belongs to team 17
File justhink19_corpus_18 belongs to team 18
File justhink19_corpus_20 belongs to team 20
File justhink19_corpus_28 belongs to team 28
File justhink19_corpus_47 belongs to team 47
Transcript of  7 has 1059 utterances
Transcript of  8 has  932 utterances
Transcript of  9 has 1076 utterances
Transcript of 10 has  769 utterances
Transcript of 11 has  910 utterances
Transcript of 17 has  451 utterances
Transcript of 18 has  506 utterances
Transcript of 20 has  653 utterances
Transcript of 28 has  490 utterances
Transcript of 47 has  642 utterances


### Load the background network.

In [5]:
network = read_network(network_file)
print('Network read from {}: {} nodes, {} edges'.format(
    network, network.number_of_nodes(), network.number_of_edges()))

Network read from : 10 nodes, 20 edges


## Prepare parsers and rulers.

### Parser for edge objects in the extended transcripts.

In [6]:
def parse_edge_object(obj, names=False):
    ''''Parses edit event object string to its node ids
    e.g. Zurich-Gallen (2-8)' to [2, 8]'''
    if names:  # Parse for names.
        (u, v) = obj.split()[0].split('-')
    else:  # Parse for node indices.
        (u, v) = obj.split()[1].strip('(').strip(')').split('-')
        u = int(u)
        v = int(v)
    return [u, v]


# Try
obj = 'Zurich-Gallen (2-8)'
parse_edge_object(obj)

[2, 8]

### Define keywords to detect instuction entities.

In [7]:
entity_keywords = {
    'ADD': {
        'add',    # "adding zurich to bern ."
        'do',
        'go',
        'put',    # "putting that one there"
        'connect',
        'build',  # "i'll build mount luzern to zermatt"
    },
    'REMOVE': {
        'remove',
        "delete",  # "okay so delete that ."
        'erase',
        'cut',    # 'yeah then cut out mount basel to mount interlaken .'
        'away',   # 'take away',
        'rub',    # 'rub that out',
        # as in "it's 3 francs rub that out" ; 
        # "no wait let me rub that out again ." ; 
        # "oh then rub that out"
    },
}

for k, words in entity_keywords.items():
    print(k, words)

ADD {'connect', 'add', 'go', 'put', 'do', 'build'}
REMOVE {'erase', 'away', 'cut', 'delete', 'rub', 'remove'}


### Define function to recognise instructions from an utterance.

In [8]:
def prepare_ruler(network, entity_keywords):
    nlp = spacy.load("en_core_web_sm", disable=["ner"])
    ruler = EntityRuler(nlp)

    node_ids = list()
    node_patterns = list()
    for u, d in network.nodes(data=True):
        word = d['label'].split()[-1]
        identifier = str(u)
        pattern = {'id': identifier, 'label': 'NODE', "pattern": [
            {'LOWER': word.lower()}]}
        node_patterns.append(pattern)
        node_ids.append(identifier)

    entity_ids = list()
    entity_patterns = list()
    for label, words in entity_keywords.items():
        for word in words:
            identifier = str(label) 
            pattern = {'id': identifier, 'label': label, "pattern": [
                {'LOWER': word.lower()}]}
            entity_patterns.append(pattern)
            entity_ids.append(identifier)

    patterns = [
        *node_patterns,
        *entity_patterns,
    ]
    ruler.add_patterns(patterns)
    nlp.add_pipe(ruler)

    return nlp, node_ids, entity_ids


def get_node_ids(text, doc):
    node_ids = [int(ent.ent_id_) for ent in doc.ents if ent.label_ == 'NODE']
    return node_ids


def recognise_instructions(text, node_ids, entity_ids, nlp,
                           default_entity_id='ADD'):
    '''Possible intended actions as entities.'''
    doc = nlp(text)

    instructions = list()
    # an instruction template [add/remove, [node1, node2]]
    template = [None, [None, None]]  
    for ent in doc.ents:
        ent_id = ent.ent_id_

        if ent_id in entity_ids:
            if template[0] is None:
                template[0] = ent_id
            else: # Begin inferring a new instruction.
                # Add the currently inferred instruction to the list.
                if template[1][0] is not None:
                    instruction = make_edit_action(template[0], template[1])
                    if instruction is not None:
                        instructions.append(instruction)
                # Create a new instruction.
                template = [ent_id, [None, None]]

        elif ent_id in node_ids:
            if template[1][0] is None:
                template[1][0] = ent_id
            elif template[1][1] is None:
                if ent_id != template[1][0]:
                    template[1][1] = ent_id
                # Start a new entity if has a verb.
                # Default to add or the previous
                if template[0] is None:
                    if len(instructions) > 0:
                        template[0] = instructions[-1].name
                    else:
                        template[0] = default_entity_id

                instruction = make_edit_action(template[0], template[1])
                if instruction is not None:
                    instructions.append(instruction)
                template = [None, [None, None]]

    if template[1][0] is not None:
        if template[0] is None:  # assume adds if otherwise detected
            template[0] = 'ADD'
        instruction = make_edit_action(template[0], template[1])
        if instruction is not None:
            instructions.append(instruction)

    return instructions


# Try
text = "go from basel to zurich and then from zurich to saint gallen ."
text = "then rub that out and then go , interlaken to mount bern ."
text = "okay rub it out and go bern to interlaken ."
text = 'is that how much that ?'
text = "how do i get off this screen ?"
text = "go from basel to zurich and then from zurich to saint gallen ."
text = 'to mount davos .'
text = "then rub that out and then go , interlaken ."
text = "no lets do mount davos to , where do you wanna go ?"
nlp, node_ids, entity_ids = prepare_ruler(network, entity_keywords)


entities = recognise_instructions(text, node_ids, entity_ids, nlp)
display(entities)

doc = nlp(text)
displacy.render(doc, style="ent")

[ADD(9,?)]

### Define function to process a row or a table to recognise instructions.

In [9]:
def recognise_instructions_for_row(sbj, verb, obj, nlp, node_ids, entity_ids):
    '''make an instruct act or an edit act'''

    # Make an node list and instruction list.
    if verb == 'says' and sbj in ['A', 'B']:
        text = obj
        nodes = get_node_ids(text, nlp(text))
        instructions = recognise_instructions(text, node_ids, entity_ids, nlp)
        if instructions is None:
            instructions = []
    else:
        nodes = []
        instructions = []

    # Make an act list.
    acts = list()
    for instruction in instructions:
        act = Instruction(instruction, agent=sbj)
        acts.append(act)

    if verb == 'adds':
        act_verb = 'ADD'
    elif verb == 'removes':
        act_verb = 'REMOVE'
    else:
        act_verb = None
    if act_verb is not None:
        action = make_edit_action(act_verb, parse_edge_object(obj))
        act = Do(action, sbj)
        acts.append(act)

    return nodes, instructions, acts


def recognise_instructions_for_table(df, network, entity_keywords, inplace=False):
    if not inplace:
        df = df.copy()

    nlp, node_ids, entity_ids = prepare_ruler(network, entity_keywords)

    node_lists, instruction_lists, act_lists = list(), list(), list()
    for i, row in df.iterrows():
        nodes, instructions, acts = recognise_instructions_for_row(
            row['subject'], row['verb'], row['object'],
            nlp, node_ids, entity_ids)

        node_lists.append(nodes)
        instruction_lists.append(instructions)
        act_lists.append(acts)

    df['nodes'] = node_lists
    df['instructions'] = instruction_lists
    df['matching'] = act_lists

    return df


# Try:
team_no = 28
df = corpus_dfs[team_no].copy()
recognise_instructions_for_table(df, network, entity_keywords).head()

Unnamed: 0,team_no,attempt_no,turn_no,utterance_no,start,end,subject,verb,object,nodes,instructions,matching
0,28,1,1,-1,0.296,0.296,R,shows,observe gesture,[],[],[]
1,28,1,1,-1,0.365,0.365,R,says,"so, ann and bob, let's start building the trac...",[],[],[]
2,28,1,1,-1,33.409,33.409,A,presses,help (enabled),[],[],[]
3,28,1,1,0,40.0,41.161,A,says,"okay , so",[],[],[]
4,28,1,1,1,40.58,45.036,B,says,so we have to connect all the places with trac...,[],[],[]


### Define function to match instructions and actions.

In [10]:
def match_instructions_and_actions(df, inplace=False, verbose=False):
    if not inplace:
        df = df.copy()

    pending_instructions_list = list()
    pending_instructions = list()
    turn_no = 1
    for i, row in df.iterrows():
        pending_instructions = list(pending_instructions)
        act_list = row['matching']
        current_turn_no = row['turn_no']

        # flush at every turn change
        if current_turn_no != -1 and current_turn_no == turn_no + 1:
            if verbose:
                print('Cleared at {} at row {}'.format(current_turn_index, i))
            pending_instructions = list()
            turn_no = current_turn_no

        instructions = [act for act in act_list if isinstance(act, Instruction)]
        edit_acts = [act for act in act_list if isinstance(act, Do)]
        assert len(edit_acts) <= 1, 'more than one edit act? at {}'.format(row)

        pending_instructions = pending_instructions + instructions

        if len(edit_acts) > 0:
            edit_act = edit_acts[0]

            if verbose and len(pending_instructions) > 0:
                print()
                print('Matching {} to {}'.format(
                    pending_instructions, edit_acts))

            # check with its only item in this trivial case
            # get instructs by the other speaker.
            others_acts = [
                a for a in pending_instructions if a.agent != row['subject']]
            if len(others_acts) > 0:
                new_act = None

                for instruction in others_acts:
                    if instruction.action.partial_equals(edit_act.action):
                        new_act = Match(instruction, agent=edit_act.agent)
                        if verbose:
                            print('Matched {} to {}'.format(
                                edit_act, instruction))

                instruction = others_acts[-1]
                if new_act is None:
                    new_act = Mismatch(instruction, agent=edit_acts[0].agent)
                    if verbose:
                        print('No match {}: Create {}'.format(
                            instruction, new_act))

                if new_act is not None:
                    # remove all that match instruction.
                    l = list(pending_instructions)
                    for s in pending_instructions:
                        if s.action.partial_equals(instruction.action):
                            l.remove(s)
                    pending_instructions = l
                    act_list.append(new_act)
                    row['matching'] = act_list

            else:
                act = Nonmatch(action=edit_act, agent=row['subject'])
                act_list.append(act)
                row['matching'] = act_list

        pending_instructions_list.append(pending_instructions)

    df['pending_instructions'] = pending_instructions_list
    return df


# Try.
task_index = 10 #28
df = corpus_dfs[task_index].copy()
df = recognise_instructions_for_table(df, network, entity_keywords)
df = match_instructions_and_actions(df)
display(df.head())

n_instructions = df['instructions'].apply(len).sum()
n_matches = df['matching'].apply(lambda l: len(
    [act for act in l if isinstance(act, Mismatch)])).sum()
n_mismatches = df['matching'].apply(lambda l: len(
    [act for act in l if isinstance(act, Match)])).sum()
n_edits = df['matching'].apply(lambda l: len(
    [act for act in l if isinstance(act, Do)])).sum()
n_nonmatches = df['matching'].apply(lambda l: len(
    [act for act in l if isinstance(act, Nonmatch)])).sum()

c = len(df[df.verb.isin(['adds', 'removes'])])
print(n_instructions, n_matches, n_mismatches, n_edits, c, n_nonmatches)

# e.g. print number of accept acts i.e. matches.
print(df['matching'].apply(lambda l: len(
    [act for act in l if isinstance(act, Match)])).sum())

Unnamed: 0,team_no,attempt_no,turn_no,utterance_no,start,end,subject,verb,object,nodes,instructions,matching,pending_instructions
0,10,1,1,-1,0.204,0.204,R,shows,observe gesture,[],[],[],[]
1,10,1,1,-1,0.334,0.334,R,says,"so, ann and bob, let's start building the trac...",[],[],[],[]
2,10,1,1,-1,53.525,53.525,R,shows,thinking gesture,[],[],[],[]
3,10,1,1,-1,53.544,53.544,R,says,"hmm, i see.",[],[],[],[]
4,10,1,1,-1,56.192,56.192,A,adds,Zermatt-Davos (4-9),[],[],"[DO_A(ADD(4,9)), NONMATCH_A(DO_A(ADD(4,9)))]",[]


135 12 36 108 108 60
36


### Annotate the corpus tables with instructions and follow-up actions

In [11]:
annotated_dfs = dict()
for team_no in sorted(corpus_dfs):
    print('Processing team {:2d} ...'.format(team_no))
    df = corpus_dfs[team_no].copy()

    recognise_instructions_for_table(
        df, network, entity_keywords, inplace=True)
    match_instructions_and_actions(df, inplace=True)

    annotated_dfs[team_no] = df

print('Done!')

Processing team  7 ...
Processing team  8 ...
Processing team  9 ...
Processing team 10 ...
Processing team 11 ...
Processing team 17 ...
Processing team 18 ...
Processing team 20 ...
Processing team 28 ...
Processing team 47 ...
Done!


### Export to text files and a pickle file (to easily load match objects etc. later).

In [12]:
for team_no in sorted(annotated_dfs):
    df = annotated_dfs[team_no].copy()

    # Make filename.
    file = annot_corpus_dir.joinpath(
        'justhink19_annotated_corpus_{:02d}.csv'.format(team_no))
    print('Save team {:2d} to {}'.format(team_no, file))

    cols = ['team_no', 'attempt_no', 'turn_no', 'utterance_no',
            'start', 'end',
            'subject', 'verb', 'object',
            'instructions', 'pending_instructions', 
            'matching',
            ]
    df = df.loc[:, cols]
    
    # Export to file.
    df.to_csv(file, sep='\t', float_format='%.3f', index=False)

Save team  7 to ../processed_data/annotated_corpus/justhink19_annotated_corpus_07.csv
Save team  8 to ../processed_data/annotated_corpus/justhink19_annotated_corpus_08.csv
Save team  9 to ../processed_data/annotated_corpus/justhink19_annotated_corpus_09.csv
Save team 10 to ../processed_data/annotated_corpus/justhink19_annotated_corpus_10.csv
Save team 11 to ../processed_data/annotated_corpus/justhink19_annotated_corpus_11.csv
Save team 17 to ../processed_data/annotated_corpus/justhink19_annotated_corpus_17.csv
Save team 18 to ../processed_data/annotated_corpus/justhink19_annotated_corpus_18.csv
Save team 20 to ../processed_data/annotated_corpus/justhink19_annotated_corpus_20.csv
Save team 28 to ../processed_data/annotated_corpus/justhink19_annotated_corpus_28.csv
Save team 47 to ../processed_data/annotated_corpus/justhink19_annotated_corpus_47.csv


### Export to a pickle file (to easily load in another notebook, preserving data types e.g. matches).

In [13]:
with annot_corpus_pickle_file.open('wb') as handle:
    pickle.dump(annotated_dfs, handle, protocol=pickle.HIGHEST_PROTOCOL)

print('Saved all teams to {}'.format(annot_corpus_pickle_file))

Saved all teams to ../processed_data/annotated_corpus/justhing19_annotated_corpus.pickle
