*David Schlangen, 2019-03-24*

# Task: Resolving Co-Reference / Predicting Model Size

In the section on discourses in the denotations section, we have already briefly mentioned the task of co-reference resolution. If the image (model) is available, co-reference is established exophorically, via the anchoring in the image object. If we take away the image, the task must be tackled via linguistic evidence (and common-sense knowledge about scenes) alone. (It hence becomes an inference / entailment task more than one of denotation computation.)

In [1]:
# imports

from __future__ import division
import codecs
import json
from itertools import chain, izip, permutations, combinations
from collections import Counter, defaultdict
import ConfigParser
import os
import random
from textwrap import fill
import scipy
import sys
from copy import deepcopy

from nltk.parse import CoreNLPParser
import nltk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from IPython.display import Latex, display

pd.set_option('max_colwidth', 250)

In [2]:
# Load up config file (needs path; adapt env var if necessary); local imports

# load config file, set up paths, make project-specific imports
config_path = os.environ.get('VISCONF')
if not config_path:
    # try default location, if not in environment
    default_path_to_config = '../Config/default.cfg'
    if os.path.isfile(default_path_to_config):
        config_path = default_path_to_config

assert config_path is not None, 'You need to specify the path to the config file via environment variable VISCONF.'        

config = ConfigParser.SafeConfigParser()
with codecs.open(config_path, 'r', encoding='utf-8') as f:
    config.readfp(f)

corpora_base = config.get('DEFAULT', 'corpora_base')
preproc_path = config.get('DSGV-PATHS', 'preproc_path')
dsgv_home = config.get('DSGV-PATHS', 'dsgv_home')


sys.path.append(dsgv_home + '/Utils')
from utils import icorpus_code, plot_labelled_bb, get_image_filename, query_by_id
from utils import plot_img_cropped, plot_img_ax, invert_dict, get_a_by_b
sys.path.append(dsgv_home + '/WACs/WAC_Utils')
from wac_utils import create_word2den, is_relational
sys.path.append(dsgv_home + '/Preproc')
from sim_preproc import load_imsim, n_most_sim

sys.path.append('../Common')
from data_utils import load_dfs, plot_rel_by_relid, get_obj_bb, compute_distance_objs
from data_utils import get_obj_key, compute_relpos_relargs_row, get_all_predicate
from data_utils import compute_distance_relargs_row, get_rel_type, get_rel_instances
from data_utils import compute_obj_sizes_row

In [3]:
# Load up preprocessed DataFrames. Slow!
# These DataFrames are the result of pre-processing the original corpus data,
# as per dsg-vision/Preprocessing/preproc.py

df_names = ['vgregdf', #'vgimgdf', 'vgobjdf', 'vgreldf',
           ]
df = load_dfs(preproc_path, df_names)

# a derived DF, containing only those region descriptions which I was able to resolve
df['vgpregdf'] = df['vgregdf'][df['vgregdf']['pphrase'].notnull() & 
                               (df['vgregdf']['pphrase'] != '')]

Co-reference resolution is the task of determining whether a referring expression introduces a new entity into the discourse or not. We can create data for this task using the visual genome region annotation. Turning the set of region descriptions into a "discourse", we have gold truth information about whether a region description that is  added to the discourse introduces a new entity or talks about one that has previously been introduced.

This is what a model would have to predict. The result then is a set of co-reference chains, or entity mentions (in order of ocurrence). From a more semantic point of view, the task entails determining the size of the intended model of the discourse; co-reference between two mentions here means that only one individual constant needs to be introduced into the model. This is how it is displayed below, with the maximal model size being the number of entity-denoting expressions (if we were to create a new individual constant for each), the minimal number being the number of entity-types in the discourse (and assuming that all mentions of the same type co-refer), and the actual size being the one indicated by the object resolution of the descriptions. A perfect resolution of the co-references would lead to that number. 

(Note that the example here is relies on the provided object identifiers to distinguish objects, but visual genome seems to have insuffiently consolidated on that score. To create a cleaner dataset, to make the judgement whether a new object is introduced or not, a test of overlap (intersection over union) between bounding boxes should be performed.)

In [5]:
# deep caption with co-reference on object level
def extr_disc_ref_pphr(pphr):
    discourse_referents = []
    for token in pphr.split():
        subtoken = token.split('|')
        if len(subtoken) > 1:
            word = subtoken.pop(0)
            id_syn_list = zip(subtoken[::2], subtoken[1::2])
            discourse_referents.extend([(int(e[0]), e[1]) for e in id_syn_list])
    return discourse_referents

def cond_print(instr, show):
    if show:
        print instr

def model_size_stats(df, image_id, show=False):
    all_pphr = df[df['image_id'] == image_id][['phrase', 'pphrase']].values.tolist()
    all_discourse_referents = []
    all_types = set()
    n_mentions = 0
    for this_phr, this_pphr in all_pphr:
        cond_print(this_phr, show)
        this_disc_refs = extr_disc_ref_pphr(this_pphr)
        n_mentions += len(this_disc_refs)
        #print '   ', this_disc_refs
        #this_disc_refs_ids, this_disc_ref_types = zip(*this_disc_refs)
        for disc_ref, ref_type in this_disc_refs:
            if disc_ref in all_discourse_referents:
                cond_print("    OLD: %d (%s)" % (disc_ref, ref_type), show)
            else:
                cond_print("    NEW: %d (%s)" % (disc_ref, ref_type), show)
                all_discourse_referents.append(disc_ref)
                if ref_type in all_types:
                    cond_print('    old type, new instance: %d %s' % (disc_ref, ref_type), show)
                all_types.add(ref_type)
        cond_print('-' * 10, show)
    cond_print('max model size: %d || min model size: %d || actual model size: %d'\
                    % (n_mentions, len(all_types), len(all_discourse_referents)), show)
    return n_mentions, len(all_types), len(all_discourse_referents)

n_egs = 3

for _ in range(n_egs):
    print "=" * 70
    ii = df['vgpregdf'].sample()['image_id'].values[0]
    model_size_stats(df['vgpregdf'], ii, show=True)
    print ""

black tie worn by man
    NEW: 1167150 (necktie.n.01)
    NEW: 1167149 (man.n.01)
----------
logo on white shirt worn by man
    NEW: 1167151 (logo.n.01)
    NEW: 1167152 (shirt.n.01)
----------
black glasses worn by man
    NEW: 1167153 (spectacles.n.01)
    OLD: 1167149 (man.n.01)
----------
collar of white shirt worn by man
    NEW: 1167155 (collar.n.01)
    OLD: 1167152 (shirt.n.01)
----------
collar of white shirt worn by man
    OLD: 1167155 (collar.n.01)
    OLD: 1167152 (shirt.n.01)
----------
white shirt worn by man
    OLD: 1167152 (shirt.n.01)
    OLD: 1167149 (man.n.01)
----------
white shirt worn by man
    OLD: 1167152 (shirt.n.01)
    OLD: 1167149 (man.n.01)
----------
an epaulet on a shirt
    NEW: 1167160 (epaulet.n.01)
    NEW: 1167158 (shirt.n.01)
    old type, new instance: 1167158 shirt.n.01
----------
a collar on a shirt
    OLD: 1167155 (collar.n.01)
    OLD: 1167158 (shirt.n.01)
----------
glasses on a man's face
    NEW: 1167156 (spectacles.n.01)
    old type, 

As the examples here show, these aren't particularly nice discourses. Many features of real discourses are missing here: real coherence, in the sense that the individual discourse units build on each other; cohesion, in the sense that discourse-new and discourse-old is properly signalled. But for the purposes here, this can be seen as a feature, as it removes all cues to this task other than semantic ones. To decide whether another mention of an entity type co-refers to a previous one, here a model really must reason about whether the event it occurs in is compatible, what number of entities of this type are likely to be found in a scene of this kind, and so on. This argues that this tasks is still interesting from a semantic perspective, even if a model trained on this data would not directly be transferable to real, natural text. (As a final note, however, it would be possible to annotate the image paragraphs for co-reference and test the model on them, or even train on that data.)