*David Schlangen, 2019-03-24*

# Task: Predicting Semantic Well-Formedness

What is it that makes the assertion odd that "colourless green ideas sleep furiously"? It is not a *syntactic* problem, goes the famous claim \cite{chomsky:synstruc}, but rather a semantic one.

In a recent book \cite{asher_2011}, Nicholas Asher analyses the fact that certain expressions appear odd (e.g., "Tigers are financial institutions") as being due to the semantic ill-formedness of the *predications* they express, proposing that this is something that lexical semantics ought to explain. (In Asher's approach, this is done via fine-grained semantic types and type hierarchies, through which certain applications can be blocked.) (See also \cite{Vecchi2017}.)

Let's explore whether our corpora might provide material to test or even develop approaches to this phenomenon.

**Technical Note**

If you want to execute this notebook, besides the usual preconditions, you also need to have a [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) server running on localhost on port 9000. (Download the package at the link, if you don't have it.) Like so:

```
cd StanfordCORE_NLP_DIR
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-preload tokenize,ssplit,pos,lemma,ner,parse,depparse \
-status_port 9000 -port 9000 -timeout 15000 & 
```

In [1]:
# imports

from __future__ import division
import codecs
import json
from itertools import chain, izip, permutations, combinations
from collections import Counter, defaultdict
import ConfigParser
import os
import random
from textwrap import fill
import scipy
import sys
from copy import deepcopy

from nltk.parse import CoreNLPParser
import nltk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from IPython.display import Latex, display

pd.set_option('max_colwidth', 250)

In [2]:
# Load up config file (needs path; adapt env var if necessary); local imports

# load config file, set up paths, make project-specific imports
config_path = os.environ.get('VISCONF')
if not config_path:
    # try default location, if not in environment
    default_path_to_config = '../Config/default.cfg'
    if os.path.isfile(default_path_to_config):
        config_path = default_path_to_config

assert config_path is not None, 'You need to specify the path to the config file via environment variable VISCONF.'        

config = ConfigParser.SafeConfigParser()
with codecs.open(config_path, 'r', encoding='utf-8') as f:
    config.readfp(f)

corpora_base = config.get('DEFAULT', 'corpora_base')
preproc_path = config.get('DSGV-PATHS', 'preproc_path')
dsgv_home = config.get('DSGV-PATHS', 'dsgv_home')


sys.path.append(dsgv_home + '/Utils')
from utils import icorpus_code, plot_labelled_bb, get_image_filename, query_by_id
from utils import plot_img_cropped, plot_img_ax, invert_dict, get_a_by_b
sys.path.append(dsgv_home + '/WACs/WAC_Utils')
from wac_utils import create_word2den, is_relational
sys.path.append(dsgv_home + '/Preproc')
from sim_preproc import load_imsim, n_most_sim

sys.path.append('../Common')
from data_utils import load_dfs, plot_rel_by_relid, get_obj_bb, compute_distance_objs
from data_utils import get_obj_key, compute_relpos_relargs_row, get_all_predicate
from data_utils import compute_distance_relargs_row, get_rel_type, get_rel_instances
from data_utils import compute_obj_sizes_row

In [3]:
# Load up preprocessed DataFrames. Slow!
# These DataFrames are the result of pre-processing the original corpus data,
# as per dsg-vision/Preprocessing/preproc.py

df_names = ['vgattdf', 'vgobjdf', 'vgregdf', #'vgimgdf', 'vgobjdf', 'vgreldf',
           ]
df = load_dfs(preproc_path, df_names)

# a derived DF, containing only those region descriptions which I was able to resolve
df['vgpregdf'] = df['vgregdf'][df['vgregdf']['pphrase'].notnull() & 
                               (df['vgregdf']['pphrase'] != '')]

## Creating Data 
It is relatively straightforward to create odd expressions by manipulating attested expressions found in a corpus. (For example, by taking a source sentence like "Pierre Vinken will join the board as a nonexecutive director." and swapping in an NP taken from a distractor sentence, to yield for example "Pierre Vinken will join the board as the striped zebra.") A statistical language model, provided that it has enough capacity, could be expected to easily *detect* such manipulations (as the string as a whole presumably will be assigned a lower likelihood than the original version). It can however not contribute to an *explanation* of this oddness. (Which for the tiger example above would be something like "tigers are concrete objects, financial institutions are abstract objects; nothing can be both a concrete and an abstract object", and for the Vinken example would be something like "being a striped zebras is not a function on an executive board".) What is required here is a *conceptual* analysis that addresses the predication failure.

While manipulated expressions can be constructed from any text corpus, using the corpora described here could potentially have some advantages. First, the expressions in these corpora are relatively simple (compared to newspaper text, for example), and so offer a simpler access to the phenomenon. Second, having the semantc annotation available (the segmented images) gives us more control over the distractor sentences and hence over the types of manipulation.

## Adjectival Modification

We'll start with adjectival modification / predication. As discussed above, visual genome provides separate annotation for attributes and names. We can use this to assemble pairs that have a certain chance to be semantically ill-formed. 

Here is a random sample showing first a pairing of attribute and name (adjective and noun) as it was annotated for an object, and then the same name with a randomly sampled attribute.

In [4]:
# sampling actual and random ADJ N pairings
def sample_NP_pair(attdf, objdf, n_filter=None):
    ic, ii, oi, atts = attdf.sample()['i_corpus image_id obj_id attributes'.split()].values[0]
    name = objdf[objdf['obj_id'] == oi]['name'].values[0]
    # TODO: test that this name type has never been annotated with this attribute
    
    neg_atts = []
    if n_filter is not None:
        while len(neg_atts) == 0:
            neg_atts = attdf.sample()['attributes'].values[0]
            neg_att = np.random.choice(neg_atts)
            if neg_att in n_filter and name in n_filter[neg_att]:
                neg_atts = []
    else:
        neg_atts = attdf.sample()['attributes'].values[0]
        neg_att = np.random.choice(neg_atts)

    pos = np.random.choice(atts) + u' • ' + name
    neg = neg_att + u' • ' + name
    return pos, neg

n_pairs = 20
for pos, neg in [sample_NP_pair(df['vgattdf'], df['vgobjdf']) for _ in range(n_pairs)]:
    print u'{:>40} || {:<40}'.format(pos, neg)

                           green • trees || black • trees                           
                          lounging • cat || blue • cat                              
                            black • trim || long • trim                             
                          orange • tower || abstract • tower                        
                       signaling • light || for hotdog • light                      
                          french • fries || off • fries                             
                          blue • sticker || closed • sticker                        
                            small • tree || the man's • tree                        
                         red • red cloth || yellow • red cloth                      
                   another • snowboarder || green • snowboarder                     
                            wood • table || small • table                           
                         man made • pond || glass • pond         

(The character "·" is used here to separate attribute and name; this makes visible that sometimes the names aren't properly segmented and contain what should be split off as attributes. If this data were to be used for this task, some post processing would need to be performed.)

As this shows, simply randomly sampling attributes from the corpus often does not lead to obviously incoherent pairings. There are attributes that seem quite generally applicable (e.g., colours). We can try to make it more likely that the constructed pairing is incoherent by at least checking that it is not attested in the corpus as a whole.

In [5]:
# prepare mapping of attributes to attested modified Ns, for sample. SLOW!
# Would other way round be faster? Map from N to all ADJ that occur with it. Same?
n_atts_to_sample = 20000
attr2den = defaultdict(list)
_ = [attr2den[a].append(oi) for oi, atts in df['vgattdf']\
     .sample(n_atts_to_sample)[['obj_id', 'attributes']].values for a in atts]

attr2applicable_types = dict([(k, set([get_obj_key(df['vgobjdf'], oi, key='name') for oi in ois]))
                                for  k, ois in attr2den.items()])

In [6]:
# sampling ADJ N pairings; ensuring that neg ADJ N is not attested
n_pairs = 20
for pos, neg in [sample_NP_pair(df['vgattdf'], df['vgobjdf'],
                                n_filter=attr2applicable_types) for _ in range(n_pairs)]:
    print u'{:>40} || {:<40}'.format(pos, neg)

                            dark • water || plaid • water                           
                            black • bird || large • bird                            
                        chicken  • salad || multiple • salad                        
                            man's • nose || rustic • nose                           
                          rainbow • flag || silver • flag                           
                              tan  • hat || utility • hat                           
                           brown • shirt || off white • shirt                       
                          large • window || gray • window                           
                          yellow • train || clear • train                           
                          silver • spoon || green • spoon                           
                            white • line || whole • line                            
                            wire • fence || mud • fence          

Intuitively, these samples are somewhat better (in that they are odder). (To what degree this holds of course is an empirical question that could answered by collecting acceptability ratings via crowdsourcing, for example.) We can at least use this data to ask what the place should be for modelling this oddness. Is it just a frequency effect? (We can assume that these are rarer pairings, as they at least don't occur in this corpus.) Is it world knowledge that makes some pairings appear off (e.g., "khaki sky")? Is there a conceptual mismatch? And if so, is that different from being unlikely based on world knowledge?

* **Dataset:** ADJ + N
* **Negative Instances:** ADJ sampled from different context
* **Source:** visual genome, derived
* **Uses:** predict which predications are semantically odd

These questions also arise when larger phrases are manipulated, as we show in the following.

## V, VP, PP application to NPs

The following examples first show a source expression taken from the corpus, and then a manipulated version where one NP of the source expression is replaced by an NP from a randomly sampled distractor expression.

In [7]:
# Take original phrase, randomly sampled other phrase, swap in NP from latter
parser = CoreNLPParser(url="http://localhost:9000")
def parse_phrase(parser, phrase):
    return list(parser.raw_parse(phrase))[0]

def is_cat(tree, cat='NP'):
    return tree.label() == cat

def traverse_and_copy(tree, replacement, target_np):
    return _traverse_and_copy(tree, replacement, target_np, 0)

def _traverse_and_copy(tree, replacement, target_np, nps_seen):
    out_tree = []
    for subtree in tree:
        if type(subtree) == nltk.tree.Tree:
            if is_cat(subtree):
                nps_seen += 1
                if nps_seen == target_np:
                    out_tree.append(replacement)
                else:
                    out_tree.append(nltk.tree.Tree('NP', _traverse_and_copy(subtree, replacement,
                                                              target_np, nps_seen)))
            else:
                out_tree.append(nltk.tree.Tree(subtree.label(), _traverse_and_copy(subtree, replacement,
                                                      target_np, nps_seen)))
        else:
            out_tree.append(subtree)
    return out_tree

def swap_in_np(original, distractor):
    phrase_pd = parse_phrase(parser, original)
    n_nps = len(list(phrase_pd.subtrees(filter=is_cat)))
    start_np = 1
    if phrase_pd[0,0].label() == 'NP':
        start_np = 2
    target_np = random.randint(start_np, n_nps)

    distr_phrase_pd = parse_phrase(parser, distractor)
    distr_nps = list(distr_phrase_pd.subtrees(filter=is_cat))
    replacement = random.choice(distr_nps[1:])

    tmp = traverse_and_copy(phrase_pd, replacement, target_np)[0]
    manipulated_string = ' '.join(tmp.leaves())
    return manipulated_string

for _ in range(20):
    phrase = df['vgpregdf'].sample()['phrase'].values[0]
    rand_distr_phrase = df['vgpregdf'].sample()['phrase'].values[0]
    try:
        manipulated_phrase = swap_in_np(phrase, rand_distr_phrase)
    except:
        continue
    print u'{:>40} || {:<40}'.format(phrase, manipulated_phrase)

          a garden with different plants || building with different plants          
 small white air conditioner in the wall || small white air the computer            
                    window of a building || window of the airplane                  
            the word "gassing" on a sign || the building on a sign                  
Bag on top of table with book on top of it. || Bag on top of table with book on top of two rivets .
                   Numbers on the clock. || white object on the clock .             
                     Dirt on the ground. || Dirt on a nose .                        
 The tennis racket in the player's hand. || man in the player 's hand .             
                         mane of a zebra || door of a zebra                         
                      bricks on the wall || bricks on the arm                       
                      waves in the water || waves in a motorcycle                   
             the windshield of a trolley || the wi

The semantic oddness, to the degree there is any for the samples, resides in the relation that is confronted with a non-attested pair of arguments. (Unlike in the previous section, we do not actually check for that; given the much wider range of NPs, it is very likely in any case that randomly sampling will lead to a non-attested pair. More control would be possible, if desired.)

To give an example, one run (remember that these examples are indeed generated randomly and will be different for each run of the notebook) resulted in the expression "white clouds in a bear". What is it that makes this at least curious, more so than the source expression "white clouds in blue sky"? We know that clouds are a metereological phenomenon, and as such there is a very limited range of entities that they can be "in". To assign any meaning to the manipulated phrase, we must coerce "white clouds" into something like "pictures of white clouds"; even then, "in" isn't quite the right choice of preposition. 

We will leave it at that for now and only note that this simple manipulation seems to create an interesting challenge that only begins with predicting *that* something is odd (which a language model should be able to do) and that in its full stage involves quite sophisticated conceptual knowledge.

* **Dataset:** complex phrases with NPs swapped out
* **Negative Instances:** NPs taken from other expressions, potentially selected via image similarity
* **Source:** visual genome, derived
* **Uses:** predict which predications are semantically odd

# References

[<a id="cit-chomsky:synstruc" href="#call-chomsky:synstruc">1</a>] Noam Chomsky, ``_Syntactic Structures_'',  1957.

[<a id="cit-asher_2011" href="#call-asher_2011">2</a>] Nicholas Asher, ``_Lexical Meaning in Context: A Web of Words_'',  2011.

[<a id="cit-Vecchi2017" href="#call-Vecchi2017">3</a>] Vecchi Eva M., Marelli Marco, Zamparelli Roberto <em>et al.</em>, ``_Spicy Adjectives and Nominal Donkeys: Capturing Semantic Deviance Using Compositionality in Distributional Spaces_'', Cognitive Science, vol. , number , pp. ,  2017.

