# Resolving Ambiguity in Prepositional Phrase Attachment

This notebook shows results of predicting prepositional phrase attachments across a subset of the NLVR2 dataset which has been annotated. 

The first group of models are trained from the output the large uncased model from BERT with whole word masking. 
This model was subsequently converted to PyTorch/HuggingFace via command-line. 


In [1]:
from IPython.display import Image

Blah blah blah about prepositional phrase attachments... 

Blah blah blah some interesting examples. 

Blah blah blah about NLVR2 paper and dataset

Some stuff about this dataset and how it was collected
and how it was annotated

What this notebook shows... 

(my sig)

Prelims
Imports
outline/toc
Background


## Preliminary Steps

In [2]:
# conda create -n python=3.7 ...
# pip install transformers... 

In [3]:
import sys
import os
import json
import numpy as np
import sklearn
import torch
from sklearn.metrics import confusion_matrix
from sklearn.metrics import cohen_kappa_score as kappa
from itertools import groupby

from sklearn import svm
from collections import Counter

sys.path.append('/bridge/science/AI/nlp/bert')
from notebook_source import load_text_file, load_xml_files, generate_tuples
from notebook_source import load_folia_xml
from notebook_source import find_sentence_from_file, find_sentence_from_word_id
from notebook_source import generate_annotated4tpls, generate_sentences_from_4tpls
from notebook_source import generate_google_instances, generate_huggingface_instances
#import tokenization

from transformers import BertConfig, BertTokenizer, BertModel, BertForMaskedLM
from sklearn.neural_network import MLPClassifier


  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [4]:
np.random.seed(91768)


In [5]:
anndir = "/bridge/data/compositional_semantics/folia/jblackmore/done"
spacydir = "/bridge/data/compositional_semantics/folia/dev"

In [6]:
sents, generator = load_folia_xml(anndir)

In [7]:
annotated4tpls = []
tdeps = {}
for t,dep in generator():
    tdeps[t[2]] = dep
    annotated4tpls.append(t)

In [8]:
len(tdeps)

631

In [9]:
spacy_sents, spacy_gen = load_folia_xml(spacydir)
sdeps = {}
spacy4tpls = []
for spacy_tpl,sdep in spacy_gen():
    sprep = spacy_tpl[2]
    sdeps[sprep] = sdep
    spacy4tpls.append(spacy_tpl)

In [10]:
len(sdeps)

930

In [11]:
len(annotated4tpls)

631

In [12]:
len(spacy4tpls)

930

In [13]:
annotated4tpls = [a4tpl for i,a4tpl in enumerate(annotated4tpls) if a4tpl[2] in sdeps]

In [14]:
tdeps[annotated4tpls[21][2]]

('flower', 'nlvr2_dev_002.text.s.24.w.6', 'NN', 'flower', True)

In [15]:
prep_attachment_class = lambda tpl, deps : \
    '!' if tpl[2] not in deps else \
    'V' if deps[tpl[2]]==tpl[0] else \
    'N' if deps[tpl[2]]==tpl[1] else \
    'O'

In [16]:
labels = [prep_attachment_class(tpl,tdeps) for tpl in annotated4tpls]

In [17]:
Counter(labels)

Counter({'N': 442, 'V': 140, 'O': 47})

In [18]:
spacy_preds = [prep_attachment_class(tpl,sdeps) for tpl in annotated4tpls]

In [19]:
kappa(spacy_preds, labels)

0.275600163537006

In [20]:
Counter(spacy_preds)

Counter({'V': 94, 'N': 465, 'O': 70})

In [21]:
len(annotated4tpls)

629

In [22]:
len(spacy4tpls)

930

In [23]:
import pandas as pd
pd.DataFrame(confusion_matrix(labels, spacy_preds, labels=['N','V','O']), index=None)

Unnamed: 0,0,1,2
0,361,50,31
1,95,37,8
2,9,7,31


In [24]:
#sents_all = list(generate_sentences_from_4tpls(annotated4tpls,sents))

In [25]:
#from notebook_source import stext
#sents_all=[stext(find_sentence_from_word_id(t4tpl[0][1],sents)) for t4tpl in annotated4tpls]

In [26]:
#len(sents_all)

In [27]:
# Write all sentences with annotations to disk for further
# processing with BERT models. 

#with open(os.path.join(bert_datadir,'sents_all.txt'),'w') as allout:
#    for s in sents_all:
#        allout.write(s)
#        allout.write('\n')

In [28]:
# ... Wait for features from BERT ...
#export BERT_BASE_DIR=/bridge/science/AI/nlp/corpora/BERT/wwm_uncased_L-24_H-1024_A-16
#python extract_features.py --input_file=/bridge/science/AI/nlp/data/compositional_semantics/BERT/sents_all.txt --output_file=/bridge/science/AI/nlp/data/compositional_semantics/BERT/sents_all_wwmu_output.jsonl --vocab_file=$BERT_BASE_DIR/vocab.txt --bert_config_file=$BERT_BASE_DIR/bert_config.json --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt --layers=-1,-2,-3,-4 --max_seq_length=128 --batch_size=8


In [29]:
#X = np.array(X_list)
test_size = int(len(annotated4tpls)/4)
randidx = list(range(len(annotated4tpls)))
np.random.shuffle(randidx)
trainidx = randidx[test_size:]
testidx = randidx[:test_size]
labels_train = [labels[i] for i in trainidx]
labels_test = [labels[i] for i in testidx]


In [30]:
sents_all = list(generate_sentences_from_4tpls(annotated4tpls,sents))

We can convert the same BERT model to work with huggingface with a command-line based converter. 

Now, we're going to load the same BERT model through the huggingface
transformers API. 

We need to stack these in such a way that layers 4-3-2-1 appear for each of the 4 words, selected across word pieces. 

Ex: <br>
Mary ate noodles with chopsticks. <br>
Mary ate noodles with curry. <br>

4-tuple (VNPN): ate, noodles, with, (chopsticks/curry)

The BERT tokenizer may break up words, so it's possible to see something like 
ate,noodl#es, with, chop#sticks/cur#ry
We take the 4 layers of up to 4 pieces of each word, starting with the 
4th layer of the first piece, then the
3rd layer of the second/last piece, ...
top layer of the fourth/last piece, 
So we'll have 16 piece-layers for each attachment instance. 
 

In [36]:
train_data=[]
test_data=[]
instance_4tpl_labels = ["V","N","P","N2"]
word_attribute_labels = ["text","source","pos_tag","lemma","trail_space"]
for i,(ann4tpl,senttext,label) in enumerate(zip(annotated4tpls,sents_all,labels)):
    new_instance = {"sentence_text": senttext, "label": label}
    for word_label,token_tpl in zip(instance_4tpl_labels,ann4tpl):
        word_attributes = {}
        for word_attr_label,word_attr_value in zip(word_attribute_labels,token_tpl):
            word_attributes[word_attr_label] = word_attr_value
        new_instance[word_label] = word_attributes
    if i in trainidx:
        train_data.append(new_instance)
    elif i in testidx:
        test_data.append(new_instance)    
    else:
        raise ValueError("{} is out of bounds of dataset".format(i))

In [36]:
train_data=[]
test_d ata=[]
instance_4tpl_labels = ["V","N","P","N2"]
word_attribute_labels = ["text","source","pos_tag","lemma","trail_space"]
for i,(ann4tpl,senttext,label) in enumerate(zip(annotated4tpls,sents_all,labels)):
    new_instance = {"sentence_text": senttext, "label": label}
    for word_label,token_tpl in zip(instance_4tpl_labels,ann4tpl):
        word_attributes = {}
        for word_attr_label,word_attr_value in zip(word_attribute_labels,token_tpl):
            word_attributes[word_attr_label] = word_attr_value
        new_instance[word_label] = word_attributes
    if i in trainidx:
        train_data.append(new_instance)
    elif i in testidx:
        test_data.append(new_instance)    
    else:
        raise ValueError("{} is out of bounds of dataset".format(i))

In [46]:
def align_sents(dataset,sents):
    new_dataset = []
    for instance in dataset:
        source = instance['V']['source']
        foundit=False
        for sent in sents:
            sent_token_sources = [tok[1] for tok in sent]
            if source in sent_token_sources:
                new_dataset.append(sent)
                foundit=True
                continue
        if not foundit:
            raise ValueError("Couldn't find it: {}".format(source))
    return new_dataset

In [32]:
import json

In [40]:
datadir = "/bridge/data/compositional_semantics"

In [41]:
json.dump(train_data,open('{}/ppa-hugging-face-train.json'.format(datadir),'w'),indent=4)
json.dump(test_data,open('{}/ppa-hugging-face-test.json'.format(datadir),'w'),indent=4)

In [43]:
train_data[212]

{'sentence_text': 'There is broccoli on a towel.',
 'label': 'V',
 'V': {'text': 'is',
  'source': 'nlvr2_dev_019.text.s.2.w.2',
  'pos_tag': 'VBZ',
  'lemma': 'be',
  'trail_space': True},
 'N': {'text': 'broccoli',
  'source': 'nlvr2_dev_019.text.s.2.w.3',
  'pos_tag': 'NN',
  'lemma': 'broccoli',
  'trail_space': True},
 'P': {'text': 'on',
  'source': 'nlvr2_dev_019.text.s.2.w.4',
  'pos_tag': 'IN',
  'lemma': 'on',
  'trail_space': True},
 'N2': {'text': 'towel',
  'source': 'nlvr2_dev_019.text.s.2.w.6',
  'pos_tag': 'NN',
  'lemma': 'towel',
  'trail_space': False}}

In [47]:
train_sents = align_sents(train_data, sents)

In [48]:
test_sents = align_sents(test_data, sents)

In [49]:
def zip_dataset_with_sents(dataset, sents_for_dataset):
    zipped = []
    for instance,sent in zip(dataset,sents_for_dataset):
        instance['tokenized_sentence'] = sent
        zipped.append(instance)
    return zipped

In [50]:
dataset = {}
dataset['train'] = zip_dataset_with_sents(train_data, train_sents)
dataset['test'] = zip_dataset_with_sents(test_data, test_sents)

In [51]:
json.dump(dataset['train'],open('{}/ppa_train.json'.format(datadir),'w'),indent=4)
json.dump(dataset['test'],open('{}/ppa_test.json'.format(datadir),'w'),indent=4)