# Resolving Ambiguity in Prepositional Phrase Attachment

The problem of resolving ambiguity in prepositional phrase attachment is one that remains largely unsolved in NLP, and one that pre-trained language models such as BERT will likely not be of much help with. This notebook shows results of predicting prepositional phrase attachments across a subset of the NLVR2 dataset which has been annotated, leveraging a pre-trained language model commonly known as "BERT" (cite). 

The first group of models are trained from the output (hidden layers) of the large uncased model from BERT with whole word masking. The results are presented in terms of Cohen's kappa score and F1 score. 

The second group of models are trained without the aid of a language model. 

Our expectation is that none of these models will perform very well on its own. Results should be comparable between the two groups. 


In [1]:
from IPython.display import Image

## Preliminary Steps

In [2]:
# conda create -n python=3.7 ...
# pip install transformers... 

In [3]:
import sys
import os
import json
import numpy as np
import torch
import spacy
import sklearn
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.metrics import cohen_kappa_score as kappa

from itertools import groupby
from collections import Counter

from transformers import BertConfig, BertTokenizer, BertModel, BertForMaskedLM


  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [4]:
np.random.seed(91768)


In [5]:
datadir = "data"

In [6]:
train_data = json.load(open('{}/ppa_train.json'.format(datadir)))
test_data = json.load(open('{}/ppa_test.json'.format(datadir)))


In [7]:
from generator import generate_instances

In [8]:
labels_train = [instance['label'] for instance in train_data]
labels_test = [instance['label'] for instance in test_data]

In [9]:
bert_config = BertConfig.from_pretrained("bert-large-uncased-whole-word-masking")
bert_config.output_hidden_states=True

bert_model = BertModel.from_pretrained("bert-large-uncased-whole-word-masking",config=bert_config)
bert_model.eval()

bert_tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking',config=bert_config)

In [10]:
train = np.array([x for x in generate_instances(
    bert_model,bert_tokenizer,train_data,use_cuda=True)])
test = np.array([x for x in generate_instances(
    bert_model,bert_tokenizer,test_data,use_cuda=True)])

In [11]:
clfhf = svm.SVC(gamma=0.0001, C=100., random_state=91768)

clfhf.fit(train, labels_train)

SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='rbf',
    max_iter=-1, probability=False, random_state=91768, shrinking=True,
    tol=0.001, verbose=False)

In [12]:
labels_test_hf = clfhf.predict(test)
kappa(labels_test, labels_test_hf)

0.6134147542598247

In [13]:
f1_score(labels_test, labels_test_hf, labels=['N','V','O'], average=None)

array([0.90909091, 0.68656716, 0.5       ])