# Resolving Ambiguity in Prepositional Phrase Attachment

The problem of resolving ambiguity in prepositional phrase attachment is one that remains largely unsolved in NLP, and one that pre-trained language models such as BERT will likely not be of much help with. This notebook shows results of predicting prepositional phrase attachments across a subset of the NLVR2 dataset which has been annotated, leveraging a pre-trained language model commonly known as "BERT" (cite). 

The first group of models are trained from the output (hidden layers) of the large uncased model from BERT with whole word masking. The results are presented in terms of Cohen's kappa score and F1 score. 

The second group of models are trained without the aid of a language model. 

Our expectation is that none of these models will perform very well on its own. Results should be comparable between the two groups. 


In [1]:
from IPython.display import Image

# Preliminary Steps

In [2]:
# conda create -n python=3.7 ...
# pip install transformers... 

In [3]:
import json
import numpy as np

import sklearn
from sklearn import svm
from sklearn.metrics import f1_score
from sklearn.metrics import cohen_kappa_score as kappa

import spacy

In [4]:
from generator import HuggingFaceGenerator, CountVectorizerGenerator

In [5]:
np.random.seed(91768)

## Load Dataset (train/test)

In [6]:
datadir = "data"

In [7]:
train_data = json.load(open('{}/ppa_train.json'.format(datadir)))
labels_train = [instance['label'] for instance in train_data]

test_data = json.load(open('{}/ppa_test.json'.format(datadir)))
labels_test = [instance['label'] for instance in test_data]

## Load Language Models

In [8]:
bert_model_name = "bert-large-uncased-whole-word-masking"
hf_generator = HuggingFaceGenerator(bert_model_name)
spacy_model_name = "en_core_web_lg"
nlp = spacy.load(spacy_model_name) #, disable=["tagger","parser","ner"])
cv_generator = CountVectorizerGenerator(binarize=True, tokenizer=nlp)

## Transform Dataset

In [9]:
cv_train = cv_generator.fit_transform(train_data).toarray()
cv_test = cv_generator.transform(test_data).toarray()

In [None]:
hf_train = hf_generator.generate_dataset(train_data)
hf_test = hf_generator.generate_dataset(test_data)

# Model Training

In [None]:
clfhf = svm.SVC(gamma=0.0001, C=100., random_state=91768)
clfhf.fit(hf_train, labels_train)
clfcv = svm.SVC(gamma=0.0001, C=100., random_state=91768)
clfcv.fit(cv_train, labels_train)

In [None]:
preds_test_hf = clfhf.predict(hf_test)

In [None]:
f1_score(labels_test, preds_test_hf, labels=['N','V','O'], average=None)

In [None]:
kappa(labels_test, preds_test_hf)

In [None]:
preds_test_cv = clfcv.predict(cv_test)

In [None]:
f1_score(labels_test, preds_test_cv, labels=['N','V','O'], average=None)

In [None]:
kappa(labels_test, preds_test_cv)