# Resolving Ambiguity in Prepositional Phrase Attachment

The problem of resolving ambiguity in prepositional phrase attachment is one that remains largely unsolved in NLP, and one that pre-trained language models such as BERT will likely not be of much help with. This notebook shows results of predicting prepositional phrase attachments across a subset of the NLVR2 dataset which has been annotated, leveraging a pre-trained language model commonly known as "BERT" (cite). 

We trained an SVM classifier from the output (hidden layers) of the large uncased model from BERT with whole word masking. The results are presented in terms of Cohen's kappa score and F1 score. 

In [1]:
from IPython.display import Image

# Preliminary Steps

In [2]:
# conda create -n python=3.7 ...
# pip install transformers... 

In [3]:
import json
import numpy as np

import sklearn
from sklearn import svm
from sklearn.metrics import f1_score
from sklearn.metrics import cohen_kappa_score as kappa

import spacy

In [4]:
from generator import HuggingFaceGenerator, CountVectorizerGenerator

In [5]:
np.random.seed(91768)

## Load Dataset (train/test)

In [6]:
datadir = "data"

In [7]:
train_data = json.load(open('{}/ppa_train.json'.format(datadir)))
labels_train = [instance['label'] for instance in train_data]

test_data = json.load(open('{}/ppa_test.json'.format(datadir)))
labels_test = [instance['label'] for instance in test_data]

## Using BERT Language Model
We load a pre-trained model from BERT and use it to generate instances for model training. 

In [8]:
bert_model_name = "bert-large-uncased-whole-word-masking"
hf_generator = HuggingFaceGenerator(bert_model_name)

## Transform Dataset

In [9]:
hf_train = hf_generator.generate_dataset(train_data)
hf_test = hf_generator.generate_dataset(test_data)

# Model Training

In [10]:
clfhf = svm.SVC(gamma=0.0001, C=100., random_state=91768)
clfhf.fit(hf_train, labels_train)

SVC(C=100.0, gamma=0.0001, random_state=91768)

In [11]:
preds_test_hf = clfhf.predict(hf_test)

In [12]:
f1_score(labels_test, preds_test_hf, labels=['N','V','O'], average=None)

array([0.90909091, 0.68656716, 0.5       ])

In [13]:
kappa(labels_test, preds_test_hf)

0.6134147542598247