This notebook is based largely on the prior works of Patrick Pantel and Dekang Lin (1999). 

The primary purpose here is to classify a given sentence as having prepositional phrase attachment ambiguity. A syntactic parse is required. 

Secondary purpose will be to enumerate the candidate attachment sites for those sentences with ambiguities. 

Once this work is completed, we can use it to analyze sentences with prepositional phrases, both with and without ambiguities, and look for interesting of distinguishing patterns or characteristics. Following Pantel and Lin's approach, we can use the distribution of words across the unambiguous cases to predict the ambiguous cases in an unsupervised manner. 

Pantel and Lin define ambiguous cases of the form (V, N1, P, N2). 

They define unambiguous cases as of the forms (N1, P, N2) and (V, P, N2). 

I'm not sure I agree with this. What about... 

(P, N2, P, N3)? based on above, this is said to be unambiguous. I disagree. 
    sub (V, P, N2, P, N3)? 
    sub (N1, P, N2, P, N3)? 

I carried my golf clubs with new grips on graphite shafts with cavity-backed blades over my shoulder. 

I carried my golf clubs with new grips over my shoulder. 
I carried my golf clubs with new grips from the pro shop. 

I carried my golf bag with the new irons from the pro shop. 
I carried my golf bag across the 15th fairway to the pro shop. 
I carried my golf bag with my friend Jim across the 15th fairway. 
I carried my golf bag to the pro shop with my friend Jim. 

All of form...

N, V, N, P, N, P, N

Bah! Matches first rule. 

N, V, N, P, N, P, N, P, N

I'm calling that ambiguous. 

hence the only unambiguous cases would be of the form... 

(V, P, N2) or maybe
(N, P, N, V)

The man with the horse offered me a job. 

These heuristics must be ranked... first match fires, done. 
(cooler way to say that?)
disjunctive heuristics? 
decision trees? 

(V, N, P, N) => ambiguous (these cats)
(V, P, N) => unambiguous (yea. them cats)
They say...
N, P, N => unambiguous

I say...

(V, P, N, P, N) => ambiguous (I made this!)
I walked with my golf bag to the clubhouse. 
I walked with my golf bag in a pullcart. 

(P, N, P, N, P, N) => ambiguous

^(N, P, N, V) => unambiguous (I made this!)
The man with the beard sells tacos. 

(P, N, P, N) => unambiguous? (variation of N P N - them cats)
(P) => unknown (presumed unambiguous)
=> no P





In [1]:
import re
import os
import nltk
from itertools import groupby

In [2]:
pat1 = re.compile(r'[VNP]P')

In [3]:
ex_parse = "(S (PP (IN In) (NP (QP (RB at) (RBS least) (CD one)) (NN image))) (NP (EX there)) (VP (VBZ is) (NP (NP (DT a) (JJ single) (NN tree)) (PP (IN with) (NP (JJ orange) (NNS flowers))) (PP (IN in) (NP (NP (NN front)) (PP (IN of) (NP (DT a) (NN church))))) (PP (IN with) (S (NP (DT the) (JJ open) (NN door)) (VP (VBG facing) (RB forward) (RB left)))))) (. .))"
print(ex_parse)

(S (PP (IN In) (NP (QP (RB at) (RBS least) (CD one)) (NN image))) (NP (EX there)) (VP (VBZ is) (NP (NP (DT a) (JJ single) (NN tree)) (PP (IN with) (NP (JJ orange) (NNS flowers))) (PP (IN in) (NP (NP (NN front)) (PP (IN of) (NP (DT a) (NN church))))) (PP (IN with) (S (NP (DT the) (JJ open) (NN door)) (VP (VBG facing) (RB forward) (RB left)))))) (. .))


In [4]:
matches = re.findall(pat1,ex_parse)

In [5]:
phrase_pattern = ''.join([m[0] for m in matches])

In [6]:
phrase_pattern

'PNNVNNPNPNNPNPNV'

In [7]:
# NN -> N
new_phrase_pattern = 'N'.join(phrase_pattern.split('NN'))

In [8]:
new_phrase_pattern

'PNVNPNPNPNPNV'

In [35]:
def has_pp_attachment_ambiguity(parse):
    '''
    This is intended to be a reference implementation of Pantel & Lin (2002). 
    
    (V, N1, P, N2) => ambigious"
    Otherwise => not ambiguous
    '''
    # step0 = re.sub(r'SBAR \(IN', 'PP (IN', parse)
    step0 = parse # no modifications
    step1 = [m[0] for m in re.findall(r'[VNP]P',step0)]
    step2 = ''.join(step1)
    # Removing consecutive duplicates, e.g. NN => N
    # https://stackoverflow.com/questions/18799036/python-best-way-to-remove-duplicate-character-from-string
    step3 = ''.join(ch for ch, _ in groupby(step2))
    phrase_pattern = step3
    print(phrase_pattern)
    # If no prep phrase, let's not waste time.
    if not re.search(r'P',phrase_pattern):
        if len(phrase_pattern)>0:
            print("No preps: {}".format(phrase_pattern))
            print(parse)
        return False
    
    # (V, N, P, N) => ambiguous (Pantel&Lin) 
    # Mary ate a salad with a fork. 
    # Mary ate a salad with croutons. 
    match_obj = re.search(r'VNPN', phrase_pattern)
    if match_obj:
        return True
    return False


In [10]:
def get_pp_attachment_ambiguity(parse):
    # https://stackoverflow.com/questions/18799036/python-best-way-to-remove-duplicate-character-from-string
    step0 = re.sub(r'SBAR \(IN', 'PP (IN', parse)
    step1 = [m[0] for m in re.findall(r'[VNP]P',step0)]
    step2 = ''.join(step1)
    step3 = ''.join(ch for ch, _ in groupby(step2))
    phrase_pattern = step3
    
    # If no prep phrase, let's not waste time.
    if not re.search(r'P',phrase_pattern):
        if len(phrase_pattern)>0:
            print("No preps: {}".format(phrase_pattern))
            print(parse)
        return False, 'P'
    
    # (V, N, P, N) => ambiguous (Pantel&Lin) 
    # Mary ate a salad with a fork. 
    # Mary ate a salad with croutons. 
    match_obj = re.search(r'VNPN', phrase_pattern)
    if match_obj:
        return True, 'VNPN'
    #(V, P, N, P, N) => ambiguous (I made this!) 
    #I walked with my golf bag to the clubhouse. 
    #I walked with my golf bag in a pullcart.
    match_obj = re.search(r'VPNPN', phrase_pattern)
    if match_obj:
        return True, 'VPNPN'
    # (N, P, N, P, N) => ambiguous (I made this!)
    # In at least one image there is a single tree with orange flowers in front of a church with the open door facing forward left.
    # ... [tree with orange] flowers in front of a church ...
    # ... [tree with orange] flowers with five petals on thick branches... 
    # ([N, P, {N, P, N], P, N})
    match_obj = re.search(r'NPNPN', phrase_pattern)
    if match_obj:
        return True, 'NPNPN'
    # (N, V, P, N) => ambiguous (discovered empirically)
    # ... women wearing white bikinis standing next to the water. 
    # (simplification)
    # ... A woman is wearing a bikini cooking on a gas stove. 
    # ... A woman is holding a spatula cooking in a white bikini. 
    if re.search(r'NVNVPN', phrase_pattern):
        return True, 'NVNVPN'
    
    # We have now passed all the checks for ambiguity. 
    
    #^(N, P, N, V) => unambiguous (I made this!)
    # The man with the beard sells tacos. 
    match_obj = re.search(r'^NPNV', phrase_pattern)
    if match_obj:
        print("Pattern match (^NPNV=un): {}".format(phrase_pattern))
        print(parse)
        return False, '^NPNV'
    else:
        # ...in the image on the left...
        match_obj = re.search(r'PNPN', phrase_pattern)
        if match_obj:
            print("Pattern match (PNPN=un): {}".format(phrase_pattern))
            print(parse)
            return False, 'PNPN'
        else:
            print("New pattern: {}".format(phrase_pattern))
    print(parse)
    return False, phrase_pattern

In [11]:
parses = []
num_total = 0
num_found = 0
with open('../data/pp_ambiguity_parses.txt') as ppap_in:
    for line in ppap_in:
        num_total += 1
        if has_pp_attachment_ambiguity(line.strip()):
            num_found += 1

No preps: NVNVNVNVNV
(S (S (NP (DT The) (JJ left) (NN image)) (VP (VBZ contains) (NP (NP (DT a) (NN dog)) (SBAR (WHNP (WDT that)) (S (VP (VBZ is) (VP (VP (VBG standing) (ADVP (RP up))) (CC and) (VP (VBD turned) (ADVP (RB forward)))))))))) (, ,) (CC and) (S (NP (DT the) (JJ right) (NN image)) (VP (VBZ features) (NP (NP (DT a) (NN dog)) (VP (VBG wearing) (NP (DT a) (JJ blue) (NN collar)) (SBAR (IN with) (S (NP (PRP$ its) (NN head) (CC and) (NN gaze)) (VP (VBD raised)))))))) (. .))
No preps: NVN
(S (NP (PRP it)) (VP (VBZ 's) (NP (JJ blue) (CC and) (NNP red) (NNP box))) (. .))


In [12]:
print(num_found, num_total)

56 94



(P, N, P, N, P, N) => ambiguous

^(N, P, N, V) => unambiguous (I made this!) 

(P, N, P, N) => unambiguous? (variation of N P N - them cats) (P) => unknown (presumed unambiguous) => no P

In [14]:
all_parses = []
with open('../data/annotated_dev.parse') as ppap_in:
    for line in ppap_in:
        all_parses.append(line.strip())

In [15]:
ppaa_parses = []
with open('../data/pp_ambiguity.parse') as ppap_in:
    for line in ppap_in:
        ppaa_parses.append(line.strip())

In [16]:
pos_ambiguous_parses = [parse for parse in all_parses if parse in ppaa_parses]

In [17]:
len(pos_ambiguous_parses)

94

In [18]:
annotated_parses = [(parse, parse in ppaa_parses) for parse in all_parses]

In [19]:
my_predictions = [has_pp_attachment_ambiguity(parse) for parse in all_parses]

No preps: NVN
(S (NP (EX There)) (VP (VBP are) (NP (QP (RB exactly) (CD three)) (JJ juvenile) (NNS cheetahs))) (. .))
No preps: NVN
(S (NP (EX There)) (VP (VBP are) (NP (QP (DT no) (JJR more) (IN than) (CD six)) (NNS birds))) (. .))
No preps: NVN
(S (NP (CD One) (NN dog)) (VP (VBZ is) (VP (VBG wearing) (NP (DT a) (NN leash)))) (. .))
No preps: NVN
(S (NP (EX There)) (VP (VBP are) (NP (QP (DT no) (JJR more) (IN than) (CD three)) (NNS elephants))) (. .))
No preps: NVNV
(S (S (NP (CD One) (NN dog) (NN sled) (NN team)) (VP (VBZ is) (VP (VBG moving)))) (CC and) (S (NP (CD one)) (VP (VBZ is) (RB not))) (. .))
No preps: NVNV
(S (NP (DT The) (JJ right) (NN image)) (VP (VBZ contains) (NP (NP (DT a) (JJ brown) (NN dog)) (SBAR (IN with) (S (NP (PRP$ its) (NN tongue)) (VP (VBN exposed)))))) (. .))
No preps: NVNVNVN
(S (NP (DT The) (JJ right) (NN image)) (VP (VBZ contains) (NP (NP (DT a) (NN person)) (VP (VBG riding) (NP (DT a) (NN mountain) (NN bike)) (S (VP (VBG wearing) (NP (NN knee) (NNS pads))

In [20]:
their_predictions = [p[1] for p in annotated_parses]

In [21]:
import numpy as np
import sklearn


In [22]:
mine = np.array([int(boo) for boo in my_predictions])
theirs = np.array([int(boo) for boo in their_predictions])
false_positives = [all_parses[i] for i in range(len(all_parses)) if mine[i]>theirs[i]]

In [23]:
len(false_positives)

304

In [24]:
fp_stuff = [get_pp_attachment_ambiguity(parse) for parse in false_positives]

In [25]:
len(fp_stuff)

304

In [26]:
from collections import Counter

In [27]:
Counter([fp[1] for fp in fp_stuff])

Counter({'VNPN': 303, 'VPNPN': 1})

In [28]:
interesting = [false_positives[i] for i in range(len(false_positives)) if fp_stuff[i][1]=='VNPN']

In [29]:
len(interesting)

303

In [30]:
interesting[141]

'(S (NP (NP (CD One)) (PP (IN of) (NP (NP (DT the) (NNS images)) (PP (IN in) (NP (DT the) (NN pair)))))) (VP (VBZ shows) (NP (NP (DT a) (NN dog)) (PP (IN with) (NP (DT a) (NN tennis) (NN ball))))) (. .))'

In [31]:
moreinteresting = [false_positives[i] for i in range(len(false_positives)) if not fp_stuff[i][1]=='VNPN']

In [32]:
moreinteresting[0]

'(S (NP (NP (DT The) (NN device)) (PP (IN on) (NP (DT the) (NN right)))) (VP (VBZ is) (VP (VBN viewed) (ADVP (S (ADJP (NN head))) (HYPH -) (RP on)) (, ,) (SBAR (SBAR (IN while) (S (NP (NP (DT the) (CD one)) (PP (IN on) (NP (DT the) (NN left)))) (VP (VBZ is) (VP (VBN angled) (S (VP (VBG facing) (ADVP (RB rightward)))))))) (, ,) (CC and) (S (NP (DT both) (NNS devices)) (VP (VBP have) (S (NP (DT the) (NN screen)) (VP (VBD slid) (ADVP (RP up)) (S (VP (TO to) (VP (VB show) (NP (DT the) (NN keyboard)))))))))))) (. .))'

In [None]:
def search_parses(regex, parses):
    for parse in parses:
        if re.search(regex, parse):
            yield parse

In [36]:
has_pp_attachment_ambiguity(moreinteresting[0])

NPNVNPNVNVNVN


True

In [37]:
len(my_predictions)

803

In [39]:
len([my for my in my_predictions if my])

360

In [46]:
len([their for their in their_predictions if their])

94

In [41]:
len([1 for (x,y) in zip(my_predictions, their_predictions) if x == y])

461

In [38]:
len(their_predictions)

803

In [42]:
# TP = 54/92
# FP = 303/800
# TN = 

precision = 54/360
recall = 54/92
accuracy = 458/800

In [43]:
precision

0.15

In [44]:
recall

0.5869565217391305

In [45]:
accuracy

0.5725