This notebook is based largely on the prior works of Patrick Pantel and Dekang Lin (1999). 

The primary purpose here is to classify a given sentence as having prepositional phrase attachment ambiguity. A syntactic parse is required. 

Secondary purpose will be to enumerate the candidate attachment sites for those sentences with ambiguities. 

Once this work is completed, we can use it to analyze sentences with prepositional phrases, both with and without ambiguities, and look for interesting of distinguishing patterns or characteristics. Following Pantel and Lin's approach, we can use the distribution of words across the unambiguous cases to predict the ambiguous cases in an unsupervised manner. 

Pantel and Lin define ambiguous cases of the form (V, N1, P, N2). 

They define unambiguous cases as of the forms (N1, P, N2) and (V, P, N2). 

I'm not sure I agree with this. What about... 

(P, N2, P, N3)? based on above, this is said to be unambiguous. I disagree. 
    sub (V, P, N2, P, N3)? 
    sub (N1, P, N2, P, N3)? 

I carried my golf clubs with new grips on graphite shafts with cavity-backed blades over my shoulder. 

I carried my golf clubs with new grips over my shoulder. 
I carried my golf clubs with new grips from the pro shop. 

I carried my golf bag with the new irons from the pro shop. 
I carried my golf bag across the 15th fairway to the pro shop. 
I carried my golf bag with my friend Jim across the 15th fairway. 
I carried my golf bag to the pro shop with my friend Jim. 

All of form...

N, V, N, P, N, P, N

Bah! Matches first rule. 

N, V, N, P, N, P, N, P, N

I'm calling that ambiguous. 

hence the only unambiguous cases would be of the form... 

(V, P, N2) or maybe
(N, P, N, V)

The man with the horse offered me a job. 

These heuristics must be ranked... first match fires, done. 
(cooler way to say that?)
disjunctive heuristics? 
decision trees? 

(V, N, P, N) => ambiguous (these cats)
(V, P, N) => unambiguous (yea. them cats)
They say...
N, P, N => unambiguous

I say...

(V, P, N, P, N) => ambiguous (I made this!)
I walked with my golf bag to the clubhouse. 
I walked with my golf bag in a pullcart. 

(P, N, P, N, P, N) => ambiguous

^(N, P, N, V) => unambiguous (I made this!)
The man with the beard sells tacos. 

(P, N, P, N) => unambiguous? (variation of N P N - them cats)
(P) => unknown (presumed unambiguous)
=> no P





In [2]:
import re
import os
import nltk
from itertools import groupby

In [1]:
pp_ambiguity_annotated_set = []
with open('../data/pp_ambiguity.txt') as ppap_in:
    for line in ppap_in:
        pp_ambiguity_annotated_set.append(line.strip())


In [3]:
import numpy as np

In [4]:
pp_ambiguity_annotated_set[1]

'The right image features a sparsely filled tall bookcase with square and rectangular compartments used as a divider in a room with a window on the right.'

In [5]:
len(pp_ambiguity_annotated_set)

92

In [6]:
np.random.seed(45678)
np.random.shuffle(pp_ambiguity_annotated_set)
random_sample = pp_ambiguity_annotated_set[:10]

In [7]:
random_sample

['In one image, a yurt sitting on post supports has an outdoor seating area on a deck with railing that encircles the yurt.',
 'In at least one image there is one man in a long sleeve blue shirt with black hair force feeding a duck through a tube.',
 'Reddish-pink illuminated letters are on a white background over the entrance of a store.',
 "One image shows multiple golf balls that aren't displayed in a vertically-stacked pyramid, and the other contains a single white ball with printing on it.",
 'there is a single fish with a white strip down its back and an opaque tail swimming near an anemone',
 'In one of the images there is one person in a book store holding  a book.',
 'An image features a drape hanging from a dark rod on at least one side of a row of white paned sliding glass doors.',
 'A marmot is standing on its hind legs behind a camera on a tripod in one image.',
 'Each image contains one non-standing spaniel, and one image features a spaniel with fur in shades of brown wit

In [8]:
with open('../data/random_sample_ppaa.txt','w') as smpout:
    for smp in random_sample:
        smpout.write('{}\n'.format(smp))
        

In [28]:
ex1 = "(S (NP (DT The) (JJ left) (NN image)) (VP (VBZ shows) (NP (NP (DT a) (NN staircase)) (SBAR (WHNP (WDT that)) (S (VP (VP (VBZ ascends) (PP (IN to) (NP (DT the) (NN left)))) (CC and) (VP (VBZ features) (NP (NP (DT a) (JJ wooden) (NN handrail)) (PP (PP (IN with) (NP (JJ black) (JJ wrought) (NN iron) (NNS bars))) (VP (VBN accented) (PP (IN by) (NP (NN scroll) (NNS shapes)))))))))))) (. .))"

In [31]:
def phrase_spitter(parse):
    r6 = r'\(([A-Z]+P)( \([A-Z]+ [A-Za-z]+\))+\)'
    r7 = r'\(([A-Z]+P) \([A-Z]+ ([A-Za-z]+)\) \([A-Z]+P'
    #re_phrase = r'\(([A-Z]+P) .*([a-z]+)\)+ \(([A-Z]+P)'
    re_last_phrase = r'\(([A-Z]+P) .* ([a-z]+)\)+ \(. .\)+$'
    done=False
    s = parse
    while not done:
        m6 = re.search(r6, s)
        m7 = re.search(r7, s)
        mlast = re.search(re_last_phrase, s)
        if m6 and m7:
            sp6 = m6.span()
            sp7 = m7.span()
            if sp6[0]<sp7[0]:
                #whole_phrase = m6.group(0)
                phrase_type = m6.group(1)
                head_word = m6.group(2).split()[-1].split(')')[0]
                s = s[sp6[1]:].strip()
                yield (phrase_type, head_word)
            else:
                phrase_type = m7.group(1)
                head_word = m7.group(2)
                s = s[sp7[1]:].strip()
                yield (phrase_type, head_word)
        elif m6:
            sp6 = m6.span()
            phrase_type = m6.group(1)
            head_word = m6.group(2).split()[-1].split(')')[0]
            s = s[sp6[1]:].strip()
            yield (phrase_type, head_word)
        elif m7:
            sp7 = m7.span()
            phrase_type = m7.group(1)
            head_word = m7.group(2)
            s = s[sp7[1]:].strip()
            yield (phrase_type, head_word)
        elif mlast:
            yield (mlast.group(1), mlast.group(2))
            done = True
        else:
            done = True


In [32]:
for (phrase_type, head_word) in phrase_spitter(ex1):
    print(phrase_type, head_word)

NP image
VP shows
NP staircase
WHNP that
VP ascends
NP left
VP features
NP handrail
PP with
VP accented
NP shapes


In [None]:
pat1 = re.compile(r'[VNP]P')

In [None]:
ex_parse = "(S (PP (IN In) (NP (QP (RB at) (RBS least) (CD one)) (NN image))) (NP (EX there)) (VP (VBZ is) (NP (NP (DT a) (JJ single) (NN tree)) (PP (IN with) (NP (JJ orange) (NNS flowers))) (PP (IN in) (NP (NP (NN front)) (PP (IN of) (NP (DT a) (NN church))))) (PP (IN with) (S (NP (DT the) (JJ open) (NN door)) (VP (VBG facing) (RB forward) (RB left)))))) (. .))"
print(ex_parse)

In [None]:
matches = re.findall(pat1,ex_parse)

In [None]:
phrase_pattern = ''.join([m[0] for m in matches])

In [None]:
phrase_pattern

In [None]:
# NN -> N
new_phrase_pattern = 'N'.join(phrase_pattern.split('NN'))

In [None]:
new_phrase_pattern

In [None]:
def has_pp_attachment_ambiguity(parse):
    '''
    This is intended to be a reference implementation of Pantel & Lin (2002). 
    
    (V, N1, P, N2) => ambigious"
    Otherwise => not ambiguous
    '''
    # step0 = re.sub(r'SBAR \(IN', 'PP (IN', parse)
    step0 = parse # no modifications
    step1 = [m[0] for m in re.findall(r'[VNP]P',step0)]
    step2 = ''.join(step1)
    # Removing consecutive duplicates, e.g. NN => N
    # https://stackoverflow.com/questions/18799036/python-best-way-to-remove-duplicate-character-from-string
    step3 = ''.join(ch for ch, _ in groupby(step2))
    phrase_pattern = step3
    print(phrase_pattern)
    # If no prep phrase, let's not waste time.
    if not re.search(r'P',phrase_pattern):
        if len(phrase_pattern)>0:
            print("No preps: {}".format(phrase_pattern))
            print(parse)
        return False
    
    # (V, N, P, N) => ambiguous (Pantel&Lin) 
    # Mary ate a salad with a fork. 
    # Mary ate a salad with croutons. 
    match_obj = re.search(r'VNPN', phrase_pattern)
    if match_obj:
        return True
    return False


In [None]:
def get_pp_attachment_ambiguity(parse):
    # https://stackoverflow.com/questions/18799036/python-best-way-to-remove-duplicate-character-from-string
    step0 = re.sub(r'SBAR \(IN', 'PP (IN', parse)
    step1 = [m[0] for m in re.findall(r'[VNP]P',step0)]
    step2 = ''.join(step1)
    step3 = ''.join(ch for ch, _ in groupby(step2))
    phrase_pattern = step3
    
    # If no prep phrase, let's not waste time.
    if not re.search(r'P',phrase_pattern):
        if len(phrase_pattern)>0:
            print("No preps: {}".format(phrase_pattern))
            print(parse)
        return False, 'P'
    
    # (V, N, P, N) => ambiguous (Pantel&Lin) 
    # Mary ate a salad with a fork. 
    # Mary ate a salad with croutons. 
    match_obj = re.search(r'VNPN', phrase_pattern)
    if match_obj:
        return True, 'VNPN'
    #(V, P, N, P, N) => ambiguous (I made this!) 
    #I walked with my golf bag to the clubhouse. 
    #I walked with my golf bag in a pullcart.
    match_obj = re.search(r'VPNPN', phrase_pattern)
    if match_obj:
        return True, 'VPNPN'
    # (N, P, N, P, N) => ambiguous (I made this!)
    # In at least one image there is a single tree with orange flowers in front of a church with the open door facing forward left.
    # ... [tree with orange] flowers in front of a church ...
    # ... [tree with orange] flowers with five petals on thick branches... 
    # ([N, P, {N, P, N], P, N})
    match_obj = re.search(r'NPNPN', phrase_pattern)
    if match_obj:
        return True, 'NPNPN'
    # (N, V, P, N) => ambiguous (discovered empirically)
    # ... women wearing white bikinis standing next to the water. 
    # (simplification)
    # ... A woman is wearing a bikini cooking on a gas stove. 
    # ... A woman is holding a spatula cooking in a white bikini. 
    if re.search(r'NVNVPN', phrase_pattern):
        return True, 'NVNVPN'
    
    # We have now passed all the checks for ambiguity. 
    
    #^(N, P, N, V) => unambiguous (I made this!)
    # The man with the beard sells tacos. 
    match_obj = re.search(r'^NPNV', phrase_pattern)
    if match_obj:
        print("Pattern match (^NPNV=un): {}".format(phrase_pattern))
        print(parse)
        return False, '^NPNV'
    else:
        # ...in the image on the left...
        match_obj = re.search(r'PNPN', phrase_pattern)
        if match_obj:
            print("Pattern match (PNPN=un): {}".format(phrase_pattern))
            print(parse)
            return False, 'PNPN'
        else:
            print("New pattern: {}".format(phrase_pattern))
    print(parse)
    return False, phrase_pattern

In [None]:
def get_pp_attachment_ambiguity(parse):
    # https://stackoverflow.com/questions/18799036/python-best-way-to-remove-duplicate-character-from-string
    step0 = re.sub(r'SBAR \(IN', 'PP (IN', parse)
    step1 = [m[0] for m in re.findall(r'[VNP]P',step0)]
    step2 = ''.join(step1)
    step3 = ''.join(ch for ch, _ in groupby(step2))
    phrase_pattern = step3
    
    # If no prep phrase, let's not waste time.
    if not re.search(r'P',phrase_pattern):
        if len(phrase_pattern)>0:
            print("No preps: {}".format(phrase_pattern))
            print(parse)
        return False, 'P'
    
    # (V, N, P, N) => ambiguous (Pantel&Lin) 
    # Mary ate a salad with a fork. 
    # Mary ate a salad with croutons. 
    match_obj = re.search(r'VNPN', phrase_pattern)
    if match_obj:
        return True, 'VNPN'
    #(V, P, N, P, N) => ambiguous (I made this!) 
    #I walked with my golf bag to the clubhouse. 
    #I walked with my golf bag in a pullcart.
    match_obj = re.search(r'VPNPN', phrase_pattern)
    if match_obj:
        return True, 'VPNPN'
    # (N, P, N, P, N) => ambiguous (I made this!)
    # In at least one image there is a single tree with orange flowers in front of a church with the open door facing forward left.
    # ... [tree with orange] flowers in front of a church ...
    # ... [tree with orange] flowers with five petals on thick branches... 
    # ([N, P, {N, P, N], P, N})
    match_obj = re.search(r'NPNPN', phrase_pattern)
    if match_obj:
        return True, 'NPNPN'
    # (N, V, P, N) => ambiguous (discovered empirically)
    # ... women wearing white bikinis standing next to the water. 
    # (simplification)
    # ... A woman is wearing a bikini cooking on a gas stove. 
    # ... A woman is holding a spatula cooking in a white bikini. 
    if re.search(r'NVNVPN', phrase_pattern):
        return True, 'NVNVPN'
    
    # We have now passed all the checks for ambiguity. 
    
    #^(N, P, N, V) => unambiguous (I made this!)
    # The man with the beard sells tacos. 
    match_obj = re.search(r'^NPNV', phrase_pattern)
    if match_obj:
        print("Pattern match (^NPNV=un): {}".format(phrase_pattern))
        print(parse)
        return False, '^NPNV'
    else:
        # ...in the image on the left...
        match_obj = re.search(r'PNPN', phrase_pattern)
        if match_obj:
            print("Pattern match (PNPN=un): {}".format(phrase_pattern))
            print(parse)
            return False, 'PNPN'
        else:
            print("New pattern: {}".format(phrase_pattern))
    print(parse)
    return False, phrase_pattern

In [None]:
def get_phrases_with_heads(parse):
    step0 = re.sub(r'SBAR \(IN', 'PP (IN', parse)
    step1 = [m[0] for m in re.findall(r'[VNP]P',step0)]
    step2 = ''.join(step1)
    step3 = ''.join(ch for ch, _ in groupby(step2))
    phrase_pattern = step3


In [None]:
def get_pp_attachment_sites(parse):
    # https://stackoverflow.com/questions/18799036/python-best-way-to-remove-duplicate-character-from-string
    step0 = re.sub(r'SBAR \(IN', 'PP (IN', parse)
    step1 = [m[0] for m in re.findall(r'[VNP]P',step0)]
    step2 = ''.join(step1)
    step3 = ''.join(ch for ch, _ in groupby(step2))
    phrase_pattern = step3
    
    # If no prep phrase, let's not waste time.
    if not re.search(r'P',phrase_pattern):
        if len(phrase_pattern)>0:
            print("No preps: {}".format(phrase_pattern))
            print(parse)
        return False, 'P'
    
    # (V, N, P, N) => ambiguous (Pantel&Lin) 
    # Mary ate a salad with a fork. 
    # Mary ate a salad with croutons. 
    match_obj = re.search(r'VNPN', phrase_pattern)
    if match_obj:
        return True, 'VNPN'
    #(V, P, N, P, N) => ambiguous (I made this!) 
    #I walked with my golf bag to the clubhouse. 
    #I walked with my golf bag in a pullcart.
    match_obj = re.search(r'VPNPN', phrase_pattern)
    if match_obj:
        return True, 'VPNPN'
    # (N, P, N, P, N) => ambiguous (I made this!)
    # In at least one image there is a single tree with orange flowers in front of a church with the open door facing forward left.
    # ... [tree with orange] flowers in front of a church ...
    # ... [tree with orange] flowers with five petals on thick branches... 
    # ([N, P, {N, P, N], P, N})
    match_obj = re.search(r'NPNPN', phrase_pattern)
    if match_obj:
        return True, 'NPNPN'
    # (N, V, P, N) => ambiguous (discovered empirically)
    # ... women wearing white bikinis standing next to the water. 
    # (simplification)
    # ... A woman is wearing a bikini cooking on a gas stove. 
    # ... A woman is holding a spatula cooking in a white bikini. 
    if re.search(r'NVNVPN', phrase_pattern):
        return True, 'NVNVPN'
    
    # We have now passed all the checks for ambiguity. 
    
    #^(N, P, N, V) => unambiguous (I made this!)
    # The man with the beard sells tacos. 
    match_obj = re.search(r'^NPNV', phrase_pattern)
    if match_obj:
        print("Pattern match (^NPNV=un): {}".format(phrase_pattern))
        print(parse)
        return False, '^NPNV'
    else:
        # ...in the image on the left...
        match_obj = re.search(r'PNPN', phrase_pattern)
        if match_obj:
            print("Pattern match (PNPN=un): {}".format(phrase_pattern))
            print(parse)
            return False, 'PNPN'
        else:
            print("New pattern: {}".format(phrase_pattern))
    print(parse)
    return False, phrase_pattern

In [None]:
parses = []
num_total = 0
num_found = 0
with open('../data/pp_ambiguity_parses.txt') as ppap_in:
    for line in ppap_in:
        num_total += 1
        if has_pp_attachment_ambiguity(line.strip()):
            num_found += 1

In [None]:
print(num_found, num_total)


(P, N, P, N, P, N) => ambiguous

^(N, P, N, V) => unambiguous (I made this!) 

(P, N, P, N) => unambiguous? (variation of N P N - them cats) (P) => unknown (presumed unambiguous) => no P

In [None]:
all_parses = []
with open('../data/annotated_dev.parse') as ppap_in:
    for line in ppap_in:
        all_parses.append(line.strip())

In [None]:
ppaa_parses = []
with open('../data/pp_ambiguity.parse') as ppap_in:
    for line in ppap_in:
        ppaa_parses.append(line.strip())

In [None]:
pos_ambiguous_parses = [parse for parse in all_parses if parse in ppaa_parses]

In [None]:
len(pos_ambiguous_parses)

In [None]:
annotated_parses = [(parse, parse in ppaa_parses) for parse in all_parses]

In [None]:
my_predictions = [has_pp_attachment_ambiguity(parse) for parse in all_parses]

In [None]:
their_predictions = [p[1] for p in annotated_parses]

In [None]:
import numpy as np
import sklearn


In [None]:
mine = np.array([int(boo) for boo in my_predictions])
theirs = np.array([int(boo) for boo in their_predictions])
false_positives = [all_parses[i] for i in range(len(all_parses)) if mine[i]>theirs[i]]

In [None]:
len(false_positives)

In [None]:
fp_stuff = [get_pp_attachment_ambiguity(parse) for parse in false_positives]

In [None]:
len(fp_stuff)

In [None]:
from collections import Counter

In [None]:
Counter([fp[1] for fp in fp_stuff])

In [None]:
interesting = [false_positives[i] for i in range(len(false_positives)) if fp_stuff[i][1]=='VNPN']

In [None]:
len(interesting)

In [None]:
interesting[141]

In [None]:
moreinteresting = [false_positives[i] for i in range(len(false_positives)) if not fp_stuff[i][1]=='VNPN']

In [None]:
moreinteresting[0]

In [None]:
def search_parses(regex, parses):
    for parse in parses:
        if re.search(regex, parse):
            yield parse

In [None]:
has_pp_attachment_ambiguity(moreinteresting[0])

In [None]:
len(my_predictions)

In [None]:
len([my for my in my_predictions if my])

In [None]:
len([their for their in their_predictions if their])

In [None]:
len([1 for (x,y) in zip(my_predictions, their_predictions) if x == y])

In [None]:
len(their_predictions)

In [None]:
# TP = 54/92
# FP = 303/800
# TN = 

precision = 54/360
recall = 54/92
accuracy = 458/800

In [None]:
precision

In [None]:
recall

In [None]:
accuracy