In [1]:
%%html
<style>
.h1_cell, .just_text {
    box-sizing: border-box;
    padding-top:5px;
    padding-bottom:5px;
    font-family: "Times New Roman", Georgia, Serif;
    font-size: 125%;
    line-height: 22px; /* 5px +12px + 5px */
    text-indent: 25px;
    background-color: #fbfbea;
    padding: 10px;
}

hr { 
    display: block;
    margin-top: 0.5em;
    margin-bottom: 0.5em;
    margin-left: auto;
    margin-right: auto;
    border-style: inset;
    border-width: 2px;
}
</style>

<h1>
<center>
Module 9
</center>
</h1>
<div class=h1_cell>
<p>
Last week we explored how to pull relation triples out of a sentence. This week, let's see if we can do something with those triples.
<p>
Your goal is to store the relations you extract from sentences in a "knowledge base". My first thought was to use a pandas dataframe as the knowledge base. Store a relation one per row with 3 columns. But I don't think that is a good idea. Two of the three values are NP subtrees. The subtrees can have structure of their own, e.g., more than one leaf node. I don't see how to easily store the subtree in a dataframe.
<p>
Maybe the easiest is to implement the knoweldge base as just a list of relations, where each relation is a triple of NP verb NP.
<p>
Before we get going, here is some material from last week as reference.
</div>

In [2]:
sentences = [
    'Victor Frankenstein builds the creature in his laboratory',

    'The creature is 8 feet tall',  # tricky

    'the monster wanders through the wilderness',  # tricky

    'He finds brief solace beside a remote cottage inhabited by a family of peasants',

    'Eavesdropping, the creature familiarizes himself with their lives and learns to speak',  # tricky

    "The creature eventually introduces himself to the family's blind father",

    'the creature rescues a peasant girl from a river.',

    "He finds Frankenstein's journal in the pocket of the jacket he found in the laboratory",

    "The monster kills Victor's younger brother William upon learning of the boy's relation to his hated creator.",

    "Frankenstein builds a female creature.",

    "the monster kills Frankenstein's best friend Henry Clerva.",

    "the monster boards the ship.",

    "The monster has also been analogized to an oppressed class",

    "the monster is the tragic result of uncontrolled technology."
    ]

In [3]:
import nltk
from nltk.tree import Tree

In [119]:
def build_relation(text, chunker):
    
    #chunk the text with chunker
    chunks = chunker.parse(nltk.pos_tag(nltk.word_tokenize(text)))
    
    #Now re-chunk looking for our triples. Call the chunk REL for relation
    chunker2 = nltk.RegexpParser(r'''
                   REL:
                   {<NP><VBZ><NP>}
                   ''')
    relation_chunk = chunker2.parse(chunks)
    
    for t in relation_chunk:
        if type(t) != Tree: continue
        if t.label() == 'REL':
            return (t[0], t[1], t[2])
            
    return tuple([]) 

In [15]:
rel_chunker2 = nltk.RegexpParser(r'''
    NP:
    {<DT>?<JJ>*<NN>} # chunk determiner (optional), adjectives (optional) and noun
    {<NNP>+} # chunk sequences of proper nouns
   ''')

In [16]:
build_relation(sentences[0], rel_chunker2)

(S
  (NP Victor/NNP Frankenstein/NNP)
  builds/VBZ
  (NP the/DT creature/NN)
  in/IN
  his/PRP$
  (NP laboratory/NN))


(Tree('NP', [('Victor', 'NNP'), ('Frankenstein', 'NNP')]),
 ('builds', 'VBZ'),
 Tree('NP', [('the', 'DT'), ('creature', 'NN')]))

<h2>
All the sentences
</h2>
<div class=h1_cell>
<p>
See how many we can pull relations from. I am showing results prior to your assignment. I assume you now are seeing less empty tuples, i.e., you are matching more sentences.
</div>

In [120]:
all_relations = []  # will be our knowledge base
for i,s in enumerate(sentences):
    relation = build_relation(s, rel_chunker2)
    all_relations.append(relation)
    print(relation)
    print('===============')

(Tree('NP', [('Victor', 'NNP'), ('Frankenstein', 'NNP')]), ('builds', 'VBZ'), Tree('NP', [('the', 'DT'), ('creature', 'NN')]))
()
()
()
()
()
(Tree('NP', [('the', 'DT'), ('creature', 'NN')]), ('rescues', 'VBZ'), Tree('NP', [('a', 'DT'), ('peasant', 'JJ'), ('girl', 'NN')]))
()
(Tree('NP', [('The', 'DT'), ('monster', 'NN')]), ('kills', 'VBZ'), Tree('NP', [('Victor', 'NNP')]))
(Tree('NP', [('Frankenstein', 'NNP')]), ('builds', 'VBZ'), Tree('NP', [('a', 'DT'), ('female', 'JJ'), ('creature', 'NN')]))
(Tree('NP', [('the', 'DT'), ('monster', 'NN')]), ('kills', 'VBZ'), Tree('NP', [('Frankenstein', 'NNP')]))
()
()
(Tree('NP', [('the', 'DT'), ('monster', 'NN')]), ('is', 'VBZ'), Tree('NP', [('the', 'DT'), ('tragic', 'JJ'), ('result', 'NN')]))


<h2>
Challenge 1
</h2>
<div class=h1_cell>
<p>
The goal will be to write a lookup query that looks like "Show me who built things.". Or "Who did the monster kill?"
Your first thought might be that this is straightforward. Just match "built" or "kill" to the verb in each relation using `==`. But the actual verbs are "builds" and "kills". So won't literally match.
<p>
There is something we can use to help. It is called a lemmatizer and nltk has one (actually several). The general idea is that we pass any form of a verb in like "build" and it will always return "build". Let's check it out.
<p>
<p>
BTW: WordNet is kind of interesting. It is an online syllabus of a huge number of English words. It is separate from nltk. However, nltk has a wrapper for it so we can use it as below.
<p>
BTW2: for the spelling police out there, see this:
<pre>
builded. Verb. (archaic or childish, nonstandard) simple past tense and past participle of build.
</pre>
</div>

In [None]:
from nltk.stem import WordNetLemmatizer  # using the cool WordNet syllabus
lemmatizer = WordNetLemmatizer()  # one of the varietes to choose from in nltk

In [340]:
print(lemmatizer.lemmatize("build", pos="v"))
print(lemmatizer.lemmatize("builds", pos="v"))
print(lemmatizer.lemmatize("built", pos="v"))
print(lemmatizer.lemmatize("builded", pos="v"))  # archaic but ok
print(lemmatizer.lemmatize("builted", pos="v"))  # bogus

build
build
build
build
builted


<div class=h1_cell>
<p>
Here are a few more, nouns if no pos parameter.
</div>

In [343]:

print(lemmatizer.lemmatize("cats"))  # default to n or noun
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))  # a is adjective
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("ran"))
print(lemmatizer.lemmatize("ran",'v'))

cat
cactus
goose
rock
python
good
best
ran
run


<div class=h1_cell>
<p>
I think we are in business. If we are trying to match one form of the same verb against another, we can lemmatize both of them first then use `==`. We are almost ready to define a function, verb_match, that takes a verb we are trying to match and a relation we are matching against. It returns True if we get a match after lemmatization. But before that, let's look at a relation in a bit more detail.
</div>

In [347]:
s0 = all_relations[0]  # first relation in our knowledge base
print(s0)

(Tree('NP', [('Victor', 'NNP'), ('Frankenstein', 'NNP')]), ('builds', 'VBZ'), Tree('NP', [('the', 'DT'), ('creature', 'NN')]))


In [348]:
for item in s0:
    print((item, type(item)))

(Tree('NP', [('Victor', 'NNP'), ('Frankenstein', 'NNP')]), <class 'nltk.tree.Tree'>)
(('builds', 'VBZ'), <type 'tuple'>)
(Tree('NP', [('the', 'DT'), ('creature', 'NN')]), <class 'nltk.tree.Tree'>)


<div class=h1_cell>
You can see that the relation is a triple of (Tree, tuple, Tree). Since we are only focusing on the verb, we don't have to deal with Tree objects yet. That will come when we want to match against noun-phrases (1st and 3rd components of the triple).
<p>
We can see that the verb is a tuple of actual verb and then its pos as seen in table above. With that info, you should be ready to define the function.
</div>

In [345]:
def verb_match(verb_word, relation):



In [346]:
print(verb_match('built', s0))
print(verb_match('build', s0))
print(verb_match('builds', s0))
print(verb_match('builts', s0))

True
True
True
False


<h2>
Challenge 2
</h2>
<div class=h1_cell>
<p>
Cool. We have verb matching under control. Now for matching a noun-phrase. A noun-phrase as we have defined it is a Tree object with one or more leaves. A leaf is a tuple of word followed by pos. Before doing anything else, let's define a helper function that will return a list of the words on the leaves.
</div>

In [356]:
def np_to_word_list(np_tree):
    leaves = np_tree.leaves()  # Tree method
    return [tup[0] for tup in leaves]

<div class=h1_cell>
<p>
You can see I am using a method leaves() that is defined by the Tree class. It will give us a list of tuples. I then use a list comprehension to pull out the words.
</div>

In [358]:
np1 = s0[0]  # first noun-phrase
print((np1, type(np1)))
np_to_word_list(np1)

(Tree('NP', [('Victor', 'NNP'), ('Frankenstein', 'NNP')]), <class 'nltk.tree.Tree'>)


['Victor', 'Frankenstein']

In [352]:
np2 = s0[2]  # second noun-phrase
np_to_word_list(np2)

['the', 'creature']

<h2>
Matching strategy
</h2>
<div class=h1_cell>
<p>
We know we can get a list of words from a noun-phrase. We could easily check for a single word match by using the `in` operator, e.g., 'a' in ['b', 'a', 'c'] returns True. I'd like something a bit more sophisticated. I would like the match words to also be a list. So we are attempting to match a list of words against another list of words. How does this work? Let's call the two word lists target-words and np-words. I would like you to go through each word in target-words, one by one, and find a match in np-words. The tricky part is I would like you to remember where the match occurred in np-words and start the next match from that point. Here are some examples. First list is target-words and second np-words.
<pre>

['a', 'b', 'c'] and ['d', 'a', 'b', 'r', 'c', 'f'] match.

['a', 'b', 'c'] and ['d', 'a', 'c', 'r', 'b', 'f'] no match.

['a', 'b', 'c'] and ['d', 'a', 'b', 'r', 'b', 'f'] no match.

[] and ['d', 'a', 'b', 'r', 'b', 'f'] wildcard match.

</pre>

Also see the example calls below the function definition.
<p>
BTW: I broke out single word matching into a separate function. I did so to make it easier to do more sophisticated matching in the future.
</div>

In [234]:
def np_word_match(word1, word2):
    word1 = word1.lower()
    word2 = word2.lower()
    
    #literally equal
    return word1 == word2

In [235]:
def np_match(np_tree, target_word_list):


In [236]:
np_match(s0[0], ['victor'])

True

In [237]:
np_match(s0[0], ['victor', 'frankenstein'])

True

In [238]:
np_match(s0[0], [ 'frankenstein', 'victor'])

False

In [239]:
np_match(s0[0], ['victor', 'victor', 'frankenstein'])

False

In [240]:
np_match(s0[2], ['the', 'creature'])

True

In [368]:
np_match(s0[2], [])  # empty list is wildcard

True

<h2>
Should we lemmatize matching?
</h2>
<div class=h1_cell>
<p>
At moment, np_word_match is matching words literally. But we saw for verbs, lemmatization helped as be less strict and match different forms of same verb. How would that work with words in noun phrases? Here is an example:
<pre>
(Tree('NP', [('Frankenstein', 'NNP')]), ('builds', 'VBZ'), Tree('NP', [('a', 'DT'), ('female', 'JJ'), ('creature', 'NN')]))
</pre>
<p>
It sounds reasonable to me to match women with female. Will the lemmatizer give us this?
</div>

In [245]:
lemmatizer.lemmatize("woman",'n')

'woman'

In [246]:
lemmatizer.lemmatize("female",'n')

'female'

<div class=h1_cell>
<p>
Nope. I think we are going to have to try something else. Let's consider a thesaurus based approach. We can get the synonyms of a word and check against that. So if we are trying to match word1 against word2, we could also match word1 against the synonyms of word2 and vice versa. Does nltk give us a thesaurus to use? Yes. More accurately, it gives us access to that large online thesaurus called WordNet. Here is a function that will return the synonyms of a word using WordNet.
</div>

In [None]:
from nltk.corpus import wordnet

In [None]:
def get_syns(word):
    synonyms = []
    for syn in wordnet.synsets(word):
        for lem in syn.lemmas():
            synonyms.append(lem.name())
    return list(set(synonyms))

In [247]:
get_syns('female')

[u'distaff', u'female_person', u'female']

In [250]:
get_syns('woman')

[u'cleaning_woman',
 u'woman',
 u'womanhood',
 u'fair_sex',
 u'cleaning_lady',
 u'char',
 u'adult_female',
 u'charwoman']

<div class=h1_cell>
<p>
Uh. A little on the sexist side if you ask me. And does not give us what we want: a match between 'female' and 'woman': 'female' does not appear in synonyms for 'woman' nor vice versa. Let's check some others.
</div>

In [251]:
get_syns('monster')

[u'ogre',
 u'giant',
 u'devil',
 u'freak',
 u'monster',
 u'behemoth',
 u'teras',
 u'colossus',
 u'demon',
 u'fiend',
 u'monstrosity',
 u'lusus_naturae',
 u'goliath']

In [252]:
get_syns('creature')

[u'animate_being',
 u'tool',
 u'brute',
 u'beast',
 u'wight',
 u'puppet',
 u'animal',
 u'fauna',
 u'creature']

<div class=h1_cell>
<p>
Still no luck. But looking at some of the synonyms, it does open the door to matching 'monster' with useful synonyms and same for 'creature'.
</div>

In [254]:
'ogre' in get_syns('monster')

True

In [256]:
'brute' in get_syns('creature')

True

<h2>
Challenge 3
</h2>
<div class=h1_cell>
<p>
Go ahead and modify np_word_match to now include a match against synonym lists. As before return True if literal match. But also return True if word1 in synonyms of word2 or vice versa.
<p>
My guess is you only need to check against one synonym list because of symmetry of synonyms. In particular, my hypothesis is that if you don't find word1 in synonyms of word2, you won't find word2 in synonyms of word1. But I have not had a chance to verify this so check against both lists for now.
</div>

In [259]:
#improved version

def np_word_match(word1, word2):



In [260]:
np_match(s0[2], ['brute'])

True

<h2>
Challenge 4
</h2>
<div class=h1_cell>
<p>
Ok, now we have some helper functions defined and we can get to the cool stuff. I want to treat our collection of relations as a kind of database (I'll also sometimes use the more high falutin term *knowledge base*). What can you do with a database? You can query it. I'd like you to build the function `who` to get us started. The function will take 3 arguments: (1) the verb to match on, (2) a list of words to match against the 2nd noun-phrase, and (3) the relation to check.
</div>

In [264]:
#One more helper function if you need it
def np_to_string(np_tree):
    words = np_to_word_list(np_tree)
    return ' '.join(words)

In [301]:
def who(verb, target_words, relation):

    

In [364]:
for rel in all_relations:
    print(who('built', ['the', 'creature'], rel))

Victor Frankenstein
None
None
None
None
None
None
None
None
None
None
None
None
None


In [365]:
for rel in all_relations:
    print(who('rescued', ['girl'], rel))

None
None
None
None
None
None
the creature
None
None
None
None
None
None
None


In [366]:
for rel in all_relations:
    print(who('was', ['tragic'], rel))

None
None
None
None
None
None
None
None
None
None
None
None
None
the monster


In [367]:
for rel in all_relations:
    print(who('killed', [], rel))  # use of wildcard: Who killed anything?

None
None
None
None
None
None
None
None
The monster
None
the monster
None
None
None


<div class=h1_cell>
<p>
I'm going to package up the for loop into a function. I'll return a list of answers.
</div>

In [369]:
def search_for_who(verb, target_words, kb):
    who_dunit = []
    for rel in kb:
        if not rel: continue
        p = who(verb, target_words, rel)
        if p: who_dunit.append(p)
    return who_dunit

In [315]:
search_for_who('built', ['the', 'creature'], all_relations)

['Victor Frankenstein']

In [316]:
search_for_who('built', ['the', 'brute'], all_relations)

['Victor Frankenstein']

In [317]:
search_for_who('rescued', ['girl'], all_relations)

['the creature']

In [318]:
search_for_who('killed', ['frankenstein'], all_relations)

['the monster']

In [319]:
search_for_who('was', ['tragic'], all_relations)

['the monster']

In [370]:
search_for_who('killed', [], all_relations)

['The monster', 'the monster']

In [320]:
search_for_who('built', ['the', 'monster'], all_relations)  #seems like it should match but does not

[]

<h2>
Challenge 5
</h2>
<div class=h1_cell>
<p>
Pretty dang cool if you ask me. Let's do another. Define `what_done_by` that only takes 2 arguments: (1) the list of target words to match against the first noun-phrase and (2) the relation. See my example results below.
</div>

In [327]:
def what_done_by(target_words, relation):


In [329]:
def search_for_what_done_by(target_words, kb):
    what_done = []
    for rel in kb:
        if not rel: continue
        p = what_done_by(target_words, rel)
        if p: what_done.append(p)
    return what_done

In [331]:
search_for_what_done_by(['victor'], all_relations)

['builds the creature']

In [332]:
search_for_what_done_by(['monster'], all_relations)

['kills Victor', 'kills Frankenstein', 'is the tragic result']

In [371]:
search_for_what_done_by([], all_relations)  # wildcard

['builds the creature',
 'rescues a peasant girl',
 'kills Victor',
 'builds a female creature',
 'kills Frankenstein',
 'is the tragic result']

<h2>
Challenge 6
</h2>
<div class=h1_cell>
<p>
Last one. Define a function `what_happened_to` that takes target words to match against the 2nd noun-phrase.
</div>

In [334]:
def what_happened_to(target_words, relation):


In [335]:
def search_for_what_happened_to(target_words, kb):
    what_done_to = []
    for rel in kb:
        if not rel: continue
        p = what_happened_to(target_words, rel)
        if p: what_done_to.append(p)
    return what_done_to

In [336]:
search_for_what_happened_to(['creature'], all_relations)

['Victor Frankenstein builds the creature',
 'Frankenstein builds a female creature']

In [337]:
search_for_what_happened_to(['brute'], all_relations)

['Victor Frankenstein builds the creature',
 'Frankenstein builds a female creature']

In [376]:
search_for_what_happened_to(['tyke'], all_relations)

['the creature rescues a peasant girl']

In [377]:
search_for_what_happened_to([], all_relations)

['Victor Frankenstein builds the creature',
 'the creature rescues a peasant girl',
 'The monster kills Victor',
 'Frankenstein builds a female creature',
 'the monster kills Frankenstein',
 'the monster is the tragic result']

<h2>
Closing Notes
</h2>
<div class=h1_cell>
<p>
One next step would be to build something closer to an SQL language for querying. Then map that language to our functions.
<p>
Another step would be to look for contradictions, e.g., "X killed Y", "Y killed X". Or "X is 8 feet tall", "X is 3 feet tall". One of our PhD students just finished a study like this for medical papers. He tried to find contradictions in different author's findings. And he did! He wrote to the authors and pointed out the contractions. You might even be able to use it to detect fake news. If (a big if) you had a set of relations that you knew were true, you can search the web for text (e.g., tweets, blogs) that contradicted what you knew was true.
<p>
A never-ending next step is to improve pattern-matching to extract relations. Deal with the convoluted way English sentences can be written.
</div>