### LX 496/796  Introduction -- Parsing and context free grammars

### Due Monday at 11:59 PM in Gradescope

In this homework, you will become familiar with creating context free grammars in NLTK.  Also, parsing, drawing, and traversing syntactic trees.

I'm not going to bother with the autograder, it wasn't behaving last week anyway.  I'll give you tests to check your work instead.

# Getting started with a context-free grammar

We'll now use NLTK to do a little bit of actual theoretical linguistics.
This is at least partly based on chapter 8 of the NLTK book.

As a first step, we're going to create a context-free grammar to play with.

First, the standard `import nltk` part.

In [None]:
import nltk

Second, install `svgling`. This is a package written by Kyle Rawlins (JHU) designed to allow NLTK to draw trees properly within Colab/Jupyter notebooks (among other things). It lives here: https://github.com/rawlins/svgling .

If you don't install this, NLTK would by default want to open a new window to draw trees in.  That's not something Colab can do, and so you get an error.  Since we like trees better than we like errors, we will install this.

In [None]:
# install svgling for drawing trees with NLTK in ipynb files
!pip install svgling
import svgling
# the following will allow us to create a list or row of trees and then draw them, possibly captioned
from svgling.figure import RowByRow, SideBySide, Caption

And now, we will make a trivial context free grammar just to ensure that the tech is working.  It will only be able to parse the sentence "I left" and will do so by saying that sentences are made of a NP, "I", and a VP, "left".  We'll define it, draw a tree with it, and then move on to more elaborate grammars.

The command below defines a string containing three lines, each line with a rule of our context free grammar.

> The triple-quotation-mark syntax allows you to have a string that spans multiple lines.

In [None]:
gramileft = """
S -> NP VP
NP -> 'I'
VP -> 'left'
"""

Having defined the string, we feed it to NLTK to turn it into a proper Context Free Grammar object.

In [None]:
cfgileft = nltk.CFG.fromstring(gramileft)

> *Convention*: In this notebook, we will designate grammars with names like `ileft` and things related to those grammars with prefixes.  So:
>
> - `gram...` (e.g. `gramileft`) is the string that specifies a context free grammar.
> - `cfg...` (e.g. `cfgileft`) is the context free grammar object that NLTK creates from them.
> - `parser...` (e.g. `parserileft`) is a parser created to handle a given context-free grammar.
> - `gen...` (e.g. `genileft`) is a parse-generator created by the parser for a given sentence.
>
> We will continually define and redefine `raw` (raw text), `sent` (tokenized text), `parses` (parses generated by a parse-generator), `trees` (trees, representing parses).

We can see that it worked by telling this grammar object to list the things it can produce.  It should give you a list of the rules back.

In [None]:
cfgileft.productions()

Now, let's define the sentence we would like to have our grammar parse.  We will put this sentence in the variable `raw` (for raw text).  Then, we will define `sent` (sentence) as being the list of words in our raw text, by using `.split()` to split on "whitespace" and collect the results in a list.  This process of splitting a text into words like this is called "tokenizing."

In [None]:
raw = "I left"
sent = raw.split()
print(sent)

We now ask NLTK to make a parser just for our little toy grammar.  The parser type will be a recursive descent parser, tailored to parse based on the `cfgileft` grammar.  We will call our parser `parserileft`.

In [None]:
parserileft = nltk.RecursiveDescentParser(cfgileft)

Now that we have our parser, we can tell it to parse something, specifically `sent`.  What we get back (and assign the name `genileft` to) is a "generator", which is an object that generates the next tree until we are out of possible parses.

A generator isn't a list of parses, it's a procedure that can produce a list of parses.  One way to make it produce the list is to iterate through it, another is to coerce it into a list by using `list()`.

In [None]:
genileft = parserileft.parse(sent)
print(genileft) # reveals that this is a parse-generator, not yet actual parses

We will use the generator by iterating through it.  Below `for t in genileft` will iterate through the parses this generator can find, and will `print()` each parse in turn.

Note that when you use a generator, it gets "consumed."  More technically, the generator knows how to move to the next thing in the list of things it is going to generate, but it does not know how to go back to a previous thing.  So, it can only move forward, and so if you ask it to keep generating more trees until it runs out, it will then be out, and asking it again will not yield anything.  The little code vignette below shows this.

In [None]:
print('Defining genileft as a parse-generator for sent.')
genileft = parserileft.parse(sent)
# first time through we get some trees
print('parse(s) from genileft, first time used:')
for t in genileft:
    print(t)
# second time "through" we get nothing because we had already reached the end of the parses
print('parse(s) from genileft, second time used:')
for t in genileft:
    print(t)
# "third" time through after having redefined the generator, gives us the trees again
print('Defining genileft as a parse-generator for sent.')
print('parses(s) from genileft, third time used after redefining it:')
genileft = parserileft.parse(sent)
for t in genileft:
    print(t)

The only way to get the trees back is to redefine the generator.  So, we'll do that now, but then we'll collect the generated trees into a list of trees.  Once they are stored in the list, we can refer to them as many times as we'd like.  We consume the generator making the list, but then we can refer to the list after that.

In [None]:
genileft = parserileft.parse(sent)
parses = list(genileft)
print(parses)

Now, let's draw our tree.

> This is the part where installing `svgling` helps us.  If you had not done that, you will probably get an error here, as it attempts to open a new window and draw a tree in it.  Note that although NLTK standard practice would be to call `draw()` on the tree, we do not want to do that here in a Colab notebook.  We will instead just let Colab reveal the tree's contents on its own terms.

To draw the tree, just let Python try to display the contents.  That's the simplest way to do this.

In [None]:
parses[0]

# Developing a more complex phrase structure grammar

We can start with the basic "park grammar" that comes from the NLTK book (so named I guess because it handles sentences that contain "in the park").

The first part of the grammar specification below generally is defining the possible structures of sentences in general, and then the latter part of the grammar specification is defining the words.  It is possible to use the pipe character (`|`) to separate disjunctive options (essentially like a logical "or").  The grammar given below is what we want.  It shows that there are three verbs, four prepositions, four determiners, five nouns, three proper names.  It shows that verbs can either be followed by an NP (the object) or an NP and a PP (an object and a prepositional indirect object).  Prepositions are followed by an object.  And NPs can either be a name, or be a Det plus an N and optionally plus a prepositional phrase.

```python
S -> NP VP
VP -> V NP | V NP PP
PP -> P NP
V -> 'saw' | 'ate' | 'walked'
NP -> 'John' | 'Mary' | 'Bob' | Det N | Det N PP
Det -> 'a' | 'an' | 'the' | 'my'
N -> 'man' | 'dog' | 'cat' | 'telescope' | 'park'
P -> 'in' | 'on' | 'by' | 'with'
```

## TASK 1 define `grampark`

Put this grammar's specification in a string called `grampark`, following the same procedure we used for `gramileft` above.

> This is not supposed to be hard.  You can copy the grammar from the cell above, and paste it into a definition of a multi-line string called `grampark`.

In [None]:
# Answer 1: define a multi-line string called grampark with the grammar shown above



## TASK 2 define `cfgpark`

Have NLTK build a CFG object out of this specification, call it `cfgpark`, following the model of creating `cfgileft` above.

In [None]:
# Answer 2: define a CFG called cfgpark from grampark



In preparation for the next task, we will parse the sentence "Bob saw a telescope on a dog" and count how many parses it has.  (It has 2, because it is two-ways ambiguous.)  This serves as an example of what you will do in Task 3.  And a test to be sure that you've got the steps above correct.

> The code below loops through the numbers representing how many parses the sentence had (`len(trees)`) rather than simply looping through the parses as we did up in the code vignette above.  My only reason for doing it that way was so I could print the labels "Parse 1" and "Parse 2" using the number of the current parse.
>
> To draw trees side by side, follow the recipe below.  Start with an empty list (`trees = []`), use `svgling.draw_tree()` to draw the tree for a parse and assign it a name (`tree`), and then `append` the tree to the list.  Here, we wrapped the tree in `Caption(tree, string)` to add a caption below the tree.
>
> At the end, when all the side-by-side trees are in the list, call `SideBySide(*trees)` -- note the asterisk before `trees`.  This is necessary, it basically feeds the trees one by one to `SideBySide()`.

In [None]:
parserpark = nltk.RecursiveDescentParser(cfgpark)
raw = "Bob saw a telescope on a dog"
sent = raw.split()
genpark = parserpark.parse(sent)
parses = list(genpark)
print("{}. {} parse(s)".format(raw, len(parses)))
trees = []
for i in range(len(parses)):
  tree = svgling.draw_tree(parses[i])
  trees.append(Caption(tree, "Parse {}".format(i+1)))
SideBySide(*trees)

## TASK 3 test `parserpark`

Show how many parses `cfgpark` can generate for each of the following sentences:
 - Mary saw Bob
 - my dog saw a cat in the park
 - a dog ate John
 - Mary saw

To do this, follow the process we used earlier.  Specifically, you'll need to have NLTK make a parser object for you, then tell that parser object to parse each of the (tokenized) sentences above, and then print the length of the list of parses each sentence gets.  You don't need to draw the trees, just count the number of parses.

> You should get 1, 2, 1, and 0 parses for the four sentences above, respectively, if it worked.
>
> You can of course follow the logic of the code we used above, that is why it is there.
> But you do not need to redefine `parserpark`, since it is already defined by the code above
> and does not need to change (since we are still working with the same grammar).

In [None]:
# Answer 3. Parse the sentences above and print how many parses each gets.



## TASK 4 `grampark2`: adjectives and PPs

In this task, we will create a grammar specification called `grampark2` that extends `grampark` to allow for (multiple) adjectives and multiple PP modifiers.
Specifically, the goal is for it to be able to handle
"my annoying little dog saw a cat by Bob in the lovely park"
and
"a vicious dog ate John."

To complete this task, just define `grampark2`, which is a multi-line string that defines a context free grammar.  We will use this in the activities below to test it out.  For more step-by-step hints, read the indented text below.


> The NLTK book (chapter 8, around example 3.3) works through the adjective part of this, though it is very concise there.
>
> When you think about the structure of "my annoying little dog", we have a determiner (*my*),
> two adjectives (*annoying* and *little*), and then the head noun (*dog*).  So, in addition to
> `NP -> Det N` (which we need to keep so we can still get "a dog"), we need an `NP` rule that
> allows for one or two adjectives between `Det` and `N`.
>
> There are two approaches here.  One is to create a grammar that allows for 0, 1, or 2 adjectives between a determiner and a noun.  That can be accomplished by adding two more rules.  But then if we ever need to handle a sentence containing "my annoying little fluffy dog" we'd need to add yet another rule.
>
> The other approach is suppose that we can have essentially any number of adjectives between the Det and the N.  The way to do this is to create a recursive rule, one that will add an adjective to something and yield something that it suitable for adding another adjective to.  In the NLTK book, a "Nom" category is proposed, which is essentially the same as what might have been called "N-bar" in Intro to Linguistics.  For our purposes, the simplest version of this would be:
>
> ```
N -> Adj N
```
> This expands our understanding of "N" from "a word that is a noun" to "one or more words that function as a noun does", but it gets what we want.  Specifically, if you have Adj "fluffy" and N "dog", you can make the N "fluffy dog", which you can then add the Adj "little" to, yielding "little fluffy dog", etc.
>
> While we're at it, PP nodes attached to Ns behave in pretty much the same way.  You can have any number of them, suggesting that they too are a good candidate for a recursive rule.
>
> ```
N -> N PP
```
> However, when you add that rule above, you want to remove the rule that added a single PP within the NP (so that there is only one rule, the one just above here, that introduces a PP within an NP).
>
> In fact, let's not stop there.  The VP can also have multiple PPs attached to it, so let's revise the VP rule so that PP can modify verbs even when those verbs have no object.  That is, let's add:
>
> ```
VP -> VP PP
```
>So, let's add those three recursive rules above (two for N, one for VP), as well as the rules that define the adjectives:
>```
Adj -> 'annoying' | 'little' | 'lovely' | 'vicious' | 'fluffy'
```
>
> And don't forget to *remove* the rules that we had introducing PPs before (remove the rule rewriting NP to Det N PP, and the rule rewriting VP to V NP PP).
>
> To define the multi-line string `grampark2`, you can just copy and paste the `grampark` string (from Task 1) and then
> modify it as needed to add support for adjectives and the recursive N rules described above.


In [None]:
# Answer 4: Define a multi-line string gramparkadj that specifies the grammar, extended to handle adjectives



Now we will build the CFG object from this description, and have NLTK make a parser based on that CFG.

**NOTE** We need to use a `ChartParser` now rather than a `RecursiveDescentParser` because the recursive CFG we have created will send a recursive descent parser into an infinite loop.

In [None]:
cfgpark2 = nltk.CFG.fromstring(grampark2)
# we cannot use nltk.RecursiveDescentParser(cfgpark2) with this kind of recursive CFG
parserpark2 = nltk.ChartParser(cfgpark2)

If it worked, the following code should succeed.  (Look through the code to see what it is doing,
but you should get the "Success!" message.)

In [None]:
raw = 'Mary saw a lovely cat in the park'
sent = raw.split()
genpark2 = parserpark2.parse(sent)
parses = list(genpark2)
if len(parses) > 0:
    print('\o/ Success! The sentence was parsed (into {} parses).'.format(len(parses)))
else:
    print("D'oh! Something is amiss.")

Let's look a bit at what the actual parses were.  There are three.

In [None]:
trees = [svgling.draw_tree(t, font_size=12) for t in parses]
SideBySide(*trees)

Intuitively, this seems like it might be too many parses, but looking at the trees, we can see what happened. There are three places the PP *in the park* can be attached. It can be attached to the VP (the seeing was in the park), or it can be attached to the NP (the cat was in the park).  However, because there is an adjective (*lovely*) as well, the PP can either attach below the adjective (it is the *cat in the park* that is *lovely*) or above it (it is the *lovely cat* that is *in the park*).  This (arguably) may not make a difference to the meaning, but the structures are still distinct.

And it'll get worse if you have two adjectives and two PPs.  Try *Mary saw an annoying little cat with a telescope in the park.* How many parses does that sentence have?  Yikes.


In [None]:
raw = 'Mary saw an annoying little cat with a telescope in the park'
sent = raw.split()
genpark2 = parserpark2.parse(sent)
parses = list(genpark2)
if len(parses) > 0:
    print('\o/ Success! The sentence was parsed (into {} parses).'.format(len(parses)))
else:
    print("D'oh! Something is amiss.")

Let's define a function that can take a
string, break it up into words, parse it, and return the trees.  That will make
it simpler to deal with this procedure.
Take a moment to understand the code, and see how it relates to what we just did.

In [None]:
def get_parses(raw, cfg):
    sent = raw.split()
    parser = nltk.ChartParser(cfg)
    treegen = parser.parse(sent)
    parses = list(treegen)
    return parses

Now, we can do the same check we did above, but making use of our `get_trees()` function.  You should still get the "Success!" message.  Look it over to understand what is happening.

In [None]:
parses = get_parses('Mary saw a lovely cat', cfgpark2)
if len(parses) > 0:
    print('\o/ Success! The sentence was parsed (into {} parses).'.format(len(parses)))
else:
    print("D'oh! Something is amiss.")  

## TASK 5 testing `get_parses`

Show how many parses the grammar specified by `gramparkadj` gives (as you did in Task 3) for the following sentences:
 - my annoying little dog saw a cat in the lovely park
 - a vicious dog ate John
 - a man walked in the park

> You should get 2, 1, and 0, respectively, if it worked.

In [None]:
# Answer 5: Parse the sentences above with cfgparkadj and print how many parses each gets



## TASK 6 walking in the park

This grammar will give you nothing for "a man walked in the park".  Intuitions about English suggest that this should be possible. What is it about `grampark2` that leads this not to be grammatical/parsable?

**Answer 6** (markdown)



# Traversing trees, finding subjects and objects

Next, we will try to find the subject of a sentence.  Descriptively, the subject
of a sentence is the NP that is a daughter of S.  Ultimately, in this grammar we have built
so far, it's always going to be in the same place, but let's explore this a little bit anyway.

We can take a tree that our parser has found for us and break it up into subtrees, which
will allow us to isolate NP-daughter-of-S pretty easily.  So, for the "John saw Mary" tree,
what `get_parses("John saw Mary")` gives us back is a list of parses (containing just one
element), so let's look at that parse.  We'll name it `parse1`.  Good day to you, `parse1`.

In [None]:
parse1 = get_parses("John saw Mary", cfgpark2)[0]
parse1

We can ask things of type `Tree` to provide `subtrees()`, which will give us *all* the subtrees contained in it.

In [None]:
list(parse1.subtrees())

If we are looking for NP-daughter-of-S, we first want to find the Ses, and then we can look at the daughters to find an NP.  In this case, we have just the one S, but later we will look at more complex sentences where one S in contained inside another.  So, let's do this in a general way from the beginning.

In [None]:
# find all the subtrees labeled "S"
ssubtrees = [n for n in parse1.subtrees() if n.label() == 'S']
# go through each S and find the NP daughters
subjects = [d for snode in ssubtrees for d in snode if d.label() == 'NP']
# report on what we found
print(subjects)

We can actually combine these in a single (though complicated) list comprehension:

In [None]:
[d for n in parse1.subtrees() if n.label() == 'S' for d in n if d.label() == 'NP']

> This takes the NP-finder above, but adds in the computation of `snodes` as well. Notice the order.  We're making a list of `d`s, which are the daughters of `snode`. So we start with `[d for...` but then we are going to find the `snodes` first, and then the daughters of those once we have an `snode`.  So, we continue with `n in parse1.subtrees() if n.label() == "S"` meaning that `n` is going to be a subtree with label "S" that we want to then check the daughters of.  So, then we go through the daughters with `for d in n if d.label() == "NP"`.  Put together, it looks as given above.
>
>Saying it again/slightly differently: To read this out loud in something like English: skip the `[d for` part until the end: "For each node `n` in whose label is `"S"`, and for each node `d` in `n` whose label is `"NP"`, add `d` to the list"

If we make some assumptions about the grammar (in particular, that the first daughter of S is always going to be the subject), we can do the same thing by just gathering the first daughters of all the Ses (without bothering to check for an NP label).

In [None]:
[snode[0] for snode in ssubtrees]

## TASK 7 Locate the object

Understand how that complex list comprehension works.
It's not simple.  Even I have to stare at these for a little while before I get it.
Re-read the explanation above a couple of times and keep in mind what this is supposed
to be accomplishing.
Then, **convince yourself
that you have succeeded
by changing it so that it finds the object instead.**
(What we did above is find the subject, which is the
NP daughter of S.  So, how do we characterize the object?  Use the technique above to find it.
The answer should be "Mary", right?)

In [None]:
# Answer 7: Revise the list comprehension above to find the object instead.



Now, let's enhance the grammar by adding the ability to embed clauses, like in
"Bob thought that John saw Mary" and "Bob said that John saw Mary".

## TASK 8 Add embedded sentences

Enhance the grammar so that it can parse "Bob thought that John saw Mary" and "Bob said that John saw Mary".  Call the (multi-line string) specification of the new grammar `gramcomp` and have NLTK create a CFG object from it called `cfgcomp`.


> The idea here is to add still more to `grampark2` from Task 4.  We need to add the verbs `"said"` and `"thought"` at least, and the
> complementizer `"that"`.  So to start, just copy and paste the definition of `grampark2` from Task 4, and then revise it.  We'll call the reviesed version `gramcomp`.
> 
> Consider that, although "said" and "thought" are verbs, they do not take NP objects.
> So they're a different kind of a verb.  They are in the *category* of "verb" but they
> are a sub-type, a *sub-category* of verb.  So, we do not simply want to add something
> like `... | "thought" | "said"` to the `V ->` line.  We need a different kind of verb,
> the book calls them "Sentential verbs" and gives them a label of `SV`, so we can follow
> that here.
>
>```
SV -> 'said' | 'thought'
> ```
> 
> If the sentential verbs are category `SV`, we still want to be able to form a `VP`
> out of a `SV` and its complement.  So, we need to add that as an option to the `VP`
> rules.  In order to do this, we also need to figure out what the complement of such
> a verb is.
> 
> This is simplifying things, but let's assume that the complement of "thought" is
> basically always "that S" --- so "that" is a complementizer, we can call it category
> `C` and we can form a `CP` from `C` and `S`.  Then `SV` type verbs will have a
> `CP` as their complement.  It's pretty close to what you'd have seen in Intro
> syntax, apart from probably calling `S` "IP" instead.
>
>```
VP -> SV CP
CP -> C S
C -> 'that'
> ```
> 
> TL;DR: You will want to add rules with left sides being CP, C, SV, and another with VP.

In [None]:
# Answer 8: Define gramcomp and cfgcomp
# (allowing sentential complements of thought and said)



## TASK 9 Two complex trees

Give trees for:
 - Bob said that John saw Mary in the park
 - the annoying man thought that Bob said that my dog saw a vicious cat in the park
 
You can use the text representations of the trees that `print(tree)` provides.  Also: don't forget that we defined `get_parses(raw, cfg)` above, so you can just use that as-is to get the parses.

Print the tree for the first parse of each (don't bother printing trees of all the possible parses of the second one).

In [None]:
# Answer 9: Print trees for the first parse of sentences
# 9a: Bob said that John saw Mary in the park


In [None]:
# 9b: the annoying man thought that Bob said that my dog saw a vicious cat in the park



## TASK 10 Locate many subjects

Find the subjects of those sentences using our subject-finding procedure
from before.
It should be "Bob" and "John" in one case, "the annoying man", "Bob", and "my dog" in the other.
(Also, it is ok if your subjects when printed look like `[Tree('NP', ['Bob'])]` rather than just "Bob".)

> Here, you would just use the list comprehension for finding subjects above Task 7.

In [None]:
# Answer 10: Find (and print) the subjects of the sentences above
# first sentence


In [None]:
# second sentence


## TASK 11 Locate many objects

Find the objects of those sentences using our
object-finding procedure from before.  (Should be "Mary" in one case, "a vicious cat (in the park)" in the other.)

> Here, you can just use the object-finder from Task 7.

In [None]:
# Find (and print) the objects of the sentences above
# first sentence


In [None]:
# second sentence


# Relative clauses

A relative clause is something like "who saw Mary" in "the man who saw Mary". 
It is formed by adding a *wh*-question to a noun, more or less.  So the referent
of "the man who saw Mary" is the individual that is a man, and also the answer
to the question "Who saw Mary?". 

Suppose we want our parser to recognize "the man who saw Mary" as an NP.

It can already recognize "the man" and "the man in the park", so we can
simply add an extra option for the `NP` rule to allow for this.

In "the man who saw Mary", it seems like "who" is basically the
subject of "saw".  So, "who saw Mary" is a special kind of sentence
with "who" as the subject.  Let's define this kind of special case
by, first, making "who" a special kind of NP, and then making a
relative clause be a special kind of sentence with "who" as its
subject.

So, we can add these to the grammar (`RP` is the relative pronoun, `RC` is the relative clause, which is a relative pronoun and a verb phrase, and we add one more kind of `NP` that has a `RC` attached).

In [None]:
gramrcsub = gramcomp + """
RP -> 'who'
RC -> RP VP
NP -> Det N RC
"""

cfgrcsub = nltk.CFG.fromstring(gramrcsub)

In [None]:
parses = get_parses("the man who saw Mary saw Bob", cfgrcsub)

In [None]:
parses[0]

There's another form a relative clause can take, though.  You can also say
"the man who Mary saw saw Bob".  What's different here is that "who" is now 
playing the role of the object, rather than the subject.

The relative pronoun "who" generally corresponds to a gap
in the sentence.  We didn't notice the gap before, when the gap was in the subject
position, but it's obvious
when the gap is in the object position.  These relative clauses are, again,
basically *wh*-questions, and the normal way *wh*-questions are formed is to 
move the *wh*-word to the front of the clause.

And this is where parsing becomes difficult, when things move around in a sentence.

Let's try a kind of a hack to make this work.

For any transitive verb ("saw", "ate", and "walked" in our grammar), there is the
version we already have, which form a VP with their object NP.  If any of these appear
in an "object relative", then the object NP will be "missing".  So, let's make a version
of the VP that has a "**gap**".  That is, we will define `VPG` (VP-gap) to be just `V` rather than 
`V NP`.


The line below will give us a list of all the rules in `cfgrcsub` that have `VP` on the left side. For any of these that have `V NP` in them, we want to make a `VPG` version (a VP with an object gap) with the `NP` omitted.

In [None]:
[p for p in cfgrcsub.productions() if str(p.lhs()) == 'VP']

In [None]:
gramvpg = gramrcsub + """
VPG -> V
VPG -> V PP
"""

We haven't used `VPG` yet in the grammar apart from defining it.  But conceptually,
what we want is that `VPG` should be available in a relative clause where the
object is missing.  So, we want to add an expansion for `RC`, like this:
```python
RC -> RP SWOG
SWOG -> NP VPG
```
The idea here is that `SWOG` (sentence-with-object-gap) is like a regular `S` but
has a `VP` with an object gap.

To do this replacement, it's probably easier to edit the whole specification.  So, let's print out the multi-line string that specifies our most recent version of the grammar, and then afterwards you can copy and paste that into the next version of the grammar, making the changes just mentioned (adding `SWOG` in).

In [None]:
print(gramvpg)

## TASK 12 object relatives

Define a multi-line string `gramrcobj` that specifies a grammar that can handle object relative clauses, by copying the grammar above and then adding the modifications for `SWOG` discussed above that.

In [None]:
# Answer 12: define gramrcobj as above but with modifications for SWOG.



Assuming you did that right, we should now get a tree below.

In [None]:
cfgrcobj = nltk.CFG.fromstring(gramrcobj)
parses = get_parses("the man who Mary saw in the park saw Bob", cfgrcobj)
parses[0]

Having analyzed object relatives this way really probably means that we should re-analyze subject relatives to match (so that there is a gap even in subject relatives).  We won't bother with that here though.

## TASK 13 the man who saw the man who Mary saw

Draw a tree for "the man who saw the man who Mary saw saw Bob".

In [None]:
# the man who saw the man who Mary saw saw Bob



## Advanced/optional: TASK 14 Find subjects and objects

Define a function `find_subjects` that will find subjects and a function `find_objects` that will find objects, both of which will work with relative clauses.  Specifically, the subjects of the tree above should be "the man who was the man who Mary saw", "who", "Mary"; the objects should be "Bob", "the man who Mary saw", "who".  This will require looking not only at NP daughters of VP and S but looking also at RCs.


In [None]:
# Answer 14: define find_subjects(raw, cfg) and find_objects(raw, cfg)
# so that they will find subjects/objects in sentences with relative clauses too

def find_subjects(raw, gram):
  # revise this to actually return the right things
  return ['a', 'b']

def find_objects(raw, gram):
  # revise this to actually return the right things
  return ['a', 'b']


The code below should test your functions.

In [None]:
subjs = find_subjects('the man who saw the man who Mary saw saw Bob', cfgrcobj)
print("Found {} subjects, should be 3.".format(len(subjs)))
print("Last two should be 'who' and 'Mary' and are:")
print(subjs[-2])
print(subjs[-1])

In [None]:
objs = find_objects('the man who saw the man who Mary saw saw Bob', cfgrcobj)
print("Found {} objects, should be 3.".format(len(subjs)))
print("Last two should be 'Bob' and 'who' and are:")
print(objs[-2])
print(objs[-1])

And that's it for the homework.  Feel free to play around with it, there are certainly more things one can do.