# FLIP(01):  Advanced Data Science
**(Module E: Natural Language Processing)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, but NOT allowed to change or distribute this package.
- If you found any issue/bug for this document, please submit an issue at [tulip lab/flip01](https://github.com/tulip-lab/flip01/issues)


Prepared by and for 
**Student Members** |
2006-2022 [TULIP Lab](http://www.tulip.org.au)

---


# Session 08 - Building Feature-Based Grammars

### Grammatical Features

In this part, we will investigate the role of features in building rule-based grammars. In contrast to feature extractors, which record features that have been automatically detected, we are now going to declare the features of words and phrases. We start off with a very simple example, using dictionaries to store features and their values.

In [None]:
##
kim = {'CAT': 'NP', 'ORTH': 'Kim', 'REF': 'k'}

In [None]:
chase = {'CAT': 'V', 'ORTH': 'chased', 'REL': 'chase'}

Feature structures contain various kinds of information about grammatical entities. The information need not be exhaustive, and we might want to add further properties. For example, in the case of a verb, it is often useful to know what “semantic role” is played by the arguments of the verb. In the case of chase, the subject plays the role of “agent,” whereas the object has the role of “patient.” Let’s add this information, using 'sbj' (subject) and 'obj' (object) as placeholders which will get filled once the verb combines with its grammatical arguments:

In [None]:
chase['AGT'] = 'sbj'

In [None]:
chase['PAT'] = 'obj'

If we now process a sentence Kim chased Lee, we want to “bind” the verb’s agent role to the subject and the patient role to the object. We do this by linking to the REF feature
of the relevant NP. In the following example, we make the simple-minded assumption that the NPs immediately to the left and right of the verb are the subject and object,respectively. We also add a feature structure for Lee to complete the example.

In [None]:
sent = "Kim chased Lee"
tokens = sent.split()
lee = {'CAT': 'NP', 'ORTH': 'Lee', 'REF': 'l'}

In [None]:
def lex2fs(word):
    for fs in [kim, lee, chase]:
        if fs['ORTH'] == word:
            return fs

In [None]:
subj, verb, obj = lex2fs(tokens[0]), lex2fs(tokens[1]), lex2fs(tokens[2])
verb['AGT'] = subj['REF'] # agent of 'chase' is Kim
verb['PAT'] = obj['REF'] # patient of 'chase' is Lee

In [None]:
for k in ['ORTH', 'REL', 'AGT', 'PAT']: # check featstruct of 'chase'
    print("%-5s => %s" % (k, verb[k]))

The same approach could be adopted for a different verb—say, surprise—though in this case, the subject would play the role of “source” (SRC), and the object plays the role of “experiencer” (EXP):

In [None]:
surprise = {'CAT': 'V', 'ORTH': 'surprised', 'REL': 'surprise',
            'SRC': 'sbj', 'EXP': 'obj'}

In [None]:
# Example feature-based grammar
import nltk
nltk.data.show_cfg('grammars/book_grammars/feat0.fcfg')

In [None]:
# Trace of feature-based chart parser.
tokens = 'Kim likes children'.split()

In [None]:
from nltk import load_parser
cp = load_parser('grammars/book_grammars/feat0.fcfg', trace=2)

In [None]:
for tree in cp.parse(tokens):
     print(tree)

# Processing Feature Structures

In this part, we will show how feature structures can be constructed and manipulated in NLTK. We will also discuss the fundamental operation of unification, which allows us to combine the information contained in two different feature structures. Feature structures in NLTK are declared with the FeatStruct() constructor. Atomic feature values can be strings or integers.

In [None]:
fs1 = nltk.FeatStruct(TENSE='past', NUM='sg')

In [None]:
print(fs1)

In [None]:
fs1 = nltk.FeatStruct(PER=3, NUM='pl', GND='fem')

In [None]:
print(fs1['GND'])

In [None]:
fs2 = nltk.FeatStruct(POS='N', AGR=fs1)

In [None]:
print(fs2)

In [None]:
print(fs2['AGR'])

In [None]:
print(fs2['AGR']['PER'])

In [None]:
print(nltk.FeatStruct("[POS='N', AGR=[PER=3, NUM='pl', GND='fem']]"))

In [None]:
print(nltk.FeatStruct(name='Lee', telno='01 27 86 42 96', age=33))

In order to indicate reentrancy in our matrix-style representations, we will prefix the first occurrence of a shared feature structure with an integer in parentheses, such as (1). Any later reference to that structure will use the notation ->(1), as shown below.

In [None]:
print(nltk.FeatStruct("""[NAME='Lee', ADDRESS=(1)[NUMBER=74, STREET='rue Pascal'],
                          SPOUSE=[NAME='Kim', ADDRESS->(1)]]"""))

The bracketed integer is sometimes called a tag or a coindex. The choice of integer is not significant. There can be any number of tags within a single feature structure.

In [None]:
print(nltk.FeatStruct("[A='a', B=(1)[C='c'], D->(1), E->(1)]"))

Merging information from two feature structures is called unification and is supported by the unify() method.



In [None]:
fs1 = nltk.FeatStruct(NUMBER=74, STREET='rue Pascal')
fs2 = nltk.FeatStruct(CITY='Paris')

In [None]:
print(fs1.unify(fs2))

Unification is formally defined as a (partial) binary operation: FS0 ⊔ FS1. Unification is symmetric, so FS0 ⊔ FS1 = FS1 ⊔ FS0. The same is true in Python:

In [None]:
print(fs2.unify(fs1))

Unification between FS0 and FS1 will fail if the two feature structures share a path π, but the value of π in FS0 is a distinct atom from the value of π in FS1. This is implemented by setting the result of unification to be None.

In [None]:
fs0 = nltk.FeatStruct(A='a')
fs1 = nltk.FeatStruct(A='b')
fs2 = fs0.unify(fs1)

In [None]:
print(fs2)

Now, if we look at how unification interacts with structure-sharing, things become really interesting. First, let's define (21) in Python:

In [None]:
fs0 = nltk.FeatStruct("""[NAME=Lee,
                          ADDRESS=[NUMBER=74,
                          STREET='rue Pascal'],
                          SPOUSE= [NAME=Kim,
                          ADDRESS=[NUMBER=74,
                          STREET='rue Pascal']]]""")

In [None]:
print(fs0)

What happens when we augment Kim's address with a specification for CITY? Notice that fs1 needs to include the whole path from the root of the feature structure down to CITY.

In [None]:
fs1 = nltk.FeatStruct("[SPOUSE = [ADDRESS = [CITY = Paris]]]")

In [None]:
print(fs1.unify(fs0))

By contrast, the result is very different if fs1 is unified with the structure-sharing version fs2

In [None]:
fs2 = nltk.FeatStruct("""[NAME=Lee, ADDRESS=(1)[NUMBER=74, STREET='rue Pascal'],
                          SPOUSE=[NAME=Kim, ADDRESS->(1)]]""")

In [None]:
print(fs1.unify(fs2))

Rather than just updating what was in effect Kim's "copy" of Lee's address, we have now updated both their addresses at the same time. More generally, if a unification adds information to the value of some path π, then that unification simultaneously updates the value of any path that is equivalent to π.

As we have already seen, structure sharing can also be stated using variables such as ?x.



In [None]:
fs1 = nltk.FeatStruct("[ADDRESS1=[NUMBER=74, STREET='rue Pascal']]")
fs2 = nltk.FeatStruct("[ADDRESS1=?x, ADDRESS2=?x]")

In [None]:
print(fs2)

In [None]:
print(fs2.unify(fs1))

# Extending a Feature-Based Grammar

In this part, we return to feature-based grammar and explore a variety of linguistic issues, and demonstrate the benefits of incorporating features into the grammar.

In [None]:
# Grammar with productions for inverted clauses and long-distance dependencies, making use of slash categories.
nltk.data.show_cfg('grammars/book_grammars/feat1.fcfg')

In [None]:
tokens = 'who do you claim that you like'.split()

In [None]:
from nltk import load_parser
cp = load_parser('grammars/book_grammars/feat1.fcfg')

In [None]:
for tree in cp.parse(tokens):
     print(tree)

In [None]:
tokens = 'you claim that you like cats'.split()

In [None]:
for tree in cp.parse(tokens):
     print(tree)

In addition, it admits inverted sentences which do not involve wh constructions:



In [None]:
tokens = 'rarely do you sing'.split()

In [None]:
for tree in cp.parse(tokens):
     print(tree)

In [None]:
# Example feature-based grammar.
nltk.data.show_cfg('grammars/book_grammars/german.fcfg')

In [None]:
tokens = 'ich folge den Katzen'.split()

In [None]:
cp = load_parser('grammars/book_grammars/german.fcfg')

In [None]:
for tree in cp.parse(tokens):
    print(tree)

In [None]:
tokens = 'ich folge den Katzen'.split()

In [None]:
cp = load_parser('grammars/book_grammars/german.fcfg',trace = 2)

In [None]:
for tree in cp.parse(tokens):
    print(tree)