# 9.1 - Grammatical Features

Instead of automatically detecting features of words, we will change to *explicitly declaring these features*. Let's first store some features and their values using dictionaries. 

In [1]:
kim = {"CAT": "NP", "ORTH": "Kim", "REF": "k"}
chase = {"CAT": "V", "ORTH": "chased", "REL": "chase"}

Both objects `kim` and `chase` have shared features, **CAT** (grammatical category) and **ORTH** (orthography, spelling). Each also has a semantically oriented feature: 
* `kim["REF"]` is intended to give the referent of `kim` while 
* `chase["REL"]` gives the relation expressed by `chase`

Such pairings of features and values are known as **feature structures**.

We might want to add additional features on top of the general ones as mentioned in earlier chapters. Consider the word `chase`. 

The subject of the sentence is known as the **agent** while the object is known as the **patient**. Let's add them in.

In [2]:
chase["AGT"] = "sbj"
chase["PAT"] = "obj"

Now, given a sentence *Kim chased Lee*, we want to bind the **verb's agent role to the subject** and its **patient role to the object**.

In [3]:
#add a feature structure to Lee
lee = {"CAT": "NP", "ORTH": "Lee", "REF": "l"}
#Declare the sentence.
sent = "Kim chased Lee"
tokens = sent.split()

def lex2fs(word):
    for fs in [kim, lee, chase]:
        if fs["ORTH"] == word:
            return fs

subj, verb, obj = lex2fs(tokens[0]), lex2fs(tokens[1]), lex2fs(tokens[2])
verb["AGT"] = subj["REF"]
verb["PAT"] = obj["REF"]
for k in ["ORTH", "REL", "AGT", "PAT"]:
    print("%-5s => %s" % (k, verb[k]))

ORTH  => chased
REL   => chase
AGT   => k
PAT   => l


The same approach can be adopted by a different verb. For the word `surprise` the subject is the **source** and the object is the **experiencer**.

In [4]:
surprise = {"CAT": "V", "ORTH": "surprised", "REL": "surprise", "SRC": "sbj", "EXP" : "obj"}

## Syntactic Agreement

`This dog` is grammatically correct while `these dog` is grammatically wrong. The correct form of that phrase is `these dogs`. Conversely, `this dogs` is also wrong. ("this is for singlular while "these" is for plural nouns)

`The dog runs` is grammatically correct while `The dog run` is grammatically wrong. (verb after plural verb usually has no "s" and vice versa). Similarly, `The dogs run` is grammatically correct while `The dogs runs` is wrong.

Morphological properties of the verb co-vary with syntactic properties of the subject noun phrase. This co-variance is called **agreement**.

We can make the morphological properties more explicit (We use `3` for 3rd person, `SG` for singular and `PL` for plural)

In [5]:
print("SINGULAR, 3rd person")
print("the   dog        run-s")
print("the   dog.3.SG   run-3.SG")
print()
print("PLURAL, 3rd person")
print("the   dog-s      run")
print("the   dog.3.PL   run.3.PL")

SINGULAR, 3rd person
the   dog        run-s
the   dog.3.SG   run-3.SG

PLURAL, 3rd person
the   dog-s      run
the   dog.3.PL   run.3.PL


In [6]:
#Introduce a CFG on the above example:
import nltk
grammar = nltk.CFG.fromstring("""
  S  -> NP VP
  NP -> Det N
  VP -> V 
  Det -> 'this'
  N -> 'dog'
  V -> 'runs'
""")
#This grammar generates "This dog runs" but it also generates other invalid sentences like "These dog runs".

In [7]:
#The most straightforward way to implement the constraints is to add new non-terminals and productions to the grammar
grammar2 = nltk.CFG.fromstring("""
  S  -> NP_SG VP_SG
  S  -> NP_PL VP_PL
  NP_SG -> Det_SG N_SG
  NP_PL -> Det_PL N_PL
  VP_SG -> V_SG
  VP_PL -> V_PL
  
  Det_SG -> 'this'
  N_SG -> 'dog'
  V_SG -> 'runs'

  Det_PL -> 'these'
  N_PL -> 'dog'
  V_PL -> 'runs'
""")

Instead of having 1 CFG production, we now have **2 CFG productions** - one for singular and one for plural. With a small grammar, it's ugly and in larger grammars, clearly, it is inefficient. 

## Using Attributes and Constraints
Linguistic categories have properties e.g. a noun has the property of being plural:

`N[NUM=pl]`
In this case, we introduce a new feature where a term in the category `N` has a feature called `NUM` and its value is `pl` (short for plural). We can add similar annotations to other categories and use them in lexical entries:

`Det[NUM=sg] -> 'this'`

`Det[NUM=pl] -> 'these'`

`N[NUM=sg] -> 'dog'`

`N[NUM=pl] -> 'dogs'`

`V[NUM=sg] -> 'runs'`

`N[NUM=pl] -> 'run'`


**Previously we only allowed feature values to be explicit. Now, we relax this rule and let values be variable.**

```S  -> NP[NUM=?n] VP[NUM=?n]
NP -> Det[NUM=?n] N[NUM=?n]
VP -> V[NUM=?n])```

We use `?n` as a variable over explicit values of `NUM`, `sg` or `pl`. Whatever the value `?n` takes in the first term in the frist production, the 2nd value **must** take the same value. This applies to `S  -> NP[NUM=?n] VP[NUM=?n]` and also `NP -> Det[NUM=?n] N[NUM=?n]`.

We can also *underspecify* this attribute's value to let it agree in number with whatever noun it combines with:

`Det[NUM=?n] -> 'the' | 'some' | 'several' `

In [8]:
nltk.data.show_cfg('grammars/book_grammars/feat0.fcfg')
#In the result, notice how the variables are used for the NUM and the TENSE feature.

% start S
# ###################
# Grammar Productions
# ###################
# S expansion productions
S -> NP[NUM=?n] VP[NUM=?n]
# NP expansion productions
NP[NUM=?n] -> N[NUM=?n] 
NP[NUM=?n] -> PropN[NUM=?n] 
NP[NUM=?n] -> Det[NUM=?n] N[NUM=?n]
NP[NUM=pl] -> N[NUM=pl] 
# VP expansion productions
VP[TENSE=?t, NUM=?n] -> IV[TENSE=?t, NUM=?n]
VP[TENSE=?t, NUM=?n] -> TV[TENSE=?t, NUM=?n] NP
# ###################
# Lexical Productions
# ###################
Det[NUM=sg] -> 'this' | 'every'
Det[NUM=pl] -> 'these' | 'all'
Det -> 'the' | 'some' | 'several'
PropN[NUM=sg]-> 'Kim' | 'Jody'
N[NUM=sg] -> 'dog' | 'girl' | 'car' | 'child'
N[NUM=pl] -> 'dogs' | 'girls' | 'cars' | 'children' 
IV[TENSE=pres,  NUM=sg] -> 'disappears' | 'walks'
TV[TENSE=pres, NUM=sg] -> 'sees' | 'likes'
IV[TENSE=pres,  NUM=pl] -> 'disappear' | 'walk'
TV[TENSE=pres, NUM=pl] -> 'see' | 'like'
IV[TENSE=past] -> 'disappeared' | 'walked'
TV[TENSE=past] -> 'saw' | 'liked'


You can see there are other features too including the `TENSE` feature.

In [9]:
#trace of feature-based chart parser.
tokens = "Kim likes children".split()
from nltk import load_parser
cp = load_parser('grammars/book_grammars/feat0.fcfg', trace=2)
for tree in cp.parse(tokens):
    print(tree)

|.Kim .like.chil.|
Leaf Init Rule:
|[----]    .    .| [0:1] 'Kim'
|.    [----]    .| [1:2] 'likes'
|.    .    [----]| [2:3] 'children'
Feature Bottom Up Predict Combine Rule:
|[----]    .    .| [0:1] PropN[NUM='sg'] -> 'Kim' *
Feature Bottom Up Predict Combine Rule:
|[----]    .    .| [0:1] NP[NUM='sg'] -> PropN[NUM='sg'] *
Feature Bottom Up Predict Combine Rule:
|[---->    .    .| [0:1] S[] -> NP[NUM=?n] * VP[NUM=?n] {?n: 'sg'}
Feature Bottom Up Predict Combine Rule:
|.    [----]    .| [1:2] TV[NUM='sg', TENSE='pres'] -> 'likes' *
Feature Bottom Up Predict Combine Rule:
|.    [---->    .| [1:2] VP[NUM=?n, TENSE=?t] -> TV[NUM=?n, TENSE=?t] * NP[] {?n: 'sg', ?t: 'pres'}
Feature Bottom Up Predict Combine Rule:
|.    .    [----]| [2:3] N[NUM='pl'] -> 'children' *
Feature Bottom Up Predict Combine Rule:
|.    .    [----]| [2:3] NP[NUM='pl'] -> N[NUM='pl'] *
Feature Bottom Up Predict Combine Rule:
|.    .    [---->| [2:3] S[] -> NP[NUM=?n] * VP[NUM=?n] {?n: 'pl'}
Feature Single Edge Fundame

## Terminology

Simple values like `sg` and `pl` are **atomic** feature values - they cannot be decomposed to subparts. A special case of atomic values are **boolean** values, that take the value true or false. For example, we might want to distinguish *auxiliary verbs* like `can`, `may`, `will` and `do` with the boolean feature `AUX` then the production 

`V[TENSE=pres, aux=+] -> 'can'` means `can` receives the value`pres` for `TENSE` and `+` for `AUX`. Some representative productions are:

`V[TENSE=pres, +aux] -> 'can'`

`V[TENSE=pres, +aux] -> 'may'`

`V[TENSE=pres, -aux] -> 'walks'`

`V[TENSE=pres, -aux] -> 'likes'`

Another more radical way is to represent the whole category as a bundle of features. For example, `N[NUM=sg]` contains POS information which can be represented as `POS=N`. Hence an alternative notation is `[POS=N, NUM=sg]`.

We can also group ***agreement features*** as a distinguished part of a category, serving as the value of `AGR`. In this case we say `AGR` has a **complex** value. It can be expressed as an **attribute value matrix (AVM)**.
```
[POS = N           ]
[                  ]
[AGR = [PER = 3   ]]
[      [NUM = pl  ]]
[      [GND = fem ]]
```
Representing 3rd person, plural, feminine affinity