# Python Parsing with NLTK

**(C) 2017-2021 by [Damir Cavar](http://damir.cavar.me/)**

**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-notebooks).

**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))

This is a tutorial related to the discussion of grammar engineering and parsing in the class *Alternative Syntactic Theories* and *Advanced Natural Language Processing* taught at Indiana University in Spring 2017, Fall 2018 and 2020.

## Working with Grammars

The following examples are taken from the NLTK [parsing HOWTO](http://www.nltk.org/howto/parse.html) page.

In [1]:
from nltk import Nonterminal, nonterminals, Production, CFG

In [2]:
nt1 = Nonterminal('NP')
nt2 = Nonterminal('VP')

In [3]:
nt1.symbol()

'NP'

In [4]:
nt1 == Nonterminal('NP')

True

In [5]:
nt1 == nt2

False

In [6]:
S, NP, VP, PP = nonterminals('S, NP, VP, PP')
print(S.symbol())

S


In [7]:
N, V, P, DT = nonterminals('N, V, P, DT')

In [9]:
prod1 = Production(S, [NP, VP])

In [10]:
prod2 = Production(NP, [DT, NP])

In [11]:
prod1.lhs()

S

In [12]:
prod1.rhs()

(NP, VP)

In [13]:
prod1 == Production(S, [NP, VP])

True

In [14]:
prod1 == prod2

False

In [16]:
grammar = CFG.fromstring("""
 S -> NP VP
 PP -> P NP
 PP -> P NP
 NP -> 'the' N | N PP | 'the' N PP
 NP -> D N
 D -> 'a'
 VP -> V NP | V PP | V NP PP
 N -> 'cat'
 N -> 'fish'
 N -> 'aligator'
 N -> 'dog'
 N -> 'rug'
 N -> 'mouse'
 V -> 'chased'
 V -> 'sat'
 P -> 'in'
 P -> 'on'
""")

In [17]:
print(grammar)

Grammar with 21 productions (start state = S)
    S -> NP VP
    PP -> P NP
    PP -> P NP
    NP -> 'the' N
    NP -> N PP
    NP -> 'the' N PP
    NP -> D N
    D -> 'a'
    VP -> V NP
    VP -> V PP
    VP -> V NP PP
    N -> 'cat'
    N -> 'fish'
    N -> 'aligator'
    N -> 'dog'
    N -> 'rug'
    N -> 'mouse'
    V -> 'chased'
    V -> 'sat'
    P -> 'in'
    P -> 'on'


## Feature Structures

One can build complex feature structures using the following strategies:

In [18]:
import nltk

fstr = nltk.FeatStruct("[POS='N', AGR=[PER=3, NUM='pl', GND='fem']]")
print(fstr)

[       [ GND = 'fem' ] ]
[ AGR = [ NUM = 'pl'  ] ]
[       [ PER = 3     ] ]
[                       ]
[ POS = 'N'             ]


Creating shared paths is also possible:

In [19]:
fstr2 = nltk.FeatStruct("""[NAME='Lee', ADDRESS=(1)[NUMBER=74, STREET='rue Pascal'],
                          SPOUSE=[NAME='Kim', ADDRESS->(1)]]""")
print(fstr2)

[ ADDRESS = (1) [ NUMBER = 74           ] ]
[               [ STREET = 'rue Pascal' ] ]
[                                         ]
[ NAME    = 'Lee'                         ]
[                                         ]
[ SPOUSE  = [ ADDRESS -> (1)  ]           ]
[           [ NAME    = 'Kim' ]           ]


Let us create feature structures and try out unification:

In [22]:
fs1 = nltk.FeatStruct("[AGR=[PER=3, NUM='pl', GND='fem'], POS='N']")
fs2 = nltk.FeatStruct("[POS='N', AGR=[PER=3, GND='fem']]")

print(fs1.unify(fs2))

[       [ GND = 'fem' ] ]
[ AGR = [ NUM = 'pl'  ] ]
[       [ PER = 3     ] ]
[                       ]
[ POS = 'N'             ]


## Chart Parser

The following examples are taken from the NLTK [parsing HOWTO](http://www.nltk.org/howto/parse.html) page.

In [23]:
import nltk

In [24]:
nltk.parse.chart.demo(2, print_times=False, trace=1,
                       sent='I saw a dog', numparses=1)

* Sentence:
I saw a dog
['I', 'saw', 'a', 'dog']

* Strategy: Bottom-up

|.    I    .   saw   .    a    .   dog   .|
|[---------]         .         .         .| [0:1] 'I'
|.         [---------]         .         .| [1:2] 'saw'
|.         .         [---------]         .| [2:3] 'a'
|.         .         .         [---------]| [3:4] 'dog'
|>         .         .         .         .| [0:0] NP -> * 'I'
|[---------]         .         .         .| [0:1] NP -> 'I' *
|>         .         .         .         .| [0:0] S  -> * NP VP
|>         .         .         .         .| [0:0] NP -> * NP PP
|[--------->         .         .         .| [0:1] S  -> NP * VP
|[--------->         .         .         .| [0:1] NP -> NP * PP
|.         >         .         .         .| [1:1] Verb -> * 'saw'
|.         [---------]         .         .| [1:2] Verb -> 'saw' *
|.         >         .         .         .| [1:1] VP -> * Verb NP
|.         >         .         .         .| [1:1] VP -> * Verb
|.         [--------->

This is an example how to apply top-down parsing:

In [25]:
nltk.parse.chart.demo(1, print_times=True, trace=0,
                       sent='she killed the man with the tie', numparses=2)

* Sentence:
she killed the man with the tie
['she', 'killed', 'the', 'man', 'with', 'the', 'tie']

* Strategy: Top-down



ValueError: Grammar does not cover some of the input words: "'she', 'killed', 'man', 'tie'".

This is how to apply bottom-up parsing:

In [29]:
nltk.parse.chart.demo(2, print_times=False, trace=0,
                       sent='I saw John on the roof', numparses=2)

* Sentence:
I saw John on the roof
['I', 'saw', 'John', 'on', 'the', 'roof']

* Strategy: Bottom-up



ValueError: Grammar does not cover some of the input words: "'on', 'roof'".

In [30]:
nltk.parse.featurechart.demo(print_times=False,
                              print_grammar=True,
                              parser=nltk.parse.featurechart.FeatureChartParser,
                              sent='I saw John with a dog')


Grammar with 18 productions (start state = S[])
    S[] -> NP[] VP[]
    PP[] -> Prep[] NP[]
    NP[] -> NP[] PP[]
    VP[] -> VP[] PP[]
    VP[] -> Verb[] NP[]
    VP[] -> Verb[]
    NP[] -> Det[pl=?x] Noun[pl=?x]
    NP[] -> 'John'
    NP[] -> 'I'
    Det[] -> 'the'
    Det[] -> 'my'
    Det[-pl] -> 'a'
    Noun[-pl] -> 'dog'
    Noun[-pl] -> 'cookie'
    Verb[] -> 'ate'
    Verb[] -> 'saw'
    Prep[] -> 'with'
    Prep[] -> 'under'

* FeatureChartParser
Sentence: I saw John with a dog
|.I.s.J.w.a.d.|
|[-] . . . . .| [0:1] 'I'
|. [-] . . . .| [1:2] 'saw'
|. . [-] . . .| [2:3] 'John'
|. . . [-] . .| [3:4] 'with'
|. . . . [-] .| [4:5] 'a'
|. . . . . [-]| [5:6] 'dog'
|[-] . . . . .| [0:1] NP[] -> 'I' *
|[-> . . . . .| [0:1] S[] -> NP[] * VP[] {}
|[-> . . . . .| [0:1] NP[] -> NP[] * PP[] {}
|. [-] . . . .| [1:2] Verb[] -> 'saw' *
|. [-> . . . .| [1:2] VP[] -> Verb[] * NP[] {}
|. [-] . . . .| [1:2] VP[] -> Verb[] *
|. [-> . . . .| [1:2] VP[] -> VP[] * PP[] {}
|[---] . . . .| [0:2] S[] ->

## Loading grammars from files and editing them

We will need the following NLTK modules in this section:

In [31]:
import nltk
from nltk import CFG
from nltk.grammar import FeatureGrammar as FCFG

We can load a *grammar* from a file, that is located in the same folder as the current Jupyter notebook, in the following way:

In [32]:
cfg = nltk.data.load('spanish1.cfg')
print(cfg)

Grammar with 31 productions (start state = S)
    S -> SN SV
    SV -> v SN
    SV -> v
    SN -> det GN
    GN -> nom_com
    GN -> nom_prop
    det -> 'el'
    det -> 'la'
    det -> 'los'
    det -> 'las'
    det -> 'un'
    det -> 'una'
    det -> 'unos'
    det -> 'unas'
    nom_com -> 'vecino'
    nom_com -> 'ladrones'
    nom_com -> 'mujeres'
    nom_com -> 'bosques'
    nom_com -> 'noche'
    nom_com -> 'flauta'
    nom_com -> 'ventana'
    nom_prop -> 'Jose'
    nom_prop -> 'Lucas'
    nom_prop -> 'Pedro'
    nom_prop -> 'Marta'
    v -> 'toca'
    v -> 'moja'
    v -> 'adoran'
    v -> 'robaron'
    v -> 'escondieron'
    v -> 'rompió'


We instantiate a ChartParser object with this grammar:

In [33]:
cp1 = nltk.parse.ChartParser(cfg)

The *ChartParser* object has a parse-function that takes a list of tokens as a parameter. The token list can be generated using a language specific tokenizer. In this case we simply tokenize using the Python-function *split* on strings. The output of the parse function is a list of trees. We loop through the list of parse trees and print them out:

In [34]:
"los mujeres adoran la Lucas".split()

['los', 'mujeres', 'adoran', 'la', 'Lucas']

In [35]:
for x in cp1.parse("los mujeres adoran la Lucas".split()):
    print(x)

(S
  (SN (det los) (GN (nom_com mujeres)))
  (SV (v adoran) (SN (det la) (GN (nom_prop Lucas)))))


We can also edit a grammar directly:

In [37]:
cfg2 = CFG.fromstring("""
 S -> NP VP
 PP -> P NP
 NP -> 'the' N | N PP | 'the' N PP
 VP -> V NP | V PP | V NP PP
 N -> 'cat'
 N -> 'dog'
 N -> 'bird'
 N -> 'rug'
 N -> 'woman'
 N -> 'man'
 N -> 'tie'
 V -> 'chased'
 V -> 'killed'
 V -> 'sat'
 V -> 'bit'
 P -> 'in'
 P -> 'on'
 P -> 'with'
""")

We parse our example sentences using the same approach as above:

In [40]:
cp2 = nltk.parse.ChartParser(cfg2)
for x in cp2.parse("the cat chased the dog".split()):
    print(x)

(S (NP the (N cat)) (VP (V chased) (NP the (N dog))))


The previous example included a Context-free grammar. In the following example we load a Context-free Grammar with Features, instantiate a *FeatureChartParser*, and loop through the parse trees that are generated by our grammar to print them out:

In [41]:
fcfg = nltk.data.load('spanish1.fcfg')
fcp1 = nltk.parse.FeatureChartParser(fcfg)
for x in fcp1.parse(u"Miguel adoró el gato".split()):
    print(x)

(S[]
  (SN[+PROP, gen=?g, num='singular'] (NP[num='singular'] Miguel))
  (SV[num='singular', tiempo='pasado']
    (VT[num='singular', tiempo='pasado'] adoró)
    (SN[-PROP, gen='masculino', num='singular']
      (DET[gen='masculino', num='singular'] el)
      (NC[gen='masculino', num='singular'] gato))))


We can edit a Feature CFG in the same way directly in this notebook and then parse with it:

In [42]:
fcfg2 = FCFG.fromstring("""
% start CP
# ############################
# Grammar Rules
# ############################
CP -> Cbar[stype=decl]
Cbar[stype=decl] -> IP[+TNS]
IP[+TNS] -> DP[num=?n,pers=?p,case=nom] VP[num=?n,pers=?p]
DP[num=?n,pers=?p,case=?k] ->  Dbar[num=?n,pers=?p,case=?k]
Dbar[num=?n,pers=?p] -> D[num=?n,DEF=?d,COUNT=?c] NP[num=?n,pers=?p,DEF=?d,COUNT=?c]
Dbar[num=?n,pers=?p] -> NP[num=?n,pers=?p,DEF=?d,COUNT=?c]
Dbar[num=?n,pers=?p,case=?k] -> D[num=?n,pers=?p,+DEF,type=pron,case=?k]
NP[num=?n,pers=?p,COUNT=?c] -> N[num=?n,pers=?p,type=prop,COUNT=?c]
VP[num=?n,pers=?p] -> V[num=?n,pers=?p,val=1]
VP[num=?n,pers=?p] -> V[num=?n,pers=?p,val=2] DP[case=acc]
PP -> P DP[num=?n,pers=?p,case=acc]
#PP -> P DP[num=?n,pers=?p,case=dat]
#
# ############################
# Lexical Rules
# ############################
D[-DEF,+COUNT,num=sg] -> 'a'
D[-DEF,+COUNT,num=sg] -> 'an'
D[+DEF] -> 'the'
D[+DEF,gen=f,num=sg,case=nom,type=pron] -> 'she'
D[+DEF,gen=m,num=sg,case=nom,type=pron] -> 'he'
D[+DEF,gen=n,num=sg,type=pron] -> 'it'
D[+DEF,gen=f,num=sg,case=acc,type=pron] -> 'her'
D[+DEF,gen=m,num=sg,case=acc,type=pron] -> 'him'
N[num=sg,pers=3,type=prop] -> 'John' | 'Sara' | 'Mary'
V[tns=pres,num=sg,pers=3,val=2] -> 'loves' | 'calls' | 'sees' | 'buys'
N[num=sg,pers=3,-COUNT] -> 'furniture' | 'air' | 'justice'
N[num=sg,pers=3] -> 'cat' | 'dog' | 'mouse'
N[num=pl,pers=3] -> 'cats' | 'dogs' | 'mice'
V[tns=pres,num=sg,pers=3,val=1] -> 'sleeps' | 'snores'
V[tns=pres,num=sg,pers=1,val=1] -> 'sleep' | 'snore'
V[tns=pres,num=sg,pers=2,val=1] -> 'sleep' | 'snore'
V[tns=pres,num=pl,val=1] -> 'sleep' | 'snore'
V[tns=past,val=1] -> 'slept' | 'snored'
V[tns=pres,num=sg,pers=3,val=2] -> 'calls' | 'sees' | 'loves'
V[tns=pres,num=sg,pers=1,val=2] -> 'call' | 'see' | 'love'
V[tns=pres,num=sg,pers=2,val=2] -> 'call' | 'see' | 'love'
V[tns=pres,num=pl,val=2] -> 'call' | 'see' | 'love'
V[tns=past,val=2] -> 'called' | 'saw' | 'loved'
""")

We can now create a parser instance and parse with this grammar:

In [44]:
fcp2 = nltk.parse.FeatureChartParser(fcfg2, trace=1)
sentence = "John buys a furniture"
result = list(fcp2.parse(sentence.split()))
if result:
    for x in result:
        print(x)
else:
    print("*", sentence)

|.Joh.buy. a .fur.|
|[---]   .   .   .| [0:1] 'John'
|.   [---]   .   .| [1:2] 'buys'
|.   .   [---]   .| [2:3] 'a'
|.   .   .   [---]| [3:4] 'furniture'
|[---]   .   .   .| [0:1] N[num='sg', pers=3, type='prop'] -> 'John' *
|[---]   .   .   .| [0:1] NP[COUNT=?c, num='sg', pers=3] -> N[COUNT=?c, num='sg', pers=3, type='prop'] *
|[---]   .   .   .| [0:1] Dbar[num='sg', pers=3] -> NP[COUNT=?c, DEF=?d, num='sg', pers=3] *
|[---]   .   .   .| [0:1] DP[case=?k, num='sg', pers=3] -> Dbar[case=?k, num='sg', pers=3] *
|[--->   .   .   .| [0:1] IP[+TNS] -> DP[case='nom', num=?n, pers=?p] * VP[num=?n, pers=?p] {?k2: 'nom', ?n: 'sg', ?p: 3}
|.   [---]   .   .| [1:2] V[num='sg', pers=3, tns='pres', val=2] -> 'buys' *
|.   [--->   .   .| [1:2] VP[num=?n, pers=?p] -> V[num=?n, pers=?p, val=2] * DP[case='acc'] {?n: 'sg', ?p: 3}
|.   .   [---]   .| [2:3] D[+COUNT, -DEF, num='sg'] -> 'a' *
|.   .   [--->   .| [2:3] Dbar[num=?n, pers=?p] -> D[COUNT=?c, DEF=?d, num=?n] * NP[COUNT=?c, DEF=?d, num=?n, pers

Countable nouns and articles in a DP:

DPs and pronouns

CP/IP sentence structures

## Different Parsers

This is a list of the different Feature Parsers in NLTK.

- nltk.parse.featurechart.FeatureChartParser
- nltk.parse.featurechart.FeatureTopDownChartParser
- nltk.parse.featurechart.FeatureBottomUpChartParser
- nltk.parse.featurechart.FeatureBottomUpLeftCornerChartParser
- nltk.parse.earleychart.FeatureIncrementalChartParser
- nltk.parse.earleychart.FeatureEarleyChartParser
- nltk.parse.earleychart.FeatureIncrementalTopDownChartParser
- nltk.parse.earleychart.FeatureIncrementalBottomUpChartParser
- nltk.parse.earleychart.FeatureIncrementalBottomUpLeftCornerChartParser

I do not know whether this is an exhaustive list.

(C) 2017-2021 by [Damir Cavar](http://damir.cavar.me/) - [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))