Tutorial Pekan Keenam, Constituency Parser.

sumber: 

http://www.nltk.org/howto/parse.html

http://www.nltk.org/howto/generate.html

https://markgw.github.io/uh-nlp19/day4/


Import library yang dibutuhkan

In [1]:
import nltk
from nltk.parse.generate import generate
from nltk.parse import ViterbiParser

Contoh pendefinisian CFG

In [23]:
grammar_1 = nltk.CFG.fromstring("""
  S -> NP VP
  VP -> V NP | V NP PP
  PP -> P NP
  V -> "saw" | "ate" | "walked"
  NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my"
  N -> "man" | "dog" | "cat" | "telescope" | "park"
  P -> "in" | "on" | "by" | "with"
  """)

Pendefinisian sebuah contoh kalimat, perhatikan bahwa kalimat ini mengandung ambiguitas

In [3]:
sent_1 = 'John saw a man with a telescope'.split()

Contoh parsing dengan parser Top Down Chart.

Perhatikan parse tree yang dihasilkan > 1

In [7]:
td_parser = nltk.parse.TopDownChartParser(grammar_1)


for tree in td_parser.parse(sent_1):
    print(tree)

(S
  (NP John)
  (VP
    (V saw)
    (NP (Det a) (N man) (PP (P with) (NP (Det a) (N telescope))))))
(S
  (NP John)
  (VP
    (V saw)
    (NP (Det a) (N man))
    (PP (P with) (NP (Det a) (N telescope)))))


Contoh parsing dengan parser Bottom Up Chart.

Perhatikan parse tree yang dihasilkan > 1

In [8]:
bu_parser = nltk.parse.BottomUpChartParser(grammar_1)


for tree in bu_parser.parse(sent_1):
    print(tree)


(S
  (NP John)
  (VP
    (V saw)
    (NP (Det a) (N man))
    (PP (P with) (NP (Det a) (N telescope)))))
(S
  (NP John)
  (VP
    (V saw)
    (NP (Det a) (N man) (PP (P with) (NP (Det a) (N telescope))))))


Contoh parsing dengan Shift Reduce parser

Perhatikan proses shift-reduce hingga dicapai simbol Start (S)

In [9]:
sr_parser = nltk.ShiftReduceParser(grammar_1, trace=2)

for tree in sr_parser.parse(sent_1):
    print(tree)

Parsing 'John saw a man with a telescope'
    [ * John saw a man with a telescope]
  S [ 'John' * saw a man with a telescope]
  R [ NP * saw a man with a telescope]
  S [ NP 'saw' * a man with a telescope]
  R [ NP V * a man with a telescope]
  S [ NP V 'a' * man with a telescope]
  R [ NP V Det * man with a telescope]
  S [ NP V Det 'man' * with a telescope]
  R [ NP V Det N * with a telescope]
  R [ NP V NP * with a telescope]
  R [ NP VP * with a telescope]
  R [ S * with a telescope]
  S [ S 'with' * a telescope]
  R [ S P * a telescope]
  S [ S P 'a' * telescope]
  R [ S P Det * telescope]
  S [ S P Det 'telescope' * ]
  R [ S P Det N * ]
  R [ S P NP * ]
  R [ S PP * ]


Cek apakah Grammar memenuhi syarat CNF.

Latihan: coba ubah Grammar tersebut menjadi CNF!

In [12]:
 print(grammar_1.is_chomsky_normal_form())

False


Cek coverage Grammar

Perhatikan bahwa Grammar grammar_1 belum mengandung kata 'I'

In [13]:
sent_2 = 'I saw a man with a telescope'.split()

for s in sent_2:
    grammar_1.check_coverage(sent_2)

ValueError: ignored

Tambahkan aturan produksi / production rule, sehingga kata 'I" tercakup dalam Grammar

In [33]:
grammar_1 = nltk.CFG.fromstring("""
  S -> NP VP
  VP -> V NP | V NP PP
  PP -> P NP
  V -> "saw" | "ate" | "walked"
  NP -> "John" | "Mary" | "Bob" | "I" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my"
  N -> "man" | "dog" | "cat" | "telescope" | "park"
  P -> "in" | "on" | "by" | "with"
  """)
print(grammar_1)

Grammar with 26 productions (start state = S)
    S -> NP VP
    VP -> V NP
    VP -> V NP PP
    PP -> P NP
    V -> 'saw'
    V -> 'ate'
    V -> 'walked'
    NP -> 'John'
    NP -> 'Mary'
    NP -> 'Bob'
    NP -> 'I'
    NP -> Det N
    NP -> Det N PP
    Det -> 'a'
    Det -> 'an'
    Det -> 'the'
    Det -> 'my'
    N -> 'man'
    N -> 'dog'
    N -> 'cat'
    N -> 'telescope'
    N -> 'park'
    P -> 'in'
    P -> 'on'
    P -> 'by'
    P -> 'with'


Cek apakah kata kalimat sent_2 sudah dapat diproses oleh Grammar grammar_1

In [34]:
for s in sent_2:
    print('cek kata:',s)
    grammar_1.check_coverage(sent_2)

cek kata: I
cek kata: saw
cek kata: a
cek kata: man
cek kata: with
cek kata: a
cek kata: telescope


Generate sentence sesuai Grammar grammar_1

In [36]:
for sentence in generate(grammar_1, n=10):
    print(' '.join(sentence))

John saw John
John saw Mary
John saw Bob
John saw I
John saw a man
John saw a dog
John saw a cat
John saw a telescope
John saw a park
John saw an man


Coba generate Grammar dari file constituency treebank

Contoh yang digunakan di sini adalah 5 kalimat awal dari Constituency Treebank Bahasa Indonesia, kethu https://github.com/ialfina/kethu

Perhatikan bahwa Anda perlu **menyesuaikan path lokasi file .mrg**

In [46]:
from nltk.corpus import BracketParseCorpusReader

ptb = BracketParseCorpusReader(r"drive/My Drive/const_treebank", r".*/*\.mrg")

print(ptb)
print(ptb.sents())
print(ptb.parsed_sents())

<BracketParseCorpusReader in '/content/drive/My Drive/const_treebank'>
[['Kera', 'untuk', '*', 'amankan', 'pesta', 'olahraga'], ['Pemerintah', 'kota', 'Delhi', 'mengerahkan', 'monyet', 'untuk', '*', 'mengusir', 'monyet-monyet', 'lain', 'yang', '*', 'berbadan', 'lebih', 'kecil', 'dari', 'arena', 'Pesta', 'Olahraga', 'Persemakmuran', '.'], ...]
[Tree('NP', [Tree('NN', ['Kera']), Tree('SBAR', [Tree('IN', ['untuk']), Tree('S', [Tree('NP-SBJ', [Tree('-NONE-', ['*'])]), Tree('VP', [Tree('VB', ['amankan']), Tree('NP', [Tree('NP', [Tree('NN', ['pesta']), Tree('NN', ['olahraga'])])])])])])]), Tree('S', [Tree('NP-SBJ', [Tree('NN', ['Pemerintah']), Tree('NN', ['kota']), Tree('NNP', ['Delhi'])]), Tree('VP', [Tree('VB', ['mengerahkan']), Tree('NP', [Tree('NN', ['monyet'])]), Tree('SBAR', [Tree('IN', ['untuk']), Tree('S', [Tree('NP-SBJ', [Tree('-NONE-', ['*'])]), Tree('VP', [Tree('VB', ['mengusir']), Tree('NP', [Tree('NP', [Tree('NN', ['monyet-monyet']), Tree('JJ', ['lain'])]), Tree('SBAR', [Tree('I

Induksi PCFG (Probabilistic Context Free Grammar) dari constituency Treebank

In [53]:
from nltk import Nonterminal, nonterminals, Production, PCFG, induce_pcfg

S = Nonterminal('S')

productions = []
for t in ptb.parsed_sents():
    productions += t.productions()
grammar_3 = induce_pcfg(S, productions)
print(grammar_3)

Grammar with 105 productions (start state = S)
    NP -> NN SBAR [0.04]
    NN -> 'Kera' [0.037037]
    SBAR -> IN S [0.75]
    IN -> 'untuk' [0.222222]
    S -> NP-SBJ VP [0.538462]
    NP-SBJ -> -NONE- [0.555556]
    -NONE- -> '*' [0.428571]
    VP -> VB NP [0.285714]
    VB -> 'amankan' [0.0769231]
    NP -> NP [0.04]
    NP -> NN NN [0.16]
    NN -> 'pesta' [0.037037]
    NN -> 'olahraga' [0.037037]
    S -> NP-SBJ VP . [0.153846]
    NP-SBJ -> NN NN NNP [0.111111]
    NN -> 'Pemerintah' [0.037037]
    NN -> 'kota' [0.037037]
    NNP -> 'Delhi' [0.222222]
    VP -> VB NP SBAR [0.0714286]
    VB -> 'mengerahkan' [0.0769231]
    NP -> NN [0.04]
    NN -> 'monyet' [0.185185]
    VP -> VB NP PP [0.214286]
    VB -> 'mengusir' [0.0769231]
    NP -> NP SBAR [0.04]
    NP -> NN JJ [0.12]
    NN -> 'monyet-monyet' [0.037037]
    JJ -> 'lain' [0.2]
    IN -> 'yang' [0.111111]
    VP -> VB ADJP [0.0714286]
    VB -> 'berbadan' [0.0769231]
    ADJP -> RB JJ [1.0]
    RB -> 'lebih' [0.5]
    J

Coba tes parse sebuah kalimat dengan grammar hasil induksi

In [54]:
sent_3 = 'ribuan monyet amankan pesta'.split()
# contoh menggunakan bottom-up parser
bu_parser = nltk.parse.BottomUpChartParser(grammar_3)

for tree in bu_parser.parse(sent_3):
    print(tree)

(S
  (NP-SBJ (NP (CD ribuan) (NN monyet)))
  (VP (VB amankan) (NP (NN pesta))))
(S
  (NP-SBJ (NP (CD ribuan) (NN monyet)))
  (VP (VB amankan) (NP (NP (NN pesta)))))
(S
  (NP-SBJ (NP (NP (CD ribuan) (NN monyet))))
  (VP (VB amankan) (NP (NN pesta))))
(S
  (NP-SBJ (NP (NP (CD ribuan) (NN monyet))))
  (VP (VB amankan) (NP (NP (NN pesta)))))


Tes parsing dengan Viterbi Parser, yang akan mengembalikan 1 pohon parse dengan probability total paling tinggi

In [56]:
from nltk.parse import ViterbiParser

sent_3 = 'ribuan monyet amankan pesta'.split()
# contoh menggunakan bottom-up parser
parser = ViterbiParser(grammar_3, trace=2)
for t in parser.parse(sent_3):
    t.pretty_print()

Inserting tokens into the most likely constituents table...
   Insert: |=...| ribuan
   Insert: |.=..| monyet
   Insert: |..=.| amankan
   Insert: |...=| pesta
Finding the most likely constituents spanning 1 text elements...
   Insert: |=...| CD -> 'ribuan' [0.25]
   Insert: |=...| NP -> CD [0.04]
  Discard: |=...| NP -> NP [0.04]
   Insert: |=...| NP-SBJ -> NP [0.111111]
  Discard: |=...| NP -> NP [0.04]
   Insert: |.=..| NN -> 'monyet' [0.185185]
   Insert: |.=..| NP -> NN [0.04]
  Discard: |.=..| NP -> NP [0.04]
   Insert: |.=..| NP-SBJ -> NP [0.111111]
  Discard: |.=..| NP -> NP [0.04]
   Insert: |..=.| VB -> 'amankan' [0.0769231]
   Insert: |...=| NN -> 'pesta' [0.037037]
   Insert: |...=| NP -> NN [0.04]
  Discard: |...=| NP -> NP [0.04]
   Insert: |...=| NP-SBJ -> NP [0.111111]
  Discard: |...=| NP -> NP [0.04]
Finding the most likely constituents spanning 2 text elements...
   Insert: |==..| NP -> CD NN [0.04]
  Discard: |==..| NP -> NP [0.04]
   Insert: |==..| NP-SBJ -> NP [0.