# Chunking and Parsing

**Chunking or shallow parsing** is an analysis of a sentence which identifies the constituents (noun groups, verbs, verb groups, etc.), but does not specify their internal structure, nor their role in the main sentence.

![chunking](http://www.nltk.org/book/tree_images/ch07-tree-1.png)

**Parsing or syntactic analysis** is the process of analysing a string of symbols, either in natural language or in computer languages, conforming to the rules of a formal grammar.

![parsing](http://www.nltk.org/book/tree_images/ch08-tree-4.png)

---
Both chunking and parsing could be solved with two methods:

* **Grammars**
* **Machine learning**


In [1]:
from __future__ import unicode_literals

## Chunking

In [2]:
sentence = 'A geladeira BRASTEMP é uma boa companheira para os momentos mais felizes da sua cozinha'

In [3]:
# tokenize and tag the sentence
import nlpnet
nlpnet.set_data_dir(b'/usr/share/nlpnet_data/')

tagger = nlpnet.POSTagger()
tag_sentence = tagger.tag(sentence)[0]

print tag_sentence

[(u'A', u'ART'), (u'geladeira', u'N'), (u'BRASTEMP', u'NPROP'), (u'\xe9', u'V'), (u'uma', u'ART'), (u'boa', u'ADJ'), (u'companheira', u'N'), (u'para', u'PREP'), (u'os', u'ART'), (u'momentos', u'N'), (u'mais', u'ADV'), (u'felizes', u'ADJ'), (u'da', u'PREP+ART'), (u'sua', u'PROADJ'), (u'cozinha', u'N')]


In [4]:
# chunking based on grammar
grammar = "NP: {<PROADJ>?<N><NPROP>*<ADJ>*}"

import nltk
cp = nltk.RegexpParser(grammar)
result = cp.parse(tag_sentence)
print(result)

# For machine learning algorithms please check on http://www.nltk.org/book/ch07.html#code-unigram-chunker
# http://beta.visl.sdu.dk/visl/en/parsing/automatic/parse.php (PALAVRAS Parser)

(S
  A/ART
  (NP geladeira/N BRASTEMP/NPROP)
  e/V
  uma/ART
  boa/ADJ
  (NP companheira/N)
  para/PREP
  os/ART
  (NP momentos/N)
  mais/ADV
  felizes/ADJ
  da/PREP+ART
  (NP sua/PROADJ cozinha/N))


## Parsing

In [5]:
# Floresta is a corpus for both Brazilian and Portugal Portuguese available in NLTK
from nltk.corpus import floresta

In [6]:
sent = floresta.parsed_sents()[1]

In [7]:
print sent

(STA+fcl
  (SUBJ+np (>N+art O) (H+prop 7_e_Meio))
  (P+v-fin e)
  (SC+np
    (>N+art um)
    (H+n ex-libris)
    (N<+pp
      (H+prp de)
      (P<+np (>N+art a) (H+n noite) (N<+adj algarvia))))
  (. .))


In [8]:
sent.productions()

[STA+fcl -> SUBJ+np P+v-fin SC+np .,
 SUBJ+np -> >N+art H+prop,
 >N+art -> 'O',
 H+prop -> '7_e_Meio',
 P+v-fin -> '\xe9',
 SC+np -> >N+art H+n N<+pp,
 >N+art -> 'um',
 H+n -> 'ex-libris',
 N<+pp -> H+prp P<+np,
 H+prp -> 'de',
 P<+np -> >N+art H+n N<+adj,
 >N+art -> 'a',
 H+n -> 'noite',
 N<+adj -> 'algarvia',
 . -> '.']

For python, there is no toolkit or library to do syntatic parsing. Even parsing training is a bit complicated task.

The state-of-art for parsing is on [Stanford Statistical Parsing](http://nlp.stanford.edu/software/lex-parser.shtml) available in Java, but NLTK has a wrapper for its use.

For Portuguese there is a very known syntatic parser based on grammar called PALAVRAS. More info:

http://beta.visl.sdu.dk/constraint_grammar.html

http://beta.visl.sdu.dk/visl/en/parsing/automatic/parse.php