# Experimental noun phrase chunker

EstNLTK includes an experimental noun phrase chunker, which can be used to detect non-overlapping noun phrases from the text.

## Basic usage

The chunker uses Vabamorf's morphological analyses and dependency syntactic relations for detecting potential noun phrases.
In the following example, we use `VislTagger` for creating the prerequisite syntactic analysis layer, but you can use any [dependency syntactic layer](https://github.com/estnltk/estnltk/blob/ba471626227238b2b83ef7a3479b315407c44807/tutorials/syntax/syntax.ipynb) that has `'deprel'` and `'head'` attributes marking the relations:

In [1]:
from estnltk import Text
from estnltk.taggers import VislTagger

# Create text for analysis
text = Text('Suur karvane kass nurrus punasel diivanil, väike hiir aga hiilis temast mööda.')
# Add prerequisite layers
text.tag_layer(['morph_extended'])
syntactic_parser = VislTagger()
syntactic_parser.tag(text)
print(text.layers)

{'sentences', 'words', 'morph_analysis', 'morph_extended', 'compound_tokens', 'tokens', 'visl'}


Now we can use `NounPhraseChunker`. The tagger must be initialized with the name of the syntax layer:

In [2]:
from estnltk.taggers.miscellaneous.np_chunker import NounPhraseChunker

np_chunker = NounPhraseChunker('visl')
np_chunker.tag(text)
text.np_chunks

layer name,attributes,parent,enveloping,ambiguous,span count
np_chunks,,,words,False,4

text
"['Suur', 'karvane', 'kass']"
"['punasel', 'diivanil']"
"['väike', 'hiir']"
['temast']


You can use `enclosing_text` for obtaining exact strings corresponding to the chunks:

In [3]:
text = Text('Autojuhi lapitekk pälvis linna koduleheküljel paljude kodanike tähelepanu.')
text.tag_layer(['morph_extended'])
syntactic_parser.tag(text)
np_chunker.tag(text)
# Get phrase strings
[chunk.enclosing_text for chunk in text.np_chunks]

['Autojuhi lapitekk', 'linna koduleheküljel', 'paljude kodanike tähelepanu']

As `np_chunks` is an enveloping layer around `words`, you can iterate over all words of each chunk, and you can also access lemmas of these words via `morph_analysis` layer:

In [4]:
text = Text('Autojuhi lapitekk pälvis linna koduleheküljel paljude kodanike tähelepanu.')
text.tag_layer(['morph_extended'])
syntactic_parser.tag(text)
np_chunker.tag(text)
# Get lemmas of the words from chunks
for chunk in text.np_chunks:
    for word in chunk:
        print(word.text, word.lemma[0])
    print()

Autojuhi autojuht
lapitekk lapitekk

linna linn
koduleheküljel kodulehekülg

paljude palju
kodanike kodanik
tähelepanu tähelepanu



### Chunking based on MaltParserTagger

In the following example, we use MaltParserTagger to provide the input layer required for chunking:

In [5]:
from estnltk import Text
from estnltk.taggers import ConllMorphTagger
from estnltk.taggers import MaltParserTagger

conll_morph_tagger = ConllMorphTagger( no_visl=True,  morph_extended_layer='morph_analysis' )
maltparser_tagger = MaltParserTagger( input_conll_morph_layer='conll_morph', 
                                      input_type='morph_analysis', 
                                      version='conllu', add_parent_and_children=False )

In [6]:
# Create text and add required layers
text = Text('Juunikuu suveseiklused ootavad Sind juba täna meie uues reisiportaalis.')
text.tag_layer('morph_analysis')
conll_morph_tagger.tag( text )
maltparser_tagger.tag( text )
# Create NP chunker based on maltparser syntactic analysis
from estnltk.taggers.miscellaneous.np_chunker import NounPhraseChunker
np_chunker = NounPhraseChunker('maltparser_syntax')
np_chunker.tag(text)
text.np_chunks

layer name,attributes,parent,enveloping,ambiguous,span count
np_chunks,,,words,False,4

text
"['Juunikuu', 'suveseiklused']"
['Sind']
['meie']
"['uues', 'reisiportaalis']"


---