## A brief introduction to Text class, layers and tools

In this tutorial, we give a bird's-eye overview on `Text` class, layers and tools in EstNLTK 1.6.

### Text class

Text class is the central component of the library. It stores raw text data, related metadata and layers of linguistic annotation. It provides interfaces for calling automatic annotators, and also manages dependencies between the annotators.

In [1]:
# Example: creating a Text based on raw text data
from estnltk.text import Text

t = Text('''Kui Arno isaga koolimajja jõudis, olid tunnid juba alanud. 
Kooliõpetaja kutsus mõlemad oma tuppa, kõneles nendega natuke aega, 
käskis Arnol olla hoolas ja korralik ja seadis ta siis pinki ühe pikkade juustega poisi kõrvale istuma.''')
t

text
"Kui Arno isaga koolimajja jõudis, olid tunnid juba alanud. Kooliõpetaja kutsus mõlemad oma tuppa, kõneles nendega natuke aega, käskis Arnol olla hoolas ja korralik ja seadis ta siis pinki ühe pikkade juustega poisi kõrvale istuma."


Once a Text has been created, it can be analysed automatically:

In [2]:
# Example: add segmentation (word, sentence and paragraph tokenization) annotations
t.analyse('segmentation')
t

text
"Kui Arno isaga koolimajja jõudis, olid tunnid juba alanud. Kooliõpetaja kutsus mõlemad oma tuppa, kõneles nendega natuke aega, käskis Arnol olla hoolas ja korralik ja seadis ta siis pinki ühe pikkade juustega poisi kõrvale istuma."

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,,,sentences,False,1
sentences,,,words,False,2
words,normalized_form,,,True,40


Alternatively, individual layers can be automatically added using the method `tag_layer`:

In [3]:
t.tag_layer(['words', 'sentences', 'paragraphs'])
t

text
"Kui Arno isaga koolimajja jõudis, olid tunnid juba alanud. Kooliõpetaja kutsus mõlemad oma tuppa, kõneles nendega natuke aega, käskis Arnol olla hoolas ja korralik ja seadis ta siis pinki ühe pikkade juustega poisi kõrvale istuma."

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,,,sentences,False,1
sentences,,,words,False,2
words,normalized_form,,,True,40


If there is metadata available about the Text, it can be added using the special attribute called _meta_:

In [4]:
# Example: add metadata about the text
t.meta['author'] = 'O. Luts'
t.meta['source'] = '"Kevade"'
t

text
"Kui Arno isaga koolimajja jõudis, olid tunnid juba alanud. Kooliõpetaja kutsus mõlemad oma tuppa, kõneles nendega natuke aega, käskis Arnol olla hoolas ja korralik ja seadis ta siis pinki ühe pikkade juustega poisi kõrvale istuma."

0,1
author,O. Luts
source,"""Kevade"""

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,,,sentences,False,1
sentences,,,words,False,2
words,normalized_form,,,True,40


Other attributes provide an access to raw text and annotations:

In [5]:
# Raw text (string)
t.text

'Kui Arno isaga koolimajja jõudis, olid tunnid juba alanud. \nKooliõpetaja kutsus mõlemad oma tuppa, kõneles nendega natuke aega, \nkäskis Arnol olla hoolas ja korralik ja seadis ta siis pinki ühe pikkade juustega poisi kõrvale istuma.'

In [6]:
# Texts from layer 'words' (list of strings)
t.words.text

['Kui',
 'Arno',
 'isaga',
 'koolimajja',
 'jõudis',
 ',',
 'olid',
 'tunnid',
 'juba',
 'alanud',
 '.',
 'Kooliõpetaja',
 'kutsus',
 'mõlemad',
 'oma',
 'tuppa',
 ',',
 'kõneles',
 'nendega',
 'natuke',
 'aega',
 ',',
 'käskis',
 'Arnol',
 'olla',
 'hoolas',
 'ja',
 'korralik',
 'ja',
 'seadis',
 'ta',
 'siis',
 'pinki',
 'ühe',
 'pikkade',
 'juustega',
 'poisi',
 'kõrvale',
 'istuma',
 '.']

In [7]:
# Texts from layer 'sentences' (list of lists of strings)
t.sentences.text

['Kui',
 'Arno',
 'isaga',
 'koolimajja',
 'jõudis',
 ',',
 'olid',
 'tunnid',
 'juba',
 'alanud',
 '.',
 'Kooliõpetaja',
 'kutsus',
 'mõlemad',
 'oma',
 'tuppa',
 ',',
 'kõneles',
 'nendega',
 'natuke',
 'aega',
 ',',
 'käskis',
 'Arnol',
 'olla',
 'hoolas',
 'ja',
 'korralik',
 'ja',
 'seadis',
 'ta',
 'siis',
 'pinki',
 'ühe',
 'pikkade',
 'juustega',
 'poisi',
 'kõrvale',
 'istuma',
 '.']

Note that the attribute `text` provides access to continuous snippets of texts. E.g. words are continuous sequences of letters/characters.

Sentences are not continuous, as they are made of words and there are gaps (spaces) between words. Technically, sentences are not `Span`-s, but `SpanList`-s. So, if you want to access continuous texts corresponding to sentences, you should use the attribute `enclosing_text` (instead of the attribute `text`):

In [8]:
# Full text corresponding to the 1st element from layer 'sentences' (string)
t.sentences[0].enclosing_text

'Kui Arno isaga koolimajja jõudis, olid tunnid juba alanud.'

In [9]:
# Full texts of all sentences (list of strings)
[s.enclosing_text for s in t.sentences]

['Kui Arno isaga koolimajja jõudis, olid tunnid juba alanud.',
 'Kooliõpetaja kutsus mõlemad oma tuppa, kõneles nendega natuke aega, \nkäskis Arnol olla hoolas ja korralik ja seadis ta siis pinki ühe pikkade juustega poisi kõrvale istuma.']

### Layer class

Annotations are stored as Layers in the Text. In the following example, we will show how to create a new layer from the scratch, and how to access its elements. 

As we have already segmented the input text into words and sentences, we will now create a layer that builds upon existing annotations. More specifically, the new layer will add extra annotations to each word:

In [10]:
# Example: creating a new layer
from estnltk.text import Layer

dep = Layer(name='uppercase', # name of the layer
            parent='words',   # name of the parent layer (i.e. each element of this layer should have a parent in 'words' layer)
            attributes=['upper', 'reverse'] # list of attributes that the new layer will have
            )
t.add_layer(dep) # attach the layer to the Text
# NB! Currently, you cannot attach a layer with the same name twice (unless you delete the old layer).
t

text
"Kui Arno isaga koolimajja jõudis, olid tunnid juba alanud. Kooliõpetaja kutsus mõlemad oma tuppa, kõneles nendega natuke aega, käskis Arnol olla hoolas ja korralik ja seadis ta siis pinki ühe pikkade juustega poisi kõrvale istuma."

0,1
author,O. Luts
source,"""Kevade"""

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,,,sentences,False,1
sentences,,,words,False,2
words,normalized_form,,,True,40
uppercase,"upper, reverse",words,,False,0


The new layer is empty. However, we can iterate over the words (the parent layer), and for each of the word, add an annotation to the new layer:

In [11]:
# Example: populating the new layer with elements
for word in t.words:
    dep.add_annotation(word, upper=word.text.upper(), reverse=word.text.upper()[::-1])

dep

layer name,attributes,parent,enveloping,ambiguous,span count
uppercase,"upper, reverse",words,,False,40

text,upper,reverse
Kui,KUI,IUK
Arno,ARNO,ONRA
isaga,ISAGA,AGASI
koolimajja,KOOLIMAJJA,AJJAMILOOK
jõudis,JÕUDIS,SIDUÕJ
",",",",","
olid,OLID,DILO
tunnid,TUNNID,DINNUT
juba,JUBA,ABUJ
alanud,ALANUD,DUNALA


In [12]:
# Example: accessing the annotations (attribute values) through the layer name ('uppercase')
# returns an AttributeList - immutable list of attribute values
t.uppercase.upper
t.uppercase.reverse
;

''

In [13]:
# Example: accessing new annotations (attribute values) through the parent layer ('words')
for word in t.words[:11]:
    print(word.uppercase.upper, word.uppercase.reverse)

KUI IUK
ARNO ONRA
ISAGA AGASI
KOOLIMAJJA AJJAMILOOK
JÕUDIS SIDUÕJ
, ,
OLID DILO
TUNNID DINNUT
JUBA ABUJ
ALANUD DUNALA
. .


In [14]:
# Example: getting a subset of new annotations (SpanList of Spans)
t.uppercase[:11]

layer name,attributes,parent,enveloping,ambiguous,span count
uppercase,"upper, reverse",words,,False,11

text,upper,reverse
Kui,KUI,IUK
Arno,ARNO,ONRA
isaga,ISAGA,AGASI
koolimajja,KOOLIMAJJA,AJJAMILOOK
jõudis,JÕUDIS,SIDUÕJ
",",",",","
olid,OLID,DILO
tunnid,TUNNID,DINNUT
juba,JUBA,ABUJ
alanud,ALANUD,DUNALA


### Tools for linguistic annotations

Linguistic annotations build upon one another. So, before automatically creating an annotation layer, we must make sure that all the dependency layers have already been created. Therefore, EstNLTK has a special class Resolver that automatically resolves the dependencies between annotation layers.

You can use DEFAULT_RESOLVER to get an overview about the tools applied by default and their corresponding configurations:

In [15]:
# NBVAL_IGNORE_OUTPUT
from estnltk.resolve_layer_dag import DEFAULT_RESOLVER
DEFAULT_RESOLVER.taggers

name,layer,attributes,depends_on,configuration
TokensTagger,tokens,(),(),[apply_punct_postfixes=True]
CompoundTokenTagger,compound_tokens,"(type, normalized)","(tokens,)","[custom_abbreviations=(), ignored_words=set(), tag_numbers=True, tag_units=True, tag_email_and_www=True, tag_emoticons=True, tag_xml=True, tag_initials=True, tag_abbreviations=True, tag_case_endings=True, tag_hyphenations=True, use_custom_abbreviations=False, do_not_join_on_strings=('\n\n',)]"
WordTagger,words,"(normalized_form,)","(tokens, compound_tokens)",[make_ambiguous=True]
SentenceTokenizer,sentences,(),"(words, compound_tokens)","[base_sentence_tokenizer=<nltk.tokenize.punkt.PunktSentenceTokenizer object at 0x000001B02A869B70>, fix_paragraph_endings=True, fix_compound_tokens=True, fix_numeric=True, fix_parentheses=True, fix_double_quotes=True, fix_inner_title_punct=True, fix_repeated_ending_punct=True, fix_double_quotes_based_on_counts=False, use_emoticons_as_endings=True, record_fix_types=False]"
ParagraphTokenizer,paragraphs,(),"(sentences,)","[regex=\s*\n\n, paragraph_tokenizer=RegexpTokenizer(pattern='\\s*\n\n', gaps=True, discard_empty=True, flags=56)]"
VabamorfTagger,morph_analysis,"(normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech)","(words, sentences, compound_tokens)","[guess=True, propername=True, disambiguate=True, compound=True, phonetic=False, postanalysis_tagger=PostMorphAnalysisTagger(('compound_tokens', 'morph_analysis')->morph_analysis)]"
MorphExtendedTagger,morph_extended,"(normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, punctuation_type, pronoun_type, letter_case, fin, verb_extension_suffix, subcat)","(morph_analysis,)","[punctuation_type_retagger=PunctuationTypeRetagger(('morph_extended',)->morph_extended), morph_to_syntax_morph_retagger=MorphToSyntaxMorphRetagger(('morph_analysis',)->morph_extended), pronoun_type_retagger=PronounTypeRetagger(('morph_extended',)->morph_extended), letter_case_retagger=LetterCaseRetagger(('morph_extended',)->morph_extended), remove_adposition_analyses_retagger=RemoveAdpositionAnalysesRetagger(('morph_extended',)->morph_extended), finite_form_retagger=FiniteFormRetagger(('morph_extended',)->morph_extended), verb_extension_suffix_retagger=VerbExtensionSuffixRetagger(('morph_extended',)->morph_extended), subcat_retagger=SubcatRetagger(('morph_extended',)->morph_extended)]"
ClauseSegmenter,clauses,"(clause_type,)","(words, sentences, morph_analysis)","[ignore_missing_commas=False, use_normalized_word_form=True]"


If you want to modify the pipeline, you can use the method `make_resolver()` to create a Resolver identical to the default one, and then you can use the method `update()` to replace some existing tagger with a new one:

In [16]:
# Example: Modifying the pipeline -- replacing an existing tagger
from estnltk.resolve_layer_dag import make_resolver

my_resolver = make_resolver()  # Create a copy of the default pipeline

# Create a new sentence tokenizer that does not split sentences by emoticons
from estnltk.taggers import SentenceTokenizer
new_sentence_tokenizer = SentenceTokenizer(use_emoticons_as_endings=False)

# Replace the sentence tokenizer on the pipeline with the new one
my_resolver.update( new_sentence_tokenizer )

# Test out the new tokenizer
t2 = Text('No mida teksti :) Äge!')
t2.analyse('segmentation', resolver=my_resolver)  # Use new resolver instead of the default one
t2.sentences.text

['No', 'mida', 'teksti', ':)', 'Äge', '!']