# Restoring tokenization of a pretokenized text

There may be situations where you need to load a pretokenized text, and load it in a way that the original tokenization is preserved exactly the way it is. 
For instance, the original text may have manually corrected (tokenization) annotations, and you may want to preserve the correct annotations instead of automatically creating new ones (which may introduce some errors).

In order to restore a pretokenized text, you should artificially reconstruct the text -- join tokens by whitespaces and sentences by newlines -- and then use `WhiteSpaceTokensTagger`, `PretokenizedTextCompoundTokensTagger`, and a modified `SentenceTokenizer` to restore the layers `'tokens'`, `'compound_tokens'`, `'words'`, and `'sentences'`. Follows a brief example on how to do it.

In [1]:
# Example of a pretokenized text
pretokenized_text = '''
<s>
Maa
suurima
vulkaani
Mauna Loa
kõrgus
on
8742
meetrit
mõõdetuna
Vaikse
ookeani
põhjal
asuvalt
jalamilt
.
</s>
<s>
Mauna Loa
jalam
mahuks
parajasti
Olympus
Mons'i
kaldeerasse
!
</s>
'''

In [2]:
# 1) collect raw words, and multiword expressions
raw_words = []
multiword_expressions = []
raw_tokens = pretokenized_text.split('\n')
for raw_token in raw_tokens:
    if raw_token not in ['<s>', '</s>']:  # Skip sentence boundary tags
        raw_words.append(raw_token)
        if ' ' in raw_token:
            multiword_expressions.append(raw_token)
    elif raw_token == '</s>':
        raw_words[-1] += '\n'  # newline == sentence ending
        
# 2) reconstruct the text
text_str = ' '.join(raw_words)

In [3]:
# 3) create estnltk's text
from estnltk import Text
text = Text(text_str)

Now, we can restore the original tokenization annotation. First, let's split the text into tokens by whitespaces:

In [4]:
from estnltk.taggers import WhiteSpaceTokensTagger
tokens_tagger = WhiteSpaceTokensTagger()
tokens_tagger.tag(text)

text
Maa suurima vulkaani Mauna Loa kõrgus on 8742 meetrit mõõdetuna Vaikse ookeani põhjal asuvalt jalamilt . Mauna Loa jalam mahuks parajasti Olympus Mons'i kaldeerasse !

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,25


Then we can use `PretokenizedTextCompoundTokensTagger` to restore the multiword (multitoken) expressions from the original text:

In [5]:
# 4) convert multiword expressions to the form of lists of lists of strings
multiword_expressions = [mw.split() for mw in multiword_expressions]

# 5) restore the original compound tokens
from estnltk.taggers import PretokenizedTextCompoundTokensTagger
compound_tokens_tagger = PretokenizedTextCompoundTokensTagger( multiword_units = multiword_expressions )
compound_tokens_tagger.tag(text)

text
Maa suurima vulkaani Mauna Loa kõrgus on 8742 meetrit mõõdetuna Vaikse ookeani põhjal asuvalt jalamilt . Mauna Loa jalam mahuks parajasti Olympus Mons'i kaldeerasse !

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,25
compound_tokens,"type, normalized",,tokens,False,2


* _Notes on_ `PretokenizedTextCompoundTokensTagger`: 
  
   * multiword expressions passed to the `PretokenizedTextCompoundTokensTagger` must be exactly in the same order as they appear in the text;

   * if the original text does not have any multiword expressions or compound tokens, you still need to create the `'compound_tokens'` layer, because it is a prerequisite to the `'words'` layer. So, you should initialize `PretokenizedTextCompoundTokensTagger` with zero input parameters, so that it will create an empty `'compound_tokens'` layer;

Next, we use the default words tagger to create the 'words' layer:

In [6]:
# 6) add words layer
text.tag_layer(['words'])

text
Maa suurima vulkaani Mauna Loa kõrgus on 8742 meetrit mõõdetuna Vaikse ookeani põhjal asuvalt jalamilt . Mauna Loa jalam mahuks parajasti Olympus Mons'i kaldeerasse !

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,25
compound_tokens,"type, normalized",,tokens,False,2
words,normalized_form,,,True,23


Finally, we create a customized sentence tagger that will split sentences by newlines only:

In [7]:
# 7) create a sentence tokenizer that only splits sentences in places of new lines
from estnltk.taggers import SentenceTokenizer
from nltk.tokenize.simple import LineTokenizer
newline_sentence_tokenizer = SentenceTokenizer( base_sentence_tokenizer=LineTokenizer() )

# 8) split text into sentences by newlines
newline_sentence_tokenizer.tag(text)

text
Maa suurima vulkaani Mauna Loa kõrgus on 8742 meetrit mõõdetuna Vaikse ookeani põhjal asuvalt jalamilt . Mauna Loa jalam mahuks parajasti Olympus Mons'i kaldeerasse !

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,2
tokens,,,,False,25
compound_tokens,"type, normalized",,tokens,False,2
words,normalized_form,,,True,23


Results -- the original tokenization is successfully restored in the `Text` object:

In [8]:
text.words

layer name,attributes,parent,enveloping,ambiguous,span count
words,normalized_form,,,True,23

text,normalized_form
Maa,
suurima,
vulkaani,
Mauna Loa,
kõrgus,
on,
8742,
meetrit,
mõõdetuna,
Vaikse,


In [9]:
text.sentences

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,2

text
"['Maa', 'suurima', 'vulkaani', 'Mauna Loa', 'kõrgus', 'on', '8742', 'meetrit', ' ..., type: <class 'list'>, length: 15"
"['Mauna Loa', 'jalam', 'mahuks', 'parajasti', 'Olympus', ""Mons'i"", 'kaldeerasse', '!']"
