# Restoring tokenization of a pretokenized text

There may be situations where you want to use the simplest tokenization (split words by whitespaces only), or load a pretokenized text in a way that the original tokenization is preserved. 
This tutorial shows how to do it in EstNLTK.

**Note**: EstNLTK's tokenization is configurable, and in some situations it is more convenient to change tokenization rules instead of forcing whitespace tokenization. 
For details, see the tokenization tutorials in [nlp_pipeline/A_text_segmentation](../nlp_pipeline/A_text_segmentation).

### Whitespace tokenization for words

Use can use `WhiteSpaceTokensTagger` to split the text into `tokens` by whitespaces.
After that, you should also use `PretokenizedTextCompoundTokensTagger` to create an empty `compound_tokens` layer, or otherwise, the default `CompoundTokensTagger` would still join some of the whitespace-sparated tokens into words.
Finally, you can tag the `words` layer as usual:

In [1]:
# Create Text that needs to be tokenized by whitespaces
from estnltk import Text
text=Text('29.04-21.05 täheldati muutust vanuserühmas 25-32 ja/või 55-64 aastat')

In [2]:
from estnltk.taggers import WhiteSpaceTokensTagger
from estnltk.taggers import PretokenizedTextCompoundTokensTagger

# Initialize tools for white space tokenization
tokens_tagger = WhiteSpaceTokensTagger()
compound_tokens_tagger = PretokenizedTextCompoundTokensTagger()

# Perform word tokenization
tokens_tagger.tag(text)
compound_tokens_tagger.tag(text)
text.tag_layer('words') # join tokens and compound_tokens layers

# Browse results
text.words

layer name,attributes,parent,enveloping,ambiguous,span count
words,normalized_form,,,True,8

text,normalized_form
29.04-21.05,
täheldati,
muutust,
vanuserühmas,
25-32,
ja/või,
55-64,
aastat,


**Be aware:** EstNLTK's analysis tools assume that punctuation (esp. sentence ending punctuation) is separated from word tokens. 
The whitespace tokenization shown above does not separate punctuation from words and thus, the quality of downstream analysis (sentence tokenization, morphological analysis etc.) likely suffers because of that.
Apply it carefully.

### Restoring a pre-tokenized text

A more advanced use case is loading a pre-tokenized text. For instance, if the original text has manually corrected word and sentence tokenization, you may want to preserve the correct annotations instead of automatically creating new ones (which may introduce some errors).

In order to restore a pretokenized text, you should artificially reconstruct the text -- join tokens by whitespaces and sentences by newlines -- and then use `WhiteSpaceTokensTagger`, `PretokenizedTextCompoundTokensTagger`, and a modified `SentenceTokenizer` to restore the layers `'tokens'`, `'compound_tokens'`, `'words'`, and `'sentences'`. Follows a brief example on how to do it.

In [3]:
# Example of a pretokenized text
pretokenized_text = '''
<s>
Maa
suurima
vulkaani
Mauna Loa
kõrgus
on
8742
meetrit
mõõdetuna
Vaikse
ookeani
põhjal
asuvalt
jalamilt
.
</s>
<s>
Mauna Loa
jalam
mahuks
parajasti
Olympus
Mons'i
kaldeerasse
!
</s>
'''

In [4]:
# 1) collect raw words, and multiword expressions
raw_words = []
multiword_expressions = []
raw_tokens = pretokenized_text.split('\n')
for raw_token in raw_tokens:
    if raw_token not in ['<s>', '</s>']:  # Skip sentence boundary tags
        raw_words.append(raw_token)
        if ' ' in raw_token:
            multiword_expressions.append(raw_token)
    elif raw_token == '</s>':
        raw_words[-1] += '\n'  # newline == sentence ending
        
# 2) reconstruct the text
text_str = ' '.join(raw_words)

In [5]:
# 3) create estnltk's text
from estnltk import Text
text = Text(text_str)

Now, we can restore the original tokenization annotation. First, let's split the text into tokens by whitespaces:

In [6]:
from estnltk.taggers import WhiteSpaceTokensTagger
tokens_tagger = WhiteSpaceTokensTagger()
tokens_tagger.tag(text)

text
Maa suurima vulkaani Mauna Loa kõrgus on 8742 meetrit mõõdetuna Vaikse ookeani põhjal asuvalt jalamilt . Mauna Loa jalam mahuks parajasti Olympus Mons'i kaldeerasse !

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,25


Then we can use `PretokenizedTextCompoundTokensTagger` to restore the multiword (multitoken) expressions from the original text:

In [7]:
# 4) convert multiword expressions to the form of lists of lists of strings
multiword_expressions = [mw.split() for mw in multiword_expressions]

# 5) restore the original compound tokens
from estnltk.taggers import PretokenizedTextCompoundTokensTagger
compound_tokens_tagger = PretokenizedTextCompoundTokensTagger( multiword_units = multiword_expressions )
compound_tokens_tagger.tag(text)

text
Maa suurima vulkaani Mauna Loa kõrgus on 8742 meetrit mõõdetuna Vaikse ookeani põhjal asuvalt jalamilt . Mauna Loa jalam mahuks parajasti Olympus Mons'i kaldeerasse !

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,25
compound_tokens,"type, normalized",,tokens,False,2


* _Notes on_ `PretokenizedTextCompoundTokensTagger`: 
  
   * multiword expressions passed to the `PretokenizedTextCompoundTokensTagger` must be exactly in the same order as they appear in the text;

   * if the original text does not have any multiword expressions or compound tokens, you still need to create the `'compound_tokens'` layer, because it is a prerequisite to the `'words'` layer. So, you should initialize `PretokenizedTextCompoundTokensTagger` with zero input parameters, so that it will create an empty `'compound_tokens'` layer;

Next, we use the default words tagger to create the 'words' layer:

In [8]:
# 6) add words layer
text.tag_layer(['words'])

text
Maa suurima vulkaani Mauna Loa kõrgus on 8742 meetrit mõõdetuna Vaikse ookeani põhjal asuvalt jalamilt . Mauna Loa jalam mahuks parajasti Olympus Mons'i kaldeerasse !

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,25
compound_tokens,"type, normalized",,tokens,False,2
words,normalized_form,,,True,23


Finally, we create a customized sentence tagger that will split sentences by newlines only:

In [9]:
# 7) create a sentence tokenizer that only splits sentences in places of new lines
from estnltk.taggers import SentenceTokenizer
from nltk.tokenize.simple import LineTokenizer
newline_sentence_tokenizer = SentenceTokenizer( base_sentence_tokenizer=LineTokenizer() )

# 8) split text into sentences by newlines
newline_sentence_tokenizer.tag(text)

text
Maa suurima vulkaani Mauna Loa kõrgus on 8742 meetrit mõõdetuna Vaikse ookeani põhjal asuvalt jalamilt . Mauna Loa jalam mahuks parajasti Olympus Mons'i kaldeerasse !

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,2
tokens,,,,False,25
compound_tokens,"type, normalized",,tokens,False,2
words,normalized_form,,,True,23


Results -- the original tokenization is successfully restored in the `Text` object:

In [10]:
text.words

layer name,attributes,parent,enveloping,ambiguous,span count
words,normalized_form,,,True,23

text,normalized_form
Maa,
suurima,
vulkaani,
Mauna Loa,
kõrgus,
on,
8742,
meetrit,
mõõdetuna,
Vaikse,


In [11]:
text.sentences

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,2

text
"['Maa', 'suurima', 'vulkaani', 'Mauna Loa', 'kõrgus', 'on', '8742', 'meetrit', ' ..., type: <class 'list'>, length: 15"
"['Mauna Loa', 'jalam', 'mahuks', 'parajasti', 'Olympus', ""Mons'i"", 'kaldeerasse', '!']"
