# <span style="color:purple">Morphological analysis with HFST analyser</span>

In addition to [Vabamorf](https://github.com/Filosoft/vabamorf/)'s morphological analysis (`VabamorfTagger`), EstNLTK also has an alternative morphological analysis model, which is based on [HFST](https://github.com/hfst/hfst) (_Helsinki Finite-State Technology_).
Currently, the HFST-based model is still under development, and it is not so complete and throughly-tested as Vabamorf's one. 
Still, it can be a viable alternative to Vabamorf's analyser, especially from the perspective of analysing compound words.

## HfstClMorphAnalyser

Before using `HfstClMorphAnalyser`, you need to install HFST [command line tools](https://github.com/hfst/hfst/wiki/Command-Line-Tools). 
You can find installation instructions for different platforms [here](https://github.com/hfst/hfst/wiki/Download-And-Install#download-and-install-hfst).
After the installation, the location of command line tools should be in system's PATH variable. 
You can check if the tools are properly installed and available by typing in terminal:

In [1]:
!hfst-lookup --version

hfst-lookup 0.6 (hfst 3.15.2)
Copyright (C) 2017 University of Helsinki,
License GPLv3: GNU GPL version 3 <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.


If you see version information above, then the installation should be ok.

### Basic usage

`HfstClMorphAnalyser` uses `'words'` as the input layer, and tags the `'hfst_gt_morph_analysis'` layer on the `'words'` layer. 
Note that the output layer will be _ambiguous_, as morphological disambiguation is currently not available.

In [2]:
# import and initialize HfstClMorphAnalyser
from estnltk.taggers import HfstClMorphAnalyser
hfst_analyser = HfstClMorphAnalyser()

In [3]:
# create input text
from estnltk import Text
text = Text('Mäesuusatamine on üsna lõbupakkuv.')

# add prerequisite layer
text.tag_layer(['words'])

# Tag hfst morph analyses
hfst_analyser.tag(text)

# Examine results
text['hfst_gt_morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
hfst_gt_morph_analysis,"morphemes_lemmas, postags, forms, is_guessed, has_clitic, usage, weight",words,,True,5

text,morphemes_lemmas,postags,forms,is_guessed,has_clitic,usage,weight
Mäesuusatamine,"('mäesuusatamine',)","('N',)","('Sg+Nom',)",False,False,(),40.0
,"('mägi', 'suusatamine')","('N', 'N')","('Sg+Gen', 'Sg+Nom')",False,False,(),56.0
,"('mägi', 'suusatama', 'mine')","('N', 'V', 'N')","('Sg+Gen', 'Der', 'Sg+Nom')",False,False,(),66.0
on,"('olema',)","('V',)","('Pers+Prs+Ind+Pl3+Aff',)",False,False,(),0.0
,"('olema',)","('V',)","('Pers+Prs+Ind+Sg3+Aff',)",False,False,(),0.0
üsna,"('üsna',)","('Adv',)","('',)",False,False,(),5.0
lõbupakkuv,"('lõbu', 'pakkuv')","('N', 'A')","('Sg+Par', 'Sg+Nom')",False,False,(),58.0
,"('lõbu', 'pakkuma', 'v')","('N', 'V', 'A')","('Sg+Par', 'Der', 'Sg+Nom')",False,False,(),65.0
.,"('.',)","('CLB',)","('',)",False,False,(),0.0


In the `'hfst_gt_morph_analysis'` layer, each analysis will have following attributes:


   * **`morphemes_lemmas`** -- a tuple containing morphemes and/or lemmas that the word consists of. In transducer's output, there is no marked distinction between morphemes and lemmas, so the attribute name is also ambiguous. Linguistically, you can note that a part of a compound word can be normalised as a lemma: for instance, in `('mägi','suusatama','mine')`, `'mäe'` from the original word was normalised into lemma `'mägi'`, and `'suusata'` was normalized into the lemma `'suusatama'`. And a part of a compound word can also be a morpheme -- e.g., in `('mägi','suusatama','mine')`, `'mine'` is a morpheme (it does not stand out as a word or a lemma);


   * **`postags`** -- a tuple containing part of speech tags corresponding to the word parts in `morphemes_lemmas`. The tuple always has the same size as `morphemes_lemmas`, and if a part of speech tag for some morpheme/lemma is missing, then the corresponding place is filled in with an empty string. The tagset used is a bit different from that of Vabamorf's and GT's, you can trace the tagset from the definitions in the file https://victorio.uit.no/langtech/trunk/experiment-langs/est/src/morphology/lexlang.xfscript?p=177977;
   
   
   * **`forms`** -- a tuple containing form categories corresponding to the word parts in `morphemes_lemmas`. The tuple always has the same size as `morphemes_lemmas`, and if form categories of a morpheme/lemma are missing or unknown, then the corresponding place is filled in with an empty string. Categories used are similar to GT's, but not exactly the same, you can trace the tagset from the definitions in the file https://victorio.uit.no/langtech/trunk/experiment-langs/est/src/morphology/lexlang.xfscript?p=177977;


   * **`is_guessed`** -- a boolean indicating whether some part of the word (some of the `morphemes_lemmas`) was guessed;
   
   
   * **`has_clitic`** -- a boolean indicating whether the word ends with a clitic (_-ki_ or _-gi_);
   

   * **`usage`** -- a tuple containing remarks about word's usage. This is filled in usually in case of rare words or irregular inflections;
   
   
   * **`weight`** -- weight of the analysis. Lower weight indicates higher likelyhood of an analysis, but please keep in mind that adjusting weights is still work in progress;
   

### Guessed and unknown words

The boolean attribute `is_guessed` shows if the word was guessed by the analyser.
However, if a word is unknown and the analyser was unable to guess it, all of attribute values of the analysis will be set to `None`, except the weight, which will be set to `inf`:

In [4]:
# create input text
from estnltk import Text
text = Text('Kiwikübaraga BaabaJagaa')

# add prerequisite layer
text.tag_layer(['words'])

# Tag hfst morph analyses
hfst_analyser.tag(text)

# Examine results
text['hfst_gt_morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
hfst_gt_morph_analysis,"morphemes_lemmas, postags, forms, is_guessed, has_clitic, usage, weight",words,,True,2

text,morphemes_lemmas,postags,forms,is_guessed,has_clitic,usage,weight
Kiwikübaraga,"('kiwi', 'kübar')","('', 'N')","('', 'Sg+Com')",True,False,(),241.0
BaabaJagaa,,,,,,,inf


#### Excluding guesses

You can exclude all guesses from the output if you initialize `HfstClMorphAnalyser` with the setting `remove_guesses=True`:

In [5]:
# initialize HfstClMorphAnalyser that excludes guesses from the output
hfst_analyser_no_guesses = HfstClMorphAnalyser(remove_guesses=True)

In [6]:
# create input text
from estnltk import Text
text = Text('Kiwikübaraga BaabaJagaa')

# add prerequisite layer
text.tag_layer(['words'])

# Tag hfst morph analyses
hfst_analyser_no_guesses.tag(text)

# Examine results
text['hfst_gt_morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
hfst_gt_morph_analysis,"morphemes_lemmas, postags, forms, is_guessed, has_clitic, usage, weight",words,,True,2

text,morphemes_lemmas,postags,forms,is_guessed,has_clitic,usage,weight
Kiwikübaraga,,,,,,,inf
BaabaJagaa,,,,,,,inf


### Output raw analyses

By default, `HfstClMorphAnalyser` tries to extract morphemes/lemmas and their corresponding postags/forms from the output (the output format called `'morphemes_lemmas'`). 
If you want to get the original output of the HFST analyser (that of the command line tool [hfst-lookup]( https://github.com/hfst/hfst/wiki/HfstLookUp)), then you need to change the output format to `'raw'`:

In [7]:
# import and initialize a HfstClMorphAnalyser that output's raw analyses
from estnltk.taggers import HfstClMorphAnalyser
hfst_analyser_raw = HfstClMorphAnalyser(output_format='raw')

In [8]:
# create input text
from estnltk import Text
text = Text('Mäesuusatamine on üsna lõbupakkuv.')

# add prerequisite layer
text.tag_layer(['words'])

# Tag hfst morph analyses
hfst_analyser_raw.tag(text)

# Examine results
text['hfst_gt_morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
hfst_gt_morph_analysis,"raw_analysis, weight",words,,True,5

text,raw_analysis,weight
Mäesuusatamine,mäesuusatamine+N+Sg+Nom,40.0
,mägi+N+Sg+Gen#suusatamine+N+Sg+Nom,56.0
,mägi+N+Sg+Gen#suusatama+V+Der/mine+N+Sg+Nom,66.0
on,olema+V+Pers+Prs+Ind+Pl3+Aff,0.0
,olema+V+Pers+Prs+Ind+Sg3+Aff,0.0
üsna,üsna+Adv,5.0
lõbupakkuv,lõbu+N+Sg+Par#pakkuv+A+Sg+Nom,58.0
,lõbu+N+Sg+Par#pakkuma+V+Der/v+A+Sg+Nom,65.0
.,.+CLB,0.0


In this output format, there are only two attributes: `raw_analysis` which encapsulates the morphological analysis of the word, and `weight` which encapsulates the weight of the corresponding analysis.
In similar to `'morphemes_lemmas'` output format, in case of an unknown word, `raw_analysis` will be `None`, and `weight` will be `inf`.

### Stream-based vs file-based communication modes

There are two primary ways for communicating with the HFST command line tool. 
First, the stream-based communication: the tool will be launched as a "persistent process", and its input/output will be communicated interactively via STDIN and STDOUT streams.
Second, the file-based communication, where the tool will be launched every time when `tag()` is called, its input will be passed as a file and its output will also be read from a file.
The file-based communication generally tends to be slower, because every time `tag()` is called, the HFST model is also loaded again.
However, there may be situations or configurations when the file-based communication outperforms the stream-based one, so you may want to switch between the two communication modes.
By default, `HfstClMorphAnalyser` uses stream-based communication, but you can change the mode via flag `use_stream`:

```python
# Initialize HfstClMorphAnalyser in a file-based communication mode
from estnltk.taggers import HfstClMorphAnalyser
hfst_analyser = HfstClMorphAnalyser(use_stream=False)
```

### Notes about the HFST model

The HFST-based morphological analysis model currently used in EstNLTK is based on the source code that is available here: https://victorio.uit.no/langtech/trunk/experiment-langs/est/?p=177977 (on the source revision 177977 from 2019-03-22). In order to create a new model, you need to download the source, compile the HFST models, and look for file `'src/analyser-gt-desc.hfstol'`. This is the file that can be given to `HfstClMorphAnalyser` as the transducer model:

```python
# Initialize HfstClMorphAnalyser with a custom model:
from estnltk.taggers import HfstClMorphAnalyser
hfst_analyser = HfstClMorphAnalyser(transducer_file = 'analyser-gt-desc.hfstol')
```
