# <span style="color:purple">Morphological analysis with HFST analyser</span>

In addition to [Vabamorf](https://github.com/Filosoft/vabamorf/)'s morphological analysis (`VabamorfTagger`), EstNLTK also has an alternative morphological analysis model, which is based on [HFST](https://github.com/hfst/hfst) (_Helsinki Finite-State Technology_).
Currently, the HFST-based model is still under development, and it is not so complete and throughly-tested as Vabamorf's one. 
Still, it can be a viable alternative to Vabamorf's analyser, especially from the perspective of analysing compound words.

##### Technical notes

Before using HFST-based analyser, you need to install Python's package [hfst](https://pypi.org/project/hfst/).
You can install it via PyPI, as [the installation instructions recommend](https://pypi.org/project/hfst/#installation-via-pypi).  However, on Windows platform, only 32-bit binary wheels are currently available, so a 32-bit Python is required.
If you are using an Anaconda environment with 64-bit Python on Windows, you need to create an environment with a 32-bit Python, and install _hfst_ into it (along with _estnltk_):

      REM force using 32-bit Python in conda
      set CONDA_FORCE_32BIT=1
      
      REM create a new environment that uses 32bit Python
      conda create -n py3.5_32bit python=3.5.5

      REM activate the environment and install hfst (and estnltk)
      ...
      
**(!)** After you have successfully installed _hfst_ and intend to use HFST-based analyser in your code, you need to import _hfst_ before you import anything from _estnltk_:

In [1]:
import hfst

The reason: there is an import conflict between _hfst_ and _estnltk.vabamorf_ -- if you import _estnltk_ first, and then import _hfst_, then you will get a "segmentation fault" on using _estnltk.vabamorf_. 
The conflict is yet to be solved and until then, we suggest to use fixed import order to avoid problems.



## Using HfstEstMorphAnalyser

### Basic usage

The main class is `HfstEstMorphAnalyser`. 
It uses `'words'` as the input layer, and tags the `'hfst_gt_morph_analysis'` layer on the `'words'` layer. 
Note that the output layer will be _ambiguous_, as morphological disambiguation is currently not available.

In [2]:
# import and initialize HfstEstMorphAnalyser
from estnltk.taggers.morph_analysis.hfst.hfst_gt_morph_analyser import HfstEstMorphAnalyser
hfst_analyser = HfstEstMorphAnalyser()

In [3]:
# create input text
from estnltk import Text
text = Text('Mäesuusatamine on üsna lõbupakkuv.')

# add prerequisite layer
text.tag_layer(['words'])

# Tag hfst morph analyses
hfst_analyser.tag(text)

# Examine results
text['hfst_gt_morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
hfst_gt_morph_analysis,"morphemes_lemmas, postags, forms, is_guessed, ...",words,,True,5

text,morphemes_lemmas,postags,forms,is_guessed,has_clitic,usage,weight
Mäesuusatamine,"('mäesuusatamine',)","('N',)","('Sg+Nom',)",False,False,(),30.0
,"('mägi', 'suusatamine')","('N', 'N')","('Sg+Gen', 'Sg+Nom')",False,False,(),41.0
,"('mägi', 'suusatama', 'mine')","('N', 'V', 'N')","('Sg+Gen', 'Der', 'Sg+Nom')",False,False,(),51.0
on,"('olema',)","('V',)","('Pers+Prs+Ind+Pl3+Aff',)",False,False,(),0.0
,"('olema',)","('V',)","('Pers+Prs+Ind+Sg3+Aff',)",False,False,(),0.0
üsna,"('üsna',)","('Adv',)","('',)",False,False,(),0.0
lõbupakkuv,"('lõbu', 'pakkuv')","('N', 'A')","('Sg+Par', 'Sg+Nom')",False,False,(),42.0
,"('lõbu', 'pakkuma', 'v')","('N', 'V', 'A')","('Sg+Par', 'Der', 'Sg+Nom')",False,False,(),52.0
.,"('.',)","('CLB',)","('',)",False,False,(),0.0


In the `'hfst_gt_morph_analysis'` layer, each analysis will have following attributes:


   * **`morphemes_lemmas`** -- a tuple containing morphemes and/or lemmas that the word consists of. In transducer's output, there is no marked distinction between morphemes and lemmas, so the attribute name is also ambiguous. Linguistically, you can note that a part of a compound word can be normalised as a lemma: for instance, in `('mägi','suusatama','mine')`, `'mäe'` from the original word was normalised into lemma `'mägi'`, and `'suusata'` was normalized into the lemma `'suusatama'`. And a part of a compound word can also be a morpheme -- e.g., in `('mägi','suusatama','mine')`, `'mine'` is a morpheme (it does not stand out as a word or a lemma);


   * **`postags`** -- a tuple containing part of speech tags corresponding to the word parts in `morphemes_lemmas`. The tuple always has the same size as `morphemes_lemmas`, and if a part of speech tag for some morpheme/lemma is missing, then the corresponding place is filled in with an empty string. The tagset used is a bit different from that of Vabamorf's and GT's, you can trace the tagset from the definitions in the file https://victorio.uit.no/langtech/trunk/experiment-langs/est/src/morphology/lexlang.xfscript;
   
   
   * **`forms`** -- a tuple containing form categories corresponding to the word parts in `morphemes_lemmas`. The tuple always has the same size as `morphemes_lemmas`, and if form categories of a morpheme/lemma are missing or unknown, then the corresponding place is filled in with an empty string. Categories used are similar to GT's, but not exactly the same, you can trace the tagset from the definitions in the file https://victorio.uit.no/langtech/trunk/experiment-langs/est/src/morphology/lexlang.xfscript;


   * **`is_guessed`** -- a boolean indicating whether some part of the word (some of the `morphemes_lemmas`) was guessed;
   
   
   * **`has_clitic`** -- a boolean indicating whether the word ends with a clitic (_-ki_ or _-gi_);
   

   * **`usage`** -- a tuple containing remarks about word's usage. This is filled in usually in case of rare words or irregular inflections;
   
   
   * **`weight`** -- weight of the analysis. Lower weight indicates higher likelyhood of an analysis, but please keep in mind that adjusting weights is still work in progress;
   

### Guessed and unknown words

The boolean attribute `is_guessed` shows if the word was guessed by the analyser.
However, if a word is unknown and the analyser was unable to guess it, all of attribute values of the analysis will be set to `None`, except the weight, which will be set to `inf`:

In [4]:
# create input text
from estnltk import Text
text = Text('Vannaema BaabaJagaa')

# add prerequisite layer
text.tag_layer(['words'])

# Tag hfst morph analyses
hfst_analyser.tag(text)

# Examine results
text['hfst_gt_morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
hfst_gt_morph_analysis,"morphemes_lemmas, postags, forms, is_guessed, has_clitic, usage, weight",words,,True,2

text,morphemes_lemmas,postags,forms,is_guessed,has_clitic,usage,weight
Vannaema,"('vanna', 'ema')","('', 'N')","('', 'Sg+Nom')",True,False,(),240.0
,"('vanna', 'ema')","('', 'N')","('', 'Sg+Gen')",True,False,(),241.0
,"('vanna', 'ema')","('', 'N')","('', 'Sg+Par')",True,False,(),242.0
BaabaJagaa,,,,,,,inf


### Output raw analyses

By default, `HfstEstMorphAnalyser` tries to extract morphemes/lemmas and their corresponding postags/forms from the output (the output format called `'morphemes_lemmas'`). 
If you want to get the original output of the HFST analyser (the output similar to that of the command line tool [hfst-lookup]( https://github.com/hfst/hfst/wiki/HfstLookUp)), then you need to change the output format to `'raw'`:

In [5]:
# import and initialize a HfstEstMorphAnalyser that output's raw analyses
from estnltk.taggers.morph_analysis.hfst.hfst_gt_morph_analyser import HfstEstMorphAnalyser
hfst_analyser_raw = HfstEstMorphAnalyser(output_format='raw')

In [6]:
# create input text
from estnltk import Text
text = Text('Mäesuusatamine on üsna lõbupakkuv.')

# add prerequisite layer
text.tag_layer(['words'])

# Tag hfst morph analyses
hfst_analyser_raw.tag(text)

# Examine results
text['hfst_gt_morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
hfst_gt_morph_analysis,"raw_analysis, weight",words,,True,5

text,raw_analysis,weight
Mäesuusatamine,mäesuusatamine+N+Sg+Nom,30.0
,mägi+N+Sg+Gen#suusatamine+N+Sg+Nom,41.0
,mägi+N+Sg+Gen#suusatama+V+Der/mine+N+Sg+Nom,51.0
on,olema+V+Pers+Prs+Ind+Pl3+Aff,0.0
,olema+V+Pers+Prs+Ind+Sg3+Aff,0.0
üsna,üsna+Adv,0.0
lõbupakkuv,lõbu+N+Sg+Par#pakkuv+A+Sg+Nom,42.0
,lõbu+N+Sg+Par#pakkuma+V+Der/v+A+Sg+Nom,52.0
.,.+CLB,0.0


In this output format, there are only two attributes: `raw_analysis` which encapsulates the morphological analysis of the word, and `weight` which encapsulates the weight of the corresponding analysis.
In similar to `'morphemes_lemmas'` output format, in case of an unknown word, `raw_analysis` will be `None`, and `weight` will be `inf`.

### The lookup function

There may be situations when you only need to look up analyses of a single word, without the need for analysing a full text.
For this purpose, `HfstEstMorphAnalyser` provides function `lookup()`, which performs analysis of the input word and returns results as a list of dictionaries:

In [7]:
hfst_analyser.lookup('üleguugeldamine')

[{'forms': ('', 'Der', 'Sg+Nom'),
  'has_clitic': False,
  'is_guessed': False,
  'morphemes_lemmas': ('üle', 'guugeldama', 'mine'),
  'postags': ('Adv', 'V', 'N'),
  'usage': (),
  'weight': 50.0}]

The format of the output depends on the `output_format` parameter used in the initialization of the `HfstEstMorphAnalyser`. So, with `hfst_analyser_raw`, we get:

In [8]:
hfst_analyser_raw.lookup('üleguugeldamine')

[{'raw_analysis': 'üle+Adv#guugeldama+V+Der/mine+N+Sg+Nom', 'weight': 50.0}]

In case of an unknown word, the lookup function returns an empty list.

You can only analyse _a single word_ with this function -- if you attempt to analyse sentences or texts, you will also get an empty list.

### Notes about the HFST model

The HFST-based morphological analysis model currently used in EstNLTK is based on the source code that is available here: https://victorio.uit.no/langtech/trunk/experiment-langs/est/ (on the source revision 176161 from 2019-01-28). In order to create a new model, you need to download the source, compile the HFST models, and look for file `'src/analyser-gt-desc.hfstol'`. This is the file that can be given to `HfstEstMorphAnalyser` as the transducer model:

    # Initialize HfstEstMorphAnalyser with a custom model:
    from estnltk.taggers.morph_analysis.hfst.hfst_gt_morph_analyser import HfstEstMorphAnalyser
    hfst_analyser_raw = HfstEstMorphAnalyser(transducer_file = 'analyser-gt-desc.hfstol')
    
