Skip to content
Gael de Chalendar edited this page Jun 17, 2021 · 3 revisions

Table of Contents generated with DocToc

This page describes multilingual text analysis that follows Universal Dependencies guidelines. Models for more than 60 languages are available. Following two pipelines are intended for UD analysis:

Pipeline Description Input Output*
deepud Full pipeline that includes tokenization and sentence segmentation. plain text CoNLL-U
deepud-pretok Pipeline that starts from pretokenized text. CoNLL-U CoNLL-U

* Pipelines can be configured for CoNLL-03 output (see conllDumper processing unit configuration in lima_linguisticprocessing/conf/lima-lp-ud.xml).

Installation of language models

Use the lima_models.py script to download and install models to user's home directory (we follow the XDG specification to install and search for LIMA models):

$ lima_models.py -l english

To get information about the available models, use the -i switch:

$ lima_models.py -i

Alternatively, you can manually download the language packages you need from the Releases section of lima-models repository. You can use as many language packages simultaneously as you need. Install each language package with apt. E.g.:

$ sudo apt install ./lima-deep-models-english_0.1.5_all.deb

Usage

Refer to the LIMA user manual for detailed instructions. But in short, use the ud "language", the deepud or deepud-pretok pipeline and the 3 letters ISO639-3 language code to choose required language while using analyzeText:

$ analyzeText -l ud -p deepud --meta udlang:eng your-text-file.txt

To analyze tokenized text (.conllu input file) use deepud-pretok pipeline:

$ analyzeText -l ud -p deepud-pretok --meta udlang:eng your-text-file.conllu

Short command-line syntax is also available and it works for all languages except English and French.

$ analyzeText -l spa -p deepud your-text-file.conllu

Short command-line syntax for English and French requires UD to be mentioned as a part of language code. This form equally works for all languages.

$ analyzeText -l ud-eng -p deepud your-text-file.conllu

Processing units used

deepud deepud-pretok
Input:
cpptftokenizer +
conllureader +
RNN-based PoS tagger and dependency parser:
tfmorphosyntax + +
RNN-based lemmatizer:
tflemmatizer + +
Output:
conllDumper + +

For the up-to-date definitions of these pipelines please check the corresponding configuration files: lima_linguisticprocessing/conf/lima-lp-$LANG-CODE.xml.

Performance

Current performance of LIMA in all supported languages is reported in a dedicated page. We report some of them below, for English and French only.

Regarding the speed it's important to note that LIMA and UDify do multithread computation and consume normally all available CPU cores. UDPipe use only one thread. This difference is ignored here.

eng English-EWT

Tool Mode Tokens Sentences Words UPOS UFeats Lemmas UAS LAS Speed
lima raw 98.85 85.14 98.85 94.89 90.81 94.17 85.15 82.06 245
lima gold-tok 100 100 100 95.95 91.86 95.09 87.91 84.65 254
udpipe raw 98.9 86.92 98.9 93.34 94.3 95.45 81.83 78.64 1793
udpipe gold-tok 100 100 100 94.43 95.37 96.41 84.4 81.08 2281
udify gold-tok 100 100 100 96.29 96.19 97.39 91.12 88.53 92

fra French-Sequoia

Tool Mode Tokens Sentences Words UPOS UFeats Lemmas UAS LAS Speed
lima raw 99.69 84.22 97.94 96.06 89.28 94.91 85.09 82.4 291
lima gold-tok 100 100 100 98.25 91.33 96.91 89.1 86.48 300
udpipe raw 99.79 87.5 99.09 96.1 94.93 96.93 84.85 82.09 3349
udpipe gold-tok 100 100 100 97.08 95.84 97.82 86.83 84.13 3349
udify gold-tok 100 100 100 97.93 89.41 97.24 92.07 89.22 86

Limitations

  • Speed: current speed is around 300 w/s (80-900 w/s depending of the particular language model - see evaluation page). This could be not acceptable for everyday use. We are still working on speed improvement.
  • RAM consumption: depending on the particular language model and word embeddings file size (see details below), it can take up to 32Gb of RAM.
  • Typo and Abbr features and XPOS conllu column aren't generated by LIMA.

Tradeoff between word embedding size and analysis quality

fastText models published by Facebook include word embeddings with subword information. Original binary files are ~7Gb per language and this seems to be too much for practical usage. In lima-models repository we distribute compressed (quantized) versions of these files (~600Mb per language). The quantization process slightly affects analysis quality. The following table reports the average effect on metrics calculated with original and two compressed embeddings.

Embedding size |   1.2Gb |   0.6Gb
---------------+---------+---------
         UPOS  |  -0.01  |  -0.03
          UAS  |  -0.02  |  -0.05
          LAS  |  -0.03  |  -0.1

You can manually replace word embeddings with original binary (i.e. with subword information) files from fastText.cc or with less compressed ones we've published separately. In order to do this you have to put downloaded files to /usr/share/apps/lima/resources/TensorFlowMorphoSyntax/ud/ with the name fasttext-xxx.bin where xxx is a corresponding ISO639-3 language code (eng, fra...).