UD pipelines

Table of Contents generated with DocToc

LIMA 3.0 : 60+ languages with Deep Learning models

This page describes multilingual text analysis that follows Universal Dependencies guidelines. Models for more than 60 languages are available. Following two pipelines are intended for UD analysis:

Pipeline	Description	Input	Output*
deepud	Full pipeline that includes tokenization and sentence segmentation.	plain text	CoNLL-U
deepud-pretok	Pipeline that starts from pretokenized text.	CoNLL-U	CoNLL-U

* Pipelines can be configured for CoNLL-03 output (see conllDumper processing unit configuration in lima_linguisticprocessing/conf/lima-lp-ud.xml).

Installation of language models

Use the lima_models.py script to download and install models to user's home directory (we follow the XDG specification to install and search for LIMA models):

$ lima_models.py -l english

To get information about the available models, use the -i switch:

$ lima_models.py -i

Alternatively, you can manually download the language packages you need from the Releases section of lima-models repository. You can use as many language packages simultaneously as you need. Install each language package with apt. E.g.:

$ sudo apt install ./lima-deep-models-english_0.1.5_all.deb

Usage

Refer to the LIMA user manual for detailed instructions. But in short, use the ud "language", the deepud or deepud-pretok pipeline and the 3 letters ISO639-3 language code to choose required language while using analyzeText:

$ analyzeText -l ud -p deepud --meta udlang:eng your-text-file.txt

To analyze tokenized text (.conllu input file) use deepud-pretok pipeline:

$ analyzeText -l ud -p deepud-pretok --meta udlang:eng your-text-file.conllu

Short command-line syntax is also available and it works for all languages except English and French.

$ analyzeText -l spa -p deepud your-text-file.conllu

Short command-line syntax for English and French requires UD to be mentioned as a part of language code. This form equally works for all languages.

$ analyzeText -l ud-eng -p deepud your-text-file.conllu

Processing units used

	deepud	deepud-pretok
Input:
cpptftokenizer	+
conllureader		+
RNN-based PoS tagger and dependency parser:
tfmorphosyntax	+	+
RNN-based lemmatizer:
tflemmatizer	+	+
Output:
conllDumper	+	+

For the up-to-date definitions of these pipelines please check the corresponding configuration files: lima_linguisticprocessing/conf/lima-lp-$LANG-CODE.xml.

Performance

Current performance of LIMA in all supported languages is reported in a dedicated page. We report some of them below, for English and French only.

Regarding the speed it's important to note that LIMA and UDify do multithread computation and consume normally all available CPU cores. UDPipe use only one thread. This difference is ignored here.

eng English-EWT

Tool	Mode	Tokens	Sentences	Words	UPOS	UFeats	Lemmas	UAS	LAS	Speed
lima	raw	98.85	85.14	98.85	94.89	90.81	94.17	85.15	82.06	245
lima	gold-tok	100	100	100	95.95	91.86	95.09	87.91	84.65	254
udpipe	raw	98.9	86.92	98.9	93.34	94.3	95.45	81.83	78.64	1793
udpipe	gold-tok	100	100	100	94.43	95.37	96.41	84.4	81.08	2281
udify	gold-tok	100	100	100	96.29	96.19	97.39	91.12	88.53	92

fra French-Sequoia

Tool	Mode	Tokens	Sentences	Words	UPOS	UFeats	Lemmas	UAS	LAS	Speed
lima	raw	99.69	84.22	97.94	96.06	89.28	94.91	85.09	82.4	291
lima	gold-tok	100	100	100	98.25	91.33	96.91	89.1	86.48	300
udpipe	raw	99.79	87.5	99.09	96.1	94.93	96.93	84.85	82.09	3349
udpipe	gold-tok	100	100	100	97.08	95.84	97.82	86.83	84.13	3349
udify	gold-tok	100	100	100	97.93	89.41	97.24	92.07	89.22	86

Limitations

Speed: current speed is around 300 w/s (80-900 w/s depending of the particular language model - see evaluation page). This could be not acceptable for everyday use. We are still working on speed improvement.
RAM consumption: depending on the particular language model and word embeddings file size (see details below), it can take up to 32Gb of RAM.
Typo and Abbr features and XPOS conllu column aren't generated by LIMA.

Tradeoff between word embedding size and analysis quality

fastText models published by Facebook include word embeddings with subword information. Original binary files are ~7Gb per language and this seems to be too much for practical usage. In lima-models repository we distribute compressed (quantized) versions of these files (~600Mb per language). The quantization process slightly affects analysis quality. The following table reports the average effect on metrics calculated with original and two compressed embeddings.

Embedding size |   1.2Gb |   0.6Gb
---------------+---------+---------
         UPOS  |  -0.01  |  -0.03
          UAS  |  -0.02  |  -0.05
          LAS  |  -0.03  |  -0.1

You can manually replace word embeddings with original binary (i.e. with subword information) files from fastText.cc or with less compressed ones we've published separately. In order to do this you have to put downloaded files to /usr/share/apps/lima/resources/TensorFlowMorphoSyntax/ud/ with the name fasttext-xxx.bin where xxx is a corresponding ISO639-3 language code (eng, fra...).

Table of Contents generated with DocToc

The LIMA multilingual NLP tool

The LIMA multilingual NLP tool

Provide feedback

Saved searches

Use saved searches to filter your results more quickly