## EstNLTK's resources

As of version 1.7.0, EstNLTK contains tools for automatically downloading resources required by taggers and other tools. 
Resources are usually large (model) files, which are not included in the package and can be downloaded on user's demand. 

### In a nutshell

For manually downloading a resource, use the function `download`:

In [1]:
from estnltk import download
# Download models for UDPipeTagger
download('udpipetagger')

Downloading udpipe_syntax_2021-05-29: 40.9MB [00:00, 79.9MB/s]


Unpacked resource into subfolder 'udpipe_syntax/models_2021-05-29/' of the resources dir.


True

The function returns True if the downloading was successful or if the resource already exists, and False otherwise.

You can download a resource either by its alias (like in previous example: 'udpipetagger') or by its specific version name ('udpipe_syntax_2021-05-29' in the previous example).

Use `ResourceView` to get an overview about EstNLTK's resources and their download status:

In [2]:
from estnltk import ResourceView
ResourceView()

name,description,license,downloaded
collocation_net_2022-06-07,Files needed to use CollocationNet. Files include embeddings obtained from training Latent Dirichlet Allocation (LDA) models on different collocation types and example sentences. (size: 2.8G),CC BY-SA 4.0,False
estbert_from_tartunlp_hf_2022-03-10,"Model for BertTagger and BertTransformer. BertTagger outputs token or word level embeddings. BertTransformer outputs sentence and word level embeddings. The original EstBERT model was trained on Estonian National Corpus 2017 by Hasan Tanvir, Claudia Kittask, Kairit Sirts. This version of the model is from huggingface tartuNLP/EstBERT commit e97f62c. More info: https://huggingface.co/tartuNLP/EstBERT (size: 2.4G)",CC BY 4.0,False
stanza_syntax_2021-05-29,"Models for StanzaSyntaxTagger and StanzaSyntaxEnsembleTagger. StanzaSyntaxTagger operates in three modes: - syntax prediction based on sentences - syntax prediction using morphological features generated by Vabamorf - syntax prediction using extended morphological features. Models for StanzaSyntaxEnsembleTagger are trained using extended morphological features. Corresponding models were trained by Sandra Eiche on Estonian Dependency Treebank with UD syntax annotations. Dependency parsing models and the directory 'ensemble_models' must be located in the root directory defining StanzaSyntaxTagger under the subdirectory stanza_resources/et/depparse. Pretrain model under the subdirectory stanza_resources/et/pretrain. For completeness, the file also includes original Stanford models: - directories pretrain, lemma, pos, tokenize - depparse/stanza_depparse.pt (size: 1.4G)",CC BY-SA 4.0,False
maltparser_syntax_2021-05-29,Models for MaltparserTagger. MaltparserTagger operates in three modes: - syntax prediction using morphological features generated by Vabamorf - syntax prediction using extended morphological features. - syntax prediction using extended morphological features processed with VISLCG3 Pipeline. For the first two modes two types of dependency relations can be chosen from - UD or CG. Last mode can only be used for CG-type output. Corresponding models were trained by Claudia Kittask on Estonian Dependency Treebank. Dependency parsing models must be located in the root directory defining MaltparserTagger under the subdirectory java-res/maltparser. (size: 128M),CC BY-SA 4.0 + https://www.maltparser.org/license.html,False
udpipe_syntax_2021-05-29,Models for UDPipeTagger. UDPipeTagger operates in three modes: - syntax prediction using morphological features generated by Vabamorf - syntax prediction using extended morphological features. - syntax prediction using extended morphological features processed with VISLCG3 Pipeline. For all modes two types of dependency relations can be chosen from - UD or CG. Corresponding models were trained by Claudia Kittask on Estonian Dependency Treebank. (size: 39M),CC BY-SA 4.0,True
stanza_syntax_2020-11-30,"Models for StanzaSyntaxTagger. StanzaSyntaxTagger operates in three modes: - syntax prediction based on sentences - syntax prediction using morphological features generated by Vabamorf - syntax prediction using extended morphological features. Corresponding models were trained by Sandra Eiche on Estonian Dependency Treebank with UD syntax annotations. Pretrain model under the subdirectory stanza_resources/et/pretrain. For completeness, the file also includes original Stanford models: - directories pretrain, lemma, pos, tokenize - depparse/stanza_depparse.pt (size: 470M)",CC BY-SA 4.0,False
estbert_2020-10-20,"Model for BertTagger and BertTransformer. BertTagger outputs token or word level embeddings. BertTransformer outputs sentence and word level embeddings. Corresponding EstBERT model was trained on Estonian National Corpus 2017 by Hasan Tanvir, Claudia Kittask, Kairit Sirts. When appropriate cite the article: EstBERT: A Pretrained Language-Specific BERT for Estonian https://arxiv.org/abs/2011.04784 (size: 476M)",CC BY 4.0,False
neural_morph_softmax_emb_cat_sum_2019-08-23,"Model for SoftmaxEmbCatSumTagger (NeuralMorphTagger). All neural morphological disambiguation models were trained by Kermo Saarse and Kairit Sirts in June-August 2018 using High Performance Cluster in University of Tartu. Specific requirements: Python 3.7, tensorflow version < 2.0, such as 1.15.5. (size: 355M)",CC BY-SA 4.0,False
neural_morph_softmax_emb_tag_sum_2019-08-23,"Model for SoftmaxEmbTagSumTagger (NeuralMorphTagger). All neural morphological disambiguation models were trained by Kermo Saarse and Kairit Sirts in June-August 2018 using High Performance Cluster in University of Tartu. Specific requirements: Python 3.7, tensorflow version < 2.0, such as 1.15.5. (size: 354M)",CC BY-SA 4.0,False
neural_morph_seq2seq_emb_cat_sum_2019-08-23,"Model for Seq2SeqEmbCatSumTagger (NeuralMorphTagger). All neural morphological disambiguation models were trained by Kermo Saarse and Kairit Sirts in June-August 2018 using High Performance Cluster in University of Tartu. Specific requirements: Python 3.7, tensorflow version < 2.0, such as 1.15.5. (size: 366M)",CC BY-SA 4.0,False


The information in `ResourceView` table is based on the resources index json file: https://github.com/estnltk/estnltk_resources.
The index file gives detailed information about each resource, such as resource description, size, url, license, and unpacking path relative to the resources directory.

**Downloading all resources.** By default, only one version -- the latest version -- of the resource will be downloaded, even if there are multiple resources available.
However, if you set `only_latest=False`, then all resources with the given alias will be downloaded:

In [3]:
# Download all word2vec skip-gram models (alias: 'word2vec_sg')
download('word2vec_sg', only_latest=False)

Downloading word2vec_lemmas_sg_s100_2015-06-21: 169MB [00:01, 86.0MB/s] 
Downloading word2vec_lemmas_sg_s200_2015-06-21: 333MB [00:04, 76.7MB/s] 
Downloading word2vec_words_sg_s100_2015-06-21: 313MB [00:03, 88.7MB/s] 
Downloading word2vec_words_sg_s200_2015-06-21: 616MB [00:09, 68.4MB/s] 


True

Use `ResourceView` to see which resources have been downloaded:

In [4]:
# Browse only downloaded resources
ResourceView(downloaded=True)

name,description,license,downloaded
udpipe_syntax_2021-05-29,Models for UDPipeTagger. UDPipeTagger operates in three modes: - syntax prediction using morphological features generated by Vabamorf - syntax prediction using extended morphological features. - syntax prediction using extended morphological features processed with VISLCG3 Pipeline. For all modes two types of dependency relations can be chosen from - UD or CG. Corresponding models were trained by Claudia Kittask on Estonian Dependency Treebank. (size: 39M),CC BY-SA 4.0,True
word2vec_lemmas_sg_s100_2015-06-21,word2vec lemma-based embeddings model created by Alexander Tkachenko. More info: https://github.com/estnltk/word2vec-models (size: 174M),CC BY-SA 4.0,True
word2vec_lemmas_sg_s200_2015-06-21,word2vec lemma-based embeddings model created by Alexander Tkachenko. More info: https://github.com/estnltk/word2vec-models (size: 342M),CC BY-SA 4.0,True
word2vec_words_sg_s100_2015-06-21,word2vec word-based embeddings model created by Alexander Tkachenko. More info: https://github.com/estnltk/word2vec-models (size: 322M),CC BY-SA 4.0,True
word2vec_words_sg_s200_2015-06-21,word2vec word-based embeddings model created by Alexander Tkachenko. More info: https://github.com/estnltk/word2vec-models (size: 633M),CC BY-SA 4.0,True


### Where to find downloaded resources?

EstNLTK provides function `get_resource_paths`, which returns a list of all paths to downloaded resources associated with the given name or alias:

In [5]:
from estnltk import get_resource_paths
# Get paths to downloaded UDPipeTagger's models
get_resource_paths('udpipetagger')

['C:\\Programmid\\Miniconda3\\envs\\py39_devel\\lib\\site-packages\\estnltk-1.7.0-py3.9-win-amd64.egg\\estnltk\\estnltk_resources\\udpipe_syntax\\models_2021-05-29\\']

In [6]:
# Get paths to downloaded word2vec skip-gram models
get_resource_paths('word2vec_sg')

['C:\\Programmid\\Miniconda3\\envs\\py39_devel\\lib\\site-packages\\estnltk-1.7.0-py3.9-win-amd64.egg\\estnltk\\estnltk_resources\\word2vec\\embeddings_2015-06-21\\lemmas.sg.s100.w2v.bin',
 'C:\\Programmid\\Miniconda3\\envs\\py39_devel\\lib\\site-packages\\estnltk-1.7.0-py3.9-win-amd64.egg\\estnltk\\estnltk_resources\\word2vec\\embeddings_2015-06-21\\lemmas.sg.s200.w2v.bin',
 'C:\\Programmid\\Miniconda3\\envs\\py39_devel\\lib\\site-packages\\estnltk-1.7.0-py3.9-win-amd64.egg\\estnltk\\estnltk_resources\\word2vec\\embeddings_2015-06-21\\words.sg.s100.w2v.bin',
 'C:\\Programmid\\Miniconda3\\envs\\py39_devel\\lib\\site-packages\\estnltk-1.7.0-py3.9-win-amd64.egg\\estnltk\\estnltk_resources\\word2vec\\embeddings_2015-06-21\\words.sg.s200.w2v.bin']

The function returns an empty list if the resource has not been downloaded yet (or if there is no such resource):

In [7]:
# Get paths to downloaded stanzatagger's models
get_resource_paths('stanzatagger')

[]

If there are multiple versions of the resource, then versions are _sorted by resource dates_ : the latest resources come first in the list.

You can request only a single resource (the latest resource) by setting `only_latest=True`:

In [8]:
get_resource_paths('word2vec_sg', only_latest=True)

'C:\\Programmid\\Miniconda3\\envs\\py39_devel\\lib\\site-packages\\estnltk-1.7.0-py3.9-win-amd64.egg\\estnltk\\estnltk_resources\\word2vec\\embeddings_2015-06-21\\lemmas.sg.s100.w2v.bin'

Note that this returns a string instead of a list. 
And if the requested resource is missing, `None` value will be returned.

### Where is the resources directory and how to change it?

By default, EstNLTK attempts to download resources into sub directory `estnltk_resources` inside the installation directory of the `estnltk` package.
If that fails (e.g. due to insufficient permissions), then EstNLTK creates sub directory `estnltk_resources` into [user's home directory](https://docs.python.org/3/library/pathlib.html#pathlib.Path.home) and stores resources there. 

If you want to force your own resources location, then you can set system environment variable ESTNLTK_RESOURCES to a full path of the new resources directory.
Note that this must be an existing directory where writing is permitted.
Naturally, the environment variable should be set _before_ downloading any resources.

### Removing resources

Use the function `delete_resource` to remove a downloaded resource:

In [9]:
from estnltk.resource_utils import delete_resource
delete_resource('word2vec_words_sg_s100_2015-06-21')

True

The function returns True in case of a successful deletion. 
Note that resources can be deleted only by their specific names, not by their aliases.
E.g. `delete_resource('word2vec_sg')` would not have worked in the previous example.

### Integrating automatic resource downloading ( for developers )

If you are creating a tagger that needs some of external / downloadable resources, then you can use the function `get_resource_paths` with the autodownload option.

Namely, if you set `download_missing=True` and the requested resource has not been downloaded yet, then the user will be prompted with a question asking for a permission to download the missing resource. 
If the user gives the permission, then the resource will be downloaded automatically and it's path will be returned as a result:

In [10]:
from estnltk.downloader import get_resource_paths
get_resource_paths('word2vec_words_sg_s100_2015-06-21', only_latest=True, download_missing=True)

This requires downloading resource 'word2vec_words_sg_s100_2015-06-21' (size: 322M). Proceed with downloading? [Y/n] Y


Downloading word2vec_words_sg_s100_2015-06-21: 313MB [00:04, 72.9MB/s] 


'C:\\Programmid\\Miniconda3\\envs\\py39_devel\\lib\\site-packages\\estnltk-1.7.0-py3.9-win-amd64.egg\\estnltk\\estnltk_resources\\word2vec\\embeddings_2015-06-21\\words.sg.s100.w2v.bin'

So, you can use `get_resource_paths` in the constructor of a tagger to get the path to a required resource regardless its download state: if the resource is missing, it will be downloaded automatically (if user permits it).