
# Importing Text objects from large Estonian corpora

EstNLTK contains modules for importing texts from the [Estonian Reference Corpus](http://www.cl.ut.ee/korpused/segakorpus/) ( _Eesti keele koondkorpus_ ), the [Estonian Web 2013 corpus](https://metashare.ut.ee/repository/browse/ettenten-korpus-toortekst/b564ca760de111e6a6e4005056b4002419cacec839ad4b7a93c3f7c45a97c55f) (aka _etTenTen 2013_ ), and the Estonian National Corpus [2017](https://metashare.ut.ee/repository/browse/estonian-national-corpus-2017/b616ceda30ce11e8a6e4005056b40024880158b577154c01bd3d3fcfc9b762b3/) and [2019](https://metashare.ut.ee/repository/browse/eesti-keele-uhendkorpus-2019-vrt-vormingus/be71121e733b11eaa6e4005056b4002483e6e5cdf35343e595e6ba4576d839fb/) ( _Eesti keele ühendkorpus 2017 & 2019_ ). 
In addition, it is possible to load annotated texts from [Estonian Universal Dependencies (UD) treebank files](https://github.com/UniversalDependencies/UD_Estonian-EDT).
Modules allow to import corpus content as EstNLTK Text objects, and also to preserve original annotations (e.g. paragraph and sentence boundary annotations) in the corpus (to an extent).
In sections below, we will introduce these modules in detail.

---

## A. Importing texts of the Estonian Reference Corpus <br/> ( _Eesti keele koondkorpus_ )

### Getting the files

The page [http://www.cl.ut.ee/korpused/segakorpus/](http://www.cl.ut.ee/korpused/segakorpus/) lists all the subcorpora of Estonian Reference Corpus. From there, you can follow the links and download (zipped) XML files of subcorpora. 

Once you have downloaded a zipped XML corpus and unzipped it, you should see a folder structure similar to this:


        ├── Kroonika
        │   ├── bin
        │   │   ├── koondkorpus_main_header.xml
        │   │   └── tei_corpus.rng
        │   └── Kroon
        │       ├── bin
        │       │   └── header_aja_kroonika.xml
        │       └── kroonika
        │           ├── kroonika_2000
        │           │   ├── aja_kr_2000_12_08.xml
        │           │   ├── aja_kr_2000_12_15.xml
        │           │   ├── aja_kr_2000_12_22.xml
        │           │   └── aja_kr_2000_12_29.xml
        │           ├── kroonika_2001
        │           │   ├── aja_kr_2001_01_05.xml
        │           │   ├── aja_kr_2001_01_12.xml
        │           │   ├── aja_kr_2001_01_19.xml
        │           │   ├── aja_kr_2001_01_22.xml
        ...         ...     ...
        
        
<center>Example. Folder structure in _Kroonika.zip_</center>

Folders `'bin'` contain headers and corpus descriptions. The `'.xml'` files outside the `'bin'` folders are the files with the actual textual content. These files can be loaded with EstNLTK.

### Importing Texts from a single XML file

You can use function **`parse_tei_corpus`** to create Text objects based on a single XML TEI file. 
An example:

```python
from estnltk.corpus_processing.parse_koondkorpus import get_div_target
from estnltk.corpus_processing.parse_koondkorpus import parse_tei_corpus

# input file (must be with a full path)
input_xml_file = "C:\\Kroonika\\Kroon\\kroonika\\kroonika_2000\\aja_kr_2000_12_08.xml"

# find out which subsection of the XML file forms a single document
target = get_div_target( input_xml_file )   

# import documents as Text objects   
for text_obj in parse_tei_corpus( input_xml_file, target=[target] ):
    # TODO: do something with the Text object
    ...
```

Note that before using the function, you should decide, which subsections of the XML file you want to consider as single document (the argument `target`). If you do not have clear preferences, you can use the function `get_div_target` to get a reasonable default for every subcorpus of the Reference Corpus.

**_Layers._** By default, obtained Text objects do not have any annotation layers. See the section "Details of the parse_koondkorpus module" on how to add layers.

**_Metadata._** Obtained Text objects also have metadata, stored in the dictionary `text_obj.meta`;

### Importing Texts from a directory of XML files

You can use function **`parse_tei_corpora`** to create Text objects from all (or selected) XML TEI files in a directory and in all of its subdirectories. 
An example:

```python
from estnltk.corpus_processing.parse_koondkorpus import parse_tei_corpora

# input directory (must be with a full path)
input_xml_path = 'C:\\Kroonika\\Kroon\\kroonika'

# import documents as Text objects   
for text_obj in parse_tei_corpora( input_xml_path, target=['artikkel'] ):
    # TODO: do something with the Text object
    ...
```

Note that function `parse_tei_corpora` will traverse recursively all subdirectories of the given directory (`input_xml_path`), and will also extract Text objects from XML TEI files in the subdirectories. In addition, you can use arguments `prefix` and `suffix` to further specify which file names are suitable for extraction, e.g. `prefix="aja_kr_2000_"` will tell the function to extract Texts only from files starting with the string `'aja_kr_2000_'`.

### Details of the parse_koondkorpus module

#### Reconstruction of source text

As the original plain texts for _koondkorpus_ XML files are not available, functions `parse_tei_corpus` and `parse_tei_corpora` will reconstruct the source texts by themselves. 
By default, this reconstruction follows the original XML mark-up: paragraphs (texts between`<p>` and `</p>` tags in the original XML) will be separated by double newlines, sentences (texts between `<s>` and `</s>` tags in the XML) will be separated by newlines, and words will be separated by whitespaces in the constructed texts.
Optionally, you may also add tokenization layers to created `Text` objects: either by strictly following the original XML tokenization mark-up, or by adding tokenization with EstNLTK's tools.

#### Signatures of the parsing functions

  * **`parse_tei_corpus`**`(path, target=['artikkel'], encoding='utf-8', add_tokenization=False, preserve_tokenization=False, record_xml_filename=False, sentence_separator='\n', paragraph_separator='\n\n', orig_tokenization_layer_name_prefix='')` -- reads and parses a single XML file (given with the full `path`), creates `Text` objects storing documents and metadata from the file, and returns a list of created `Text` objects;


  * **`parse_tei_corpora`**`(root, prefix='', suffix='.xml', target=['artikkel'], encoding='utf-8', add_tokenization=False, preserve_tokenization=False, record_xml_filename=False, sentence_separator='\n', paragraph_separator='\n\n', orig_tokenization_layer_name_prefix='')` -- reads recursively all the files from the directory `root`, selects files with the given `prefix` and `suffix` for XML parsing, and creates `Text` objects storing documents and metadata from the files. Returns a list of created `Text` objects;

#### Arguments

Exact behaviour of the functions can be modified by the following common arguments:

* **`target`** -- specifies the list of types of divs, from which the textual content is to be extacted. For instance, in case of newspaper articles, the content of an article is typically between `<div3 type="artikkel">` and `</div3>`, so, you should use `target=['artikkel']`. In case of fiction writings, the content of a single work is typically between `<div1 type="tervikteos">` and `</div1>`, and you may want to use `target=['tervikteos']`.

  Which values should be used for `target` depend on the subcorpus and the goal of analysis. For example, you may want to divide a fiction text into chapters, instead of analysing it as a whole. In such case, you should manually look up the correct type values (for chapters) from the concrete XML file.
  
  If you do not have very specific goals, you can use the function **`get_div_target()`**, which provides a reasonable default div type for the given XML file, based on the hard-coded values. Example:

```python
from estnltk.corpus_processing.parse_koondkorpus import get_div_target
from estnltk.corpus_processing.parse_koondkorpus import parse_tei_corpus

xml_file = 'C:\\Eesti_ilukirjandus\\ilukirjandus\\Eesti_ilukirjandus_1990\\'+\
           'ilu_ahasveerus.tasak.xml'
target = get_div_target( xml_file )    # note: returns a single value, not list
docs = parse_tei_corpus( xml_file, target=[target] )
```
   Note: the function `get_div_target()` needs name of the xml file with full path, as it uses information from directory names for determining the div type.


* **`add_tokenization`** -- specifies whether the tokenization layers (`'tokens'`, `'compound_tokens'`, `'words'`, `'sentences'`, and `'paragraphs'`) should be added to newly created Text objects (default: False). Note that if `preserve_tokenization==False`, then the tokenization layers are added with EstNLTK's tools, otherwise the original layers from the XML mark-up are preserved;


* **`preserve_tokenization`** -- specifies if the original word, sentence and paragraph tokenization from the XML mark-up should be preserved (default: False). This only works if `add_tokenization` is switched on. Then `Text` objects are created with layers `'tokens'`, `'compound_tokens'`, `'words'`, `'sentences'` and `'paragraphs'`, which preserve the original tokenization. This means that paragraphs are taken from between `<p>` and `</p>` tags, sentences from between `<s>` and `</s>` tags, and words are taken as space-separated tokens inside the sentences. The layer `'compound_tokens'` will always remain empty (because there is no information about token compounding in the XML mark-up), and the layer `'tokens'` will be equal to the layer `'words'`;

     _Note_: Creating tokenization layers typically takes more processing time than loading `Text` objects without layers. If you do not change parameters `sentence_separator` and `paragraph_separator`, then reconstructed texts also preserve the hints about tokenization, and you can restore the original tokenization afterwards. See the remark **C** below for details;
     

* **`record_xml_filename`** -- specifies if the name of XML file should recorded in the metadata of the created `Text` objects, under the key `'_xml_file'` (default: False);


* **`encoding`** -- encoding of the input file (or input files). Normally, you should go with the default value ('utf-8').


* **`sentence_separator`** -- string used as sentence separator during the reconstruction of the text (default: `'\n'`).


* **`paragraph_separator`** -- string used as paragraph separator during the reconstruction of the text (default: `'\n\n'`).


* **`orig_tokenization_layer_name_prefix`** -- string used as a prefix in names of layers of original tokenization. You can use this argument to make names of original tokenization layers distinguishable from EstNLTK's tokenization layers. (default: `''`)



#### What to keep in mind when using the loading functions

   * **A.** Functions `parse_tei_corpus` and `parse_tei_corpora` provide a simple and general solution to loading texts from XML TEI failes. The textual content (and also metadata, up to an extent) can be loaded from any subcorpus of the Estonian Reference Corpus. However, the genericity has a cost: typically, corpus-specific markings will not be loaded. For instance, the corpus of Estonian parliamentary transcripts also contains special markings for speaker names, but this information is not loaded. And the corpus of Internet forums also records time and user name for each forum message, but only the message itself (the textual content) is loaded.
 
   If you need to "get more out of" the XML TEI files, you'll need to create your own loading functions. You can follow the example of the functions in the module `estnltk.corpus_processing.parse_koondkorpus`, and use the library [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to do the job.


   * **B.** _Tokenization_: adding EstNLTK's tokenization vs preserving the original tokenization. 
   
      1. If you use EstNLTK's tokenization instead of the original one, you get annotations of multiword tokens (`'compound_tokens'`), and this also helps to get more accurate sentence annotations. For instance, the original markup occasionally contains sentence endings inside multiword names, such as _'Ju. M. Lotman'_, and date expressions, such as _'24 . 10. 1921'_, but EstNLTK's default tokenization marks such expressions as `'compound_tokens'` and thus cancels sentence endings inside these expressions. If you use EstNLTK's tokenization, it is advisable also to set `sentence_separator=' '`, so that reconstructed text will not contain newlines inside compound tokens;
    
      2. If you preserve the original tokenization, you can get better alignment with other linguistic annotations laid on the loaded corpus, such as [Dependency Treebank annotations](https://github.com/EstSyntax/EDT), or [TimeML annotations](https://github.com/soras/EstTimeMLCorpus). Data also gets loaded faster with the original tokenization annotations. The downsides are: there will be no `'compound_tokens'` marked in loaded texts, and there may be more sentence tokenization errors;
   
   
   * **C.** _Loading texts without tokenization, and then restoring the original tokenization later._ If you use parameters `sentence_separator` and `paragraph_separator` with their default values, you can restore the original tokenization after loading the texts. For this:
   
      1. Use a `CompoundTokenTagger` with initialization parameter `do_not_join_on_strings=['\n', '\n\n']` to ensure that compound tokens will not cross out sentence and paragraph endings;
      2. Use a  `SentenceTokenizer` with the `base_sentence_tokenizer` set to NLTK's `LineTokenizer()`, so that texts will be split to sentences following newlines;
      3. Use the default paragraph tokenizer to create paragraphs according to the original paragraph tokenization;

### Processing whole Estonian Reference Corpus with EstNLTK

If you need to process the whole Estonian Reference Corpus with EstNLTK, you can use the command-line scripts in [ **`https://github.com/estnltk/estnltk-workflows`**](https://github.com/estnltk/estnltk-workflows). 
The workflow for processing Estonian Reference Corpus with EstNLTK, and saving the results as JSON format files is located at: 
[https://github.com/estnltk/estnltk-workflows/tree/master/estnltk_workflows/koondkorpus_and_ettenten_to_json](https://github.com/estnltk/estnltk-workflows/tree/master/estnltk_workflows/koondkorpus_and_ettenten_to_json). 
For details about processing, check out the `readme.md` file in the folder.

---

## B. Importing texts of the Estonian Web 2013 corpus <br/> ( _etTenTen 2013 korpus_ )

### Getting the corpus

You can download the etTenTen 2013 corpus from here [https://metashare.ut.ee/repository/browse/ettenten-korpus-toortekst/b564ca760de111e6a6e4005056b4002419cacec839ad4b7a93c3f7c45a97c55f](https://metashare.ut.ee/repository/browse/ettenten-korpus-toortekst/b564ca760de111e6a6e4005056b4002419cacec839ad4b7a93c3f7c45a97c55f) (or [https://doi.org/10.15155/1-00-0000-0000-0000-0011fl](https://doi.org/10.15155/1-00-0000-0000-0000-0011fl )). After unpacking the content, you should get one large file with the extension `vert` or `prevert` (e.g. `ettenten13.processed.prevert`).


### Importing Texts

You can use function **`parse_ettenten_corpus_file_iterator`** to iterate over the corpus file and get Text objects yielded one-by-one.
An example:

```python
from estnltk.corpus_processing.parse_ettenten import parse_ettenten_corpus_file_iterator

# input file
input_file = "ettenten13.processed.prevert"

# iterate over corpus and extract Text objects one-by-one
for text_obj in parse_ettenten_corpus_file_iterator( input_file ):
    # TODO: do something with the Text object
    ...
```

**_Layers._** By default, created Text objects have only one annotation layer: `'original_paragraphs'`. The layer contains original paragraph markings (between `<p>` and `</p>` tags) from the input file. Unlike estnltk's `'paragraphs'` layer, which envelops around `'sentences'` layer, `'original_paragraphs'` will be a stand-alone layer of Text object (because sentences are not annotated in the input file).

**_Metadata._** Obtained Text objects also have metadata, stored in the dictionary `text_obj.meta`;

### Details of the parse_ettenten module

#### Reconstruction of source text

The etTenTen 2013 corpus contains documents crawled from the web. 
Documents have been cleaned up from most of the HTML (and other XML) tags, although few tags remain here and there.
Textual content of the web pages has been preserved along with paragraph markings, and this information can be used to reconstruct the text.

The function `parse_ettenten_corpus_file_iterator` reconstructs texts in a straightforward manner: all paragraphs (texts between`<p>` and `</p>` tags in the original file) will be concatenated by a double newline to form the text. If there are any other XML/HTML tags between `<p>` and `</p>` tags, these tags will also be included in the textual content.

#### Signature of the parsing function

  * **`parse_ettenten_corpus_file_iterator`**`(in_file, encoding='utf-8', focus_doc_ids=None, add_tokenization=False,        discard_empty_paragraphs=True, store_paragraph_attributes=False, paragraph_separator='\n\n')` -- reads and parses etTenTen corpus file (`in_file`). In the progress, creates `Text` objects storing documents and metadata, and yields created `Text` objects one-by-one;

#### Arguments

Exact behaviour of the function can be modified by the following arguments:

* **`encoding`** -- encoding of the input file. Normally, you should go with the default value ('utf-8').


* **`focus_doc_ids`** -- set of document id-s corresponding to the documents which need to be extracted from the `in_file`. **Important:** this should be a set of strings, not a set of integers. If provided, then only documents with given id-s will be       processed, and all other documents will be skipped. If the value is `None` or empty set, then all documents in the file will be processed. (default: `None`).


* **`add_tokenization`** -- Specifies if (full) tokenization will be added to reconstructed texts (default: False). If `add_tokenization==False`, then there will be only one tokenization layer: `'original_paragraphs'`, which will be collected from XML annotations and added as a stand-alone annotation layer of a Text object. If `add_tokenization==True`, then there will be layers `'tokens'`, `'compound_tokens'`, `'words'`, and `'sentences'` (created by EstNLTK's default tokenizers) and layer `'paragraphs'` will be collected from XML annotations and enveloped around EstNLTK's `'sentences'` layer; 


* **`discard_empty_paragraphs`** -- boolean specifying if empty paragraphs (paragraphs without textual content) should be discarded (default: True).



* **`store_paragraph_attributes`** -- boolean specifying if attributes in the paragraph's XML tag will be collected and added as attributes of the corresponding layer (`'original_paragraphs'` or `'paragraphs'`) in a Text object. (default: False).


* **`paragraph_separator`** -- string used as paragraph separator during the reconstruction of the text (default: `'\n\n'`).

### Processing whole etTenTen with EstNLTK

If you need to process the whole etTenTen with EstNLTK, you can use the command-line scripts in [ **`https://github.com/estnltk/estnltk-workflows`**](https://github.com/estnltk/estnltk-workflows). 
The workflow for processing etTenTen with EstNLTK, and saving the results as JSON format files is located at: 
[https://github.com/estnltk/estnltk-workflows/tree/master/estnltk_workflows/koondkorpus_and_ettenten_to_json](https://github.com/estnltk/estnltk-workflows/tree/master/estnltk_workflows/koondkorpus_and_ettenten_to_json). 
For details about processing, check out the `readme.md` file in the folder.

---

## C. Importing texts of the Estonian National Corpus 2017 and 2019<br/> ( _Eesti keele ühendkorpus 2017 & 2019_ )

### Getting the corpus
####  ENC 2017

You can download the Estonian National Corpus 2017 from here [https://metashare.ut.ee/repository/browse/estonian-national-corpus-2017/b616ceda30ce11e8a6e4005056b40024880158b577154c01bd3d3fcfc9b762b3/](https://metashare.ut.ee/repository/browse/estonian-national-corpus-2017/b616ceda30ce11e8a6e4005056b40024880158b577154c01bd3d3fcfc9b762b3/) (or [https://doi.org/10.15155/3-00-0000-0000-0000-071E7L](https://doi.org/10.15155/3-00-0000-0000-0000-071E7L)). 
There are 25 `.xz` compressed files to download.
After unpacking the files, you should get the following large corpus files:

    estonian_nc17.vert.01
    estonian_nc17.vert.02
    estonian_nc17.vert.03
    ...
    estonian_nc17.vert.24
    estonian_nc17.vert.25

The corpus has 4 subcorpora: Estonian Reference Corpus (`NC`), Estonian Web 2013 (`web13`), Estonian Web 2017 (`web17`) and Estonian Wikipedia 2017 (`wiki17`). The following table shows how subcorpora have been distributed into files:


| NC   | web13 | web17 | wiki17 |
|------|-------|-------|--------|
| estonian_nc17.vert.01 – estonian_nc17.vert.04 | estonian_nc17.vert.05 – estonian_nc17.vert.10 | estonian_nc17.vert.10 – estonian_nc17.vert.25 |  estonian_nc17.vert.04 –  estonian_nc17.vert.05|
| (4 files) | (6 files) | (16 files) | (2 files) |

#### ENC 2019

The Estonian National Corpus 2019 can be downloaded from here: [https://metashare.ut.ee/repository/browse/eesti-keele-uhendkorpus-2019-vrt-vormingus/be71121e733b11eaa6e4005056b4002483e6e5cdf35343e595e6ba4576d839fb/](https://metashare.ut.ee/repository/browse/eesti-keele-uhendkorpus-2019-vrt-vormingus/be71121e733b11eaa6e4005056b4002483e6e5cdf35343e595e6ba4576d839fb/) (or [https://doi.org/10.15155/3-00-0000-0000-0000-08489L](https://doi.org/10.15155/3-00-0000-0000-0000-08489L)). There are 20 `.gz` compressed files to download.
After unpacking the files, you should get the following large corpus files:

    etnc19_web_2013.vert
    etnc19_web_2019.vert
    etnc19_web_2017.vert
    etnc19_doaj.vert
    etnc19_wikipedia_2019.vert
    etnc19_wikipedia_2017.vert
    etnc19_reference_corpus.vert
    etnc19_balanced_corpus.vert

So, there are following subcorpora: `Web 2013`, `Web 2019`, `Web 2017`, `DOAJ` (Estonian Open Access Journals), `Wikipedia 2017`, `Wikipedia 2019`, `Reference Corpus` and `Balanced Corpus`.

### Importing Texts

You can use function **`parse_enc_file_iterator`** to iterate over a corpus file and get its Text objects yielded one-by-one.
An example:

```python
from estnltk.corpus_processing.parse_enc import parse_enc_file_iterator

# input file
input_file = "etnc19_balanced_corpus.vert"

# iterate over corpus and extract Text objects one-by-one
for text_obj in parse_enc_file_iterator( input_file, line_progressbar='ascii' ):
    # TODO: do something with the Text object
    ...
```

**_Layers._** By default, created Text objects will have layers `'original_tokens'`, `'original_compound_tokens'`, `'original_words'` and `'original_sentences'` preserving the original tokenization from the corpus file. In addition, there will be layer `'original_word_chunks'`, which contains words glued together by the `<g/>` tag. If the document also had paragraph annotations, there will be the layer `'original_paragraphs'` (but not all documents have paragraph annotations). You can also restore the layer of morphological annotations from the file, see the section "Details of the parse_enc module" below.

**_Metadata._** Obtained Text objects also have metadata, stored in the dictionary `text_obj.meta`;

### Details of the parse_enc module

#### Reconstruction of source text

The Estonian National Corpus 2017 (2019) contains a variety of subcorpora, and its documents have gone through different degrees of editing, normalization and automatic (XML/HTML) annotation clean-up. 
Many of the original plain texts are no longer available, so texts need to be reconstructed.

The function `parse_enc_file_iterator` reconstructs texts from available annotation clues. 
If there are paragraphs (texts between `<p>` and `</p>` tags)  in the original document, these will be concatenated by double newlines.
Sentences (texts between `<s>` and `</s>` tags) in the original document will be concatenated by a single whitespace.
And words/tokens will also be concatenated by a single whitespace, except when there are `<g/>` tags between tokens.
Tokens separated by `<g/>` will be joined together into a single word, for instance, the token sequence `['"', '<g/>', 'Õnne', '13', '<g/>', '"']` will be reconstructed as string `"Õnne 13"`.

By default, only words that have morphological analyses will be used for text reconstruction. If there are XML/HTML tags inside the document, these will not be included in the reconstructed text (unless they have been, "by accident", morphologically analysed).


#### Signature of the parsing function

  * **`parse_enc_file_iterator`**`(in_file, encoding='utf-8', focus_doc_ids=None, focus_srcs=None, focus_lang=None,  tokenization='preserve', original_layer_prefix='original_', restore_morph_analysis=False, vertParser=None, textReconstructor=None, line_progressbar=None, logger=None)` -- reads and parses Estonian National Corpus .vert file (`in_file`). 
 In the progress, creates `Text` objects storing documents, metadata and annotation layers, and yields created `Text` objects one-by-one;

#### Arguments

Exact behaviour of the function can be modified by the following arguments:


* **`encoding`** -- encoding of the input file. Normally, you should go with the default value ('utf-8').


* **`focus_doc_ids`** -- set of document id-s corresponding to the documents which need to be extracted from the `in_file`. Used for filtering documents by metadata attribute `id`. **Important:** `focus_doc_ids` should be a set of strings, not a set of integers. If provided, then only documents with given id-s will be  processed, and all other documents will be skipped. If `focus_doc_ids` is `None` or empty set, then document processing will not be affected by the given filter (but it may be affected by other filters). (default: `None`).


* **`focus_srcs`** -- set of document sources corresponding to subcorpora which documents need to be extracted from the file. Used for filtering documents by metadata attribute `src`. Potential source values for ENC 2017 are: `'NC'`, `'web13'`, `'web17'`, `'wiki17'`. Potential values for ENC 2019 are: `Web 2013`, `'Web 2019'`, `'Web 2017'`, `'DOAJ'`, `'Wikipedia 2017'`, `'Wikipedia 2019'`, `'Reference Corpus'` and `'Balanced Corpus'`. If set is provided, then only documents that have a src value from the set will be extracted, and all other documents will be skipped. If `focus_srcs` is `None` or empty, then document processing will not be affected by the given filter (but it may be affected by other filters).  (default: `None`).


* **`focus_lang`** -- set of allowed document languages. Used for filtering documents by metadata attribute `lang`. If provided, then only documents that have `lang` from the set will be extracted, and all other documents will be skipped. Note that if `focus_lang` is set, but the document does not have `lang` attribute, then the document is also be skipped. If `focus_lang` is `None` or empty, then document processing will not be affected by the given filter (but it may be affected by other filters).  (default: `None`).


* **`tokenization`** -- specifies if tokenization will be added to created Texts, and if so, then how it will be added. Following options are supported: `['none', 'preserve', 'preserve_partially', 'estnltk']` (default: `'preserve'`). Details:

      'none' -- text will be created without any tokenization layers;
      
      'preserve' -- original tokenization from the input file will be preserved in 
                    layers of the text; This option creates layers 'original_tokens', 
                    'original_compound_tokens', 'original_words', 'original_word_chunks', 
                    'original_sentences', and 'original_paragraphs' (only if 
                    paragraphs were available in the original doc);

      'preserve_partially' -- original tokenization from the input file will be 
                              preserved in layers of the text, but only partially; 
                              This option creates layers 'original_words', 
                              'original_sentences', and 'original_paragraphs' 
                              (only if paragraphs were available in the original 
                              doc); Note: using this option can help to speed up 
                              the process, because creating additional layers 
                              'original_tokens' and 'original_word_chunks' takes 
                              some time;
                              
      'estnltk' -- text's original tokenization will be overwritten by estnltk's 
                   tokenization; This option creates estnltk's tokenization layers 
                   'tokens', 'compound_tokens', 'words', 'sentences', 'paragraphs'.
      

* **`original_layer_prefix`** -- string prefix to be added to names of layers of original annotations if `tokenization=='preserve'` or `tokenization=='preserve_partially'`; (default: `'original_'`)


* **`restore_morph_analysis`** -- boolean specifying if the morphological analysis layer should be created based on the morphological annotations available in the input file (default: False). The morphological analysis layer can only be created iff the original tokenization is preserved. The name of the layer will be `"original_morph_analysis"` (unless `original_layer_prefix` is used to change the prefix). By default, `restore_morph_analysis` is `False`, so the morphological annotations will be discarded and only tokenization layers will be created;
           

* **`vertParser`** -- An instance of `VertXMLFileParser` to be used for parsing the input file (default: None). Normally, you should go with the default (None), and let the function `parse_enc_file_iterator` to create the instance. If you create `VertXMLFileParser` by yourself, there are some extra parameters you can modify; for instance, you can enable including XML/HTML tags in reconstructed text. See the source code in [parse_enc.py](https://github.com/estnltk/estnltk/blob/devel_1.6/estnltk/corpus_processing/parse_enc.py) for details;


* **`textReconstructor`** -- An instance of `ENCTextReconstructor` to be used for creating Text objects based on parsed file content (default: None).  Normally, you should go with the default (None), and let the function `parse_enc_file_iterator` to create the instance. If you create `ENCTextReconstructor` by yourself, there are some extra parameters you can modify; for instance, you can change paragraph and sentence separators in reconstructed text. See the source code in [parse_enc.py](https://github.com/estnltk/estnltk/blob/devel_1.6/estnltk/corpus_processing/parse_enc.py) for details;
           
           
* **`line_progressbar`** -- string specifying if and how a progressbar should be initiated. The progressbar shows lines elapsed during reading the file. Possible values: `['ascii', 'unicode', 'notebook', None]` (default: None). Details:

      None -- no progressbar fill be shown while reading the file;
      
      'ascii' -- initiates a progressbar that is suitable for terminals that have 
                 limited support for printing unicode characters, so using ASCII 
                 characters is the safest option;
      
      'unicode' -- initiates a progressbar that is suitable for a terminal in which 
                   unicode characters can be safely printed;
       
      'notebook' -- initiates a graphical progressbar that can be used in Jupyter 
                    Notebook★;

    ★ Remark: progressbars 'ascii' and 'unicode' can also be used in Jupyter Notebook, but progressbar 'notebook' cannot be used outside the Jupyter Notebook. In addition, you may have to  [update](https://github.com/tqdm/tqdm/issues/360) the `ipywidgets` package to get the 'notebook' progressbar working in the Jupyter Notebook;


* **`logger`** -- an instance of `logging.Logger` that is used to report warning and debug messages during the parsing process (default: None). If the function `parse_enc_file_iterator` stumbles upon missing metadata attributes, malformed annotations or empty documents/document fragments, warnings will be logged, reporting which documents are malformed. 



#### What to keep in mind when using the loading function

  * While parsing documents, function **`parse_enc_file_iterator`** assumes that a document is made of: A) sentences `<s>...</s>` that are inside paragraphs `<p> ... </p>`, or B) sentences `<s>...</s>` that are without paragraphs. A mixed variant (some sentences of the document are inside a paragraph, while others do not have any surrounding paragraph markings) is considered a malformed annotation. In such case, only sentences inside paragraphs will be collected;


   * Current versions of the ENC corpora (both 2017 and 2019) contain malformed / broken annotations. Function **`parse_enc_file_iterator`** should be able to parse all corpus files regardless the broken annotations, but some of the documents are delivered only partially, and some of documents can also be missing. For instance, if a `web17` document does not have metadata attribute `src`, then it also cannot be found with the filtering criterion `focus_srcs=set(['web17'])`.
   
   If you want to get an overview of the problems encountered during the parsing, set the parameter **`logger`** to receive warnings on broken annotations;

---

## D. Importing texts of the Estonian Universal Dependencies treebank <br/> ( _Eesti keele UD puudepank_ )

### Getting the corpus

You can download the Estonian Universal Dependencies treebank from github: https://github.com/UniversalDependencies/UD_Estonian-EDT . 
Corpus files are the files with `.conllu` extensions.

### Importing a single Text object

You can use the function **`conll_to_text`** to load the contents of the whole file into a single `Text` object.
An example:

```python
from estnltk.converters.conll_importer import conll_to_text

# input file
input_file = "et_edt-ud-dev.conllu"

# load whole file into a Text object
text_obj = conll_to_text( input_file, syntax_layer='conll_syntax' )
```

**_Layers._** The created `Text` object will have tokenization layers `'words'` and `'sentences'`, and the syntax layer (default name: `'conll_syntax'`).

**_Metadata._** Not provided.

### Importing Text objects

The function **`conll_to_texts_list`** can be used to (heuristically) split the contents of the `.conllu` file into separate documents and get as a list of `Text` objects.
An example:

```python
from estnltk.converters.conll_importer import conll_to_texts_list

# input file
input_file = "et_edt-ud-dev.conllu"

# try to separate documents in the file and load Text objects
texts = conll_to_texts_list( input_file, syntax_layer='conll_syntax' )
```
**_Document splitting heuristic._** The separation heuristic assumes that there are `sent_id` values in the metadata, which denote the name of the document along with the number of the sentence. 
The name of the document is assumed to be a substring of `sent_id` from the beginning to the last underscore. 
The sentence number is assumed to come after the last underscore.
For instance, if `sent_id = 'aja_ee200110_2698'`, then name of the document is `'aja_ee200110'`, and the sentence number is `2698`.

**_Layers._** The created `Text` objects will have tokenization layers `'words'` and `'sentences'`, and the syntax layer (default name: `'conll_syntax'`).

**_Metadata._** `Text` objects will have document names stored under `text.meta['file_prefix']`.

### Detailed description of API

A detailed description of conll importer functions can be found here:
https://github.com/estnltk/estnltk/blob/version_1.6/tutorials/converters/conll_importer.ipynb