## Processing files of Estonian Reference Corpus (_Eesti keele koondkorpus_)

EstNLTK contains tools specifically created for processing Estonian Reference Corpus (_Eesti keele koondkorpus_). 
These tools can be used for importing XML TEI format files, and converting to EstNLTK Text objects.
In this tutorial, we provide an overview about how koondkorpus files can be processed with EstNLTK (and what are the current limitations of processing).

### Koondkorpus XML files

The page [http://www.cl.ut.ee/korpused/segakorpus/](http://www.cl.ut.ee/korpused/segakorpus/) lists all the subcorpora of Estonian Reference Corpus. There, you can follow the links and download (zipped) XML files of the corpus. 

Once you have downloaded a zipped XML corpus and unzipped it, you should see a folder structure similar to this:


        ├── Kroonika
        │   ├── bin
        │   │   ├── koondkorpus_main_header.xml
        │   │   └── tei_corpus.rng
        │   └── Kroon
        │       ├── bin
        │       │   └── header_aja_kroonika.xml
        │       └── kroonika
        │           ├── kroonika_2000
        │           │   ├── aja_kr_2000_12_08.xml
        │           │   ├── aja_kr_2000_12_15.xml
        │           │   ├── aja_kr_2000_12_22.xml
        │           │   └── aja_kr_2000_12_29.xml
        │           ├── kroonika_2001
        │           │   ├── aja_kr_2001_01_05.xml
        │           │   ├── aja_kr_2001_01_12.xml
        │           │   ├── aja_kr_2001_01_19.xml
        │           │   ├── aja_kr_2001_01_22.xml
        ...         ...     ...
        
        
<center>Example. Folder structure in _Kroonika.zip_</center>

Folders `'bin'` contain headers and corpus descriptions. The `'.xml'` files outside the `'bin'` folders are the files with the actual textual content. These files can be analysed with EstNLTK.


#### Loading texts from XML TEI files

The module `estnltk.corpus_processing.parse_koondkorpus` allows to import texts and metadata from XML TEI files, and store in `Text` objects. The following functions are available for wider usage:

  * `parse_tei_corpus(path, target=['artikkel'], encoding='utf-8', preserve_tokenization=False, record_xml_filename=False)` -- reads and parses a single XML file (given with the full `path`), creates `Text` objects storing documents and metadata from the file, and returns a list of created `Text` objects;


  * `parse_tei_corpora(root, prefix='', suffix='.xml', target=['artikkel'], encoding='utf-8', preserve_tokenization=False,  record_xml_filename=False)` -- reads recursively all the files from the directory `root`, selects files with the given `prefix` and `suffix` for XML parsing, and creates `Text` objects storing documents and metadata from files. Returns a list of created `Text` objects;
  

**Arguments**

Exact behaviour of the functions can be modified by the following common arguments:

* **`target`** -- specifies the list of types of divs, from which the textual content is to be extacted. For instance, in case of newspaper articles, the content of an article is typically between `<div3 type="artikkel">` and `</div3>`, so, you should use `target=['artikkel']`. In case of fiction writings, the content of a single work is typically between `<div1 type="tervikteos">` and `</div1>`, and you may want to use `target=['tervikteos']`.

  Which are the correct type values for `target` depends on the goal of analysis. For example, you may want to divide a fiction text into chapters, instead of analysing it as a whole. In such case, you should manually look up the correct type values (for chapters) from the concrete XML file.
  
  If you do not have very specific goals, you can use the function **`get_div_target()`**, which provides a reasonable default div type for the given XML file, based on the hard-coded values. Example:

       from estnltk.corpus_processing.parse_koondkorpus import get_div_target, parse_tei_corpus
       xml_file = "/home/siim/koond/Eesti_ilukirjandus/ilukirjandus/Eesti_ilukirjandus_1990/ilu_ahasveerus.tasak.xml"
       target = get_div_target( xml_file )    # note: returns a single value, not list
       docs = parse_tei_corpus( xml_file, target=[target] )
  
     Note: the function `get_div_target()` needs name of the xml file with full path, as it uses information from directory names for determining the div type.


* **`preserve_tokenization`** -- specifies if the original paragraph and sentence tokenization in the XML files should be preserved (default: False). If switched on, then not only `Text` objects are created, but they are also annotated with layers `'words'`, `'sentences'` and `'paragraphs'`, trying to preserve the original tokenization. This means that sentences are taken from between `<s>` and `</s>` tags, and paragraphs from between `<p>` and `</p>` tags.

     _Note 1_: Creating tokenization layers also means longer processing times. So, processing with `preserve_tokenization=True` takes longer than processing with `preserve_tokenization=False` (the default setting);
     
     _Note 2_: If you need to, you can still restore the original tokenization, even if you use `preserve_tokenization=False`. In the created `Text` objects, paragraphs are separated by two newlines and sentences by a single newline. You can use `SentenceTokenizer` with the `base_sentence_tokenizer` set to NLTK's `LineTokenizer()` to get the correct sentence segmentation (and the default paragraph tokenizer should be able to provide correct paragraph tokenization);


* **`record_xml_filename`** -- specifies if the name of XML file should recorded in the metadata of the created `Text` objects, under the key `'_xml_file'` (default: False);


* **`encoding`** -- encoding of the input file (or input files). Normally, you should go with the default value ('utf-8').

### Processing whole koondkorpus with EstNLTK

TODO
