## Importing texts of the Estonian Reference Corpus (_Eesti keele koondkorpus_)

EstNLTK contains functions for loading texts from XML TEI format files of the [Estonian Reference Corpus](http://www.cl.ut.ee/korpused/segakorpus/) (_Eesti keele koondkorpus_). 
You can use these functions if you need to get the most basic loading functionality that is applicable over all the different subcorpora of the Reference Corpus. 
Or, if you need to load _koondkorpus_ texts in a specific manner, you can follow the example of these functions while implementing your own functions.

### Koondkorpus XML TEI files

The page [http://www.cl.ut.ee/korpused/segakorpus/](http://www.cl.ut.ee/korpused/segakorpus/) lists all the subcorpora of Estonian Reference Corpus. From there, you can follow the links and download (zipped) XML files of subcorpora. 

Once you have downloaded a zipped XML corpus and unzipped it, you should see a folder structure similar to this:


        ├── Kroonika
        │   ├── bin
        │   │   ├── koondkorpus_main_header.xml
        │   │   └── tei_corpus.rng
        │   └── Kroon
        │       ├── bin
        │       │   └── header_aja_kroonika.xml
        │       └── kroonika
        │           ├── kroonika_2000
        │           │   ├── aja_kr_2000_12_08.xml
        │           │   ├── aja_kr_2000_12_15.xml
        │           │   ├── aja_kr_2000_12_22.xml
        │           │   └── aja_kr_2000_12_29.xml
        │           ├── kroonika_2001
        │           │   ├── aja_kr_2001_01_05.xml
        │           │   ├── aja_kr_2001_01_12.xml
        │           │   ├── aja_kr_2001_01_19.xml
        │           │   ├── aja_kr_2001_01_22.xml
        ...         ...     ...
        
        
<center>Example. Folder structure in _Kroonika.zip_</center>

Folders `'bin'` contain headers and corpus descriptions. The `'.xml'` files outside the `'bin'` folders are the files with the actual textual content. These files can be loaded with EstNLTK.

### Loading texts from XML TEI files

The module `estnltk.corpus_processing.parse_koondkorpus` allows to import texts and metadata from XML TEI files, and store in `Text` objects. The following functions are available for wider usage:

  * **`parse_tei_corpus`**`(path, target=['artikkel'], encoding='utf-8', add_tokenization=False, preserve_tokenization=False, record_xml_filename=False, sentence_separator='\n', paragraph_separator='\n\n')` -- reads and parses a single XML file (given with the full `path`), creates `Text` objects storing documents and metadata from the file, and returns a list of created `Text` objects;


  * **`parse_tei_corpora`**`(root, prefix='', suffix='.xml', target=['artikkel'], encoding='utf-8', add_tokenization=False, preserve_tokenization=False, record_xml_filename=False, sentence_separator='\n', paragraph_separator='\n\n')` -- reads recursively all the files from the directory `root`, selects files with the given `prefix` and `suffix` for XML parsing, and creates `Text` objects storing documents and metadata from the files. Returns a list of created `Text` objects;

Note: As the original plain texts for _koondkorpus_ XML files are not available, functions `parse_tei_corpus` and `parse_tei_corpora` will reconstruct the source texts by themselves. 
By default, this reconstruction follows the original XML mark-up: paragraphs (texts between`<p>` and `</p>` tags in the original XML) will be separated by double newlines, sentences (texts between `<s>` and `</s>` tags in the XML) will be separated by newlines, and words will be separated by whitespaces in the constructed texts.
Optionally, you may also add tokenization layers to created `Text` objects: either by strictly following the original XML tokenization mark-up, or by adding tokenization with EstNLTK's tools.

#### Arguments

Exact behaviour of the functions can be modified by the following common arguments:

* **`target`** -- specifies the list of types of divs, from which the textual content is to be extacted. For instance, in case of newspaper articles, the content of an article is typically between `<div3 type="artikkel">` and `</div3>`, so, you should use `target=['artikkel']`. In case of fiction writings, the content of a single work is typically between `<div1 type="tervikteos">` and `</div1>`, and you may want to use `target=['tervikteos']`.

  Which values should be used for `target` depend on the subcorpus and the goal of analysis. For example, you may want to divide a fiction text into chapters, instead of analysing it as a whole. In such case, you should manually look up the correct type values (for chapters) from the concrete XML file.
  
  If you do not have very specific goals, you can use the function **`get_div_target()`**, which provides a reasonable default div type for the given XML file, based on the hard-coded values. Example:

       from estnltk.corpus_processing.parse_koondkorpus import get_div_target, parse_tei_corpus
       xml_file = "/home/siim/Eesti_ilukirjandus/ilukirjandus/"+\
                  "Eesti_ilukirjandus_1990/ilu_ahasveerus.tasak.xml"
       target = get_div_target( xml_file )    # note: returns a single value, not list
       docs = parse_tei_corpus( xml_file, target=[target] )
  
     Note: the function `get_div_target()` needs name of the xml file with full path, as it uses information from directory names for determining the div type.


* **`add_tokenization`** -- specifies whether the tokenization layers (`'tokens'`, `'compound_tokens'`, `'words'`, `'sentences'`, and `'paragraphs'`) should be added to newly created Text objects (default: False). Note that if `preserve_tokenization==False`, then the tokenization layers are added with EstNLTK's tools, otherwise the original layers from the XML mark-up are preserved;


* **`preserve_tokenization`** -- specifies if the original word, sentence and paragraph tokenization from the XML mark-up should be preserved (default: False). This only works if `add_tokenization` is switched on. Then `Text` objects are created with layers `'tokens'`, `'compound_tokens'`, `'words'`, `'sentences'` and `'paragraphs'`, which preserve the original tokenization. This means that paragraphs are taken from between `<p>` and `</p>` tags, sentences from between `<s>` and `</s>` tags, and words are taken as space-separated tokens inside the sentences. The layer `'compound_tokens'` will always remain empty (because there is no information about token compounding in the XML mark-up), and the layer `'tokens'` will be equal to the layer `'words'`;

     _Note_: Creating tokenization layers typically takes more processing time than loading `Text` objects without layers. If you do not change parameters `sentence_separator` and `paragraph_separator`, then reconstructed texts also preserve the hints about tokenization, and you can restore the original tokenization afterwards. See the remark **C** below for details;
     

* **`record_xml_filename`** -- specifies if the name of XML file should recorded in the metadata of the created `Text` objects, under the key `'_xml_file'` (default: False);


* **`encoding`** -- encoding of the input file (or input files). Normally, you should go with the default value ('utf-8').


* **`sentence_separator`** -- string used as sentence separator during the reconstruction of the text (default: `'\n'`).


* **`paragraph_separator`** -- string used as paragraph separator during the reconstruction of the text (default: `'\n\n'`).

#### What to keep in mind when using the loading functions

   * **A.** Functions `parse_tei_corpus` and `parse_tei_corpora` provide a simple and general solution to loading texts from XML TEI failes. The textual content (and also metadata, up to an extent) can be loaded from any subcorpus of the Estonian Reference Corpus. However, the genericity has a cost: typically, corpus-specific markings will not be loaded. For instance, the corpus of Estonian parliamentary transcripts also contains special markings for speaker names, but this information is not loaded. And the corpus of Internet forums also records time and user name for each forum message, but only the message itself (the textual content) is loaded.
 
   If you need to "get more out of" the XML TEI files, you'll need to create your own loading functions. You can follow the example of the functions in the module `estnltk.corpus_processing.parse_koondkorpus`, and use the library [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to do the job.


   * **B.** _Tokenization_: adding EstNLTK's tokenization vs preserving the original tokenization. 
   
      1. If you use EstNLTK's tokenization instead of the original one, you get annotations of multiword tokens (`'compound_tokens'`), and this also helps to get more accurate sentence annotations. For instance, the original markup occasionally contains sentence endings inside multiword names, such as _'Ju. M. Lotman'_, and date expressions, such as _'24 . 10. 1921'_, but EstNLTK's default tokenization marks such expressions as `'compound_tokens'` and thus cancels sentence endings inside these expressions. If you use EstNLTK's tokenization, it is advisable to set `sentence_separator=' '`, so that reconstructed text will not contain newlines inside compound tokens;
    
      2. If you preserve the original tokenization, you can get better alignment with other linguistic annotations laid on the loaded corpus, such as [Dependency Treebank annotations](https://github.com/EstSyntax/EDT), or [TimeML annotations](https://github.com/soras/EstTimeMLCorpus). Data also gets loaded faster with the original tokenization annotations. The downsides are: there will be no `'compound_tokens'` marked in loaded texts, and there may be more sentence tokenization errors;
   
   
   * **C.** _Loading texts without tokenization, and then restoring the original tokenization later._ If you use parameters `sentence_separator` and `paragraph_separator` with their default values, you can restore the original tokenization after loading the texts. For this:
   
      1. Use a `CompoundTokenTagger` with initialization parameter `do_not_join_on_strings=['\n', '\n\n']` to ensure that compound tokens will not cross out sentence and paragraph endings;
      2. Use a  `SentenceTokenizer` with the `base_sentence_tokenizer` set to NLTK's `LineTokenizer()`, so that texts will be split to sentences following newlines;
      3. Use the default paragraph tokenizer to create paragraphs according to the original paragraph tokenization;

### Processing the whole Estonian Reference Corpus with EstNLTK

If you need to process the whole Estonian Reference Corpus with EstNLTK, you can use the command-line scripts in [ **`https://github.com/estnltk/estnltk-workflows`**](https://github.com/estnltk/estnltk-workflows). 
The workflow for processing Estonian Reference Corpus with EstNLTK, and saving the results as JSON format files is located at: 
[https://github.com/estnltk/estnltk-workflows/tree/master/estnltk_workflows/koondkorpus_and_ettenten_to_json](https://github.com/estnltk/estnltk-workflows/tree/master/estnltk_workflows/koondkorpus_and_ettenten_to_json). 
For details about processing, check out the `readme.md` file in the folder.