## Processing files of Estonian Reference Corpus (_Eesti keele koondkorpus_)

EstNLTK contains tools specifically created for processing Estonian Reference Corpus (_Eesti keele koondkorpus_). 
First, there are functions that can be used for loading texts from specific XML TEI format files, and converting to EstNLTK Text objects. 
Second, there are command line scripts that can be used for processing the whole Estonian Reference Corpus with EstNLTK.

This tutorial gives an overview about how to use these functions and scripts.

### Koondkorpus XML files

The page [http://www.cl.ut.ee/korpused/segakorpus/](http://www.cl.ut.ee/korpused/segakorpus/) lists all the subcorpora of Estonian Reference Corpus. From there, you can follow the links and download (zipped) XML files of subcorpora. 

Once you have downloaded a zipped XML corpus and unzipped it, you should see a folder structure similar to this:


        ├── Kroonika
        │   ├── bin
        │   │   ├── koondkorpus_main_header.xml
        │   │   └── tei_corpus.rng
        │   └── Kroon
        │       ├── bin
        │       │   └── header_aja_kroonika.xml
        │       └── kroonika
        │           ├── kroonika_2000
        │           │   ├── aja_kr_2000_12_08.xml
        │           │   ├── aja_kr_2000_12_15.xml
        │           │   ├── aja_kr_2000_12_22.xml
        │           │   └── aja_kr_2000_12_29.xml
        │           ├── kroonika_2001
        │           │   ├── aja_kr_2001_01_05.xml
        │           │   ├── aja_kr_2001_01_12.xml
        │           │   ├── aja_kr_2001_01_19.xml
        │           │   ├── aja_kr_2001_01_22.xml
        ...         ...     ...
        
        
<center>Example. Folder structure in _Kroonika.zip_</center>

Folders `'bin'` contain headers and corpus descriptions. The `'.xml'` files outside the `'bin'` folders are the files with the actual textual content. These files can be loaded with EstNLTK.

#### Loading texts from XML TEI files

The module `estnltk.corpus_processing.parse_koondkorpus` allows to import texts and metadata from XML TEI files, and store in `Text` objects. The following functions are available for wider usage:

  * `parse_tei_corpus(path, target=['artikkel'], encoding='utf-8', preserve_tokenization=False, record_xml_filename=False)` -- reads and parses a single XML file (given with the full `path`), creates `Text` objects storing documents and metadata from the file, and returns a list of created `Text` objects;


  * `parse_tei_corpora(root, prefix='', suffix='.xml', target=['artikkel'], encoding='utf-8', preserve_tokenization=False,  record_xml_filename=False)` -- reads recursively all the files from the directory `root`, selects files with the given `prefix` and `suffix` for XML parsing, and creates `Text` objects storing documents and metadata from the files. Returns a list of created `Text` objects;
  

**Arguments**

Exact behaviour of the functions can be modified by the following common arguments:

* **`target`** -- specifies the list of types of divs, from which the textual content is to be extacted. For instance, in case of newspaper articles, the content of an article is typically between `<div3 type="artikkel">` and `</div3>`, so, you should use `target=['artikkel']`. In case of fiction writings, the content of a single work is typically between `<div1 type="tervikteos">` and `</div1>`, and you may want to use `target=['tervikteos']`.

  Which values should be used for `target` depends on the goal of analysis. For example, you may want to divide a fiction text into chapters, instead of analysing it as a whole. In such case, you should manually look up the correct type values (for chapters) from the concrete XML file.
  
  If you do not have very specific goals, you can use the function **`get_div_target()`**, which provides a reasonable default div type for the given XML file, based on the hard-coded values. Example:

       from estnltk.corpus_processing.parse_koondkorpus import get_div_target, parse_tei_corpus
       xml_file = "/home/siim/Eesti_ilukirjandus/ilukirjandus/"+\
                  "Eesti_ilukirjandus_1990/ilu_ahasveerus.tasak.xml"
       target = get_div_target( xml_file )    # note: returns a single value, not list
       docs = parse_tei_corpus( xml_file, target=[target] )
  
     Note: the function `get_div_target()` needs name of the xml file with full path, as it uses information from directory names for determining the div type.


* **`preserve_tokenization`** -- specifies if the original paragraph and sentence tokenization in the XML files should be preserved (default: False). If switched on, then not only `Text` objects are created, but they are also annotated with layers `'words'`, `'sentences'` and `'paragraphs'`, trying to preserve the original tokenization. This means that sentences are taken from between `<s>` and `</s>` tags, and paragraphs from between `<p>` and `</p>` tags.

     _Note 1_: Creating tokenization layers means that actual NLP tools are applied to create tokenizations. So, processing with `preserve_tokenization=True` takes longer time (and more processing resources) than processing with `preserve_tokenization=False` (the default setting);
     
     _Note 2_: If you need to, you can still restore the original tokenization, even if you use `preserve_tokenization=False`. See the remark **B** below for details;


* **`record_xml_filename`** -- specifies if the name of XML file should recorded in the metadata of the created `Text` objects, under the key `'_xml_file'` (default: False);


* **`encoding`** -- encoding of the input file (or input files). Normally, you should go with the default value ('utf-8').

#### What to keep in mind when using the loading functions

   * **A.** Functions `parse_tei_corpus` and `parse_tei_corpora` provide a simple and general solution to loading texts from XML TEI failes. The textual content (and also metadata, up to an extent) can be loaded from any subcorpus of the Estonian Reference Corpus. However, the genericity has a cost: typically, corpus-specific markings will not be loaded. For instance, the corpus of Estonian parliamentary transcripts also contains special markings for speaker names, but this information is not loaded. And the corpus of Internet forums also records time and user name for each message, but only the message itself (the textual content) is loaded.
 
   If you need to "get more out of" the XML TEI files, you'll need to create your own loading functions. Similarly to EstNLTK's loading functions, you can use the library [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for parsing  XML.


   * **B.** As the original plain texts for _koondkorpus_ XML files are not available, functions `parse_tei_corpus` and `parse_tei_corpora` will construct the source texts by themselves. Sentences (texts between `<s>` and `</s>` tags in the XML) will be separated by newlines, and paragraphs (texts between`<p>` and `</p>` tags in the XML) will be separated by two newlines in the constructed texts. This systematic representation allows you to restore the original tokenization after loading the texts. You can use a  `SentenceTokenizer` with the `base_sentence_tokenizer` set to NLTK's `LineTokenizer()` to get the original sentence segmentation. And the default paragraph tokenizer should be able to produce paragraphs that follow the original paragraph tokenization;

### Processing the whole Estonian Reference Corpus with EstNLTK

EstNLTK also contains scripts that can be used for processing the whole Estonian Reference Corpus. To use these scripts, please proceed in the following steps:

**1.** First, download all the (zipped) XML files from [http://www.cl.ut.ee/korpused/segakorpus/](http://www.cl.ut.ee/korpused/segakorpus/) to your computer. Put them into a separate folder, e.g. folder named `koond`. 

   After downloading, you should have the following files ( checked with UNIX command: `ls -1 koond` ):
     
        Agraarteadus.zip
        Arvutitehnika.zip
        Doktoritood.zip
        EestiArst.zip
        Ekspress.zip
        foorum_uudisgrupp_kommentaar.zip
        Horisont.zip
        Ilukirjandus.zip
        Kroonika.zip
        LaaneElu.zip
        Luup.zip
        Maaleht.zip
        Paevaleht.zip
        Postimees.zip
        Riigikogu.zip
        Seadused.zip
        SLOleht.tar.gz
        Teadusartiklid.zip
        Valgamaalane.zip

  (19 files at total)

**2.** Unpack the files. In UNIX, you can use commands:

        cd koond/
        unzip "*.zip"
        unzip "*.gz"

**3.** Next, XML files need to be converted into the _json_ format. First, create a new folder where the results of the conversion fill be stored. Then, use the script **`convert_koondkorpus_to_json.py`** to do the conversion. The script needs a starting directory and an output directory as arguments. For the starting directory, you can pass the name of the directory into which you unpacked the zip files in the previous step. The script will recursively traverse the directory structure, and find all the XML files suitable for conversion.

   Be aware that all the converted files will be put into the output directory. So, after the conversion, there will be a lot of files in the output folder (approx. 700,000 files).
   
   You can check other possible arguments of the script with the flag `-h`:

        python  convert_koondkorpus_to_json.py  -h

     
**4.** (_Optional_) Use the script **`split_large_corpus_files_into_subsets.py`** for splitting the large set of files from the previous step into N smaller subsets. This will enable parallel processing of the subsets.

**5.** Use the script **`process_and_save_results.py`** to analyze the JSON format files with EstNLTK 1.6.x. The script will add linguistic annotations up to the level of morphology. Before using the script, you'll also need to create a new folder where the script can store the results of analysis. 

   Optionally, you may want to evoke N instances of ** `process_and_save_results.py`** for faster processing. You can get more information about the processing options with:
   
        python  process_and_save_results.py  -h