# Importing Text objects from large corpora

EstNLTK contains modules for importing texts from the [Estonian Reference Corpus](http://www.cl.ut.ee/korpused/segakorpus/) (_Eesti keele koondkorpus_), the [Estonian Web 2013 corpus](https://metashare.ut.ee/repository/browse/ettenten-korpus-toortekst/b564ca760de111e6a6e4005056b4002419cacec839ad4b7a93c3f7c45a97c55f) (aka _etTenTen 2013_), and [the Estonian National Corpus 2017](https://metashare.ut.ee/repository/browse/estonian-national-corpus-2017/b616ceda30ce11e8a6e4005056b40024880158b577154c01bd3d3fcfc9b762b3/) (_Eesti keele ühendkorpus 2017_). 
Modules allow to import corpus content as EstNLTK Text objects, and also to preserve original annotations (e.g. paragraph and sentence boundary annotations) in the corpus (to an extent).
In sections below, we will introduce these modules in detail.

---

## A. Importing texts of the Estonian Reference Corpus (_Eesti keele koondkorpus_)

### Getting the files

The page [http://www.cl.ut.ee/korpused/segakorpus/](http://www.cl.ut.ee/korpused/segakorpus/) lists all the subcorpora of Estonian Reference Corpus. From there, you can follow the links and download (zipped) XML files of subcorpora. 

Once you have downloaded a zipped XML corpus and unzipped it, you should see a folder structure similar to this:


        ├── Kroonika
        │   ├── bin
        │   │   ├── koondkorpus_main_header.xml
        │   │   └── tei_corpus.rng
        │   └── Kroon
        │       ├── bin
        │       │   └── header_aja_kroonika.xml
        │       └── kroonika
        │           ├── kroonika_2000
        │           │   ├── aja_kr_2000_12_08.xml
        │           │   ├── aja_kr_2000_12_15.xml
        │           │   ├── aja_kr_2000_12_22.xml
        │           │   └── aja_kr_2000_12_29.xml
        │           ├── kroonika_2001
        │           │   ├── aja_kr_2001_01_05.xml
        │           │   ├── aja_kr_2001_01_12.xml
        │           │   ├── aja_kr_2001_01_19.xml
        │           │   ├── aja_kr_2001_01_22.xml
        ...         ...     ...
        
        
<center>Example. Folder structure in _Kroonika.zip_</center>

Folders `'bin'` contain headers and corpus descriptions. The `'.xml'` files outside the `'bin'` folders are the files with the actual textual content. These files can be loaded with EstNLTK.

### Importing Texts from a single XML file

You can use function **`parse_tei_corpus`** to create Text objects based on a single XML TEI file. 
An example:

    from estnltk.corpus_processing.parse_koondkorpus import get_div_target
    from estnltk.corpus_processing.parse_koondkorpus import parse_tei_corpus
    
    # input file (must be with a full path)
    input_xml_file = "C:\Kroonika\Kroon\kroonika\kroonika_2000\aja_kr_2000_12_08.xml"
    
    # find out which subsection of the XML file forms a single document
    target = get_div_target( input_xml_file )   
    
    # import documents as Text objects   
    for text_obj in parse_tei_corpus( input_xml_file, target=[target] ):
        # TODO: do something with the Text object
        ...

Note that before using the function, you should decide, which subsections of the XML file you want to consider as single document (the argument `target`). If you do not have clear preferences, you can use the function `get_div_target` to get a reasonable default for every subcorpus of the Reference Corpus.

**_Layers._** By default, obtained Text objects do not have any annotation layers. See the section "Details of the `parse_koondkorpus` module" on how to add layers.

**_Metadata._** Obtained Text objects also have metadata, stored in the dictionary `text_obj.meta`;

### Importing Texts from a directory of XML files

You can use function **`parse_tei_corpora`** to create Text objects from all (or selected) XML TEI files in a directory and in all of its subdirectories. 
An example:

    from estnltk.corpus_processing.parse_koondkorpus import parse_tei_corpora
    
    # input directory (must be with a full path)
    input_xml_path = 'C:\\Kroonika\\Kroon\\kroonika'
   
    # import documents as Text objects   
    for text_obj in parse_tei_corpora( input_xml_path, target=['artikkel'] ):
        # TODO: do something with the Text object
        ...

Note that function `parse_tei_corpora` will traverse recursively all subdirectories of the given directory (`input_xml_path`), and will also extract Text objects from XML TEI files in the subdirectories. In addition, you can use arguments `prefix` and `suffix` to further specify which file names are suitable for extraction, e.g. `prefix="aja_kr_2000_"` will tell the function to extract Texts only from files starting with the string `'aja_kr_2000_'`.

### Details of the `parse_koondkorpus` module

#### Reconstruction of source text

As the original plain texts for _koondkorpus_ XML files are not available, functions `parse_tei_corpus` and `parse_tei_corpora` will reconstruct the source texts by themselves. 
By default, this reconstruction follows the original XML mark-up: paragraphs (texts between`<p>` and `</p>` tags in the original XML) will be separated by double newlines, sentences (texts between `<s>` and `</s>` tags in the XML) will be separated by newlines, and words will be separated by whitespaces in the constructed texts.
Optionally, you may also add tokenization layers to created `Text` objects: either by strictly following the original XML tokenization mark-up, or by adding tokenization with EstNLTK's tools.

#### Signatures of the parsing functions

  * **`parse_tei_corpus`**`(path, target=['artikkel'], encoding='utf-8', add_tokenization=False, preserve_tokenization=False, record_xml_filename=False, sentence_separator='\n', paragraph_separator='\n\n', orig_tokenization_layer_name_prefix='')` -- reads and parses a single XML file (given with the full `path`), creates `Text` objects storing documents and metadata from the file, and returns a list of created `Text` objects;


  * **`parse_tei_corpora`**`(root, prefix='', suffix='.xml', target=['artikkel'], encoding='utf-8', add_tokenization=False, preserve_tokenization=False, record_xml_filename=False, sentence_separator='\n', paragraph_separator='\n\n', orig_tokenization_layer_name_prefix='')` -- reads recursively all the files from the directory `root`, selects files with the given `prefix` and `suffix` for XML parsing, and creates `Text` objects storing documents and metadata from the files. Returns a list of created `Text` objects;

#### Arguments

Exact behaviour of the functions can be modified by the following common arguments:

* **`target`** -- specifies the list of types of divs, from which the textual content is to be extacted. For instance, in case of newspaper articles, the content of an article is typically between `<div3 type="artikkel">` and `</div3>`, so, you should use `target=['artikkel']`. In case of fiction writings, the content of a single work is typically between `<div1 type="tervikteos">` and `</div1>`, and you may want to use `target=['tervikteos']`.

  Which values should be used for `target` depend on the subcorpus and the goal of analysis. For example, you may want to divide a fiction text into chapters, instead of analysing it as a whole. In such case, you should manually look up the correct type values (for chapters) from the concrete XML file.
  
  If you do not have very specific goals, you can use the function **`get_div_target()`**, which provides a reasonable default div type for the given XML file, based on the hard-coded values. Example:

       from estnltk.corpus_processing.parse_koondkorpus import get_div_target
       from estnltk.corpus_processing.parse_koondkorpus import parse_tei_corpus
       
       xml_file = 'C:\\Eesti_ilukirjandus\\ilukirjandus\\Eesti_ilukirjandus_1990\\'+\
                  'ilu_ahasveerus.tasak.xml'
       target = get_div_target( xml_file )    # note: returns a single value, not list
       docs = parse_tei_corpus( xml_file, target=[target] )
  
     Note: the function `get_div_target()` needs name of the xml file with full path, as it uses information from directory names for determining the div type.


* **`add_tokenization`** -- specifies whether the tokenization layers (`'tokens'`, `'compound_tokens'`, `'words'`, `'sentences'`, and `'paragraphs'`) should be added to newly created Text objects (default: False). Note that if `preserve_tokenization==False`, then the tokenization layers are added with EstNLTK's tools, otherwise the original layers from the XML mark-up are preserved;


* **`preserve_tokenization`** -- specifies if the original word, sentence and paragraph tokenization from the XML mark-up should be preserved (default: False). This only works if `add_tokenization` is switched on. Then `Text` objects are created with layers `'tokens'`, `'compound_tokens'`, `'words'`, `'sentences'` and `'paragraphs'`, which preserve the original tokenization. This means that paragraphs are taken from between `<p>` and `</p>` tags, sentences from between `<s>` and `</s>` tags, and words are taken as space-separated tokens inside the sentences. The layer `'compound_tokens'` will always remain empty (because there is no information about token compounding in the XML mark-up), and the layer `'tokens'` will be equal to the layer `'words'`;

     _Note_: Creating tokenization layers typically takes more processing time than loading `Text` objects without layers. If you do not change parameters `sentence_separator` and `paragraph_separator`, then reconstructed texts also preserve the hints about tokenization, and you can restore the original tokenization afterwards. See the remark **C** below for details;
     

* **`record_xml_filename`** -- specifies if the name of XML file should recorded in the metadata of the created `Text` objects, under the key `'_xml_file'` (default: False);


* **`encoding`** -- encoding of the input file (or input files). Normally, you should go with the default value ('utf-8').


* **`sentence_separator`** -- string used as sentence separator during the reconstruction of the text (default: `'\n'`).


* **`paragraph_separator`** -- string used as paragraph separator during the reconstruction of the text (default: `'\n\n'`).


* **`orig_tokenization_layer_name_prefix`** -- string used as a prefix in names of layers of original tokenization. You can use this argument to make names of original tokenization layers distinguishable from EstNLTK's tokenization layers. (default: `''`)



#### What to keep in mind when using the loading functions

   * **A.** Functions `parse_tei_corpus` and `parse_tei_corpora` provide a simple and general solution to loading texts from XML TEI failes. The textual content (and also metadata, up to an extent) can be loaded from any subcorpus of the Estonian Reference Corpus. However, the genericity has a cost: typically, corpus-specific markings will not be loaded. For instance, the corpus of Estonian parliamentary transcripts also contains special markings for speaker names, but this information is not loaded. And the corpus of Internet forums also records time and user name for each forum message, but only the message itself (the textual content) is loaded.
 
   If you need to "get more out of" the XML TEI files, you'll need to create your own loading functions. You can follow the example of the functions in the module `estnltk.corpus_processing.parse_koondkorpus`, and use the library [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to do the job.


   * **B.** _Tokenization_: adding EstNLTK's tokenization vs preserving the original tokenization. 
   
      1. If you use EstNLTK's tokenization instead of the original one, you get annotations of multiword tokens (`'compound_tokens'`), and this also helps to get more accurate sentence annotations. For instance, the original markup occasionally contains sentence endings inside multiword names, such as _'Ju. M. Lotman'_, and date expressions, such as _'24 . 10. 1921'_, but EstNLTK's default tokenization marks such expressions as `'compound_tokens'` and thus cancels sentence endings inside these expressions. If you use EstNLTK's tokenization, it is advisable also to set `sentence_separator=' '`, so that reconstructed text will not contain newlines inside compound tokens;
    
      2. If you preserve the original tokenization, you can get better alignment with other linguistic annotations laid on the loaded corpus, such as [Dependency Treebank annotations](https://github.com/EstSyntax/EDT), or [TimeML annotations](https://github.com/soras/EstTimeMLCorpus). Data also gets loaded faster with the original tokenization annotations. The downsides are: there will be no `'compound_tokens'` marked in loaded texts, and there may be more sentence tokenization errors;
   
   
   * **C.** _Loading texts without tokenization, and then restoring the original tokenization later._ If you use parameters `sentence_separator` and `paragraph_separator` with their default values, you can restore the original tokenization after loading the texts. For this:
   
      1. Use a `CompoundTokenTagger` with initialization parameter `do_not_join_on_strings=['\n', '\n\n']` to ensure that compound tokens will not cross out sentence and paragraph endings;
      2. Use a  `SentenceTokenizer` with the `base_sentence_tokenizer` set to NLTK's `LineTokenizer()`, so that texts will be split to sentences following newlines;
      3. Use the default paragraph tokenizer to create paragraphs according to the original paragraph tokenization;

### Processing whole Estonian Reference Corpus with EstNLTK

If you need to process the whole Estonian Reference Corpus with EstNLTK, you can use the command-line scripts in [ **`https://github.com/estnltk/estnltk-workflows`**](https://github.com/estnltk/estnltk-workflows). 
The workflow for processing Estonian Reference Corpus with EstNLTK, and saving the results as JSON format files is located at: 
[https://github.com/estnltk/estnltk-workflows/tree/master/estnltk_workflows/koondkorpus_and_ettenten_to_json](https://github.com/estnltk/estnltk-workflows/tree/master/estnltk_workflows/koondkorpus_and_ettenten_to_json). 
For details about processing, check out the `readme.md` file in the folder.

---

## B. Importing texts of the Estonian Web 2013 corpus (_etTenTen 2013 korpus_)

### Getting the corpus

You can download the etTenTen 2013 corpus from here [https://metashare.ut.ee/repository/browse/ettenten-korpus-toortekst/b564ca760de111e6a6e4005056b4002419cacec839ad4b7a93c3f7c45a97c55f](https://metashare.ut.ee/repository/browse/ettenten-korpus-toortekst/b564ca760de111e6a6e4005056b4002419cacec839ad4b7a93c3f7c45a97c55f). After unpacking the content, you should get one large file with the extension `vert` or `prevert` (e.g. `ettenten13.processed.prevert`).


### Importing Texts

You can use function **`parse_ettenten_corpus_file_iterator`** to iterate over the corpus file and get Text objects yielded one-by-one.
An example:

    from estnltk.corpus_processing.parse_ettenten import parse_ettenten_corpus_file_iterator
    
    # input file
    input_file = "ettenten13.processed.prevert"
   
    # iterate over extracted Text objects   
    for text_obj in parse_ettenten_corpus_file_iterator( input_file ):
        # TODO: do something with the Text object
        ...

**_Layers._** By default, created Text objects have only one annotation layer: `'original_paragraphs'`. The layer contains original paragraph markings (between `<p>` and `</p>` tags) from the input file. Unlike estnltk's `'paragraphs'` layer, which envelops around `'sentences'` layer, `'original_paragraphs'` will be a stand-alone layer of Text object (because sentences are not annotated in the input file).

**_Metadata._** Obtained Text objects also have metadata, stored in the dictionary `text_obj.meta`;

### Details of the `parse_ettenten` module

#### Reconstruction of source text

The etTenTen 2013 corpus contains documents crawled from the web. 
Documents have been cleaned up from most of the HTML (and other XML) tags, although few tags remain here and there.
Textual content of the web pages has been preserved along with its segmentation into paragraphs.

The function `parse_ettenten_corpus_file_iterator` collects documents and creates Text objects in a straightforward manner: all paragraphs (texts between`<p>` and `</p>` tags in the original file) will be concatenated by a double newline to reconstruct the text. If there are any other XML/HTML tags between `<p>` and `</p>` tags, these tags will also be included in the textual content.

#### Signature of the parsing function

  * **`parse_ettenten_corpus_file_iterator`**`(in_file, encoding='utf-8', focus_doc_ids=None, add_tokenization=False,        discard_empty_paragraphs=True, store_paragraph_attributes=False, paragraph_separator='\n\n')` -- reads and parses etTenTen corpus file (`in_file`), and on the progress, creates `Text` objects storing documents and metadata, and yields created `Text` objects one-by-one;

#### Arguments

Exact behaviour of the function can be modified by the following arguments:

* **`in_file`** -- full name of etTenTen corpus file.


* **`encoding`** -- encoding of the input file. Normally, you should go with the default value ('utf-8').


* **`focus_doc_ids`** -- set of document id-s corresponding to the documents which need to be extracted from the `in_file`. **Important:** this should be a set of strings, not a set of integers. If provided, then only documents with given id-s will be       processed, and all other documents will be skipped. If the value is `None` or empty set, then all documents in the file will be processed. (default: `None`).


* **`add_tokenization`** -- Specifies if (full) tokenization will be added to reconstructed texts (default: False). If `add_tokenization==False`, then there will be only one tokenization layer: `'original_paragraphs'`, which will be collected from XML annotations and added as a stand-alone annotation layer of a Text object. If `add_tokenization==True`, then there will be layers `'tokens'`, `'compound_tokens'`, `'words'`, and `'sentences'` (created by EstNLTK's default tokenizers) and layer `'paragraphs'` will be collected from XML annotations and enveloped around EstNLTK's `'sentences'` layer; 


* **`discard_empty_paragraphs`** -- boolean specifying if empty paragraphs (paragraphs without textual content) should be discarded (default: True).



* **`store_paragraph_attributes`** -- boolean specifying if attributes in the paragraph's XML tag will be collected and added as attributes of the corresponding layer (`'original_paragraphs'` or `'paragraphs'`) in a Text object. (default: False).


* **`paragraph_separator`** -- string used as paragraph separator during the reconstruction of the text (default: `'\n\n'`).

### Processing whole etTenTen with EstNLTK

If you need to process the whole etTenTen with EstNLTK, you can use the command-line scripts in [ **`https://github.com/estnltk/estnltk-workflows`**](https://github.com/estnltk/estnltk-workflows). 
The workflow for processing etTenTen with EstNLTK, and saving the results as JSON format files is located at: 
[https://github.com/estnltk/estnltk-workflows/tree/master/estnltk_workflows/koondkorpus_and_ettenten_to_json](https://github.com/estnltk/estnltk-workflows/tree/master/estnltk_workflows/koondkorpus_and_ettenten_to_json). 
For details about processing, check out the `readme.md` file in the folder.

---

## C. Importing texts of the Estonian National Corpus 2017 (_Eesti keele ühendkorpus 2017_)

TODO

---

---