Working with Estonian and Võru wikipedia
========================================

Wikipedia is a free-access, free-content Internet encyclopedia,
supported and hosted by the non-profit Wikimedia Foundation. Those who
can access the site can edit most of its articles, with the expectation
that they follow the website's policies. Wikipedia is ranked among the
ten most popular websites and constitutes the Internet's largest and
most popular general reference work.

Estonian version of the Wikipedia has over 130 000 articles as of 2015.
Võru dialect has also its own version containing about 5000 articles.

Downloading the Wikipedia dumps
-------------------------------

Latest Estonian wikipedia:

<http://dumps.wikimedia.org/etwiki/latest/etwiki-latest-pages-articles.xml.bz2>

Latest Võru dialect wikipedia:

<http://dumps.wikimedia.org/fiu_vrowiki/latest/fiu_vrowiki-latest-pages-articles.xml.bz2>

It takes some work to turn the dumps into usable form, so if you don't
want to do all of this by yourself, you can download fully prepared (but
older) articles (see links\_to\_processed\_wiki\_dumps).

Extracting articles from XML files
----------------------------------

Let's assume you have downloaded both the Estonian and Võru wikipedia
into `wikidump` subfolder and extracted the `.xml` files, so that you
have two files:

wikidump/etwiki-latest-pages-articles.xml


wikidump/fiu_vrowiki-latest-pages-articles.xml.bz2

Estnltk comes with a tool that can extract all the articles from the XML
files and store them as JSON:

    $ python3 -m estnltk.wiki.parser -h
    
    usage: parser.py [-h] [-v] D I

Parse Estonian Wikipedia dump file to Article Name.json files in a specified
folder

   * positional arguments:

          D              full path to output directory for the json files
          I              wikipedia dump file full path

   * optional arguments:

          -h, --help     show this help message and exit
          -v, --verbose  Print written article titles and count.

To use it, let's create separate subfolders to both Estonian and Võru
articles:

    mkdir wikidump/eesti
    mkdir wikidump/voru

And run the parser:

    python3 -m estnltk.wiki.parser wikidump/eesti/ wikidump/etwiki-latest-pages-articles.xml
    python3 -m estnltk.wiki.parser wikidump/voru/ wikidump/fiu_vrowiki-latest-pages-articles.xml.bz2

As a result, there will be many `.json` files with structure described
in section wiki\_json\_structure. 

NB! See section wiki\_convert on how
to access the articles using Estnltk.

### Json structure

The basic structure of an article.json:

### Sections

The first section is always introduction and doesn´t have a title.

A section is a nested structure, if a section has subsections, they can
be accessed like this:

    obj['sections'][0]['sections']

### Other

Other elements include objects like wikipedia templates in the form of:

### References

If there are references they are added as a top level field:

Each section has (if it has references) has a reference field in the
form of:

### Internal Links

Internal links point to articles in et.wikipedia.org/wiki/. Link parsing
works if the brackets are balanced 99.99% of the time they are, on rare
occasions (1/15000 files) can happen that internal links inside external
link labels are not balanced correctly. Parser just ignores this. :

### Text formatting

Bold/italics/bulletlists are marked in the dump, but are reformated as
plain-text in json. Quotes, newlines are preserved.

### Tables

Tables are under the corresponding section, separeted from text although
unparsed (Json has /n instead of an actual newline):

### Images

Images are also under the corresponding section. From the image text
links (both internal, external) are extracted:

Converting articles to Estnltk JSON
-----------------------------------

The JSON files produced by `estnltk.wiki.parser` contains more
structural data that can be represented by Estnltk-s
**Text** class, thus you cannot directly use this JSON to
initiate **Text** instances.

In Section extracting\_xml\_articles, we created two folders:

 wikidump/voru
 
 
 wikidump/eesti

containing article JSON files extracted from Estonian and Võru dialect
wikipedia. Let's create another subfolders:

corpora/voru


corpora/eesti

where we will store the converted JSON files. The script
`estnltk.wiki.convert` can be used for the job:

    python3 -m estnltk.wiki.convert wikidump/voru/ corpora/voru/
    python3 -m estnltk.wiki.convert wikidump/eesti corpora/eesti/

As a result, the folders contain large number of files in JSON format
that can be used with Estnltk **Text** class. Note that
there is only plain text with unique data from the article dumps. No
tokenization, named entity extraction nor anything else has been done.


### Structure

The top level layers are: data, external\_links, internal\_links,
sections, text. Data contains categories, (list of) references, infobox,
timestamp, title, url. :

Links are now top level, recalculated to point to whole concatenated
article text and point to obj[text] level. :

Sections contains start and end point of sections, title, images,
references, but not section text itself. :

Text is a separate layer all the sections concatenated with section
titles. :

Downloading the processed dumps
-------------------------------

Just in case you do not want to extract the articles yourself, here are
the links to processed files from dumps downloaded on Sep 7 2015.

Estonian Wikipedia articles:
<http://ats.cs.ut.ee/keeletehnoloogia/estnltk/wiki_articles/eesti.zip>

Võru dialect Wikipedia articles:
<http://ats.cs.ut.ee/keeletehnoloogia/estnltk/wiki_articles/voru.zip>