# Wiktionary dump parser

## Preparations
* read the [wiki-parser](wiki-parser.ipynb) first
* just download the [ruwiktionary-latest-pages-articles.xml.bz2](https://dumps.wikimedia.org/ruwiktionary/latest/ruwiktionary-latest-pages-articles.xml.bz2) file 
* open Python 3 and lets begin

## Programming

* Lets copy necessary code from [wiki-parser](wiki-parser.ipynb) :

In [1]:
from bz2 import BZ2File
import xml.etree.ElementTree as etree

def strip_tag_name(t):
    idx = k = t.rfind("}")
    if idx != -1:
        return t[idx + 1:]
    else:
        return t

def read_wiki_dump(bz2_dump_path):
    with BZ2File(bz2_dump_path) as xml_file:
        for event, elem in etree.iterparse(xml_file, events=("start", "end")):
            tname = strip_tag_name(elem.tag)
            if event == "start":
                # We will read only "page" nodes
                if tname == "page":
                    # Init necessary fields
                    title = ""
                    redirect = ""
                    ns = 0
                    text = ""
            else:
                # Assign fields values
                if tname == "title":
                    title = elem.text
                elif tname == "redirect":
                    redirect = elem.attrib["title"]
                elif tname == "ns":
                    ns = int(elem.text)
                elif tname == "text":
                    text = elem.text
                elif tname == "page":
                    # Yield fields
                    yield title, redirect, ns, text
                elem.clear()

* Lets read the file:

In [2]:
from itertools import islice

XML_BZ2_FILE_PATH = "J:\\ruwiktionary-latest-pages-articles.xml.bz2"
SLICE_SIZE = 30

def is_article(title, redirect, ns, text):
    return ns == 0 and len(redirect) == 0

for title, redirect, ns, text in islice(
        filter(
            lambda it: is_article(*it), 
            read_wiki_dump(XML_BZ2_FILE_PATH)
        ), SLICE_SIZE):
    print(title)

Заглавная страница
ё
Фемиксира
эбонитовый
а
Ба
в
да
ахинея
агат
Новосибирск
мальчик
публицист
Хвост
химия
агент
heavy duty
abuse
acceptance test
activity
code smell
smell
Elbonia
code review
reception
focus
workshop
review
follow up
framework


* Lets select nouns only:

In [3]:
def is_noun(title, redirect, ns, text):
    return is_article(title, redirect, ns, text) and "= {{-ru-}} =" in text and "{{сущ ru" in text

for title, redirect, ns, text in islice(
        filter(
            lambda it: is_noun(*it), 
            read_wiki_dump(XML_BZ2_FILE_PATH)
        ), SLICE_SIZE):
    print(title)

ё
Фемиксира
а
Ба
ахинея
агат
Новосибирск
мальчик
публицист
Хвост
химия
агент
день
деньги
Еда
Банда
я
Перл
Баба
Близкий
Израиль
Дама
Дева
Сон
беда
цель
объект
Ватикан
Италия
Бог


* Lets save all nouns to file (exclude single character words):

In [4]:
import codecs

NOUNS_DICT_PATH = "J:\\ru-nouns.txt"

def is_noun(title, redirect, ns, text):
    return is_article(title, redirect, ns, text) and len(title) > 1 and "= {{-ru-}} =" in text and "{{сущ ru" in text

with codecs.open(NOUNS_DICT_PATH, mode="w", encoding="utf-8") as fp:
    for title, redirect, ns, text in filter(
            lambda it: is_noun(*it), 
            read_wiki_dump(XML_BZ2_FILE_PATH)
        ):
        print(title.lower(), file=fp)