Skip to content

Latest commit

 

History

History
77 lines (55 loc) · 2.66 KB

README_WIKIPEDIA.md

File metadata and controls

77 lines (55 loc) · 2.66 KB
RuMor: Russian Morphology project

Corpuscula: a python NLP library for corpus processing

Wrapper for Wikipedia

The package wikipedia_utils contains tools to simplify using Wikipedia in NLP tasks. So far, Corpuscula supports only Russian part of Wikipedia.

Setting a root directory for store downloaded corpora

from corpuscula import corpus_utils
corpus_utils.set_root_dir(root_dir)

NB: it will create/update config file .rumor in your home directory.

If you won't set the root directory, Corpuscula will try to keep corpora in the directory where it's installed.

Next method allows to receive currently set root directory:

root_dir = corpus_utils.get_root_dir()

Downloading and removal Wikipedia dump

from corpuscula import wikipedia_utils
wikipedia_utils.download_wikipedia(lang='RU', root_dir=None, overwrite=True)
wikipedia_utils.remove_wikipedia(lang='RU', root_dir=None)

lang: specifies what language you'd like to download Wikipedia dump for. Only lang='RU' is currently supported.

root_dir: allows to specify alternative root directory location. Default is the path from .rumor config or, if the config does not exist, root directory is the exact directory where Corpuscula is installed.

overwrite: If True (default), force download corpus even if it already exists.

Wrappers for Wikipedia's parts:

wiki = wikipedia_utils.Wikipedia(lang='RU', fpath=None, silent=False)
titles = wiki.titles()
articles = wiki.articles()
templates = wiki.templates()

Params of the constructor:

lang: specifies of what language Wikipedia dump you want to use. Only lang='RU' is currently supported.

fpath: path to the Wikipedia dump. If it downloaded in default location, keep it None.

silent: suppress output.

All methods return iterators of tuples that are:

for Wikipedia.titles(): (<article id>, <article title>);

for Wikipedia.articles(): (<article id>, <article title>, <article text>);

for Wikipedia.templates(): (<template id>, <template title>, <template text>);

We promote .templates() in case if anyone can make parser for Wikipedia articles based on that templates. So far, only most common templates were used for parsing the articles.

NB: all methods return processed clean text, not CoNLL-U. That's because for CoNLL-U we require tokenized text. If you want Wikipedia wrapper with CoNLL-U tokenized output, refer our Toxine library.