Download wikipedia articles in Russian and Armenian
Convert to caracter level for both languages (you can set False to not convert to character level)
Generates transliteration from source_lang corpus to target_langs (you can set to False to not generate transilterations).
For Armenian each row will have 150 chars and for Russian 100 chars.
For Armenian count means count of total rows, count for Russian means count of total characters to download. To switch from count of rows to count of characters you have to use char: True

output_folder: "......"
pairs:
  -
    source_lang: "hy"
    target_langs:
      - "en"
      - "ru"
    char_level: True
    translit: True
    rows_len: 150
    count: 5000
  -
    source_lang: "ru"
    target_langs:
      - "en"
    char_level: True
    translit: True
    rows_len: 100
    count: 15000
    char: True

Language codes

You should used the ISO 639-1 or 639-2 language code used in WP Code column of the list of Wikipedias

How to add a new language pairs for translit generation

For generating transliaterations for language, which we do not support, you have to create mapping file, which have to have json format. Keys must be source language alphabets and values list of potential characters from target language.
Here is example for russian -> english:

{
 "А": ["A"],
    "Б": ["B"],
    "В": ["V", "W"],
    "Г": ["G"],
    "Д": ["D"],
    "Е": ["E", "YE", "Ye"],
    "Ё": ["YO","Yo", "E", "IO", "Io", "JO", "Jo"]
}

After creating mapping file, move it into following directory and run main.py.
Note: We will appreciate if you will share your created mappings with us.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
connectors		connectors
resources		resources
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
char_level.py		char_level.py
data.yaml		data.yaml
main.py		main.py
requirements.txt		requirements.txt
translit_generator.py		translit_generator.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

deepchar - Data extraction and translit generation

Table of Contents

Requirements

Arguments

Language codes

How to add a new language pairs for translit generation

About

Releases

Packages

Contributors 5

Languages

deepchar/data-archived

Folders and files

Latest commit

History

Repository files navigation

deepchar - Data extraction and translit generation

Table of Contents

Requirements

Arguments

Language codes

How to add a new language pairs for translit generation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages