g2p ID: Indonesian Grapheme-to-Phoneme Converter

This library is developed to convert Indonesian (Bahasa Indonesia) graphemes (words) to phonemes in IPA. We followed the methods and designs used in the English equivalent library, g2p.

Installation

pip install g2p_id_py

How to Use

from g2p_id import G2p

texts = [
    "Apel itu berwarna merah.",
    "Rahel bersekolah di S M A Jakarta 17.",
    "Mereka sedang bermain bola di lapangan.",
]

g2p = G2p()
for text in texts:
    print(g2p(text))

>> [['a', 'p', 'ə', 'l'], ['i', 't', 'u'], ['b', 'ə', 'r', 'w', 'a', 'r', 'n', 'a'], ['m', 'e', 'r', 'a', 'h'], ['.']]
>> [['r', 'a', 'h', 'e', 'l'], ['b', 'ə', 'r', 's', 'ə', 'k', 'o', 'l', 'a', 'h'], ['d', 'i'], ['e', 's'], ['e', 'm'], ['a'], ['dʒ', 'a', 'k', 'a', 'r', 't', 'a'], ['t', 'u', 'dʒ', 'u', 'h'], ['b', 'ə', 'l', 'a', 's'], ['.']]
>> [['m', 'ə', 'r', 'e', 'k', 'a'], ['s', 'ə', 'd', 'a', 'ŋ'], ['b', 'ə', 'r', 'm', 'a', 'i', 'n'], ['b', 'o', 'l', 'a'], ['d', 'i'], ['l', 'a', 'p', 'a', 'ŋ', 'a', 'n'], ['.']]

Algorithm

This is heavily inspired from the English g2p.

Spells out arabic numbers and some currency symbols, e.g. Rp 200,000 -> dua ratus ribu rupiah. This is borrowed from Cahya's code.
Attempts to retrieve the correct pronunciation for homographs based on their POS (part-of-speech) tags.
Looks up a lexicon (pronunciation dictionary) for non-homographs. This list is originally from ipa-dict, and we later made a modified version.
For OOVs, we predict their pronunciations using either a BERT model or an LSTM model.

Phoneme and Grapheme Sets

graphemes = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
phonemes = ['a', 'b', 'd', 'e', 'f', 'ɡ', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'z', 'ŋ', 'ə', 'ɲ', 'tʃ', 'ʃ', 'dʒ', 'x', 'ʔ']

Implementation Details

You can find more details on how we handled homographs and out-of-vocabulary prediction on our documentation page.

References

@misc{g2pE2019,
  author = {Park, Kyubyong & Kim, Jongseok},
  title = {g2pE},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Kyubyong/g2p}}
}

@misc{TextProcessor2021,
  author = {Cahya Wirawan},
  title = {Text Processor},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/cahya-wirawan/text_processor}}
}

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
.github/workflows		.github/workflows
docs		docs
g2p_id		g2p_id
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
MANIFEST.in		MANIFEST.in
NOTICE.md		NOTICE.md
PROJECT_CHARTER.md		PROJECT_CHARTER.md
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_test.txt		requirements_test.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

g2p ID: Indonesian Grapheme-to-Phoneme Converter

Installation

How to Use

Algorithm

Phoneme and Grapheme Sets

Implementation Details

References

Contributors

About

Releases 7

Contributors 4

Languages

License

bookbot-kids/g2p_id

Folders and files

Latest commit

History

Repository files navigation

g2p ID: Indonesian Grapheme-to-Phoneme Converter

Installation

How to Use

Algorithm

Phoneme and Grapheme Sets

Implementation Details

References

Contributors

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 7

Contributors 4

Languages