Permalink
Cannot retrieve contributors at this time
Transliteration | |
=============== | |
Transliteration is the conversion of a text from one script to another. | |
For instance, a Latin transliteration of the Greek phrase "Ελληνική | |
Δημοκρατία", usually translated as 'Hellenic Republic', is "Ellēnikḗ | |
Dēmokratía". | |
.. code:: python | |
from polyglot.transliteration import Transliterator | |
Languages Coverage | |
------------------ | |
.. code:: python | |
from polyglot.downloader import downloader | |
print(downloader.supported_languages_table("transliteration2")) | |
.. parsed-literal:: | |
1. Haitian; Haitian Creole 2. Tamil 3. Vietnamese | |
4. Telugu 5. Croatian 6. Hungarian | |
7. Thai 8. Kannada 9. Tagalog | |
10. Armenian 11. Hebrew (modern) 12. Turkish | |
13. Portuguese 14. Belarusian 15. Norwegian Nynorsk | |
16. Norwegian 17. Dutch 18. Japanese | |
19. Albanian 20. Bulgarian 21. Serbian | |
22. Swahili 23. Swedish 24. French | |
25. Latin 26. Czech 27. Yiddish | |
28. Hindi 29. Danish 30. Finnish | |
31. German 32. Bosnian-Croatian-Serbian 33. Slovak | |
34. Persian 35. Lithuanian 36. Slovene | |
37. Latvian 38. Bosnian 39. Gujarati | |
40. Italian 41. Icelandic 42. Spanish; Castilian | |
43. Ukrainian 44. Georgian 45. Urdu | |
46. Indonesian 47. Marathi (Marāṭhī) 48. Korean | |
49. Galician 50. Khmer 51. Catalan; Valencian | |
52. Romanian, Moldavian, ... 53. Basque 54. Macedonian | |
55. Russian 56. Azerbaijani 57. Chinese | |
58. Estonian 59. Welsh 60. Arabic | |
61. Bengali 62. Amharic 63. Irish | |
64. Malay 65. Afrikaans 66. Polish | |
67. Greek, Modern 68. Esperanto 69. Maltese | |
Downloading Necessary Models | |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
.. code:: python | |
%%bash | |
polyglot download embeddings2.en transliteration2.ar | |
.. parsed-literal:: | |
[polyglot_data] Downloading package embeddings2.en to | |
[polyglot_data] /home/rmyeid/polyglot_data... | |
[polyglot_data] Package embeddings2.en is already up-to-date! | |
[polyglot_data] Downloading package transliteration2.ar to | |
[polyglot_data] /home/rmyeid/polyglot_data... | |
[polyglot_data] Package transliteration2.ar is already up-to-date! | |
Example | |
------- | |
We tag each word in the text with one part of speech. | |
.. code:: python | |
from polyglot.text import Text | |
.. code:: python | |
blob = """We will meet at eight o'clock on Thursday morning.""" | |
text = Text(blob) | |
We can query all the tagged words | |
.. code:: python | |
for x in text.transliterate("ar"): | |
print(x) | |
.. parsed-literal:: | |
وي | |
ويل | |
ميت | |
ات | |
ييايت | |
أوكلوك | |
ون | |
ثورسداي | |
مورنينغ | |
Command Line Interface | |
~~~~~~~~~~~~~~~~~~~~~~ | |
.. code:: python | |
!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en transliteration --target ar | tail -n 30 | |
.. parsed-literal:: | |
which ويكه | |
India ينديا | |
beat بيت | |
Bermuda بيرمودا | |
in ين | |
Port بورت | |
of وف | |
Spain سباين | |
in ين | |
2007 | |
, | |
which ويكه | |
was واس | |
equalled يكالليد | |
five فيفي | |
days دايس | |
ago اغو | |
by بي | |
South سووث | |
Africa افريكا | |
in ين | |
their ثير | |
victory فيكتوري | |
over وفير | |
West ويست | |
Indies يندييس | |
in ين | |
Sydney سيدني | |
. | |
Citation | |
-------- | |
This work is a direct implementation of the research being described in | |
the `False-Friend Detection and Entity Matching via Unsupervised | |
Transliteration <https://arxiv.org/abs/1611.06722>`__ paper. The author | |
of this library strongly encourage you to cite the following paper if | |
you are using this software. | |
:: | |
@article{chen2016false, | |
title={False-Friend Detection and Entity Matching via Unsupervised Transliteration}, | |
author={Chen, Yanqing and Skiena, Steven}, | |
journal={arXiv preprint arXiv:1611.06722}, | |
year={2016}} |