Skip to content

Commit

Permalink
Tamil alphabet (#582)
Browse files Browse the repository at this point in the history
* Create tamil

* Delete tamil

* initial commit for tamil alphabet

* ch formatting

* Add files via upload

* Clarify Grantha script

* Clarify docs
  • Loading branch information
KumaraDilshan authored and kylepjohnson committed Sep 26, 2017
1 parent 3d91c81 commit dfce73b
Show file tree
Hide file tree
Showing 2 changed files with 45 additions and 4 deletions.
12 changes: 8 additions & 4 deletions cltk/corpus/tamil/alphabet.py
@@ -1,10 +1,14 @@
"""The Tamil language has 41 letters 8 VOWELS and 33 CONSTANTS"""
"""The Tamil language has 41 letters 8 VOWELS and 33 CONSTANTS.
`GRANTHA_CONSONANTS` are from the Grantha script, which was used between 6th and 20th century to write Sanskrit and the classical language Manipravalam.
"""

__author__ = ['Dilshan Abeysinghe']
__license__ = 'MIT License. See LICENSE.'

VOWELS=['அ', 'ஆ', 'இ', 'ஈ', 'உ', 'ஊ',' எ', 'ஏ', 'ஐ', 'ஒ', 'ஓ', 'ஔ']

CONSTANTS=['க்', 'ங்', 'ச்', 'ஞ்', 'ட்', 'ண்', 'த்', 'ந்', 'ப்', 'ம்', 'ய்', 'ர்', 'ல்', 'வ்', 'ழ்', 'ள்', 'ற்', 'ன்']
VOWELS = ['அ', 'ஆ', 'இ', 'ஈ', 'உ', 'ஊ',' எ', 'ஏ', 'ஐ', 'ஒ', 'ஓ', 'ஔ']

CONSTANTS = ['க்', 'ங்', 'ச்', 'ஞ்', 'ட்', 'ண்', 'த்', 'ந்', 'ப்', 'ம்', 'ய்', 'ர்', 'ல்', 'வ்', 'ழ்', 'ள்', 'ற்', 'ன்']

GRANTHA_CONSONANTS=['ஜ்', 'ஶ்', 'ஷ்', 'ஸ்', 'ஹ்', 'க்ஷ்']
GRANTHA_CONSONANTS = ['ஜ்', 'ஶ்', 'ஷ்', 'ஸ்', 'ஹ்', 'க்ஷ்']
37 changes: 37 additions & 0 deletions cltk/corpus/tamil/tamil_cltk_documentation.rst
@@ -0,0 +1,37 @@
Tamil
*****

Tamil is a Dravidian language predominantly spoken by the Tamil people of India and Sri Lanka. It is one of the longest-surviving classical languages in the world. A recorded Tamil literature has been documented for over 2000 years. The earliest period of Tamil literature, Sangam literature, is dated from ca. 300 BC – AD 300. It has the oldest extant literature among Dravidian languages. The earliest epigraphic records found on rock edicts and hero stones date from around the 3rd century BC. More than 55% of the epigraphical inscriptions (about 55,000) found by the Archaeological Survey of India are in the Tamil language. Tamil language inscriptions written in Brahmi script have been discovered in Sri Lanka, and on trade goods in Thailand and Egypt. (Source: `Wikipedia <https://en.wikipedia.org/wiki/Tamil_language>`_)


Alphabet
=========

.. code-block:: python
In [1]: from cltk.corpus.tamil.alphabet import VOWELS, CONSTANTS, GRANTHA_CONSONANTS
In [2]: print(VOWELS)
Out[2]: ['', '', '', '', '', '','', '', '', '', '', '']
In [3]: print(CONSONANTS)
Out[3]: ['க்', 'ங்', 'ச்', 'ஞ்', 'ட்', 'ண்', 'த்', 'ந்', 'ப்', 'ம்', 'ய்', 'ர்', 'ல்', 'வ்', 'ழ்', 'ள்', 'ற்', 'ன்']
In [4]: print(GRANTHA_CONSONANTS)
Out[4]: ['ஜ்', 'ஶ்', 'ஷ்', 'ஸ்', 'ஹ்', 'க்ஷ்']
Corpora
=======

Use ``CorpusImporter()`` or browse the `CLTK GitHub organization <https://github.com/cltk>`_ (anything beginning with ``tamil_``) to discover available tamil corpora.

.. code-block:: python
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: c = CorpusImporter('tamil')
In [3]: c.list_corpora
Out[3]: ['tamil_text_ptr_tipitaka']

0 comments on commit dfce73b

Please sign in to comment.