Permalink
Browse files

cleanup arabic rst

  • Loading branch information...
kylepjohnson committed Sep 22, 2017
1 parent 101faec commit beda4454e1284930004f0105b64b3eca21096193
Showing with 92 additions and 111 deletions.
  1. +92 −111 docs/arabic.rst
View
@@ -1,7 +1,6 @@
Arabic
******
Arabic is the form of the Arabic language used in Umayyad and Abbasid literary texts from the 7th century AD to the 9th century AD.
The orthography of the Qurʾān was not developed for the standardized form of Classical Arabic; rather, it shows the attempt on the part of writers to utilize a traditional writing system for recording a non-standardized form of Classical Arabic. (Source: `Wikipedia <https://en.wikipedia.org/wiki/Classical_Arabic>`_)
Classical Arabic is the form of the Arabic language used in Umayyad and Abbasid literary texts from the 7th century AD to the 9th century AD. The orthography of the Qurʾān was not developed for the standardized form of Classical Arabic; rather, it shows the attempt on the part of writers to utilize a traditional writing system for recording a non-standardized form of Classical Arabic. (Source: `Wikipedia <https://en.wikipedia.org/wiki/Classical_Arabic>`_)
Corpora
=======
@@ -27,90 +26,74 @@ The Arabic alphabet are placed in `cltk/corpus/arabic/alphabet.py <https://githu
.. code-block:: python
In [1]: from cltk.corpus.arabic.alphabet import *
In [1]: from cltk.corpus.arabic.alphabet import *
# all Hamza forms
In [2]: HAMZAT
Out[2]: ('ء', 'أ', 'إ', 'آ', 'ؤ', 'ؤ', 'ٔ', 'ٕ')
# print HAMZA from hamza const and from HAMZAT list
# all Hamza forms
In [2]: HAMZAT
Out[2]: ('ء', 'أ', 'إ', 'آ', 'ؤ', 'ؤ', 'ٔ', 'ٕ')
In [3] HAMZA
Out[3] 'ء'
# print HAMZA from hamza const and from HAMZAT list
In [4] HAMZAT[0]
Out[4] 'ء'
# listing all Arabic letters
In [3] HAMZA
Out [3] 'ء'
In [5] LETTERS
out [5] 'ا ب ت ة ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي ء آ أ ؤ إ ؤ'
# Listing all shaped forms for example Beh letter
In [4] HAMZAT[0]
Out [4] 'ء'
In [6] SHAPED_FORMS[BEH]
Out[6] ('', '', '', '')
# Listing all Punctuation marks
# listing all Arabic letters
In [7] PUNCTUATION_MARKS
Out[7] ['،', '؛', '؟']
# Listing all Diacritics FatHatanً ,Dammatanٌ ,Kasratanٍ ,FatHaَ ,Dammaُ ,Kasraِ ,Sukunْ ,Shaddaّ
In [5] LETTERS
out [5] 'ا ب ت ة ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي ء آ أ ؤ إ ؤ'
In [8] TASHKEEL
Out[8] ('ً', 'ٌ', 'ٍ', 'َ', 'ُ', 'ِ', 'ْ', 'ّ')
# Listing HARAKAT
# Listing all shaped forms for example Beh letter
In [9] HARAKAT
Out[9] ('ً', 'ٌ', 'ٍ', 'َ', 'ُ', 'ِ', 'ْ')
# Listing SHORTHARAKAT
In [6] SHAPED_FORMS[BEH]
Out [6] ('', '', '', '')
In [10] SHORTHARAKAT
Out[10] ('َ', 'ُ', 'ِ', 'ْ')
# Listing Tanween
# Listing all Punctuation marks
In [11] TANWEEN
Out[11] ('ً', 'ٌ', 'ٍ')
# Kasheeda, Tatweel
In [7] PUNCTUATION_MARKS
Out [7] ['،', '؛', '؟']
In [12] NOT_DEF_HARAKA
Out[12] 'ـ'
# Listing all Diacritics FatHatanً ,Dammatanٌ ,Kasratanٍ ,FatHaَ ,Dammaُ ,Kasraِ ,Sukunْ ,Shaddaّ
# WESTERN_ARABIC_NUMERALS numerals
In [8] TASHKEEL
Out [8] ('ً', 'ٌ', 'ٍ', 'َ', 'ُ', 'ِ', 'ْ', 'ّ')
In [13] WESTERN_ARABIC_NUMERALS
Out[13] ['0','1','2','3','4','5','6','7','8','9']
# EASTERN ARABIC NUMERALS from 0 to 9
# Listing HARAKAT
In [14] EASTERN_ARABIC_NUMERALS
Out[14] ['۰', '۱', '۲', '۳', '٤', '۵', '٦', '۷', '۸', '۹']
# Listing The Weak letters .
In [9] HARAKAT
Out [9] ('ً', 'ٌ', 'ٍ', 'َ', 'ُ', 'ِ', 'ْ')
In [15] WEAK
Out[15] ('ا', 'و', 'ي', 'ى')
# Listing all Ligatures Lam-Alef
# Listing SHORTHARAKAT
In [16] LIGATURES_LAM_ALEF
Out[16] ('', '', '', '')
# listing small letters
In [10] SHORTHARAKAT
Out [10] ('َ', 'ُ', 'ِ', 'ْ')
In [17] SMALL
Out[17] ('ٰ', 'ۥ', 'ۦ')
# Import letters names in arabic
# Listing Tanween
In [11] TANWEEN
Out [11] ('ً', 'ٌ', 'ٍ')
# Kasheeda, Tatweel
In [12] NOT_DEF_HARAKA
Out [12] 'ـ'
# WESTERN_ARABIC_NUMERALS numerals
In [13] WESTERN_ARABIC_NUMERALS
Out [13] ['0','1','2','3','4','5','6','7','8','9']
# EASTERN ARABIC NUMERALS from 0 to 9
In [14] EASTERN_ARABIC_NUMERALS
Out [14] ['۰', '۱', '۲', '۳', '٤', '۵', '٦', '۷', '۸', '۹']
# Listing The Weak letters .
In [15] WEAK
Out [15] ('ا', 'و', 'ي', 'ى')
# Listing all Ligatures Lam-Alef
In [16] LIGATURES_LAM_ALEF
Out [16] ('', '', '', '')
# listing small letters
In [17] SMALL
Out [17] ('ٰ', 'ۥ', 'ۦ')
# Import letters names in arabic
In [18] Names[ALEF]
Out [18] 'ألف'
In [18] Names[ALEF]
Out[18] 'ألف'
CLTK Arabic Support
===================
@@ -142,34 +125,32 @@ Specific Arabic language library for Python, provides basic functions to manipul
.. code-block:: python
In [1] from cltk.corpus.arabic.utils.pyarabic import araby
# Checks for Arabic Sukun Mark
In [3] char = 'ْ'
In [4] araby.is_sukun(char)
Out [4] True
In [1] from cltk.corpus.arabic.utils.pyarabic import araby
In [2] char = 'ْ'
In [3] araby.is_sukun(char) # Checks for Arabic Sukun Mark
Out[3] True
# Checks for Arabic Shadda Mark
In [5] char = 'ّ'
In [6] araby.is_shadda(char)
Out [6] True
In [4] char = 'ّ'
In [5] araby.is_shadda(char) # Checks for Arabic Shadda Mark
Out[5] True
# Strip Harakat from arabic word except Shadda.
In [7] text = "الْعَرَبِيّةُ"
In [8] araby.strip_harakat(text)
Out [8] العربيّة
In [6] text = "الْعَرَبِيّةُ"
# Strip the last Haraka from arabic word except Shadda
In [9] text = "الْعَرَبِيّةُ"
In [7] araby.strip_harakat(text) # Strip Harakat from arabic word except Shadda.
Out[7] العربيّة
In [8] text = "الْعَرَبِيّةُ"
In [10] araby.strip_lastharaka(text)
Out [10] الْعَرَبِيّة
In [9] araby.strip_lastharaka(text)# Strip the last Haraka from arabic word except Shadda
Out[9] الْعَرَبِيّة
# Strip vowels from a text, include Shadda
In [11] text = "الْعَرَبِيّةُ"
In [10] text = "الْعَرَبِيّةُ"
In [12] araby.strip_tashkeel(text)
Out [12] العربية
In [11] araby.strip_tashkeel(text) # Strip vowels from a text, include Shadda
Out[11] العربية
Stopword Filtering
@@ -178,25 +159,26 @@ To use the CLTK's built-in stopwords list:
.. code-block:: python
In [1]: from cltk.stop.arabic.stopword_filter import stopwords_filter as ar_stop_filter
In [1]: from cltk.stop.arabic.stopword_filter import stopwords_filter as ar_stop_filter
In [2]: text = 'سُئِل بعض الكُتَّاب عن الخَط، متى يَسْتحِقُ أن يُوصَف بِالجَودةِ؟'
In [2]: text = 'سُئِل بعض الكُتَّاب عن الخَط، متى يَسْتحِقُ أن يُوصَف بِالجَودةِ؟'
In [3]: ar_stop_filter(text)
Out[3]: ['سئل', 'الكتاب', 'الخط', '،', 'يستحق', 'يوصف', 'بالجودة', '؟']
In [3]: ar_stop_filter(text)
Out[3]: ['سئل', 'الكتاب', 'الخط', '،', 'يستحق', 'يوصف', 'بالجودة', '؟']
Word Tokenization
=================
.. code-block:: python
In [1]: from cltk.tokenize.word import WordTokenizer
In [1]: from cltk.tokenize.word import WordTokenizer
In [2]: word_tokenizer = WordTokenizer('arabic')
In [2]: word_tokenizer = WordTokenizer('arabic')
In [3]: text = 'اللُّغَةُ الْعَرَبِيَّةُ جَمِيلَةٌ.'
In [3]: text = 'اللُّغَةُ الْعَرَبِيَّةُ جَمِيلَةٌ.'
In [4]: word_tokenizer.tokenize(text)
Out[4]: ['اللُّغَةُ', 'الْعَرَبِيَّةُ', 'جَمِيلَةٌ', '.']
In [4]: word_tokenizer.tokenize(text)
Out[4]: ['اللُّغَةُ', 'الْعَرَبِيَّةُ', 'جَمِيلَةٌ', '.']
Transliteration
===============
@@ -207,26 +189,25 @@ Available Transliteration Systems
.. code-block:: python
In [1] from cltk.phonology.arabic.romanization import available_transliterate_systems
In [1] from cltk.phonology.arabic.romanization import available_transliterate_systems
In [2] available_transliterate_systems()
Out [2] ['buckwalter', 'iso233-2', 'asmo449']
In [2] available_transliterate_systems()
Out[2] ['buckwalter', 'iso233-2', 'asmo449']
Usage
`````
.. code-block:: python
In [1] from cltk.phonology.arabic.romanization import transliterate
In [2] mode = 'buckwalter'
In [1] from cltk.phonology.arabic.romanization import transliterate
In [3] ar_string = 'بِسْمِ اللهِ الرَّحْمٰنِ الرَّحِيْمِ' # translate in English: In the name of Allah, the Most Merciful, the Most Compassionate
In [2] mode = 'buckwalter'
In [4] ignore = '' # this is for ignore an arabic char from transliterate operation
In [3] ar_string = 'بِسْمِ اللهِ الرَّحْمٰنِ الرَّحِيْمِ' # translate in English: In the name of Allah, the Most Merciful, the Most Compassionate
In [5] reverse = True # true means transliteration from arabic native script to roman script such as Buckwalter
In [4] ignore = '' # this is for ignore an arabic char from transliterate operation
In [6] transliterate(mode, ar_string, ignore, reverse)
Out[7] 'bisomi Allhi Alra~Hom`ni Alra~Hiyomi'
In [5] reverse = True # true means transliteration from arabic native script to roman script such as Buckwalter
In [6] transliterate(mode, ar_string, ignore, reverse)
Out[6] 'bisomi Allhi Alra~Hom`ni Alra~Hiyomi'

0 comments on commit beda445

Please sign in to comment.