<a href="https://colab.research.google.com/github/farhadrahimiinfo/Natural_Language_Processing/blob/main/NLP__algorithms_for_the_Kurdish_Language.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Natural Language Processing (NLP) algorithms for the Kurdish Language (ckb: Central branch of Kurdish)


AsoSoft Library (Python)
AsoSoft Library offers the following natural language processing (NLP) algorithms for the Kurdish Language (ckb: Central branch of Kurdish):

* **Grapheme-to-Phoneme (G2P)** converter and Transliterator: converts Kurdish text * into syllabified phoneme string. Also transliterates Kurdish texts from * * * Arabic script into Latin script and vice versa.
* **Normalizer**: normalizes the Kurdish text and punctuation marks, unifies numerals, replaces Html Entities, extracts and replaces URLs and emails, and more.
* **Numeral Converter**: converts any type of numbers into Kurdish words.
* **Sort**: Sorts a list in correct Kurdish alphabet order.
* **Poem Meter Classifier**: Classifies the meter of the input Kurdish poem


AsoSoft Library is originally written in C# by Aso Mahmudi and this library is its Python port.




In [3]:
!pip install asosoft



#Normalizer
 normalizes the Kurdish text and punctuation marks, unifies numerals, replaces Html Entities, extracts and replaces URLs and emails, and more.


In [4]:
import asosoft


# Kurdish Text Normalizer
Several functions needed for Central Kurdish text normalization:

Normalize Kurdish
Two character replacement lists are provided as the resources of the library:

Deep Unicode Corrections:
* replacing deprecated Arabic Presentation Forms (FB50–FDFF and FE70–FEFF) with * corresponding standard characters.
* replacing different types of dashes and spaces
removing Unicode control character

Additional Unicode Corrections
* replacing special Arabic math signs with corresponding Latin characters
* replacing similar, but different letters with standard characters (e.g. ڪ,ے,ٶ with ک,ی,ؤ)


The normalization task in this function:

for all Arabic scripts (including Kurdish, Arabic, and Persian):
* Character-based replacement:
* * Above mentioned replacement lists
* * Private Use Area (U+E000 to U+F8FF) with White Square character


Standardizing and removing duplicated or unnecessary Zero-Width characters

removing unnecessary Tatweels (U+0640)

only for Central Kurdish:
standardizing Kurdish characters: ە, هـ, ی, and ک
* correcting miss-converted characters from non-Unicode fonts
* * replacing word-initial ر with ڕ
the simple overloading:



In [6]:
#Normalize
print(asosoft.Normalize("دەقے شیَعري خـــۆش. ره‌نگه‌كاني خاك"))



دەقی شێعری خۆش. ڕەنگەکانی خاک


In [12]:
#NormalizePunctuations corrects spaces before and after of the punctuations. When seprateAllPunctuations is true,

print(asosoft.NormalizePunctuations("دەقی«کوردی » و ڕێنووس ،((خاڵبەندی )) چۆنە ؟", False))


دەقی «کوردی» و ڕێنووس، «خاڵبەندی» چۆنە؟


In [15]:
#Replace Html Entities

print(asosoft.ReplaceHtmlEntity("ئێوە &quot;دەق&quot; بە زمانی &lt;کوردی&gt; دەنووسن"))


ئێوە "دەق" بە زمانی <کوردی> دەنووسن


In [18]:
#Unify Numerals
#UnifyNumerals unifies numeral characters into desired numeral type from en (0123456789) or ar (٠١٢٣٤٥٦٧٨٩)

print(asosoft.UnifyNumerals("ژمارەکانی ٤٥٦ و ۴۵۶ و 456", "en"))


ژمارەکانی 456 و 456 و 456


In [19]:
#Seperate Digits from words
#SeperateDigits add a space between joined numerals and words (e.g. replacing "12کەس" with "12 کەس"). It improves language models.
print(asosoft.SeperateDigits("لە ساڵی1950دا1000دۆلاریان بە 5کەس دا"))



لە ساڵی 1950 دا 1000 دۆلاریان بە 5 کەس دا


In [21]:
#Word to Word Replacment
#Word2WordReplacement applies a "string to string" replacement dictionary on the text. It replaces the full-matched words not a part of them.
print(asosoft.Word2WordReplacement("مال، نووری مالیکی", {"مال": "ماڵ", "سلاو": "سڵاو"}))


ماڵ، نووری مالیکی


#Kurdish Numeral converter
It converts numerals into Central Kurdish words. It is useful in text-to-speech tools.

* integers (1100 => )
* floats (10.11)
* negatives (-10.11)
* percent (100% or %100)
* querency marks ($100, £100, and €100)


In [22]:
print(asosoft.Number2Word("لە ساڵی 1999دا بڕی 40% لە پارەکەیان واتە $102.1 یان وەرگرت"))


لە ساڵی هەزار و نۆسەد و نەوەد و نۆدا بڕی چل لە سەد لە پارەکەیان واتە سەد و دوو پۆینت یەک دۆلار یان وەرگرت


#Kurdish Sort
Sorting a string list in correct order of Kurdish alphabet ("ئءاآأإبپتثجچحخدڎذرڕزژسشصضطظعغفڤقكکگلڵمنوۆۊۉهھەیێ")



In [23]:
myList = ["یەک", "ڕەنگ", "ئەو", "ئاو", "ڤەژین", "فڵان"]


In [24]:
print(asosoft.KurdishSort(myList))


['ئاو', 'ئەو', 'ڕەنگ', 'فڵان', 'ڤەژین', 'یەک']


In [26]:
input_list = ["یەک", "ڕەنگ", "ئەو", "ئاو", "ڤەژین", "فڵان"]
input_order = list("ئءاآأإبپتثجچحخدڎڊذرڕزژسشصضطظعغفڤقكکگڴلڵمنوۆۊۉۋهھەیێ")
print(asosoft.CustomSort(input_list, input_order))

['ئاو', 'ئەو', 'ڕەنگ', 'فڵان', 'ڤەژین', 'یەک']


In [None]:
#Poem Meter Classifier
