# Arabic NLP

In [1]:
import pandas as pd
import string
import re
import nltk

This dataset was obtained from Heedzy at: https://heedzy.com/. It includes 100 reviews of Hungerstation App.

In [2]:
# Read the CSV data file

df = pd.read_csv("hungerstation.csv")

In [3]:
df.head()

Unnamed: 0,Source,Date,Title,Content,Name,Rating,Version
0,HungerStation,2021-11-02,اتمنى الانتباه,تغليف المشروبات باحكام (الحار والبارد) للتاكد ...,'أميرة',2,5.13.2
1,HungerStation,2021-11-02,اعلاناتهم أكبر من واقعهم,جاهز وبس,دح 911,1,5.13.2
2,HungerStation,2021-11-02,سيء جدا,لا يوجد اي التزام بمواعيد التوصيل ، الطلب يجلس...,y_awdah,1,5.13.2
3,HungerStation,2021-11-02,فاشل,اقل من نجمه ونصابين ويزيدون فلوس فجاه بدون سبب...,ii.jem2,1,5.13.2
4,HungerStation,2021-11-02,horrible,canceled my order on me did not provide any in...,Dead Rat2,1,5.13.2


In [4]:
df.shape

(100, 7)

In [5]:
df['Content'][0]

'تغليف المشروبات باحكام (الحار والبارد) للتاكد من ان المندوب لم لم يستخدم المشروب'

In [6]:
print (df['Content'][9])

There’s is ZERO customer service. I never received an order, was charged for it and then was never refunded. To add insult to injury they marked my issue as resolved. I cannot get through to them anymore. Absolutely disgraceful and dishonest. Avoid this crooked company like the plague. If I could give -10 stars I would. Utterly horrid!


In [7]:
print (df['Content'][13])

افضل من مرسول 🤍🤍


In [8]:
# Run it first time only, to install the library.
#!pip install camel-tools

## Arabic Preprocessing Pipeline
There are many tools for processing Arabic text: NLTK, Gensim, Farasa, MADAMIRA and Stanford CoreNLP, CAMeL. We will use CAMeL, NLTK, and Gensim.

### Step 1: Dediacritization (إزالة التشكيل)

In [9]:
# import the dediacritization tool
from camel_tools.utils.dediac import dediac_ar

text  = "إِنَّ اللَّهَ وَمَلَائِكَتَهُ يُصَلُّونَ عَلَى النَّبِيِّ ۚ يَا أَيُّهَا الَّذِينَ آمَنُوا صَلُّوا عَلَيْهِ وَسَلِّمُوا تَسْلِيمًا "
text2 = dediac_ar(text)
print ("before: ", text)
print ("after : ", text2 )

before:  إِنَّ اللَّهَ وَمَلَائِكَتَهُ يُصَلُّونَ عَلَى النَّبِيِّ ۚ يَا أَيُّهَا الَّذِينَ آمَنُوا صَلُّوا عَلَيْهِ وَسَلِّمُوا تَسْلِيمًا 
after :  إن الله وملائكته يصلون على النبي ۚ يا أيها الذين آمنوا صلوا عليه وسلموا تسليما 


### Step 2: Normalizing Alif and Tah Marbotah

In [10]:
from camel_tools.utils.normalize import normalize_alef_maksura_ar
from camel_tools.utils.normalize import normalize_alef_ar
from camel_tools.utils.normalize import normalize_teh_marbuta_ar

def ortho_normalize(text):
    text = normalize_alef_maksura_ar(text)
    text = normalize_alef_ar(text)
    text = normalize_teh_marbuta_ar(text)
    return text

In [11]:
text  = "الأم الآلام الإمرأة"
text2 = ortho_normalize(text)
print ("before: ", text)
print ("after : ", text2 )

before:  الأم الآلام الإمرأة
after :  الام الالام الامراه


### Step 3: Remove English Text

In [12]:
def remove_english(text):
    text = re.sub(r'\s*[A-Za-z]+\b', '' , text)
    return (text)

In [13]:
text = df['Content'][9]
text2 = remove_english(text)
print ("before: ", text)
print ("after : ", text2)

before:  There’s is ZERO customer service. I never received an order, was charged for it and then was never refunded. To add insult to injury they marked my issue as resolved. I cannot get through to them anymore. Absolutely disgraceful and dishonest. Avoid this crooked company like the plague. If I could give -10 stars I would. Utterly horrid!
after :  ’.,..... -10.!


### Step 4: Remove Punctuation

In [14]:
def remove_punc(text):
    # define a list of arabic and english punctiations that we want to get rid of in our text
    punctuations = '''`÷×؛<>_()*&^%][ـ،/:"؟.,'{}~¦+|!”…“–ـ''' + string.punctuation
    translator = str.maketrans('', '', punctuations)
    text = text.translate(translator)
    return text

- maketrans( ) method creates a one to one mapping of a character to its translation/replacement.
- translate( ) method makes a copy of a string with a specific set of values substituted.

In [15]:
text = df['Content'][2]
text2 = remove_punc(text)
print ("before: ", text)
print ("after : ", text2)

before:  لا يوجد اي التزام بمواعيد التوصيل ، الطلب يجلس ساعتين لين يجيك ! ويجي بارد والمندوب يالله يرد ما تطور ولا شي بهنقرستيشن من يوم طلع للحين ليتهم يتعلمون من جاهز بس
after :  لا يوجد اي التزام بمواعيد التوصيل  الطلب يجلس ساعتين لين يجيك  ويجي بارد والمندوب يالله يرد ما تطور ولا شي بهنقرستيشن من يوم طلع للحين ليتهم يتعلمون من جاهز بس


### Step 5: Remove emojis

In [16]:
# you have to install this first by: !pip install demoji

import demoji
text = df['Content'][13]
text2 = demoji.replace(text, '')
print ("before: ", text)
print ("after : ", text2)

before:  افضل من مرسول 🤍🤍
after :  افضل من مرسول 


### Step 6: Tokenization

In [17]:
from camel_tools.tokenizers.word import simple_word_tokenize

text  = "إِنَّ اللَّهَ وَمَلَائِكَتَهُ يُصَلُّونَ عَلَى النَّبِيِّ ۚ يَا أَيُّهَا الَّذِينَ آمَنُوا صَلُّوا عَلَيْهِ وَسَلِّمُوا تَسْلِيمًا "
text2 = dediac_ar(text)
text2 = ortho_normalize(text2)
tokenized = simple_word_tokenize(text2)
print ("before: ", text)
print ("after : ", tokenized)

before:  إِنَّ اللَّهَ وَمَلَائِكَتَهُ يُصَلُّونَ عَلَى النَّبِيِّ ۚ يَا أَيُّهَا الَّذِينَ آمَنُوا صَلُّوا عَلَيْهِ وَسَلِّمُوا تَسْلِيمًا 
after :  ['ان', 'الله', 'وملائكته', 'يصلون', 'علي', 'النبي', 'ۚ', 'يا', 'ايها', 'الذين', 'امنوا', 'صلوا', 'عليه', 'وسلموا', 'تسليما']


### Step 7: Stop-words removal

In [18]:
from nltk.corpus import stopwords
stop_words = nltk.corpus.stopwords.words("arabic")

In [19]:
clean_text = [word for word in tokenized if word not in stop_words]
clean_text 

['ان',
 'الله',
 'وملائكته',
 'يصلون',
 'علي',
 'النبي',
 'ۚ',
 'ايها',
 'امنوا',
 'صلوا',
 'وسلموا',
 'تسليما']

### Step 8: Stemming or Lemmatization

In [20]:
from nltk.stem.isri import ISRIStemmer
st = ISRIStemmer()
w = 'حركات'
print(st.stem(w))

حرك


Other preprocessing include stemming, removing numbers, removing URLs etc, it all depends on the task you need to perform.

### Put them all together

In [21]:
def clean(text):
    
    # if you need to use this function in other code or notebook 
    # make sure you import the needed libraries.
    
    cleaned = dediac_ar(text)
    cleaned = ortho_normalize(cleaned)
    cleaned = remove_english(cleaned)
    cleaned = remove_punc(cleaned)
    cleaned = demoji.replace(cleaned, '')
    cleaned = tokenized = simple_word_tokenize(cleaned)
    cleaned = [word for word in tokenized if word not in stop_words]
    cleaned = [st.stem(word) for word in cleaned]
    return " ".join(cleaned)

In [22]:
text = " الطلب يجلس ساعتين لين يجيك !لكن افضل من مرسول 👍🏼👍🏼 مَاشِي "
text2 = clean(text)
text2

'طلب جلس ساع لين يجك فضل رسل اشي'

Let's apply it to all the reviews:

In [23]:
new_df = df
new_df['Content'] = df['Content'].apply(clean)
new_df.head()

Unnamed: 0,Source,Date,Title,Content,Name,Rating,Version
0,HungerStation,2021-11-02,اتمنى الانتباه,غلف شرب حكم حار برد تكد ان ندب خدم شرب,'أميرة',2,5.13.2
1,HungerStation,2021-11-02,اعلاناتهم أكبر من واقعهم,جهز وبس,دح 911,1,5.13.2
2,HungerStation,2021-11-02,سيء جدا,وجد اي تزم مواعيد وصل طلب جلس ساع لين يجك ويج ...,y_awdah,1,5.13.2
3,HungerStation,2021-11-02,فاشل,اقل نجم نصب يزد فلس فجه بدن سبب قئم سعر و سله ...,ii.jem2,1,5.13.2
4,HungerStation,2021-11-02,horrible,,Dead Rat2,1,5.13.2


These preprocessing step are needed for machine learning algoriyhms. Now you can apply sentiment analysis as we did in English NLP, but you need a labelled dataset. This one is not labelled. You can label it, but it is too small. 

However, for text summarization we would not need all the preprocessing steps above. We will see an example.

## Text Summarization Example

We will use an algorithm called TextRank, which depends on PageRank algorithm. 
- TextRank implementation is available at Gensim library. 
- PageRank algorithm first developed by Google co-founder Larry Page. 
- The algorithm rank pages by importance, the most important page is the page that receives the largest number of links from “also” important pages.
- It depends on graph theory as it considers a web page as a node and the edges are links (hyperlinks) between pages.
- In TextRank, node represent sentences and edges represents similarities between sentences.

"The higher the PageRank of a link, the more authoritative it is. 
We can simplify the PageRank algorithm to describe it as a way for the importance of a webpage to be measured by analyzing the quantity and quality of the links that point to it." [2]

"TextRank – is a graph-based ranking model for text processing which can be used in order to find the most relevant sentences in text and also to find keywords. The algorithm is explained in detail in the paper at https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf" [3]

Gensim python library include an implmentation of TextRank. Let's try it.

First we will combine all the sentences together as one long sentence.

In [24]:
df = pd.read_csv("hungerstation.csv")
sentences = ''
for s in df['Content']:
    sentences = sentences + s + '. '

#sentences

Then we will import the library that will do all the work for us.

In [25]:
# TextRank

import gensim

In [26]:
from gensim.summarization import summarize

In [27]:
short_summary = summarize(sentences)
print(short_summary)

لا يوجد اي التزام بمواعيد التوصيل ، الطلب يجلس ساعتين لين يجيك !
my money is lost and every time i try to contact customer service it gives me a “something went wrong” message 
They always demand that we come down to take the order even though the delivery charge is very high already.
It is only normal that the order is delivered to the doorstep; that is what we pay for after all with the delivery charge.
في الطايف المحلات والمطاعم مشغولة او مغلقة ادخل مرسول كله مفتوحة ايش المشكلة كل مره زي كذا ي مغلق او مشغول كرهت البرنامج من زمان على الوضع ذا ولا تغير.
It’s worse it application delivery food  I have tried in Saudi Arabia.
سحبو الفلوس و تم إلغاء الطلب ولا يوجد طريقة تواصل ولم يتم إسترجاع المبلغ.
يعني ياخذ طلبك ويروح مطعم ثاني، ينتظر الطلب بعدين يجيك الاكل بارد.
المندوب يقبل الطلب وهو مامعه فلوس.
اسوء تطبيق توصيل ممكن تشوفه الاكل يوصل بارد و نص الاكل ماكول الله لا يسامحهم و توصيلهم اغلى من الطلب نفسه.
Best app as well as the service they provide incredible..
كل وجبة يزيد سعرها في هنقرس

In [28]:
# Summarization by ratio

summary_by_ratio=summarize(sentences,ratio=0.1)
print(summary_by_ratio)

لا يوجد اي التزام بمواعيد التوصيل ، الطلب يجلس ساعتين لين يجيك !
my money is lost and every time i try to contact customer service it gives me a “something went wrong” message 
It is only normal that the order is delivered to the doorstep; that is what we pay for after all with the delivery charge.
في الطايف المحلات والمطاعم مشغولة او مغلقة ادخل مرسول كله مفتوحة ايش المشكلة كل مره زي كذا ي مغلق او مشغول كرهت البرنامج من زمان على الوضع ذا ولا تغير.
سحبو الفلوس و تم إلغاء الطلب ولا يوجد طريقة تواصل ولم يتم إسترجاع المبلغ.
يعني ياخذ طلبك ويروح مطعم ثاني، ينتظر الطلب بعدين يجيك الاكل بارد.
اسوء تطبيق توصيل ممكن تشوفه الاكل يوصل بارد و نص الاكل ماكول الله لا يسامحهم و توصيلهم اغلى من الطلب نفسه.
Once the order delivered, this app will be deleted for good..
طلبت اكثر من مره من التطبيق والمشكله الطلب ياصل متأخر وبارد وقليلين فهمم 
تطبيق من سيء الى اسواء … السعر اغلى … دايم يتاخر ودايم الاكل يجي بارد.
It’s a really good app but it does lag from time to time and also the delivery fees are kinda

In [29]:
# Summarization by word count

summary_by_word_count=summarize(sentences,word_count=30)
print(summary_by_word_count)

اسوء تطبيق توصيل ممكن تشوفه الاكل يوصل بارد و نص الاكل ماكول الله لا يسامحهم و توصيلهم اغلى من الطلب نفسه.
تطبيق من سيء الى اسواء … السعر اغلى … دايم يتاخر ودايم الاكل يجي بارد.


In [30]:
# Summarization when both ratio & word count is given

summary=summarize(sentences, ratio=0.1, word_count=30)
print(summary)

اسوء تطبيق توصيل ممكن تشوفه الاكل يوصل بارد و نص الاكل ماكول الله لا يسامحهم و توصيلهم اغلى من الطلب نفسه.
تطبيق من سيء الى اسواء … السعر اغلى … دايم يتاخر ودايم الاكل يجي بارد.


References <br>
[1] https://towardsdatascience.com/arabic-nlp-unique-challenges-and-their-solutions-d99e8a87893d    
[2] https://www.semrush.com/blog/pagerank/?kw=&cmp=WW_SRCH_DSA_Blog_Core_BU_EN&label=dsa_pagefeed&Network=g&Device=c&utm_content=515715493860&kwid=dsa-1057183199035&cmpid=11776868584&agpid=117384911274&BU=Core&extid=203745206953&adpos=&gclid=CjwKCAiAs92MBhAXEiwAXTi25zTaakLMojcAl13lNgOQLpRPWsdksUPyunVtrDGoS29OWVrW0eHq1hoCEIUQAvD_BwE#header2
[3] https://cran.r-project.org/web/packages/textrank/vignettes/textrank.html