                          Arabic and English Text Preprocessing 
Arabic Steps:

1.regular expression

2.word tokenize

3.remove stopwords

4.stemming

    4.1 ISRIStemmer
    4.2 SnowballStemmer

5.WordNetLemmatizer

6.POS

English Steps:

1.regular expression 

2.word tokenize

3.remove stopwords

4.stemming

    4.1 PorterStemmer
    4.2 SnowballStemmer
5.WordNetLemmatizer

6.POS

##### Import Libraies

In [15]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer,ISRIStemmer,PorterStemmer,SnowballStemmer
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to C:\Users\Ahmed
[nltk_data]     Ashraf\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Ahmed
[nltk_data]     Ashraf\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

                                     Arabic
ISRIStemmer                                    

In [16]:
def Arabic_text_preprocessing(text,flage=1):
    '''
    Flage Number
    1: ISRIStemmer
    2: SnowballStemmer
    3: WordNetLemmatizer
    4: POS
    '''
    text1=text
    #The regular expression [^\w\s] is used to match any character that is not a word character (\w) or a whitespace character (\s).
    text = re.sub(r'[^\w\s]', '',text)
    #splitting sentence into tokens
    text=word_tokenize(text.lower())
    #remove stopwords
    arabic_stop=set(stopwords.words('arabic'))
    text=[i for i in text if i not in arabic_stop]
    #stemming of each word
    if flage<=3:
        if flage==1:
            stem=ISRIStemmer()
            text=[stem.stem(i) for i in text]
            print('Perform ISRStemmer')
        elif flage==2:
            stem=SnowballStemmer('arabic')
            text=[stem.stem(i) for i in text]
            print('Perform SnowballStemmer')
        else:
            #Lemmatizer of each word
            lemmatizer=WordNetLemmatizer()
            text=[lemmatizer.lemmatize(i) for i in text]
            print('Perform WordNetLemmatizer')
        print("text before :",text1)
        print('text after :',' '.join(text))
        print('text after tokens :',text)
    else:
        #part of speech of each word
        text=nltk.pos_tag(text)
        print('Preform POS')
        print("text before :",text1)
        print('text after tokens :',text)

In [17]:
@Arabic_text_preprocessing?

[1;31mSignature:[0m [0mArabic_text_preprocessing[0m[1;33m([0m[0mtext[0m[1;33m,[0m [0mflage[0m[1;33m=[0m[1;36m1[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Flage Number
1: ISRIStemmer
2: SnowballStemmer
3: WordNetLemmatizer
4: POS
[1;31mFile:[0m      c:\users\ahmed ashraf\appdata\local\temp\ipykernel_7408\3698319348.py
[1;31mType:[0m      function

In [18]:
Arabic_text_preprocessing("ليدر مشروع التخرج احمد اشرف احمد السيد على وكان تيم يتكون من 8 اشخاص مع الليدر ",flage=1)

Perform ISRStemmer
text before : ليدر مشروع التخرج احمد اشرف احمد السيد على وكان تيم يتكون من 8 اشخاص مع الليدر 
text after : يدر شرع خرج حمد شرف حمد سيد وكان تيم يتك 8 شخص يدر
text after tokens : ['يدر', 'شرع', 'خرج', 'حمد', 'شرف', 'حمد', 'سيد', 'وكان', 'تيم', 'يتك', '8', 'شخص', 'يدر']


In [19]:
input_text = input("Please enter the Arabic text you want to process: ")
Arabic_text_preprocessing(input_text,flage=1)

Please enter the Arabic text you want to process:  ليدر مشروع التخرج احمد اشرف احمد السيد على وكان تيم يتكون من 8 اشخاص مع الليدر


Perform ISRStemmer
text before : ليدر مشروع التخرج احمد اشرف احمد السيد على وكان تيم يتكون من 8 اشخاص مع الليدر
text after : يدر شرع خرج حمد شرف حمد سيد وكان تيم يتك 8 شخص يدر
text after tokens : ['يدر', 'شرع', 'خرج', 'حمد', 'شرف', 'حمد', 'سيد', 'وكان', 'تيم', 'يتك', '8', 'شخص', 'يدر']


SnowballStemmer

In [20]:
Arabic_text_preprocessing("ليدر مشروع التخرج احمد اشرف احمد السيد على وكان تيم يتكون من 8 اشخاص مع الليدر ",flage=2)

Perform SnowballStemmer
text before : ليدر مشروع التخرج احمد اشرف احمد السيد على وكان تيم يتكون من 8 اشخاص مع الليدر 
text after : ليدر مشروع تخرج احمد اشرف احمد سيد وكا تيم يتكو 8 اشخاص ليدر
text after tokens : ['ليدر', 'مشروع', 'تخرج', 'احمد', 'اشرف', 'احمد', 'سيد', 'وكا', 'تيم', 'يتكو', '8', 'اشخاص', 'ليدر']


In [21]:
input_text = input("Please enter the Arabic text you want to process: ")
Arabic_text_preprocessing(input_text,flage=2)

Please enter the Arabic text you want to process:  ليدر مشروع التخرج احمد اشرف احمد السيد على وكان تيم يتكون من 8 اشخاص مع الليدر


Perform SnowballStemmer
text before : ليدر مشروع التخرج احمد اشرف احمد السيد على وكان تيم يتكون من 8 اشخاص مع الليدر
text after : ليدر مشروع تخرج احمد اشرف احمد سيد وكا تيم يتكو 8 اشخاص ليدر
text after tokens : ['ليدر', 'مشروع', 'تخرج', 'احمد', 'اشرف', 'احمد', 'سيد', 'وكا', 'تيم', 'يتكو', '8', 'اشخاص', 'ليدر']


WordNetLemmatizer

In [22]:
Arabic_text_preprocessing("ليدر مشروع التخرج احمد اشرف احمد السيد على وكان تيم يتكون من 8 اشخاص مع الليدر ",flage=3)

Perform WordNetLemmatizer
text before : ليدر مشروع التخرج احمد اشرف احمد السيد على وكان تيم يتكون من 8 اشخاص مع الليدر 
text after : ليدر مشروع التخرج احمد اشرف احمد السيد وكان تيم يتكون 8 اشخاص الليدر
text after tokens : ['ليدر', 'مشروع', 'التخرج', 'احمد', 'اشرف', 'احمد', 'السيد', 'وكان', 'تيم', 'يتكون', '8', 'اشخاص', 'الليدر']


In [23]:
input_text = input("Please enter the Arabic text you want to process: ")
Arabic_text_preprocessing(input_text,flage=3)

Please enter the Arabic text you want to process:  ليدر مشروع التخرج احمد اشرف احمد السيد على وكان تيم يتكون من 8 اشخاص مع الليدر


Perform WordNetLemmatizer
text before : ليدر مشروع التخرج احمد اشرف احمد السيد على وكان تيم يتكون من 8 اشخاص مع الليدر
text after : ليدر مشروع التخرج احمد اشرف احمد السيد وكان تيم يتكون 8 اشخاص الليدر
text after tokens : ['ليدر', 'مشروع', 'التخرج', 'احمد', 'اشرف', 'احمد', 'السيد', 'وكان', 'تيم', 'يتكون', '8', 'اشخاص', 'الليدر']


POS

In [24]:
Arabic_text_preprocessing("ليدر مشروع التخرج احمد اشرف احمد السيد على وكان تيم يتكون من 8 اشخاص مع الليدر ",flage=4)

Preform POS
text before : ليدر مشروع التخرج احمد اشرف احمد السيد على وكان تيم يتكون من 8 اشخاص مع الليدر 
text after tokens : [('ليدر', 'JJ'), ('مشروع', 'NNP'), ('التخرج', 'NNP'), ('احمد', 'NNP'), ('اشرف', 'NNP'), ('احمد', 'NNP'), ('السيد', 'NNP'), ('وكان', 'NNP'), ('تيم', 'NNP'), ('يتكون', 'VBD'), ('8', 'CD'), ('اشخاص', 'NNS'), ('الليدر', 'VBP')]


In [25]:
input_text = input("Please enter the Arabic text you want to process: ")
Arabic_text_preprocessing(input_text,flage=4)

Please enter the Arabic text you want to process:  ليدر مشروع التخرج احمد اشرف احمد السيد على وكان تيم يتكون من 8 اشخاص مع الليدر


Preform POS
text before : ليدر مشروع التخرج احمد اشرف احمد السيد على وكان تيم يتكون من 8 اشخاص مع الليدر
text after tokens : [('ليدر', 'JJ'), ('مشروع', 'NNP'), ('التخرج', 'NNP'), ('احمد', 'NNP'), ('اشرف', 'NNP'), ('احمد', 'NNP'), ('السيد', 'NNP'), ('وكان', 'NNP'), ('تيم', 'NNP'), ('يتكون', 'VBD'), ('8', 'CD'), ('اشخاص', 'NNS'), ('الليدر', 'VBP')]


                                    English
PorterStemmer                                   

In [26]:
def English_text_preprocessing(text,flage=1):
    '''
    Flage Number
    1: PorterStemmer
    2: SnowballStemmer
    3: WordNetLemmatizer
    4: POS
    '''
    text1=text
    #The regular expression [^\w\s] is used to match any character that is not a word character (\w) or a whitespace character (\s).
    text = re.sub(r'[^\w\s]', '',text)
    #splitting sentence into tokens
    text=word_tokenize(text.lower())
    #remove stopwords
    english_stop=set(stopwords.words('english'))
    text=[i for i in text if i not in english_stop]
    #stemming of each word
    if flage<=3:
        if flage==1:
            stem=PorterStemmer()
            text=[stem.stem(i) for i in text]
            print('Perform PorterStemmer')
        elif flage==2:
            stem=SnowballStemmer('english')
            text=[stem.stem(i) for i in text]
            print('Perform SnowballStemmer')
        else:
            #Lemmatizer of each word
            lemmatizer=WordNetLemmatizer()
            text=[lemmatizer.lemmatize(i) for i in text]
            print('Perform WordNetLemmatizer')
        print("text before :",text1)
        print('text after :',' '.join(text))
        print('text after tokens :',text)
    else:
        #part of speech of each word
        text=nltk.pos_tag(text)
        print('Preform POS')
        print("text before :",text1)
        print('text after tokens :',text)

In [27]:
English_text_preprocessing("The graduation project was led by Ahmed Ashraf Ahmed Al-Sayed Ali, and the team consisted of 8 people with the leader.”",flage=1)

Perform PorterStemmer
text before : The graduation project was led by Ahmed Ashraf Ahmed Al-Sayed Ali, and the team consisted of 8 people with the leader.”
text after : graduat project led ahm ashraf ahm alsay ali team consist 8 peopl leader
text after tokens : ['graduat', 'project', 'led', 'ahm', 'ashraf', 'ahm', 'alsay', 'ali', 'team', 'consist', '8', 'peopl', 'leader']


In [28]:
input_text = input("Please enter the English text you want to process: ")
English_text_preprocessing(input_text,flage=1)

Please enter the English text you want to process:  The graduation project was led by Ahmed Ashraf Ahmed Al-Sayed Ali, and the team consisted of 8 people with the leader.


Perform PorterStemmer
text before : The graduation project was led by Ahmed Ashraf Ahmed Al-Sayed Ali, and the team consisted of 8 people with the leader.
text after : graduat project led ahm ashraf ahm alsay ali team consist 8 peopl leader
text after tokens : ['graduat', 'project', 'led', 'ahm', 'ashraf', 'ahm', 'alsay', 'ali', 'team', 'consist', '8', 'peopl', 'leader']


SnowballStemmer

In [29]:
English_text_preprocessing("The graduation project was led by Ahmed Ashraf Ahmed Al-Sayed Ali, and the team consisted of 8 people with the leader.”",flage=2)

Perform SnowballStemmer
text before : The graduation project was led by Ahmed Ashraf Ahmed Al-Sayed Ali, and the team consisted of 8 people with the leader.”
text after : graduat project led ahm ashraf ahm alsay ali team consist 8 peopl leader
text after tokens : ['graduat', 'project', 'led', 'ahm', 'ashraf', 'ahm', 'alsay', 'ali', 'team', 'consist', '8', 'peopl', 'leader']


In [30]:
input_text = input("Please enter the English text you want to process: ")
English_text_preprocessing(input_text,flage=2)

Please enter the English text you want to process:  The graduation project was led by Ahmed Ashraf Ahmed Al-Sayed Ali, and the team consisted of 8 people with the leader.


Perform SnowballStemmer
text before : The graduation project was led by Ahmed Ashraf Ahmed Al-Sayed Ali, and the team consisted of 8 people with the leader.
text after : graduat project led ahm ashraf ahm alsay ali team consist 8 peopl leader
text after tokens : ['graduat', 'project', 'led', 'ahm', 'ashraf', 'ahm', 'alsay', 'ali', 'team', 'consist', '8', 'peopl', 'leader']


WordNetLemmatizer

In [31]:
English_text_preprocessing("The graduation project was led by Ahmed Ashraf Ahmed Al-Sayed Ali, and the team consisted of 8 people with the leader.”",flage=3)

Perform WordNetLemmatizer
text before : The graduation project was led by Ahmed Ashraf Ahmed Al-Sayed Ali, and the team consisted of 8 people with the leader.”
text after : graduation project led ahmed ashraf ahmed alsayed ali team consisted 8 people leader
text after tokens : ['graduation', 'project', 'led', 'ahmed', 'ashraf', 'ahmed', 'alsayed', 'ali', 'team', 'consisted', '8', 'people', 'leader']


In [32]:
input_text = input("Please enter the English text you want to process: ")
English_text_preprocessing(input_text,flage=3)

Please enter the English text you want to process:  The graduation project was led by Ahmed Ashraf Ahmed Al-Sayed Ali, and the team consisted of 8 people with the leader.


Perform WordNetLemmatizer
text before : The graduation project was led by Ahmed Ashraf Ahmed Al-Sayed Ali, and the team consisted of 8 people with the leader.
text after : graduation project led ahmed ashraf ahmed alsayed ali team consisted 8 people leader
text after tokens : ['graduation', 'project', 'led', 'ahmed', 'ashraf', 'ahmed', 'alsayed', 'ali', 'team', 'consisted', '8', 'people', 'leader']


POS

In [33]:
English_text_preprocessing("The graduation project was led by Ahmed Ashraf Ahmed Al-Sayed Ali, and the team consisted of 8 people with the leader.”",flage=4)

Preform POS
text before : The graduation project was led by Ahmed Ashraf Ahmed Al-Sayed Ali, and the team consisted of 8 people with the leader.”
text after tokens : [('graduation', 'NN'), ('project', 'NN'), ('led', 'VBD'), ('ahmed', 'JJ'), ('ashraf', 'NN'), ('ahmed', 'VBD'), ('alsayed', 'JJ'), ('ali', 'NN'), ('team', 'NN'), ('consisted', 'VBD'), ('8', 'CD'), ('people', 'NNS'), ('leader', 'NN')]


In [34]:
input_text = input("Please enter the English text you want to process: ")
English_text_preprocessing(input_text,flage=4)

Please enter the English text you want to process:  The graduation project was led by Ahmed Ashraf Ahmed Al-Sayed Ali, and the team consisted of 8 people with the leader.


Preform POS
text before : The graduation project was led by Ahmed Ashraf Ahmed Al-Sayed Ali, and the team consisted of 8 people with the leader.
text after tokens : [('graduation', 'NN'), ('project', 'NN'), ('led', 'VBD'), ('ahmed', 'JJ'), ('ashraf', 'NN'), ('ahmed', 'VBD'), ('alsayed', 'JJ'), ('ali', 'NN'), ('team', 'NN'), ('consisted', 'VBD'), ('8', 'CD'), ('people', 'NNS'), ('leader', 'NN')]


###### Thanks
Leader.𝔄𝔥𝔪𝔢𝔡 𝔄𝔰𝔥𝔯𝔞𝔣
