# CAMEL Install

The following steps are needed if you want to run the examples in this notebook on Google Colaboratory. If you want to run this notebook on your own machine, please follow the [installation instructions](https://camel-tools.readthedocs.io/en/latest/getting_started.html#installation) instead.

First, we install the CAMeL Tools Python package.

In [None]:
%pip install camel-tools

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting camel-tools
  Downloading camel_tools-1.5.2-py3-none-any.whl (124 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m124.3/124.3 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
Collecting docopt (from camel-tools)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill (from camel-tools)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
Collecting transformers>=3.0.2 (from camel-tools)
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m92.6 MB/s[0m eta [36m0:00:00[0m
Collecting emoji (from camel-tools)
  Downloading emoji-2.5.1.tar.gz (356 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In order to use all the components provided in CAMeL Tools, we need to install all the datasets required by these components.
To do this in Colab, we need to first mount a Google Drive and create a directory where the data will be installed.

Run the code below and follow the instructions in the output.

In [None]:
import os

%mkdir /camel_tools

Next, we need to tell CAMeL Tools to install the data in the newly created directory. This will take a couple of minutes to complete.

**NOTE:** You will need at least 2.3GB of available space on your Google Drive to install all the CAMeL Tools data.

In [None]:
os.environ['CAMELTOOLS_DATA'] = '/camel_tools'

!export | camel_data -i all

The following packages will be installed: 'morphology-db-msa-s31', 'disambig-ranking-cache-calima-msa-r13', 'disambig-bert-unfactored-glf', 'morphology-db-msa-r13', 'dialectid-model26', 'morphology-db-glf-01', 'sentiment-analysis-mbert', 'disambig-ranking-cache-calima-egy-r13', 'disambig-ranking-cache-calima-glf-01', 'disambig-bert-unfactored-msa', 'dialectid-model6', 'disambig-mle-calima-egy-r13', 'disambig-mle-calima-msa-r13', 'morphology-db-egy-r13', 'morphology-db-lev-01', 'disambig-bert-unfactored-egy', 'disambig-ranking-cache-calima-lev-01', 'sentiment-analysis-arabert', 'ner-arabert', 'disambig-bert-unfactored-lev'
Downloading package 'morphology-db-msa-s31': 100% 44.8M/44.8M [00:00<00:00, 67.7MB/s]
Extracting package 'morphology-db-msa-s31': 100% 44.8M/44.8M [00:00<00:00, 508MB/s]
Downloading package 'disambig-ranking-cache-calima-msa-r13': 100% 556M/556M [00:09<00:00, 58.3MB/s]
Extracting package 'disambig-ranking-cache-calima-msa-r13': 100% 556M/556M [00:03<00:00, 145MB/s]
Do

We also provide a lightweight dataset for the Morphology and Disambiguation components **only** that can be installed by calling `camel_data -i light` instead of `camel_data -i all`.


**Once the data has been installed on your Google Drive, you only need to run the following the next time you want to run this notebook.**

In [None]:
!camel_data -i all

No new packages will be installed.


In [None]:
%pip install camel-tools

import os

os.environ['CAMELTOOLS_DATA'] = '/camel_tools'

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# INSTALL AND IMPORT

In [None]:
!pip install emoji aaransia langdetect deep-translator tashaphyne

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting emoji
  Downloading emoji-2.5.1.tar.gz (356 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m356.3/356.3 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting aaransia
  Downloading aaransia-1.1.tar.gz (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m39.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting deep-translator
  Downloading deep_translator-1.11.1-py3-none-any.whl (37 kB)
Collecting tashaphyne
  Downloading Tashaphyne-0.3.6-py3-none-any.whl (251 kB)
[2K

In [None]:
import tensorflow as tf
tf.test.gpu_device_name()

''

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import re
import numpy as np
import unicodedata
import emoji
import nltk
from emoji import EMOJI_DATA
from aaransia import transliterate, SourceLanguageError
from langdetect import detect
from deep_translator import GoogleTranslator
from bs4 import BeautifulSoup
import string

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

# classification data

In [None]:
FILE_NAME="FinalBalanced.csv"

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/'+FILE_NAME) #Read File

In [None]:
df=df.dropna() #drop null values

In [None]:
df.text=df.text.astype(str) #set type of all rows to string

In [None]:
df = df.drop_duplicates() #drop same rows

In [None]:
df.head()

Unnamed: 0,sentiment,text
0,positive,♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️...
1,positive,franca harmet edjdadna wa waldina wa mine ba3d...
2,positive,الله يبارك ،هدا الخير ويقولوا الجزايري ماشي خد...
3,positive,bonjour doc que ce soit sur les sites françai...
4,positive,dido الدوله حبيبي حاله ازمه ماليه الازمه باقي ...


In [None]:
df.sentiment.value_counts()

positive    7304
negative    7304
neutral     7304
Name: sentiment, dtype: int64

# PREPROCESS

In [None]:
#Remove stop words
def remove_stp_words(text):
    stop_words = nltk.corpus.stopwords.words('arabic')
    text_words = []
    words = text.split(" ")
    for word in words:
        if word not in stop_words:
            text_words.append(word)
    return ' '.join(text_words)

In [None]:
# search your emoji
def is_emoji(s):
    return s in EMOJI_DATA

def add_space_emojis(text):
  result = ''
  for char in text:
    if is_emoji(char):
      result += ' '
    result += char
  return result.strip()

In [None]:
def remove_consecutive_duplicates(string):
    pattern = r'(\w)\1{2,}'
    result = re.sub(pattern, r'\1', string)
    return result

In [None]:
punctuation_to_remove = string.punctuation.replace('?', '') #remove ? from punctutions

# Define first level of preprocessing
def data_cleaning(text):

    #to string
    text=str(text)

    # Normalize unicode encoding
    text = unicodedata.normalize('NFC', text)

    # Remove trailing whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Remove HTML
    text = BeautifulSoup(text, 'html.parser').get_text()

    # Remove all punctuations except the question mark ?
    text = re.sub(f'[{re.escape(punctuation_to_remove)}]', '', text)

    # Add whitespace before and after question marks
    text = re.sub(r'(\?|؟)', r' \1 ', text)

    # Remove '@name'
    text = re.sub(r'(@.*?)[\s]', ' ', text)

    # Replace '&amp;' with '&'
    text = re.sub(r'&amp;', '&', text)

    #Remove URLs
    text = re.sub('http://\S+|https://\S+', '',text)

    # Convert text to lowercases
    text = text.lower()

    #remove tashkeel
    noise = re.compile(""" ّ    | # Tashdid
                             َ    | # Fatha
                             ً    | # Tanwin Fath
                             ُ    | # Damma
                             ٌ    | # Tanwin Damm
                             ِ    | # Kasra
                             ٍ    | # Tanwin Kasr
                             ْ    | # Sukun
                             ـ     # Tatwil/Kashida
                         """, re.VERBOSE)
    text = re.sub(noise, '', text)

    # #remove repetetions
    # text = text.replace('وو', 'و')
    # text = text.replace('يي', 'ي')
    # text = text.replace('اا', 'ا')

    #remove special arab letters
    text = re.sub("[إأٱآا]", "ا", text)
    text = re.sub("ى", "ي", text)
    text = re.sub("ؤ", "ء", text)
    text = re.sub("ئ", "ء", text)
    text = re.sub("ة", "ه", text)

    #keep 2 repetetions
    text = remove_consecutive_duplicates(text)

    #remove stop words
    text = remove_stp_words(text)

    #add space between emojis
    text = add_space_emojis(text)

    return text

In [None]:
from langdetect import detect
from deep_translator import GoogleTranslator

#simple function to detect and translate text
def to_arab(text):
    text_words = []
    words = text.split(" ")
    for word in words:
      try:
        result_lang = detect(word)
        if result_lang=="fr" or result_lang=="en":
          word=GoogleTranslator(source=result_lang, target="ar").translate(text=word)
        else:
          word= transliterate(word, source='al', target='ar', universal=True)
      except:
        pass
      text_words.append(word)

    return ' '.join(text_words)

In [None]:
from pyarabic.araby import is_arabicrange, strip_tashkeel
from tashaphyne.stemming import ArabicLightStemmer
from camel_tools.dialectid import DialectIdentifier

did = DialectIdentifier.pretrained()
def is_MSA(text):
  predictions = did.predict([text], 'region')
  return [p.top for p in predictions] != ['Maghreb']

def stemming(text):
    stemmer = ArabicLightStemmer()
    words = text.split()
    stemmed_words = []

    for word in words:
        if is_MSA(word):
            # Stemming is performed only on Modern Standard Arabic words
            stemmed_word = stemmer.light_stem(word)
            stemmed_words.append(stemmed_word)
        else:
            stemmed_words.append(word)

        text = ' '.join(stemmed_words)
    return remove_stp_words(text)

# Save

data cleaning

In [None]:
df1=pd.DataFrame()
df1['sentiment']=df['sentiment']
df1['text']=df.text.apply(data_cleaning)
df1.head()

Unnamed: 0,sentiment,text
0,positive,♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️...
1,positive,franca harmet edjdadna wa waldina wa mine ba3d...
2,positive,الله يبارك ،هدا الخير ويقولوا الجزايري ماشي خد...
3,positive,bonjour doc que ce soit sur les sites français...
4,positive,dido الدوله حبيبي حاله ازمه ماليه الازمه باقي ...


In [None]:
df1.to_csv("/content/drive/MyDrive/Colab Notebooks/preprocess1"+FILE_NAME, encoding='utf-8',index=False)

to arab

In [None]:
df2=pd.DataFrame()
df2['sentiment']=df1['sentiment']
df2['text']=df1.text.apply(to_arab)
df2.head()

Unnamed: 0,sentiment,text
0,positive,♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️...
1,positive,فرانكا هارمت دجدادنا وا والدينا وا مين باعد ول...
2,positive,اله يبارك ،هدا الخير ويقولوا الجزايري ماشي خدا...
3,positive,بونجور دوك قو ك سويت على ال الموقع (المواقع فر...
4,positive,ديدو الدوله حبيبي حاله ازمه ماليه الازمه باقي ...


In [None]:
df2.to_csv("/content/drive/MyDrive/Colab Notebooks/preprocess2"+FILE_NAME, encoding='utf-8',index=False)

stemming

In [None]:
df3=pd.DataFrame()
df3['sentiment']=df2['sentiment']
df3['text']=df2.text.apply(stemming)
df3.head()

Unnamed: 0,sentiment,text
0,positive,♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥️ ♥...
1,positive,رانك هارم دجداد دين مي اعد اد هرام خيي هر قاو ...
2,positive,بار ،هد خير قول جزاير ماش خدامل مشكل كيفاش تصل...
3,positive,ونجور دوك قو سوي ال موقع (المواقع رنس جزائري و...
4,positive,ديد دوله حال زم ماليه الازمه اق عالم شراء شعب ...


In [None]:
df3.to_csv("/content/drive/MyDrive/Colab Notebooks/preprocess3"+FILE_NAME, encoding='utf-8',index=False)