<a href="https://www.kaggle.com/code/fayssalelansari/notebookc111a6c6b8?scriptVersionId=115511459" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### constants

In [2]:
enc = "UTF-8"

### model files

In [3]:
import enum

class Tag(enum.Enum):
    Neutral = "NEUTRAL"
    Positive = "POSITIVE"
    Negative = "NEGATIVE"
    Not_defined = "NOT_DEFINED"

class Tweet:
    def __init__(self, text, real_tag=Tag.Not_defined, given_tag=Tag.Not_defined):
        self.text = text
        self.real_tag = real_tag
        self.given_tag = given_tag


    def __str__(self):
        txt = self.text
        return txt

    def __repr__(self):
        txt = self.text 
        return txt

## Loading data
### Using our own method and model files
    In this section we will use our own definition of a function that will load the files by traversing the folders containing the positive and the negative tweets
    We should compare this method to simply reading the `.tsv` files (which should be faster).

In [4]:
import os
import pathlib

# # current_path = pathlib.Path(__file__).parent.resolve()
# POS_COUNT = 29848
# NEG_COUNT = 28901
# tweets = []

# def import_data():
#     for i in range(POS_COUNT):
#         text = ""
#         with open("/kaggle/input/arabic-sentiment-twitter-corpus/arabic_tweets/pos/" + str(i) + '.txt', encoding=enc) as f:
#             for line in f:
#                 text += line
#         tweets.append(Tweet(text, Tag.Positive))
#     for i in range(NEG_COUNT):
#         text = ""
#         with open("/kaggle/input/arabic-sentiment-twitter-corpus/arabic_tweets/neg/" + str(i) + '.txt', encoding=enc) as f:
#             for line in f:
#                 text += line
#         tweets.append(Tweet(text, Tag.Negative))
        
# import_data()

In [5]:
# print(tweets)

### Using our own method and reading from `tsv` files directly
    same as the previous step we shall populate a list of tweets[Tweet] with our data.
    After deep thought it is better to use a matrix instead of classes. We shall use pandas to represent our dataset.

In [6]:
column_names = ["sentiment", "content"]
train_tweets_positive = pd.read_table("/kaggle/input/arabic-sentiment-twitter-corpus/train_Arabic_tweets_positive_20190413.tsv", names=column_names)
train_tweets_negative = pd.read_table("/kaggle/input/arabic-sentiment-twitter-corpus/train_Arabic_tweets_negative_20190413.tsv", names=column_names)
test_tweets_positive = pd.read_table("/kaggle/input/arabic-sentiment-twitter-corpus/test_Arabic_tweets_positive_20190413.tsv", names=column_names)
test_tweets_negative = pd.read_table("/kaggle/input/arabic-sentiment-twitter-corpus/test_Arabic_tweets_negative_20190413.tsv", names=column_names)

In [7]:
X_train = pd.concat([train_tweets_positive,train_tweets_negative])

from IPython.display import display, HTML
display(X_train)

Unnamed: 0,sentiment,content
0,pos,نحن الذين يتحول كل ما نود أن نقوله إلى دعاء لل...
1,pos,وفي النهاية لن يبقىٰ معك آحدإلا من رأىٰ الجمال...
2,pos,من الخير نفسه 💛
3,pos,#زلزل_الملعب_نصرنا_بيلعب كن عالي الهمه ولا ترض...
4,pos,الشيء الوحيد الذي وصلوا فيه للعالمية هو : المس...
...,...,...
22509,neg,كيف ترى أورانوس لو كان يقع مكان القمر ؟ 💙💙 كوك...
22510,neg,احسدك على الايم 💔
22511,neg,لأول مرة ما بنكون سوا 💔
22512,neg,بقله ليش يا واطي 🤔


## Preprocessing dataset
Now we need to remove special characters

## Tokenizing dataset
We need to write our tweets text as a feature-term dataframe (feature-term matrix). When using countvectorizer there is no need to preprocess the data, as it already removes stop words and speical characters.

### Using CountVectorizer

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train.content)
X_train_counts.shape

(45275, 70856)

### Using TfidfTransformer

In [9]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(45275, 70856)

## Training a classifier
### Using naïve Bayes

In [10]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, X_train.sentiment)

We will now use our testing dataset (we combine the negative and positivie tweests into one pandas dataframe), then we will call transfrom without calling fit in order to make a prediction.

In [11]:
X_test = pd.concat([test_tweets_positive,test_tweets_negative])

X_test_counts = count_vect.transform(X_test.content)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

predicted = clf.predict(X_test_tfidf)

# for tweet, sentiment in zip(X_test.content, predicted):
#     print('%r => %s' % (tweet, sentiment))

## Calculating Accuracy Score
Now we need to get a percentage of the accuracy of our model, we have a list of predicted sentiment and a list of the actual sentiment. Whenever predicted sentiment is different of the actual one we will increment a counter, after going through all the tweets we will divide but the total number of tweets to get a percentage of the wrong predictions, to get the percentage of the right prediction all we need to do is subtract the calculated score from 100%.

In [12]:
wrong_predictions = 0
validity_score = 0
for predicted_sentiment, actual_sentiment in zip(predicted, X_test.sentiment):
    if predicted_sentiment != actual_sentiment:
        wrong_predictions += 1
wrong_predictions_percentage = wrong_predictions / len(X_test.sentiment)
validity_score = 1 - wrong_predictions_percentage
print("validity score: " + str(validity_score*100) + "%")

validity score: 78.4375%


For the time being we have a validity score of `78.4375%` therefore our prediction model is considred bad, we think it is because we're studying text in `arabic` CountVectorizer is unable to correctly preprocess text and tokenize it. We will try to use another vectorizer to see if the validity increases.

### Building a pipeline
To simplify our training and prediction process we will build a new Pipeline

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier

text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='hinge', penalty=None,
                          alpha=1e-3, random_state=2,
                          max_iter=5, tol=None)),
])

### Using SGDClassifierfrom sklearn.linear_model import SGDClassifier

In [14]:
text_clf.fit(X_train.content, X_train.sentiment)

predicted = text_clf.predict(X_test.content)
np.mean(predicted == X_test.sentiment)

0.7395833333333334

Now we will try to `Lemmetize` tweets, we believe this is the reason why our prediction model isn't producing better results.
We could try `Farasa` lemmetizer as it has a good reputation of outperforming other lemmetizers. But it uses an API and is not available as an imported library usable directly.
So for now we will just stick with ntlk's `ISRIStemmer`

In [15]:
from nltk.stem.isri import ISRIStemmer
st = ISRIStemmer()

A lemmetizer is also called a stemmer, NLTK has many stemmer solutions. We could try out each one of NLTK's stemmers and compare the results.
For now we will try out `ARLSTem Stemmer`

In [16]:
from nltk.stem.arlstem import ARLSTem

X_test

Unnamed: 0,sentiment,content
0,pos,#الهلال_الاهلي فوز هلالي مهم الحمد لله 💙 زوران...
1,pos,صباحك خيرات ومسرات 🌸
2,pos,#تأمل قال الله ﷻ :- _*​﴿بواد غير ذي زرع ﴾*_ 💫💫...
3,pos,😂😂 يا جدعان الرجاله اللي فوق ال دول خطر ع تويت...
4,pos,رساله صباحيه : 💛 اللهم اسألك التوفيق في جميع ا...
...,...,...
5763,neg,النوم وانت مكسور ده احساس غبي اللي هو مش قادر ...
5764,neg,استشهاد_الامام_كاظم_الغيظ السلام على المعذب في...
5765,neg,انا كنت اكل الصحن بكبره 😐
5766,neg,قولوا لي ايش تشوفوا .. مع ملاحظة التلطف لأنه ا...


# NLTK tokenization
    So we're still getting a very low validity score for our prediction model. It could be because we're using Arabic language and scikit learn is unable to correctly tokenize words, it could be that the words that have the same root arent' considred as the same token. I will try to preprocess the data first to turn each tweet text into tokens. 
    A problem that could arise is not being able to detect the order of words. I'm not sure if NLtk will scramble the words or will they be in the same order for us to be able to use n-gram of words later on.

In [17]:
from nltk.tokenize import wordpunct_tokenize,word_tokenize

X_train = X_train.content.apply(wordpunct_tokenize)
display(X_train)

0        [نحن, الذين, يتحول, كل, ما, نود, أن, نقوله, إل...
1        [وفي, النهاية, لن, يبقى, ٰ, معك, آحدإلا, من, ر...
2                                     [من, الخير, نفسه, 💛]
3        [#, زلزل_الملعب_نصرنا_بيلعب, كن, عالي, الهمه, ...
4        [الشيء, الوحيد, الذي, وصلوا, فيه, للعالمية, هو...
                               ...                        
22509    [كيف, ترى, أورانوس, لو, كان, يقع, مكان, القمر,...
22510                               [احسدك, على, الايم, 💔]
22511                       [لأول, مرة, ما, بنكون, سوا, 💔]
22512                             [بقله, ليش, يا, واطي, 🤔]
22513    [قد, طال, صبري, في, النوى, إذ, تركتني, كئيبا, ...
Name: content, Length: 45275, dtype: object

arabic isn't supported by `nltk` so we'll use some other third party library to tokenize our te

In [18]:
# import tkseem as tk

# tkseem_tokenizer = tk.WordTokenizer()
# X_train.content.apply(tkseem_tokenizer.tokenize)

### Parameter tuning using grid search

In [19]:
# from sklearn.model_selection import GridSearchCV

# parameters = {
#     'vect__ngram_range': [(1, 1), (1, 2)],
#     'tfidf__use_idf': (True, False),
#     'clf__alpha': (1e-2, 1e-3),
# }

# gs_clf = GridSearchCV(text_clf, parameters, cv=5, n_jobs=-1)
# gs_clf = gs_clf.fit(X_train.content[:400], X_train.sentiment[:400])

## Extracting features from tweets data
    After loading the data the next step is to try and extract features from our tweets corpus.