In [1]:
import sys
import re
import pandas as pd
import numpy as np
reload(sys)
sys.setdefaultencoding('utf-8')

Import Arabic Text dataset, this text is cleaned from all (Franko-Arab) and English Text. As both of English and Franko-Arab don't add much weight or strength  to my model,  I think they more like noise in this dataset.

So, I decided to get rid of them and only work with arabic text

In [2]:
arabic_text = pd.read_excel("all_text2.xlsx")
arabic_text_list = arabic_text["text"].values.astype(str).tolist()

# Arabic Stop Words set
arbic_stop_words = pd.read_csv('stopwords-ar.txt', sep=" ", header=None)
arbic_stop_words_set = arbic_stop_words.values.astype(str).tolist()

arabic_text.head()

Unnamed: 0,text,sentiment
0,ياجماعة البيع بسعر الجملة كاوتش سيارات جميع ال...,0.0
1,هو مش العروض دي طول شهر رمضان ازاااي التوكيل ي...,0.0
2,لوسمحتو سعر الفوليستير كام,0.0
3,H IX35 2000 بكام واقرب مكان الى ملوى للشراء,0.0
5,مرسى جدا لللآهتمام,0.0


## Text Cleaning and Preprocessing

A function to remove empty lists from a list of list

__Word2Vec Model__ expects single sentences, each one as a list of words. In other words, the input format is a list of lists.

In [3]:
def remove_empty(l):
    return tuple(filter(lambda x:not isinstance(x, (str, list, tuple)) or x,
           (remove_empty(x) if isinstance(x, (tuple, list)) else x for x in l)))



First, Clean the training set Corpus

In [4]:
cleaned_arabictext_train = []
for i in range(int(len(arabic_text_list)-len(arabic_text_list)*0.4)): # train on 60% of the Corpus
    #posttext = txt.decode('utf-8')
    match = re.sub(r"http\S+"," ", arabic_text_list[i])  # remove http liks
    match = re.sub(r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", " ", match) # remove hashtags
    match = re.sub(r'(?:(?:\d+,?)+(?:\.?\d+)?)', " ", match) # remove numbers
    match = re.sub(r"(?:[a-z][a-z'\-_]+[a-z])", " ", match) # remove words with - and '
    match = re.sub(r'(?:@[\w_]+)', " ", match)# remove @-mentions
    emotions_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
    match = re.sub(emotions_str, " ", match)
    match = match.replace('"', " ") # remove double quotes
    #words = [w for w in match if not w in arbic_stop_words_set] # remove arabic stop words
    match = match.lower().split() # tokenize to words
    match = [x for x in match if x != []] # remove "falsy", e.g. empty strings, empty tuples, zeros, empty lists
    match = remove_empty(match)
    cleaned_arabictext_train.append(match)


Then, Clean the test set Corpus

In [5]:

cleaned_arabictext_test = []
for i in range(int(len(arabic_text_list)-len(arabic_text_list)*0.6)): # test on 40% of the Corpus
    #posttext = txt.decode('utf-8')
    match = re.sub(r"http\S+"," ", arabic_text_list[i])  # remove http liks
    match = re.sub(r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", " ", match) # remove hashtags
    match = re.sub(r'(?:(?:\d+,?)+(?:\.?\d+)?)', " ", match) # remove numbers
    match = re.sub(r"(?:[a-z][a-z'\-_]+[a-z])", " ", match) # remove words with - and '
    match = re.sub(r'(?:@[\w_]+)', " ", match)# remove @-mentions
    emotions_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
    match = re.sub(emotions_str, " ", match)
    match = match.replace('"', " ") # remove double quotes
    #words = [w for w in match if not w in arbic_stop_words_set] # remove arabic stop words
    match = match.lower().split() # tokenize to words
    match = [x for x in match if x != []] # remove "falsy", e.g. empty strings, empty tuples, zeros, empty lists
    cleaned_arabictext_test.append(match)


In [6]:
# Import the built-in logging module and configure it
# so that Word2vec creates nice output messages
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
    level=logging.INFO)

# set values for various parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 40   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

# Initialize and train the model
from gensim.models import word2vec
print "Training model...\n"
model = word2vec.Word2Vec(cleaned_arabictext_train, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

# It can be helpful to create a meaningful model name and 
# save the model for later use. You can load it later using Word2Vec.load()
model_name = "Arabic_Comment_Posts_Word2vec_Model"
model.save(model_name)

## From Words To Paragraphs, Attempt 1: Vector Averaging

One challenge with the dataset is the variable-length text. We need to find a way to take individual word vectors and transform them into a feature set that is the same length for every review. Since each word is a vector in 300-dimensional space, we can use vector operations to combine the words in each review.

One method is to simply average the word vectors in a given review (for this purpose, we removed stop words, which would just add noise).

In [7]:
def makeFeatureVec(words, model, num_features):
    """
    A function to average all of the word vectors
    in a given paragraph
    """
    # Pre-initialize an empty numpy array (for speed)
    featureVec = np.zeros((num_features,), dtype="float32")
    nwords = 0
    # Index2word is a list that contains the names of the
    # words in the model's vocabulary. convert it to a set
    # for speed,,
    index2word_set = set(model.wv.index2word)
    # loop over each word in the review and, if it is 
    # in the model's vocabulary.
    # add its feature vector to the total
    for word in words:
        if word in index2word_set:
            nwords = nwords + 1
            featureVec = np.add(featureVec, model[word])
    # Divide the result by the number of words to get the average
    featureVec = np.divide(featureVec, nwords)
    return featureVec

In [8]:
def getAvgFeatureVecs(text, model, num_features):
    """
    Given a set of text (each one a list of words),
    caculate the average feature vector for each one 
    and return a 2D numpy array
    """
    counter = 0
    # Prellocate a 2D numpy array, for speed
    reviewFeatureVecs = np.zeros((len(text), num_features), dtype = "float32")
    # Loop through the reviews
    for review in text:
        # print a status msg every 1000th review
        if counter%1000 == 0:
            print "text %d of %d" % (counter, len(text))
        # call the function (defined above) that makes average feature vectors
        reviewFeatureVecs[counter] = makeFeatureVec(review, model, num_features)
        
        counter = counter + 1
        
    return reviewFeatureVecs

Calculate average feature vectors for training and testing sets, using the functions we defined above. 

In [9]:
print "Creating average feature vecs for training reviews"
trainDataVecs = getAvgFeatureVecs(cleaned_arabictext_train, model, num_features)

print "Creating average feature vecs for training reviews"
testDataVecs = getAvgFeatureVecs(cleaned_arabictext_test, model, num_features)

By observing the train and text datavecs it turned out that they include a lot of nans.....!


In [10]:
trainDataVecs_frame = pd.DataFrame(trainDataVecs)
testDataVecs_frame = pd.DataFrame(testDataVecs)
trainDataVecs_frame.dropna(inplace=True)
testDataVecs_frame.dropna(inplace=True)

Fit a random forest to the training data, using 100 trees

In [11]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier( n_estimators = 100 )
print "Fitting a random forest to labeled training data..."
forest = forest.fit(trainDataVecs_frame, arabic_text["sentiment"][0:2781] )

# Test & extract results 
result = forest.predict(testDataVecs_frame)

from sklearn.metrics import confusion_matrix
#cm = confusion_matrix(arabic_text[], result)

# Write the test results 
output = pd.DataFrame( data={"text":cleaned_arabictext_test[:1855], "sentiment":result} )
output.to_csv( "Word2Vec_AverageVectors.csv", index=False, encoding='utf-8')

The Predicted Sentiment is loaded in a the [Word2Vec_AverageVectors.csv](https://docs.google.com/spreadsheets/d/1g6OE74YUVOljlWQyFPDu6LCu3527V0RbZ1aqTBBEKEI/edit?usp=sharing) file.