Approaches to representing text like `CountVectorizer` are sometimes called bag-of-words approaches, because they don't contain any information about the sequencing of the words. We're also using a fairly small dataset, so our model doesn't necessarily learn any general information about the mea

The next few notebooks will introduce more advanced techniques that try to address this problem. First, we'll take a look at word embeddings by focusing on the implementation of Word2Vec available in the `gensim` library.

In [0]:
import gensim
import pandas as pd
import numpy as np
import sklearn

In [14]:
from google.colab import drive

drive.mount('/content/gdrive')

train = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_train.csv')
val = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_val.csv')
test = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_test.csv')

train.head()

#Again, we'll limit ourselves to 20 classes for now.
train = train[train.label <= 20]
test = test[test.label <= 20]

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


We need to add a new preprocessing step here - tokenizing our text into individual words. `nltk`, which we used in a previous notebook for loading stopwords, has a nice utility for this. We're using a naive tokenizer that just looks for spaces between words, but `nltk` offers more advanced approaches as well, for example making use of regular expressions.

We'll also remove stopwords again.

In [15]:
import nltk

nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('arabic')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
from nltk.tokenize import WhitespaceTokenizer

tokenizer = WhitespaceTokenizer()

train_words = [tokenizer.tokenize(t) for t in train.text]
test_words = [tokenizer.tokenize(t) for t in test.text]

In [0]:
train_words = [[t for t in text if t not in stop_words] for text in train_words]
test_words = [[t for t in text if t not in stop_words] for text in test_words]

Now we're going to train a set of word embeddings using the popular Word2Vec approach.

Word embeddings are high-dimensional vector representations of words. In previous notebooks, we were also providing vector representations of our text, but using sparse vectors, or vectors where most values are zero ("this word appears zero times in this document") and therefore don't contain any information.

Word2Vec is a specific approach for learning word embeddings with a shallow neural network. You can read more about it, and word embeddings in general, here: https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa.

We pass a few arguments when creating our model, to define the minimum number of times a word can appear to be included in the model and to use the skigram implementation of word2vec (the other being cbow, or continous bag of words, which does better for datasets larger than ours).

In [0]:
model = gensim.models.Word2Vec(train_words, min_count=1, sg=1)

Now we should be able to get the embedding for an indiviudal word out of our trained model. Let's give it a try:

In [21]:
model['المعارضة']

  """Entry point for launching an IPython kernel.


array([ 0.14549054,  0.599306  , -0.4804341 ,  0.03285765, -0.5277679 ,
        0.18579623, -0.3790516 , -0.16893376, -0.0919387 , -0.69887435,
       -0.47992244,  0.5604452 ,  1.7088275 , -0.39851496, -0.93221676,
        0.38124117, -0.34682694, -0.10315377,  0.34234753, -0.22387849,
        0.36565235, -1.7127385 ,  0.3991669 ,  0.4639502 ,  0.16955788,
       -0.08727596,  0.68151236, -0.533671  , -0.9149401 ,  0.6349782 ,
        1.0968806 ,  0.09654019,  0.24191156, -0.11690563,  0.196531  ,
        0.03272146, -0.0201999 , -0.09860993, -0.41145647, -1.1043358 ,
        0.09548824,  0.30044132, -0.2053842 ,  0.1468742 ,  0.10847885,
       -0.6022765 ,  0.90401655,  0.276518  ,  0.06019887, -0.8702403 ,
       -0.33587086, -0.4497403 , -0.5549428 ,  0.4569512 ,  0.55942935,
        0.39943692,  0.09723862, -0.0251064 ,  0.37965676, -0.16792785,
       -0.26170108,  0.12032393, -0.30160433,  0.58695364, -0.5198097 ,
       -0.32774565,  0.1265672 , -0.8171582 ,  0.8184151 , -0.03

A brief tangent: one appealing characteristic of word embeddings is that similar words should be near to each other in embedding space. `gensim` allows us to test this by computing the similarity between two words and by returning the most similar words to a given word.

So let's give it a try!

In [22]:
model.similarity('الحرب', 'الصراع')

  """Entry point for launching an IPython kernel.
  if np.issubdtype(vec.dtype, np.int):


0.6769255

So the words 'war' and 'conflict' are fairly similar - makes sense.

Now, we'll look at the five words most similar to "opposition." This is also built into the `gensim` implementation of word2vec.

In [23]:
model.most_similar('المعارضة')[:5]

  """Entry point for launching an IPython kernel.
  if np.issubdtype(vec.dtype, np.int):


[('المعتدلة', 0.8478207588195801),
 ('للمعارضة', 0.8072715997695923),
 ('الفصائل', 0.7820077538490295),
 ('معارضة', 0.7800770401954651),
 ('المسلحة', 0.7489530444145203)]

This looks decent! Two of these are variants on the word opposition and the others are "armed," "factions," and "moderate" -- all words that Arab media outlets might use to describe an opposition movement.

We could further improve our results here by using a different metric, cosine similarity, to measure the distance between words in embedding space. But for now, let's return to our classification task.

Right now, we have vectors for individual words, but we want a single vector for each piece of text in our dataset. One way to solve this problem is by averaging the vectors of individual words. Let's create a simple function that does this.

In [0]:
#Remember that we've already tokenized our text, creating a 'list of lists.' Let's plan to feed that into our function.
def doc_vectorizer(text, model):
  doc_vec = 0
  count = 0

  if len(text) == 1:
    return model[text]

  for t in text:
    try:
      word_vec = model[t]
      doc_vec = doc_vec + word_vec
      count += 1
    except:
      pass
  
  doc_vec = doc_vec / count
  return doc_vec

In [25]:
X_train = [doc_vectorizer(t, model) for t in train_words]
X_test = [doc_vectorizer(t, model) for t in test_words]

  # Remove the CWD from sys.path while we load stuff.


For similicity and for better comparison, let's feed our new vectors into the same `sklearn` classifier that we've used previously.

In [26]:
from sklearn.linear_model import LogisticRegression

Y_train = train.label
Y_test = test.label

classifier = LogisticRegression(max_iter = 5000, multi_class='multinomial').fit(X_train, Y_train)
preds = classifier.predict(X_test)
pd.crosstab(Y_test, preds)

col_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,16,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,4,1,0,0,0
1,0,14,0,0,0,1,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0
2,0,0,37,0,0,0,0,0,1,0,12,0,0,0,2,0,0,2,0,0,0
3,0,0,0,28,1,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,3
4,0,0,0,0,37,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
5,0,0,0,0,0,44,7,0,0,0,0,0,0,1,0,0,4,0,0,1,1
6,0,0,0,2,0,5,280,2,0,0,0,0,31,7,0,0,5,0,0,2,0
7,0,0,0,1,0,0,2,45,0,0,0,0,0,0,0,0,0,0,0,3,1
8,0,0,6,0,0,0,0,0,51,0,0,0,0,0,5,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,13,1,0,0,0,0,0,0,2,0,1,0


In [27]:
from sklearn.metrics import f1_score

f1_score(Y_test, preds, average = 'weighted')

0.780616644380058

This looks good. Notice that the model is getting confused between categories 6 and 12 less often. The F1 score is also slightly improved.

But wait! There's another approach we can try here. `gensim` also offers a doc2vec implementation, for cases like ours where we really want document-level and not a word-level vectors. We use a different function within `gensim` to tokenize our documents.

In [0]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(train.text)]
doc_model = Doc2Vec(documents, min_count=1)

In [0]:
X_train = [doc_model.infer_vector(t) for t in train.text]
X_test = [doc_model.infer_vector(t) for t in test.text]

In [30]:

classifier = LogisticRegression(max_iter = 5000, multi_class='multinomial').fit(X_train, Y_train)
preds = classifier.predict(X_test)
f1_score(Y_test, preds, average = 'weighted')

0.43112670116544183

These results are much worse! That's probably due to the size of our training set. With more data, doc2vec would likely outperform word2vec for this kind of problem.

In the next notebook, we'll explore how to take advantage of large training sets when we don't have access to them ourselves.