Previously, we trained a set of word embeddings and used them to improve our text classification model. But we ran into problems related to the size of our dataset. How might we overcome this problem? One solution is pre-trained word embeddings. These are word embedding models trained on a very large corpus -- say, all of Wikipedia. We can get the vectors out of this model in the same way that we did for our own smaller model previously. If there is some overlap between the domain of the training data and the domain of our data, we should get good results.

One example of pre-trained word embeddings is the `fasttext` library, which provides a well-performing approach to train embeddings but also makes available pre-trained embeddings for 157 languages trained on Wikipedia and Common Crawl data: https://fasttext.cc/docs/en/crawl-vectors.html **The pre-trained Arabic embeddings that we use here take a while to download, and longer to upload to Google Drive, so you might prefer to run this notebook locally.**

In [0]:
import pandas as pd
import numpy as np
import sklearn

In [2]:
from google.colab import drive

drive.mount('/content/gdrive')

train = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_train.csv')
val = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_val.csv')
test = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_test.csv')

train.head()

#Again, we'll limit ourselves to 20 classes for now.
train = train[train.label <= 20]
test = test[test.label <= 20]

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


Once we've downloaded the `cc.ar.300.vec` file, we load it in using `gensim`.

In [0]:
import gensim

path = 'gdrive/My Drive/cc.ar.300.bin'
model = gensim.models.fasttext.FastText.load_fasttext_format(path)

We have access to the same word similarity utilities we used previously, so let's take a look at those.

In [14]:
model.similarity('الحرب', 'الصراع')

  """Entry point for launching an IPython kernel.
  if np.issubdtype(vec.dtype, np.int):


0.4434066

In [15]:
model.most_similar('المعارضة')[:5]

  """Entry point for launching an IPython kernel.
  if np.issubdtype(vec.dtype, np.int):


[('والمعارضة', 0.7426829934120178),
 ('للمعارضة', 0.7048141360282898),
 ('بالمعارضة', 0.6725300550460815),
 ('فالمعارضة', 0.6373220682144165),
 ('معارضة', 0.6321747303009033)]

Interesting! "War" and "conflict" are less similar here--perhaps because in this broader set of training data, the words are used in many different contexts--but for "opposition" we're getting exclusively variants on that word as most similar. So this model might have a better understanding of Arabic grammar than the small model we trained from scratch.

Now we'll use the same function we built previously to get document level embeddings, and pass them into a logistic regression.

In [0]:
def doc_vectorizer(text, model):
  doc_vec = 0
  count = 0

  if len(text) == 1:
    return model[text]

  for t in text:
    try:
      word_vec = model[t]
      doc_vec = doc_vec + word_vec
      count += 1
    except:
      pass
  
  doc_vec = doc_vec / count
  return doc_vec

In [17]:
import nltk
from nltk.tokenize import WhitespaceTokenizer

nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('arabic')

tokenizer = WhitespaceTokenizer()

train_words = [tokenizer.tokenize(t) for t in train.text]
test_words = [tokenizer.tokenize(t) for t in test.text]

train_words = [[t for t in text if t not in stop_words] for text in train_words]
test_words = [[t for t in text if t not in stop_words] for text in test_words]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [18]:
X_train = [doc_vectorizer(t, model) for t in train_words]
X_test = [doc_vectorizer(t, model) for t in test_words]

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

Y_train = train.label
Y_test = test.label

classifier = LogisticRegression(max_iter = 5000, multi_class='multinomial').fit(X_train, Y_train)
preds = classifier.predict(X_test)

f1_score(Y_test, preds, average = 'weighted')

  # Remove the CWD from sys.path while we load stuff.


0.5151575854701926

In [20]:

from sklearn.svm import LinearSVC
from sklearn.metrics import f1_score

Y_train = train.label
Y_test = test.label

classifier = LinearSVC(max_iter = 5000, multi_class='ovr').fit(X_train, Y_train)
preds = classifier.predict(X_test)

f1_score(Y_test, preds, average = 'weighted')

0.7640660411073206

Using the logistic regression we've looked at previously, our f1 score is much worse! We can improve the score by using an SVC instead, but it's still around the same as it was previously. So we aren't getting a significant benefit from using these pre-trained embeddings.

You might wonder if we could combine the knowledge contained in these pre-trained embeddings with the specific context of our dataset. What a convenient question! We sure can. This is called fine-tuning, and we'll get to it soon.