<a href="https://colab.research.google.com/github/ajw-42/arabic-nlp/blob/master/05_laser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let's look at one more set of pre-trained word embeddings before moving on. They aren't necessarily a good fit for our problem, but they're an interesting and relatively new approach.

[LASER (Language-Agnostic SEntence Representations)](https://github.com/facebookresearch/LASER) is, as the name indicates, an approach for producing a single set of sentence-level representations across languages. This is great for multilingual datasets and also avoids the problem we encountered previously of having word vectors that we then averaged to get sentence- or document-level vecotrs. As we saw for `fasttext`, pre-trained embeddings are made available in addition to utilities for training your own embeddings from scratch.

The `LaserEmbeddings` library is a nice wrapper for `LASER` that makes the process of getting sentence representations a bit easier. In Google Colab, we'll need to pip install it before importing.



In [1]:
!pip install laserembeddings

Collecting laserembeddings
  Downloading https://files.pythonhosted.org/packages/c5/6b/93843d90080666571a79f8eb195fa58aa5e45cf24d36158b9c01dba306e2/laserembeddings-1.0.1-py3-none-any.whl
Collecting transliterate==1.10.2
[?25l  Downloading https://files.pythonhosted.org/packages/a1/6e/9a9d597dbdd6d0172427c8cc07c35736471e631060df9e59eeb87687f817/transliterate-1.10.2-py2.py3-none-any.whl (45kB)
[K     |███████▏                        | 10kB 30.3MB/s eta 0:00:01[K     |██████████████▎                 | 20kB 6.1MB/s eta 0:00:01[K     |█████████████████████▌          | 30kB 8.7MB/s eta 0:00:01[K     |████████████████████████████▋   | 40kB 11.2MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 5.5MB/s 
[?25hCollecting subword-nmt<0.4.0,>=0.3.6
  Downloading https://files.pythonhosted.org/packages/74/60/6600a7bc09e7ab38bc53a48a20d8cae49b837f93f5842a41fe513a694912/subword_nmt-0.3.7-py2.py3-none-any.whl
Collecting sacremoses==0.0.35
[?25l  Downloading https://files.pytho

In [0]:
import pandas as pd
import numpy as np
import sklearn
import laserembeddings

In [3]:
from google.colab import drive

drive.mount('/content/gdrive')

train = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_train.csv')
val = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_val.csv')
test = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_test.csv')

train.head()

#Again, we'll limit ourselves to 20 classes for now.
train = train[train.label <= 20]
test = test[test.label <= 20]

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [4]:
#Now we need to download the pre-trained models
!python -m laserembeddings download-models


Downloading models into /usr/local/lib/python3.6/dist-packages/laserembeddings/data

✅   Downloaded https://dl.fbaipublicfiles.com/laser/models/93langs.fcodes    
✅   Downloaded https://dl.fbaipublicfiles.com/laser/models/93langs.fvocab    
✅   Downloaded https://dl.fbaipublicfiles.com/laser/models/bilstm.93langs.2018-12-26.pt    

✨ You're all set!


We're also going to assume that each of our documents is one sentence. This isn't exactly true -- most are 3-4 sentences -- but it's a reasonable shortcut that will make things a bit easier here.

The `lang` argument is only used to determine how your text will be tokenized. For a multilingual dataset you could either not 

In [0]:
#This will take a while to run!
from laserembeddings import Laser

laser = Laser()

X_train = [laser.embed_sentences([text], lang='ar') for text in train.text]
X_test = [laser.embed_sentences([text], lang='ar') for text in test.text]


`embed_sentences()` outputs a two-dimensional array where the second dimension is 1, so we can flatten our results before we pass them into sklearn.


In [0]:
X_train = [np.concatenate(x) for x in X_train]
X_test = [np.concatenate(x) for x in X_test]

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

Y_train = train.label
Y_test = test.label

classifier = LogisticRegression(max_iter = 5000, multi_class='multinomial').fit(X_train, Y_train)
preds = classifier.predict(X_test)

f1_score(Y_test, preds, average = 'weighted')

0.7594143755886662

Not bad! It makes sense that this would perform slightly worse in our case than language-specific embeddings, but again LASER is a powerful tool for multi-lingual datasets.