## Use Aravec as spaCy model

In this notebook we show how to load aravec model with spaCy library

## Steps

- Install the required modules
- load the aravec model model
- export the model and gzip it
- initialize the spaCy model
- load the model into spaCy
- add the preprocessing function
- enjoy aravec from within spaCy !

## Install the required modules

In [1]:
!pip install gensim spacy nltk



## Load aravec
Download any model you would like from the [repository](https://github.com/bakrianoo/aravec), and follow the steps to load it in your code

In [2]:
import gensim
import re

model = gensim.models.Word2Vec.load("aravec/WWW-CBOW/WWW-CBOW")

### Don't forget about the preprocessing function

In [3]:
def clean_str(text):
    search = ["أ", "إ", "آ", "ة", "_", "-", "/", ".", "،", " و ", " يا ",
              '"', "ـ", "'", "ى", "\\", '\n', '\t', '&quot;', '?', '؟', '!']
    replace = ["ا", "ا", "ا", "ه", " ", " ", "", "", "", " و",
               " يا", "", "", "", "ي", "", ' ', ' ', ' ', ' ? ', ' ؟ ', ' ! ']

    #remove tashkeel
    p_tashkeel = re.compile(r'[\u0617-\u061A\u064B-\u0652]')
    text = re.sub(p_tashkeel, "", text)

    #remove longation
    p_longation = re.compile(r'(.)\1+')
    subst = r"\1\1"
    text = re.sub(p_longation, subst, text)

    text = text.replace('وو', 'و')
    text = text.replace('يي', 'ي')
    text = text.replace('اا', 'ا')

    for i in range(0, len(search)):
        text = text.replace(search[i], replace[i])

    #trim
    text = text.strip()

    return text

In [4]:
word1 = clean_str('يعشق')

# find and print the most similar terms to a word
most_similar = model.wv.most_similar(word1)
for term, score in most_similar:
    print(term, score)


يحب 0.6261304020881653
تعشق 0.56817626953125
ويعشق 0.5610277652740479
يهوي 0.5600417256355286
لايحب 0.5520501136779785
يعشقه 0.5470657348632812
يعشقها 0.5283931493759155
اعشق 0.5102519989013672
يعشقون 0.509077787399292
يلاعب 0.4903143346309662


everything is working just fine, now we will create a folder to export the model to and build the spaCy model

## Export the model and gzip it

In [5]:
%mkdir spacyModel

In [6]:
model.wv.save_word2vec_format("./spacyModel/aravec.txt")

In [7]:
!gzip ./spacyModel/aravec.txt 

## Initialize the spaCy model 

In [8]:
!python -m spacy  init-model ar spacy.aravec.model --vectors-loc ./spacyModel/aravec.txt.gz

[2K[38;5;2m✔ Successfully created model[0m
146273it [00:14, 10104.67it/s]del/aravec.txt.gz
[2K[38;5;2m✔ Loaded vectors from spacyModel/aravec.txt.gz[0m
[38;5;2m✔ Sucessfully compiled vocab[0m
146466 entries, 146273 vectors


> this will create a folder named `/spacy.aravec.model` inside your current directory

## Load it into spaCy

In [9]:
import spacy
nlp = spacy.load("./spacy.aravec.model/")

In [10]:
nlp("قطة").vector

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

As you can see we have no vectors at all for this token for example, but that's because we didn't apply the preprocessor step to the text, let's add it to the model.

## Add the preprocessor function

In [11]:
class Preprocessor:
    def __init__(self, tokenizer, **cfg):
        self.tokenizer = tokenizer

    def __call__(self, text):
        preprocessed = clean_str(text)
        return self.tokenizer(preprocessed)

In [12]:
nlp.tokenizer = Preprocessor(nlp.tokenizer)

In [13]:
nlp("قطة").vector

array([ 2.8480022 , -0.92140836, -0.32840294,  1.9604828 , -3.9761736 ,
       -0.63699883,  0.52591836, -0.49163073, -0.9315454 ,  1.4118265 ,
       -1.8157891 ,  0.02102472,  2.465533  , -0.60494626, -0.8643424 ,
       -1.8974758 ,  2.8115346 ,  1.7856901 ,  0.60246867, -2.5311856 ,
        0.6646705 , -0.02800057,  2.6766527 ,  1.1444625 , -2.3182938 ,
       -0.23770235, -0.95880896, -2.370305  ,  0.21067968, -0.9180316 ,
        1.0837697 , -1.339369  ,  1.0470312 , -0.85747755, -0.3612395 ,
       -0.04006149,  0.6157008 , -2.3505592 ,  1.0696003 ,  0.18124492,
       -0.02356418, -2.4026926 ,  0.29490086, -1.3148676 , -2.0697622 ,
        0.3080379 ,  1.6946433 ,  1.5146562 ,  3.6492367 ,  1.0488898 ,
       -1.1743469 ,  1.4775808 , -1.3201469 ,  0.7495463 ,  1.5690784 ,
        1.9071016 , -2.5420423 ,  1.9932128 , -0.24944896,  0.26473418,
       -1.0503093 , -3.358672  ,  4.476831  , -1.9207832 , -0.13164501,
        1.6348315 ,  2.1712348 , -0.93513036, -0.13917194, -2.48

In [14]:
cat = nlp("قطة")
dog = nlp("كلب")
book= nlp("كتاب")

In [15]:
dog.similarity(cat)

0.5383209354534503

In [16]:
dog.similarity(book)

0.04397916193487976

## Done !!

Congratulations, now you have your aravec model running on spaCy.