## Integrate [AraVec](https://github.com/bakrianoo/aravec) with [Spacy.io](https://spacy.io/)

This notebook demonstrates how to integrate an [AraVec](https://github.com/bakrianoo/aravec) model with [spaCy.io](https://spacy.io/)

## Outlines

- Install/Load the required modules
- Load AraVec
- Export the Word2Vec format + gzip it.
- Initialize the spaCy model using AraVec vectors
- Run Your AraVec Spacy Model
- Test the Model

## Install/Load the required modules

In [None]:
!pip install gensim spacy nltk

In [None]:
import gensim
import re
import spacy

# Clean/Normalize Arabic Text
def clean_str(text):
    search = ["أ","إ","آ","ة","_","-","/",".","،"," و "," يا ",'"',"ـ","'","ى","\\",'\n', '\t','&quot;','?','؟','!']
    replace = ["ا","ا","ا","ه"," "," ","","",""," و"," يا","","","","ي","",' ', ' ',' ',' ? ',' ؟ ',' ! ']
    
    #remove tashkeel
    p_tashkeel = re.compile(r'[\u0617-\u061A\u064B-\u0652]')
    text = re.sub(p_tashkeel,"", text)
    
    #remove longation
    p_longation = re.compile(r'(.)\1+')
    subst = r"\1\1"
    text = re.sub(p_longation, subst, text)
    
    text = text.replace('وو', 'و')
    text = text.replace('يي', 'ي')
    text = text.replace('اا', 'ا')
    
    for i in range(0, len(search)):
        text = text.replace(search[i], replace[i])
    
    #trim    
    text = text.strip()

    return text

## Load AraVec
Download a model from the [AraVec Repository](https://github.com/bakrianoo/aravec), then follow the below steps to load it.

In [None]:
# Download via terminal commands
!wget "https://archive.org/download/aravec2.0/tweet_cbow_300.zip"
!unzip "tweet_cbow_300.zip"

In [None]:
# load the AraVec model
model = gensim.models.Word2Vec.load("tweets_cbow_300")

In [None]:
print("We have",len(model.wv.index_to_key),"vocabularies")

## Export the Word2Vec format + gzip it.

In [None]:
# make a directory called "spacyModel"
%mkdir spacyModel

In [None]:
# export the word2vec fomart to the directory
model.wv.save_word2vec_format("./spacyModel/aravec.txt")

In [None]:
# using `gzip` to compress the .txt file
!gzip ./spacyModel/aravec.txt

## Initialize the spaCy model using AraVec vectors

- This will create a folder called `/spacy.aravec.model` within your current working directory.
- This step could take several minutes to be completed.

In [None]:
!python -m spacy init vectors ar ./spacyModel/aravec.txt.gz spacy.aravec.model 

## Run Your AraVec Spacy Model


In [None]:
# load AraVec Spacy model
nlp = spacy.load("./spacy.aravec.model/")

In [None]:
# Define the preprocessing Class
class Preprocessor:
    def __init__(self, tokenizer, **cfg):
        self.tokenizer = tokenizer

    def __call__(self, text):
        preprocessed = clean_str(text)
        return self.tokenizer(preprocessed)

In [None]:
# Apply the `Preprocessor` Class
nlp.tokenizer = Preprocessor(nlp.tokenizer)

## Test the Model

In [None]:
# Test your model
nlp("قطة").vector

In [None]:
egypt = nlp("مصر بلد")
tunisia = nlp("تونس")
apple = nlp("تفاح")

print("egypt Vs. tunisia = ", egypt.similarity(tunisia))
print("egypt Vs. apple = ", egypt.similarity(apple))

In [None]:
egypt.vector

## Done !!

Congratulations, now you have your AraVec model running on spaCy.