This notebook creates an **Arabic ULMFiT** language model using **Fastai V1.x** and scripts in the same local directory (copied or adapted from [n-waves](https://github.com/n-waves/ulmfit-multilingual/blob/refactor/ulmfit/)).   
It is a draft notebook that was working on Dec. 7, 2018 and no guarantee of maintenance or applicability to own cases.  
The code was run on Google free colab instances. Some code cells have minor comments. 
If you have questions or need support, please use [fastai forum](https://forums.fast.ai/t/multilingual-ulmfit/28117/37). Please do not open issues - this is not a repository!

In [0]:
# install latest torch and fastai for google colab or see https://github.com/fastai/fastai#installation

!curl https://course-v3.fast.ai/setup/colab | bash 

# tokenizer Moses (also needed for xnli)
# https://github.com/alvations/sacremoses
!pip install sacremoses

# cupy needs to be installed for QRNN
!pip install cupy-cuda92

# fire to run modules as command line interface
!pip install fire       

In [0]:
#---------- if fastai version is broken, remove and revert to working copy
#!pip uninstall fastai 
#!pip install fastai==1.0.32


In [0]:
from fastai import *
from fastai.text import * 

# only needed for diagnostics
import fastai 
fastai.show_install(1) 


# for reproducibility (from fastai forum)
# see https://forums.fast.ai/t/solved-reproducibility-where-is-the-randomness-coming-in/31628/5
def random_seed(seed_value, use_cuda):
    np.random.seed(seed_value) # cpu vars
    torch.manual_seed(seed_value) # cpu  vars
    random.seed(seed_value) # Python
    if use_cuda: 
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value) # gpu vars
        torch.backends.cudnn.deterministic = True  #needed
        torch.backends.cudnn.benchmark = False
        
random_seed(42, True)

In [0]:
# this shell calls the create_wikitext.py module, below
!wget https://raw.githubusercontent.com/abedkhooli/ds2/master/ulmfit2/prepare_wiki.sh
!wget https://raw.githubusercontent.com/abedkhooli/ds2/master/ulmfit2/create_wikitext.py   

# these scripts are also used (some renamaed from original to work in same folder)
!wget https://raw.githubusercontent.com/abedkhooli/ds2/master/ulmfit2/fastai_contrib_utils.py
!wget https://raw.githubusercontent.com/abedkhooli/ds2/master/ulmfit2/postprocess_wikitext.py 
!wget https://raw.githubusercontent.com/abedkhooli/ds2/master/ulmfit2/fastai_contrib_data.py
!wget https://raw.githubusercontent.com/abedkhooli/ds2/master/ulmfit2/fastai_contrib_learner.py
!wget https://raw.githubusercontent.com/abedkhooli/ds2/master/ulmfit2/fastai_contrib_models.py
!wget https://raw.githubusercontent.com/abedkhooli/ds2/master/ulmfit2/pretrain_lm.py
    
# for classifier training    
!wget https://raw.githubusercontent.com/abedkhooli/ds2/master/ulmfit2/train_clas.py  
# xnli dataset
!wget https://raw.githubusercontent.com/abedkhooli/ds2/master/ulmfit2/prepare_xnli.sh    

In [0]:
# run shell silently (also supressed logger in wiki extractor)
# creates data/wiki_dumps (.bz2) 
# for Arabic:
# and data/wiki_extr/ar (AA - AL, .. folders with 100 wiki_xx in each)
# and data/wiki/ar-100,data/wiki/ar-2,data/wiki/ar-all 
# warnings from WikiExtractor.py (cloned in shell). See:
# https://github.com/attardi/wikiextractor/blob/master/WikiExtractor.py#L666

# lots of info to stdout, turned off with > /dev/null
!bash prepare_wiki.sh > /dev/null

In [0]:
# connect g drive to store files, session die or disconnect. Wiki extraction takes time
from google.colab import drive
drive.mount('/gdrive')

In [0]:
#------------ token files to drive
!cp -a data/wiki/. '/gdrive/My Drive/ulmfit/ar2/'
#------------ raw files to drive
!cp -a data/wiki_extr/. '/gdrive/My Drive/ulmfit/ar2/'

In [0]:
# ----------- create ar-unk files from small corpus ---------
!python -m postprocess_wikitext "data/wiki/ar-2" 'ar'

In [0]:
# save the unk tokens (source and target may differ - check you env) 
!cp data/wiki-unk/ar.wiki.test.tokens '/gdrive/My Drive/ulmfit/ar2/unk/'
!cp data/wiki-unk/ar.wiki.train.tokens '/gdrive/My Drive/ulmfit/ar2/unk/'
!cp data/wiki-unk/ar.wiki.valid.tokens '/gdrive/My Drive/ulmfit/ar2/unk/'

# assume you want to work from drive after disconnect
!rm -rf data/

In [0]:
# get unk wiki files from g drive
!cp -a '/gdrive/My Drive/ulmfit/ar2/unk/.' data/

In [0]:
# here, I was training for 30k vocab. If need to start over and clear folders
!rm -rf tmp
!rm -rf data/models/v30k
!rm -rf models/*

In [0]:
#----- qrnn (quasi rnn, faster) and bidir (sentencepiece) not working at this time
#----- qrnn trains but model not read by fastai, bidir causes batch matching error

import pretrain_lm
import train_clas

exp = pretrain_lm.LMHyperParams(dataset_path='data', 
                                base_lm_path=None, bidir=False, 
                                qrnn=False, tokenizer='v', max_vocab=30000, 
                                emb_sz=400, nh=1150, nl=3, clip=0.2, 
                                bptt=70, bs=64, lang='ar', name='Arabic')


Batch size: 64
Max vocab: 30000
Cache dir: data/models/v30k
Model dir: data/models/v30k/lstm_Arabic.m


In [0]:
#===============
#---- Runtime/restart runtime may help if out of memory
#---- qrnn causes memory issue sometimes, keep off for now
# learning rate (in code, not a param) is probably too small.
#===============

# use this to train model
#exp.train_lm(num_epochs=10, drop_mult=0.3) 

# if you want to report logloss and perplexity. Method brought back into class, hard wired for moses no sp.
exp.validate_lm(num_epochs=10, drop_mult=0.3) # this calls train_lm

# for Arabic small corpus:
# Test logloss: 3.2080156803131104 perplexity: 24.729965209960938
