# **Pre-training of Icelandic version of LUKE**

Heavily based on code available at https://github.com/studio-ousia/luke


These instructions should work for most languages listed [here](https://en.wikipedia.org/wiki/List_of_Wikipedias).



---


## **Creation of Icelandic pretraining dataset**

1. Download relevant wikipedia dump [here](https://dumps.wikimedia.org). 
The dump file used in this experiment can be found [here](https://dumps.wikimedia.org/iswiki/20220101/).
2. Process the wikipedia dump file.
3. Changes made to original code for encoding. *Explain and reference github where code is stored.*
4. Create pretraining dataset.

###Build DB dump file

```
Usage: !python -m luke.cli build-dump-db [OPTIONS] DUMP_FILE OUT_FILE

Options:
  --pool-size INTEGER
  --chunk-size INTEGER
  --help                Show this message and exit.
```

In [None]:
!python -m luke.cli build-dump-db \
iswiki-20220101-pages-articles.xml.bz2 \
iswiki-20220101

/usr/bin/python3: Error while finding module specification for 'luke.cli' (ModuleNotFoundError: No module named 'luke')


###Build entity vocabulary file

```
Usage: !python -m luke.cli build-entity-vocab [OPTIONS] DUMP_DB_FILE OUT_FILE

Options:
  --vocab-size INTEGER
  -w, --white-list FILENAME
  --white-list-only
  --pool-size INTEGER
  --chunk-size INTEGER
  --help                     Show this message and exit.
  ```

In [None]:
!python -m luke.cli build-entity-vocab \
iswiki-20220101 \
is-entity-vocab

/usr/bin/python3: Error while finding module specification for 'luke.cli' (ModuleNotFoundError: No module named 'luke')


###Build pre-training dataset

```Usage: python -m luke.cli build-wikipedia-pretraining-dataset [OPTIONS] DUMP_DB_FILE TOKENIZER_NAME ENTITY_VOCAB_FILE OUTPUT_DIR```

In [None]:
!python -m luke.cli build-wikipedia-pretraining-dataset \
iswiki-20220101 \
mideind/IceBERT-igc \
is-entity-vocab \
./is-wikipedia-pretrain-dataset/

/usr/bin/python3: Error while finding module specification for 'luke.cli' (ModuleNotFoundError: No module named 'luke')




---


##**Requirements**

###Install requirements

In [None]:
%cd /content/drive/MyDrive/ice-luke/
!pip install -r requirements.txt

### Install apex

Note: this might take some time to install. If apex has already been downloaded it is sufficient to just copy the files to relevant locations.

In [None]:
path = './apex/apex/'

if not os.path.isdir(path):
  !git clone https://github.com/NVIDIA/apex.git
  !cd apex
  !git checkout c3fad1ad120b23055f6630da0b029c8b626db78f
  !pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

!cp -r apex/apex/ /usr/local/lib/python3.7/dist-packages/
!cp -r apex/apex/ /usr/local/lib/python3.7/site-packages/

###Check if cuda and apex are available

In [None]:
import torch
from apex import amp

print(torch.cuda.is_available())



---


## **Pretraining LUKE**

As seen in [LUKE: Deep Contextualized Entity Representations with
Entity-aware Self-attention](https://arxiv.org/pdf/2010.01057.pdf) by Yamada et al.


Experiments for Icelandic were conducted on a Tesla V100-SXM2-16GB GPU and pretraining took roughly 16 hrs.

###Make sure that a folder exists for the model output

In [None]:
import os

path = 'luke-icebert-base-768'

if not os.path.isdir(path):
  os.mkdir(path)
  print('{} created!'.format(path))
else:
  print('{} exists!'.format(path))

### Pretraining options

```
Options:
  --multilingual
  --sampling-smoothing FLOAT
  --parallel
  --cpu
  --bert-model-name TEXT
  --entity-emb-size INTEGER
  --batch-size INTEGER
  --gradient-accumulation-steps INTEGER
  --learning-rate FLOAT
  --lr-schedule [warmup_constant|warmup_linear]
  --warmup-steps INTEGER
  --adam-b1 FLOAT
  --adam-b2 FLOAT
  --adam-eps FLOAT
  --weight-decay FLOAT
  --max-grad-norm FLOAT
  --masked-lm-prob FLOAT
  --masked-entity-prob FLOAT
  --whole-word-masking / --subword-masking
  --unmasked-word-prob FLOAT
  --random-word-prob FLOAT
  --unmasked-entity-prob FLOAT
  --random-entity-prob FLOAT
  --mask-words-in-entity-span
  --fix-bert-weights
  --grad-avg-on-cpu / --grad-avg-on-gpu
  --num-epochs INTEGER
  --global-step INTEGER
  --fp16
  --fp16-opt-level [O1|O2]
  --fp16-master-weights / --fp16-no-master-weights
  --fp16-min-loss-scale INTEGER
  --fp16-max-loss-scale INTEGER
  --local-rank, --local_rank INTEGER
  --num-nodes INTEGER
  --node-rank INTEGER
  --master-addr TEXT
  --master-port TEXT
  --log-dir PATH
  --model-file PATH
  --optimizer-file PATH
  --scheduler-file PATH
  --amp-file PATH
  --save-interval-sec INTEGER
  --save-interval-steps INTEGER
  --help                          Show this
                                  message and
                                  exit.
```

###Pretraining

We use IceBERT-base as our base model.

In [None]:
!python -m luke.cli pretrain ./is-wikipedia-pretrain-dataset/ ./luke-icebert-base-768/ \
--bert-model-name mideind/IceBERT-igc \
--entity-emb-size 768 \
--batch-size 8 \
--gradient-accumulation-steps 1 \
--learning-rate 1e-5 \
--warmup-steps 100 \
--log-dir ./logs/luke-icebert-base-768/ \
--weight-decay=0.01 \
--adam-b1=0.9 \
--adam-b2=0.999 \
--adam-eps=1e-6 \
--lr-schedule='warmup_linear' \
--masked-entity-prob=0.15 \
--masked-lm-prob=0.15



---


## **Pretraining LUKE ED**

As seen in [Global Entity Disambiguation with Pretrained Contextualized Embeddings of Words and Entities](https://arxiv.org/pdf/1909.00426.pdf) by Yamada et al.


Experiments for Icelandic were conducted on a Tesla V100-SXM2-16GB GPU and pretraining took roughly 16 hrs.

In [None]:
!python -m examples.cli entity-disambiguation create-redirect-tsv iswiki-20220101 iswiki-20220101-redirect.tsv

In [None]:
!wikipedia2vec build-dump-db iswiki-20220101-pages-articles.xml.bz2 iswiki-20220101

###Make sure that a folder exists for the model output

In [None]:
import os


path = 'ed-luke-icebert-base-768'

if not os.path.isdir(path):
  os.mkdir(path)
  print('{} created!'.format(path))
else:
  print('{} exists!'.format(path))

### Pretraining options

```
Options:
  --multilingual
  --sampling-smoothing FLOAT
  --parallel
  --cpu
  --bert-model-name TEXT
  --entity-emb-size INTEGER
  --batch-size INTEGER
  --gradient-accumulation-steps INTEGER
  --learning-rate FLOAT
  --lr-schedule [warmup_constant|warmup_linear]
  --warmup-steps INTEGER
  --adam-b1 FLOAT
  --adam-b2 FLOAT
  --adam-eps FLOAT
  --weight-decay FLOAT
  --max-grad-norm FLOAT
  --masked-lm-prob FLOAT
  --masked-entity-prob FLOAT
  --whole-word-masking / --subword-masking
  --unmasked-word-prob FLOAT
  --random-word-prob FLOAT
  --unmasked-entity-prob FLOAT
  --random-entity-prob FLOAT
  --mask-words-in-entity-span
  --fix-bert-weights
  --grad-avg-on-cpu / --grad-avg-on-gpu
  --num-epochs INTEGER
  --global-step INTEGER
  --fp16
  --fp16-opt-level [O1|O2]
  --fp16-master-weights / --fp16-no-master-weights
  --fp16-min-loss-scale INTEGER
  --fp16-max-loss-scale INTEGER
  --local-rank, --local_rank INTEGER
  --num-nodes INTEGER
  --node-rank INTEGER
  --master-addr TEXT
  --master-port TEXT
  --log-dir PATH
  --model-file PATH
  --optimizer-file PATH
  --scheduler-file PATH
  --amp-file PATH
  --save-interval-sec INTEGER
  --save-interval-steps INTEGER
  --help                          Show this
                                  message and
                                  exit.
```


###Pretraining



---



Pretraining LUKE for ED is two staged. In the first stage training, the pretrained BERT parameters are fixed by setting `fix_bert_weights` to `True`. During this stage the model is trained using a learning rate of 5e-4 for one epoch. Then, for the second stahe, the pretraining continues by updating all the parameters with a learning rate of 5e-5 for six epochs. When the second stage pretraining starts the trained model parameters from the first stage are loaded by setting the `--model-file` to the model_file from stage one.



---



Note that `--masked-lm-prob` must be set to 0 since  masked language model is not used in the training. In addition, `--masked_entity_prob` is set to 0.3 for the experiments.

####First Stage

In [None]:
!python -m luke.cli pretrain ./is-wikipedia-pretrain-dataset/ ./ed-luke-icebert-base-768-first-stage/ \
--bert-model-name mideind/IceBERT-igc \
--entity-emb-size 768 \
--batch-size 8 \
--gradient-accumulation-steps 1 \
--learning-rate 5e-4 \
--warmup-steps 100 \
--log-dir ./logs/ed-luke-icebert-base-768-first-stage/ \
--weight-decay=0.01 \
--adam-b1=0.9 \
--adam-b2=0.999 \
--adam-eps=1e-6 \
--lr-schedule='warmup_linear' \
--masked-entity-prob=0.30 \
--masked-lm-prob=0 \
--max-grad-norm=1.0 \
--num-epochs 1 \
--fix-bert-weights

####Second Stage

In [None]:
!python -m luke.cli pretrain ./is-wikipedia-pretrain-dataset/ ./ed-luke-icebert-base-768-second-stage/ \
--bert-model-name mideind/IceBERT-igc \
--model-file ./ed-luke-icebert-base-768-first-stage/model_epoch1.bin \
--entity-emb-size 768 \
--batch-size 8 \
--gradient-accumulation-steps 1 \
--learning-rate 5e-5 \
--warmup-steps 100 \
--log-dir ./logs/ed-luke-icebert-base-768-second-stage/ \
--weight-decay=0.01 \
--adam-b1=0.9 \
--adam-b2=0.999 \
--adam-eps=1e-6 \
--lr-schedule='warmup_linear' \
--masked-entity-prob=0.30 \
--masked-lm-prob=0 \
--max-grad-norm=1.0 \
--num-epochs 6