# The BERT Cross-Lingual Transferability

The idea and some preliminary results described [here](https://towardsdatascience.com/bert-based-cross-lingual-question-answering-with-deeppavlov-704242c2ac6f?source=friends_link&sk=b7aef1c29b8a8f067fe62e3bfbea2292)

Basically we have a BERT-based QA model pretrained on English SquAD dataset. The model is based on multilingual BERT, which means it allows cross-language transfer learning. Our task is measure to what extend BERT-based QA (and not only QA) models transfers to other languages (Russia?)(Chinese?).

Nowadays, exploring the BERT cross-lingual transferability is a very competitive field. That's why we need to think very carefully before starting doing something. Meanwhile the most promising idea is to build  learning curves, which means train M-BERT model on freely available English data then gradually adding training instances from the scarce language specific datasets (for English and Chinese) and compare between this model and the model trained solely on the limit number of instances.

Useful material:

0) https://arxiv.org/pdf/1910.04659.pdf 

1) https://arxiv.org/abs/1906.01502

2) https://arxiv.org/abs/1904.09077

3) https://arxiv.org/abs/1806.00920

4) https://arxiv.org/abs/1810.04805

5) https://arxiv.org/abs/1912.09723

This is a starter notebook it shows how to set up experiments with DeepPavlov. It should be ready to run on Colab. It might contains some issues so please feel free to improve.

In [None]:
!git clone --single-branch --branch squad_multilingual_configs_size git@github.com:deepmipt/DeepPavlov.git

In [None]:
!pip install -e DeepPavlov/.

In [None]:
!python -m deeppavlov install trans_en_mbert

In [None]:
# we need to clean up the data because we want to train from the beggining
!rm -rf ~/.deeppavlov

In [None]:
from deeppavlov.core.common.file import read_json
from deeppavlov import configs, train_model, evaluate_model
from deeppavlov.core.commands.train import train_evaluate_model_from_config
from deeppavlov import train_model 

# MBERT-based model for English squad
cfg_mbert_ensquad = read_json(configs.squad.trans_en_mbert)

# MBERT-based model for Russian squad
cfg_mbert_rusquad = read_json(configs.squad.trans_ru_mbert)

# MBERT-based model for Chinese squad
cfg_mbert_zhsquad = read_json(configs.squad.squad_zh_bert_mult)

# Chinese-BERT-based model for Chinese squad
cfg_zhbert_zhquad = read_json(configs.squad.squad_zh_bert_zh)

# Russia-BERT-based model for Russia squad
cfg_rubert_rusquad = read_json(configs.squad.squad_ru_rubert_infer)
cfg_rubert_rusquad = read_json(configs.squad.squad_ru_rubert)

# define how many instances to use
cfg_mbert_rusquad['dataset_iterator']['port'] = 10000

train_evaluate_model_from_config(cfg_mbert_rusquad, to_train=True, download=True)

# model = train_model(cfg_mbert_rusquad, download=True)
# model = train_model(configs.classifiers.insults_kaggle_bert, download=True)
# model = build_model(configs.squad.ru_on_mbert_1000, download=True, load_trained=False)

In [1]:
model(['Su área de distribución comprende casi toda Sudamérica al este de los Andes en las \
       cuencas del río Orinoco, del Amazonas y del Río de la Plata; cubriendo desde el este \
       de Venezuela y la Guyana hasta Uruguay y el norte y centro de Argentina. Pueden vivir \
       en diferentes tipos de hábitat, pero muestran preferencia por algunos en concreto. \
       Suelen encontrarse cerca de lagos, ríos, marismas o manglares.'], 
      ['What countries do capybara live in?'])