# XLM-R Tutorial

In this tutorial, we will train(fine-tune), eval and export a XLM-R model based on a large pre-train model.

We will create dummy datasets with a test config such that the model can be trained in a few minutes for demo purpose. To train your model, please update config with your own datasets and tune params in config(e.g. increase epochs)

See the full paper and introduction in https://pytext.readthedocs.io/en/master/xlm_r.html

## Install PyText from source code

In [1]:
# please ignore the warning about tensorboard if it appears
!pip install --quiet git+https://github.com/facebookresearch/pytext

## Download a pre-trained model

We provide 2 pre-trained models "xlmr.base.v0" and "xlmr.large.v0" at https://pytext.readthedocs.io/en/master/xlm_r.html#pre-trained-models

Let's download "xlmr.large.v0" and extract the files

In [2]:
!wget https://dl.fbaipublicfiles.com/fairseq/models/xlmr.large.v0.tar.gz
!tar xzf ./xlmr.large.v0.tar.gz

--2020-04-06 16:01:21--  https://dl.fbaipublicfiles.com/fairseq/models/xlmr.large.v0.tar.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 2606:4700:10::6816:4a8e, 2606:4700:10::6816:4b8e, 104.22.74.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|2606:4700:10::6816:4a8e|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5116367334 (4.8G) [application/x-tar]
Saving to: ‘xlmr.large.v0.tar.gz’


2020-04-06 16:30:21 (2.81 MB/s) - ‘xlmr.large.v0.tar.gz’ saved [5116367334/5116367334]



## Create dummy datasets

In [3]:
import json
import os

import torch

from pytext import workflow
from pytext.config.serialize import pytext_config_from_json
from pytext.models.roberta import RoBERTa
from pytext.task.serialize import load

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Install apex from https://github.com/NVIDIA/apex/.


In [4]:
dummy_train_filename = "dummy_train_file.txt"
dummy_test_filename = "dummy_test_file.txt"
dummy_eval_filename = "dummy_eval_file.txt"

dummy_dataset = """neutral	Conceptually cream skimming has two basic dimensions - product and geography.	Product and geography are what make cream skimming work. 
entailment	you know during the season and i guess at at your level uh you lose them to the next level if if they decide to recall the the parent team the Braves decide to call to recall a guy from triple A then a double A guy goes up to replace him and a single A guy goes up to replace him	You lose the things to the following level if the people recall.
entailment	One of our number will carry out your instructions minutely.	A member of my team will execute your orders with immense precision.
entailment	How do you know? All this is their information again.	This information belongs to them.
neutral	yeah i tell you what though if you go price some of those tennis shoes i can see why now you know they're getting up in the hundred dollar range	The tennis shoes have a range of prices.
entailment	my walkman broke so i'm upset now i just have to turn the stereo up real loud	I'm upset that my walkman broke and now I have to turn the stereo up really loud.
neutral	But a few Christian mosaics survive above the apse is the Virgin with the infant Jesus, with the Archangel Gabriel to the right (his companion Michael, to the left, has vanished save for a few feathers from his wings).	Most of the Christian mosaics were destroyed by Muslims.  
entailment	(Read  for Slate 's take on Jackson's findings.)	Slate had an opinion on Jackson's findings.
contradiction	Gays and lesbians.	Heterosexuals.
contradiction	At the end of Rue des Francs-Bourgeois is what many consider to be the city's most handsome residential square, the Place des Vosges, with its stone and red brick facades.	Place des Vosges is constructed entirely of gray marble."""

for filename in (dummy_train_filename, dummy_test_filename, dummy_eval_filename):
    with open(filename, "w") as f:
        f.write(dummy_dataset)
        print(f"Created dummy dataset: {filename}")

Created dummy dataset: dummy_train_file.txt
Created dummy dataset: dummy_test_file.txt
Created dummy dataset: dummy_eval_file.txt


In [5]:
# To train a real model,set your own dataset here
TRAIN_FILENAME = dummy_train_filename
TEST_FILENAME = dummy_test_filename
EVAL_FILENAME = dummy_eval_filename

PRE_TRAIN_MODEL_DIR = "xlmr.large.v0"

## Create a PyText Config

In [6]:
config_json = """
{
  "version": 18,
  "task": {
    "DocumentClassificationTask": {
      "data": {
        "Data": {
          "source": {
            "TSVDataSource": {
              "train_filename": "{TRAIN_FILENAME}",
              "test_filename": "{TEST_FILENAME}",
              "eval_filename": "{EVAL_FILENAME}",
              "field_names": [
                "label",
                "text1",
                "text2"
              ]
            }
          },
          "batcher": {
            "Batcher": {
              "train_batch_size": 8,
              "eval_batch_size": 8,
              "test_batch_size": 8
            }
          },
          "sort_key": "tokens"
        }
      },
      "trainer": {
        "TaskTrainer": {
          "epochs": 1,
          "early_stop_after": 0,
          "max_clip_norm": null,
          "report_train_metrics": true,
          "target_time_limit_seconds": null,
          "do_eval": true,
          "num_samples_to_log_progress": 10,
          "num_accumulated_batches": 1,
          "optimizer": {
            "Adam": {
              "lr": 0.000005,
              "weight_decay": 0
            }
          },
          "scheduler": null,
          "sparsifier": null,
          "fp16_args": {
            "FP16OptimizerApex": {
              "init_loss_scale": null,
              "min_loss_scale": null
            }
          }
        }
      },
      "model": {
        "RoBERTa": {
          "inputs": {
            "tokens": {
              "columns": [
                "text1",
                "text2"
              ],
              "vocab_file": "{VOCAB_PATH}",
              "tokenizer": {
                "SentencePieceTokenizer": {
                  "sp_model_path": "{SP_MODEL_PATH}"
                }
              },
              "max_seq_len": 256
            },
            "labels": {
              "LabelTensorizer": {
                "column": "label",
                "allow_unknown": false,
                "pad_in_vocab": false,
                "label_vocab": null
              }
            }
          },
          "encoder": {
            "RoBERTaEncoder": {
              "load_path": null,
              "save_path": "encoder.pt",
              "shared_module_key": null,
              "embedding_dim": 1024,
              "vocab_size": 250002,
              "num_encoder_layers": 24,
              "num_attention_heads": 16,
              "model_path": "{PRE_TRAIN_MODEL_PATH}",
              "is_finetuned": false
            }
          },
          "decoder": {
            "load_path": null,
            "save_path": "decoder.pt",
            "freeze": false,
            "shared_module_key": "DECODER",
            "hidden_dims": [],
            "out_dim": null,
            "activation": "gelu"
          },
          "output_layer": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "loss": {
              "CrossEntropyLoss": {}
            },
            "label_weights": null
          }
        }
      },
      "metric_reporter": {
        "ClassificationMetricReporter": {
          "model_select_metric": "accuracy",
          "target_label": null,
          "text_column_names": [
            "text1",
            "text2"
          ],
          "recall_at_precision_thresholds": []
        }
      }
    }
  }
}
"""

In [7]:
config = pytext_config_from_json(json.loads(config_json))



## Update Config

In [8]:
config.task.data.source.train_filename = TRAIN_FILENAME
config.task.data.source.test_filename = TEST_FILENAME
config.task.data.source.eval_filename = EVAL_FILENAME

config.task.model.inputs.tokens.tokenizer.sp_model_path = os.path.join(
    PRE_TRAIN_MODEL_DIR, "sentencepiece.bpe.model"
)
config.task.model.inputs.tokens.vocab_file = os.path.join(PRE_TRAIN_MODEL_DIR, "dict.txt")
config.task.model.encoder.model_path = os.path.join(PRE_TRAIN_MODEL_DIR, "model.pt")

## Train Model

In [9]:
trained_model, best_metric = workflow.train_model(config)


Parameters: PyTextConfig:
    auto_resume_from_snapshot: False
    debug_path: /tmp/model.debug
    distributed_world_size: 1
    export_caffe2_path: None
    export_onnx_path: /tmp/model.onnx
    export_torchscript_path: None
    gpu_streams_for_distributed_training: 1
    include_dirs: None
    load_snapshot_path: 
    modules_save_dir: 
    random_seed: None
    report_eval_results: False
    save_all_checkpoints: False
    save_module_checkpoints: False
    save_snapshot_path: /tmp/model.pt
    task: DocumentClassificationTask.Config:
        data: Data.Config:
            batcher: Batcher.Config:
                eval_batch_size: 8
                test_batch_size: 8
                train_batch_size: 8
            in_memory: True
            sort_key: tokens
            source: TSVDataSource.Config:
                column_mapping: {}
                delimiter: 	
                drop_incomplete_rows: False
                eval_filename: dummy_eval_file.txt
                field_name

## Load Model

In [10]:
model_file = "/tmp/model.pt"
task, config, _ = load(model_file)
loaded_model = task.model

Loaded checkpoint...
Use config saved in snapshot
Creating task: DocumentClassificationTask...
PyText data schema: {'text1': <class 'str'>, 'text2': <class 'str'>, 'label': <class 'str'>}.
Skipped initializing tensorizers since they are loaded from a previously saved state.
Loading model from model state dict...
Loaded!


In [11]:
# loaded_model is a torch.nn.Module
isinstance(loaded_model, torch.nn.Module)


True