# XLM-R Tutorial

In this tutorial, we will train(fine-tune), eval and export a XLM-R model based on a large pre-train model.

We are using a dummy dataset with test config so we can train the model in a few minutes

See the full paper and introduction in https://pytext.readthedocs.io/en/master/xlm_r.html

## Install PyText from source code

following https://pytext.readthedocs.io/en/master/installation.html#install-from-source

## Download pre-trained models

We provide 2 pre-trained models "xlmr.base.v0" and "xlmr.large.v0" model at https://pytext.readthedocs.io/en/master/xlm_r.html#pre-trained-models

Let's download "xlmr.large.v0" and extract the files

In [None]:
!wget https://dl.fbaipublicfiles.com/fairseq/models/xlmr.large.v0.tar.gz
!tar xzf ./xlmr.large.v0.tar.gz

## Update Config and Tran Model

In [4]:
import json
import os

import torch

from pytext import workflow
from pytext.config.serialize import pytext_config_from_json
from pytext.models.roberta import RoBERTa
from pytext.task.serialize import load

In [12]:
dummy_dataset = """contradiction\t▁Well , ▁I ▁wasn ' t ▁even ▁thinking ▁about ▁that , ▁but ▁I ▁was ▁so ▁frustra ted , ▁and , ▁I ▁ended ▁up ▁talking ▁to ▁him ▁again .\t▁I ▁have nt ▁spoke n ▁to ▁him ▁again .\nentailment\t▁Well , ▁I ▁wasn ' t ▁even ▁thinking ▁about ▁that , ▁but ▁I ▁was ▁so ▁frustra ted , ▁and , ▁I ▁ended ▁up ▁talking ▁to ▁him ▁again .\t▁I ▁was ▁so ▁up set ▁that ▁I ▁just ▁started ▁talking ▁to ▁him ▁again .\nneutral\t▁Well , ▁I ▁wasn ' t ▁even ▁thinking ▁about ▁that , ▁but ▁I ▁was ▁so ▁frustra ted , ▁and , ▁I ▁ended ▁up ▁talking ▁to ▁him ▁again .\t▁We ▁had ▁a ▁great ▁talk .\nneutral\t▁And ▁I ▁thought ▁that ▁was ▁a ▁privilege , ▁and ▁it ' s ▁still , ▁it ' s ▁still , ▁I ▁was ▁the ▁only ▁ni ne ▁two - two ▁Ex - O ▁which ▁was ▁my ▁AFF C ▁Air ▁Force ▁Care er ▁field .\t▁I ▁was ▁not ▁aware ▁that ▁I ▁was ▁not ▁the ▁only ▁person ▁to ▁be ▁at ▁the ▁field ▁that ▁day .\nentailment\t▁And ▁I ▁thought ▁that ▁was ▁a ▁privilege , ▁and ▁it ' s ▁still , ▁it ' s ▁still , ▁I ▁was ▁the ▁only ▁ni ne ▁two - two ▁Ex - O ▁which ▁was ▁my ▁AFF C ▁Air ▁Force ▁Care er ▁field .\t▁I ▁was ▁under ▁the ▁impression ▁that ▁I ▁was ▁the ▁only ▁one ▁with ▁that ▁number ▁at ▁the ▁AFF C ▁Air ▁Force ▁Care er ▁field .\ncontradiction\t▁And ▁I ▁thought ▁that ▁was ▁a ▁privilege , ▁and ▁it ' s ▁still , ▁it ' s ▁still , ▁I ▁was ▁the ▁only ▁ni ne ▁two - two ▁Ex - O ▁which ▁was ▁my ▁AFF C ▁Air ▁Force ▁Care er ▁field .\t▁We ▁all ▁were ▁given ▁the ▁same ▁exact ▁number ▁no ▁matter ▁what ▁privilege s ▁we ▁were ▁promise d ▁to ▁be ▁gran ted , ▁it ▁was ▁all ▁a ▁lie .\ncontradiction\t▁They ▁told ▁me ▁that , ▁uh , ▁that ▁I ▁would ▁be ▁called ▁in ▁a ▁guy ▁at ▁the ▁end ▁for ▁me ▁to ▁meet .\t▁I ▁was ▁never ▁told ▁anything ▁about ▁meeting ▁anyone .\nentailment\t▁They ▁told ▁me ▁that , ▁uh , ▁that ▁I ▁would ▁be ▁called ▁in ▁a ▁guy ▁at ▁the ▁end ▁for ▁me ▁to ▁meet .\t▁I ▁was ▁told ▁a ▁guy ▁would ▁be ▁called ▁in ▁for ▁me ▁to ▁meet .\nneutral\t▁They ▁told ▁me ▁that , ▁uh , ▁that ▁I ▁would ▁be ▁called ▁in ▁a ▁guy ▁at ▁the ▁end ▁for ▁me ▁to ▁meet .\t▁The ▁guy ▁showed ▁up ▁a ▁bit ▁late .\ncontradiction\t▁There ' s ▁so ▁much ▁you ▁could ▁talk ▁about ▁on ▁that ▁I ' ll ▁just ▁skip ▁that .\t▁I ▁want ▁to ▁tell ▁you ▁everything ▁I ▁know ▁about ▁that !\n"""

dummy_train_filename = os.path.join(os.getcwd(), "dummy_train_file.txt")
dummy_test_filename = os.path.join(os.getcwd(), "dummy_test_file.txt")
dummy_eval_filename = os.path.join(os.getcwd(), "dummy_eval_file.txt")
for filename in (dummy_train_filename, dummy_test_filename, dummy_eval_filename):
    with open(filename, "w") as f:
        f.write(dummy_dataset)
        print(f"Created: {filename}")

Created: /data/users/stevenliu/notebooks/dummy_train_file.txt
Created: /data/users/stevenliu/notebooks/dummy_test_file.txt
Created: /data/users/stevenliu/notebooks/dummy_eval_file.txt


In [13]:
# To train a real model,set your own dataset here
TRAIN_FILENAME = dummy_train_filename
TEST_FILENAME = dummy_test_filename
EVAL_FILENAME = dummy_eval_filename

PRE_TRAIN_MODEL_DIR = os.path.join(os.getcwd(), "xlmr.large.v0")

In [14]:
config_json = """
   {
    "version": 18,
	"task": {
		"DocumentClassificationTask": {
			"data": {
				"Data": {
					"source": {
						"TSVDataSource": {
							"train_filename": "{TRAIN_FILENAME}",
							"test_filename": "{TEST_FILENAME}",
							"eval_filename": "{EVAL_FILENAME}",
							"field_names": [
								"label",
								"text1",
								"text2"
							]
						}
					},
					"batcher": {
						"Batcher": {
							"train_batch_size": 8,
							"eval_batch_size": 8,
							"test_batch_size": 8
						}
					},
					"sort_key": "tokens"
				}
			},
			"trainer": {
				"TaskTrainer": {
					"epochs": 1,
					"early_stop_after": 0,
					"max_clip_norm": null,
					"report_train_metrics": true,
					"target_time_limit_seconds": null,
					"do_eval": true,
					"num_samples_to_log_progress": 10,
					"num_accumulated_batches": 1,
					"num_batches_per_epoch": 1,
					"optimizer": {
						"Adam": {
							"lr": 0.000005,
							"weight_decay": 0
						}
					},
					"scheduler": null,
					"sparsifier": null,
					"fp16_args": {
						"FP16OptimizerApex": {
							"init_loss_scale": null,
							"min_loss_scale": null
						}
					}
				}
			},
			"model": {
				"RoBERTa": {
					"inputs": {
						"tokens": {
							"columns": [
								"text1",
								"text2"
							],
							"vocab_file": "{VOCAB_PATH}",
							"tokenizer": {
								"SentencePieceTokenizer": {
									"sp_model_path": "{SP_MODEL_PATH}"
								}
							},
							"max_seq_len": 256
						},
						"labels": {
							"LabelTensorizer": {
								"column": "label",
								"allow_unknown": false,
								"pad_in_vocab": false,
								"label_vocab": null
							}
						}
					},
					"encoder": {
						"RoBERTaEncoder": {
							"load_path": null,
							"save_path": "encoder.pt",
							"shared_module_key": null,
							"embedding_dim": 1024,
							"vocab_size": 250002,
							"num_encoder_layers": 24,
							"num_attention_heads": 16,
							"model_path": "{PRE_TRAIN_MODEL_PATH}",
							"is_finetuned": false
						}
					},
					"decoder": {
						"load_path": null,
						"save_path": "decoder.pt",
						"freeze": false,
						"shared_module_key": "DECODER",
						"hidden_dims": [],
						"out_dim": null,
						"activation": "gelu"
					},
					"output_layer": {
						"load_path": null,
						"save_path": null,
						"freeze": false,
						"shared_module_key": null,
						"loss": {
							"CrossEntropyLoss": {}
						},
						"label_weights": null
					}
				}
			},
			"metric_reporter": {
				"ClassificationMetricReporter": {
					"model_select_metric": "accuracy",
					"target_label": null,
					"text_column_names": [
						"text1",
						"text2"
					],
					"recall_at_precision_thresholds": []
				}
			}
		}
	}
 }

"""

In [15]:
config_dict = json.loads(config_json)
config_obj = pytext_config_from_json(config_dict)

* Update config

In [16]:
config_obj.task.data.source.train_filename = TRAIN_FILENAME
config_obj.task.data.source.test_filename = TEST_FILENAME
config_obj.task.data.source.eval_filename = EVAL_FILENAME

config_obj.task.model.inputs.tokens.tokenizer.sp_model_path = os.path.join(
    PRE_TRAIN_MODEL_DIR, "sentencepiece.bpe.model"
)
config_obj.task.model.inputs.tokens.vocab_file = os.path.join(PRE_TRAIN_MODEL_DIR, "dict.txt")
config_obj.task.model.encoder.model_path = os.path.join(PRE_TRAIN_MODEL_DIR, "model.pt")

* Train Model

In [None]:
trained_model, best_metric = workflow.train_model(config_obj)


Parameters: PyTextConfig:
    auto_resume_from_snapshot: False
    debug_path: /tmp/model.debug
    distributed_world_size: 1
    export_caffe2_path: None
    export_onnx_path: /tmp/model.onnx
    export_torchscript_path: None
    include_dirs: None
    load_snapshot_path: 
    modules_save_dir: 
    random_seed: None
    report_eval_results: False
    save_all_checkpoints: False
    save_module_checkpoints: False
    save_snapshot_path: /tmp/model.pt
    task: DocumentClassificationTask.Config:
        data: Data.Config:
            batcher: Batcher.Config:
                eval_batch_size: 8
                test_batch_size: 8
                train_batch_size: 8
            in_memory: True
            sort_key: tokens
            source: TSVDataSource.Config:
                column_mapping: {}
                delimiter: 	
                drop_incomplete_rows: False
                eval_filename: /data/users/stevenliu/notebooks/dummy_eval_file.txt
                field_names: ['label',

## Load Model

In [None]:
model_file = "/tmp/model.pt"
task, config, _ = load(model_file)

In [None]:
model = task.model
# model is a torch.nn.Module
isinstance(model, torch.nn.Module)