<a href="https://colab.research.google.com/github/clayton-summitt/w266-final/blob/main/XLM_T_Fine_tuning_on_custom_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# !pip install --upgrade pip
!pip install sentencepiece
!pip install datasets
!pip install transformers

Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l[K     |▎                               | 10 kB 31.8 MB/s eta 0:00:01[K     |▌                               | 20 kB 35.4 MB/s eta 0:00:01[K     |▉                               | 30 kB 42.7 MB/s eta 0:00:01[K     |█                               | 40 kB 30.6 MB/s eta 0:00:01[K     |█▍                              | 51 kB 16.7 MB/s eta 0:00:01[K     |█▋                              | 61 kB 14.7 MB/s eta 0:00:01[K     |██                              | 71 kB 13.7 MB/s eta 0:00:01[K     |██▏                             | 81 kB 15.2 MB/s eta 0:00:01[K     |██▍                             | 92 kB 16.8 MB/s eta 0:00:01[K     |██▊                             | 102 kB 12.8 MB/s eta 0:00:01[K     |███                             | 112 kB 12.8 MB/s eta 0:00:01[K     |███▎                            | 122 kB 12.8 MB/s eta 0:00:01[K     |██

# Fine-tuning XLM-T

This notebook describes a simple case of finetuning. You can finetune either the `XLM-T` language model, or XLM-T sentiment, which has already been fine-tuned on sentiment analysis data, in 8 languages (this could be useful to do sentiment transfer learning on new languages).,

This notebook was modified from https://huggingface.co/transformers/custom_datasets.html

In [3]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch

import numpy as np
from sklearn.metrics import classification_report

## Parameters

In [4]:
LR = 2e-5
EPOCHS = 1
BATCH_SIZE = 32
# MODEL = "cardiffnlp/twitter-xlm-roberta-base" # use this to finetune the language model
MODEL = "cardiffnlp/twitter-xlm-roberta-base-sentiment" # use this to finetune the sentiment classifier
MAX_TRAINING_EXAMPLES = -1 # set this to -1 if you want to use the whole training set

## Data

We download the xml-t sentiment dataset (`UMSAB`) but you can use your own.
If you use the same files structures as [TweetEval](https://github.com/cardiffnlp/tweeteval) (`train_text.txt`, `train_labels.txt`, `val_text.txt`, `...`), you do not need to change anything in the code.

---



In [5]:
from google.colab import files
from google.colab import drive
drive.mount('/content/drive' ,force_remount=True)
import glob
import os
os.chdir("drive/MyDrive/vaccine/data/fine_tune_sentimnet/")

Mounted at /content/drive


In [7]:
# loading dataset for UMSAB's all 8 languages

files = """test_labels.txt
test_text.txt
train_labels.txt
train_text.txt
val_labels.txt
val_text.txt""".split('\n')

for f in files:
  # p = f"https://raw.githubusercontent.com/cardiffnlp/xlm-t/main/data/sentiment/all/{f}"
  !wget $f

--2021-12-01 17:06:48--  http://test_labels.txt/
Resolving test_labels.txt (test_labels.txt)... failed: Name or service not known.
wget: unable to resolve host address ‘test_labels.txt’
--2021-12-01 17:06:48--  http://test_text.txt/
Resolving test_text.txt (test_text.txt)... failed: Name or service not known.
wget: unable to resolve host address ‘test_text.txt’
--2021-12-01 17:06:48--  http://train_labels.txt/
Resolving train_labels.txt (train_labels.txt)... failed: Name or service not known.
wget: unable to resolve host address ‘train_labels.txt’
--2021-12-01 17:06:49--  http://train_text.txt/
Resolving train_text.txt (train_text.txt)... failed: Name or service not known.
wget: unable to resolve host address ‘train_text.txt’
--2021-12-01 17:06:49--  http://val_labels.txt/
Resolving val_labels.txt (val_labels.txt)... failed: Name or service not known.
wget: unable to resolve host address ‘val_labels.txt’
--2021-12-01 17:06:49--  http://val_text.txt/
Resolving val_text.txt (val_text.txt

In [6]:
os.listdir()

['mapping.txt',
 'test_text.txt',
 'test_labels.txt',
 'train_text.txt',
 'val_labels.txt',
 'val_text.txt',
 'train_labels.txt']

In [7]:

dataset_dict = {}
for i in ['train','val','test']:
  dataset_dict[i] = {}
  for j in ['text','labels']:
    dataset_dict[i][j] = open(f"{i}_{j}.txt").read().split('\n')
    dataset_dict[i][j].pop()
    if j == 'labels':
      
      dataset_dict[i][j] = [int(x) for x in dataset_dict[i][j]]

if MAX_TRAINING_EXAMPLES > 0:
  dataset_dict['train']['text']=dataset_dict['train']['text'][:MAX_TRAINING_EXAMPLES]
  dataset_dict['train']['labels']=dataset_dict['train']['labels'][:MAX_TRAINING_EXAMPLES]

''

In [8]:

tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)

Downloading:   0%|          | 0.00/841 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [9]:
train_encodings = tokenizer(dataset_dict['train']['text'], truncation=True, padding=True)
val_encodings = tokenizer(dataset_dict['val']['text'], truncation=True, padding=True)
test_encodings = tokenizer(dataset_dict['test']['text'], truncation=True, padding=True)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [10]:
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = MyDataset(train_encodings, dataset_dict['train']['labels'])
val_dataset = MyDataset(val_encodings, dataset_dict['val']['labels'])
test_dataset = MyDataset(test_encodings, dataset_dict['test']['labels'])

## Fine-tuning

The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a model
to fine-tune, define the `TrainingArguments`/`TFTrainingArguments` and
instantiate a `Trainer`/`TFTrainer`.

In [16]:
training_args = TrainingArguments(
    output_dir='./results',                   # output directory
    num_train_epochs=EPOCHS,                  # total number of training epochs
    per_device_train_batch_size=BATCH_SIZE,   # batch size per device during training
    per_device_eval_batch_size=BATCH_SIZE,    # batch size for evaluation
    warmup_steps=100,                         # number of warmup steps for learning rate scheduler
    weight_decay=0.01,                        # strength of weight decay
    logging_dir='./logs',                     # directory for storing logs
    logging_steps=10,                         # when to print log
    save_strategy = "steps",
    evaluation_strategy="steps",                        
    
    load_best_model_at_end=True              # load or not best model at the end
)

num_labels = len(set(dataset_dict["train"]["labels"]))
model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=num_labels)

In [17]:
trainer = Trainer(
    model=model,                              # the instantiated 🤗 Transformers model to be trained
    args=training_args,                       # training arguments, defined above
    train_dataset=train_dataset,              # training dataset
    eval_dataset=val_dataset                  # evaluation dataset
)

trainer.train()

***** Running training *****
  Num examples = 4200
  Num Epochs = 1
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 132


Step,Training Loss,Validation Loss
10,0.906,0.814919
20,0.76,0.734425
30,0.6999,0.714322
40,0.6481,0.669997
50,0.7231,0.628164
60,0.602,0.613263
70,0.5321,0.575895
80,0.6176,0.587809
90,0.6644,0.605106
100,0.7044,0.546445


***** Running Evaluation *****
  Num examples = 840
  Batch size = 32
***** Running Evaluation *****
  Num examples = 840
  Batch size = 32
***** Running Evaluation *****
  Num examples = 840
  Batch size = 32
***** Running Evaluation *****
  Num examples = 840
  Batch size = 32
***** Running Evaluation *****
  Num examples = 840
  Batch size = 32
***** Running Evaluation *****
  Num examples = 840
  Batch size = 32
***** Running Evaluation *****
  Num examples = 840
  Batch size = 32
***** Running Evaluation *****
  Num examples = 840
  Batch size = 32
***** Running Evaluation *****
  Num examples = 840
  Batch size = 32
***** Running Evaluation *****
  Num examples = 840
  Batch size = 32
***** Running Evaluation *****
  Num examples = 840
  Batch size = 32
***** Running Evaluation *****
  Num examples = 840
  Batch size = 32
***** Running Evaluation *****
  Num examples = 840
  Batch size = 32


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=132, training_loss=0.6669176988529436, metrics={'train_runtime': 51.9304, 'train_samples_per_second': 80.878, 'train_steps_per_second': 2.542, 'total_flos': 254685566066400.0, 'train_loss': 0.6669176988529436, 'epoch': 1.0})

In [18]:
trainer.save_model("./results/best_model") # save best model

Saving model checkpoint to ./results/best_model
Configuration saved in ./results/best_model/config.json
Model weights saved in ./results/best_model/pytorch_model.bin


## Evaluate on Test set

In [19]:
test_preds_raw, test_labels , _ = trainer.predict(test_dataset)
test_preds = np.argmax(test_preds_raw, axis=-1)
print(classification_report(test_labels, test_preds, digits=3))

***** Running Prediction *****
  Num examples = 1800
  Batch size = 32


              precision    recall  f1-score   support

           0      0.500     0.231     0.316       143
           1      0.785     0.791     0.788      1078
           2      0.662     0.741     0.699       579

    accuracy                          0.731      1800
   macro avg      0.649     0.588     0.601      1800
weighted avg      0.723     0.731     0.722      1800



<a id='ft_native'></a>