In [1]:
!pip install --upgrade pip
!pip install sentencepiece
!pip install datasets
!pip install transformers

Collecting pip
  Downloading pip-21.3.1-py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 5.1 MB/s 
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
Successfully installed pip-21.3.1
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
     |████████████████████████████████| 1.2 MB 5.2 MB/s            
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.96
Collecting datasets
  Downloading datasets-1.16.1-py3-none-any.whl (298 kB)
     |████████████████████████████████| 298 kB 5.0 MB/s            
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
     |████████████████████████████████| 61 kB 519 kB/s             
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
cd "/content/drive/MyDrive/ChiSquareX_NLP_Tutoring/Multi_Lingual/CrisisNLP_labeled_data/CrisisNLP_labeled_data_crowdflower/2014_Hurricane_Odile_Mexico_en"

/content/drive/MyDrive/ChiSquareX_NLP_Tutoring/Multi_Lingual/CrisisNLP_labeled_data/CrisisNLP_labeled_data_crowdflower/2014_Hurricane_Odile_Mexico_en


# Fine-tuning XLM-T

This notebook describes a simple case of finetuning. You can finetune either the `XLM-T` language model, or XLM-T sentiment, which has already been fine-tuned on sentiment analysis data, in 8 languages (this could be useful to do sentiment transfer learning on new languages).,

This notebook was modified from https://huggingface.co/transformers/custom_datasets.html

In [4]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch

import numpy as np
from sklearn.metrics import classification_report

## Parameters
Modify according to your need- may play with LR it often helps

In [5]:
LR = 2e-5 #Standard LR for Adam Optimizer is 3e-4
EPOCHS = 3
BATCH_SIZE = 4
MODEL = "cardiffnlp/twitter-xlm-roberta-base" # use this to finetune the language model
#MODEL = "cardiffnlp/twitter-xlm-roberta-base-sentiment" # use this to finetune the sentiment classifier
MAX_TRAINING_EXAMPLES = -1 # set this to -1 if you want to use the whole training set

## Data

We download the xml-t sentiment dataset (`UMSAB`) but you can use your own.
If you use the same files structures as [TweetEval](https://github.com/cardiffnlp/tweeteval) (`train_text.txt`, `train_labels.txt`, `val_text.txt`, `...`), you do not need to change anything in the code.

---



In [6]:
# loading dataset for UMSAB's all 8 languages
'''
files = """test_labels.txt
test_text.txt
train_labels.txt
train_text.txt
val_labels.txt
val_text.txt""".split('\n')

for f in files:
  p = f"https://raw.githubusercontent.com/cardiffnlp/xlm-t/main/data/sentiment/all/{f}"
  !wget $p #downloading the corresponding datasets
'''

'\nfiles = """test_labels.txt\ntest_text.txt\ntrain_labels.txt\ntrain_text.txt\nval_labels.txt\nval_text.txt""".split(\'\n\')\n\nfor f in files:\n  p = f"https://raw.githubusercontent.com/cardiffnlp/xlm-t/main/data/sentiment/all/{f}"\n  !wget $p #downloading the corresponding datasets\n'

If any other dataset, you need to modify this part to handle data accordingly

In [7]:
#defining dataset dictionary in the format dict(dict(text:label))
dataset_dict = {} #dict()
for i in ['train','val','test']:
  dataset_dict[i] = {} #dict(dict())
  for j in ['text','labels']:
    dataset_dict[i][j] = open(f"{i}_{j}.txt").read().split('\n')
    dataset_dict[i][j].pop()
    if j == 'labels':
      dataset_dict[i][j] = [int(x) for x in dataset_dict[i][j]] #dict(dict(text:label))

if MAX_TRAINING_EXAMPLES > 0:
  dataset_dict['train']['text']=dataset_dict['train']['text'][:MAX_TRAINING_EXAMPLES] #set MAX_TRAINING_EXAMPLES to -1 if you want to use the whole training set elso upto the index you want
  dataset_dict['train']['labels']=dataset_dict['train']['labels'][:MAX_TRAINING_EXAMPLES] #set MAX_TRAINING_EXAMPLES to -1 if you want to use the whole training set elso upto the index you want

In [8]:
'''
#defining dataset dictionary in the format dict(dict(text:label))
dataset_dict = {} #dict()
for i in ['train','val','test']:
  dataset_dict[i] = {} #dict(dict())
  for j in ['text','labels']:
    dataset_dict[i][j] = open(f"{i}_{j}.txt").read().split('\n')
    if j == 'labels':
      if dataset_dict[i][j] == '':
        continue
      else:
        dataset_dict[i][j] = [int(x) for x in dataset_dict[i][j]] #dict(dict(text:label))

if MAX_TRAINING_EXAMPLES > 0:
  dataset_dict['train']['text']=dataset_dict['train']['text'][:MAX_TRAINING_EXAMPLES] #set MAX_TRAINING_EXAMPLES to -1 if you want to use the whole training set elso upto the index you want
  dataset_dict['train']['labels']=dataset_dict['train']['labels'][:MAX_TRAINING_EXAMPLES] #set MAX_TRAINING_EXAMPLES to -1 if you want to use the whole training set elso upto the index you want

'''

'\n#defining dataset dictionary in the format dict(dict(text:label))\ndataset_dict = {} #dict()\nfor i in [\'train\',\'val\',\'test\']:\n  dataset_dict[i] = {} #dict(dict())\n  for j in [\'text\',\'labels\']:\n    dataset_dict[i][j] = open(f"{i}_{j}.txt").read().split(\'\n\')\n    if j == \'labels\':\n      if dataset_dict[i][j] == \'\':\n        continue\n      else:\n        dataset_dict[i][j] = [int(x) for x in dataset_dict[i][j]] #dict(dict(text:label))\n\nif MAX_TRAINING_EXAMPLES > 0:\n  dataset_dict[\'train\'][\'text\']=dataset_dict[\'train\'][\'text\'][:MAX_TRAINING_EXAMPLES] #set MAX_TRAINING_EXAMPLES to -1 if you want to use the whole training set elso upto the index you want\n  dataset_dict[\'train\'][\'labels\']=dataset_dict[\'train\'][\'labels\'][:MAX_TRAINING_EXAMPLES] #set MAX_TRAINING_EXAMPLES to -1 if you want to use the whole training set elso upto the index you want\n\n'

In [9]:
tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)

Downloading:   0%|          | 0.00/652 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.68M [00:00<?, ?B/s]

Creating the dataset encodings, encode only the dataset texts

In [10]:
train_encodings = tokenizer(dataset_dict['train']['text'], truncation=True, padding=True)
val_encodings = tokenizer(dataset_dict['val']['text'], truncation=True, padding=True)
test_encodings = tokenizer(dataset_dict['test']['text'], truncation=True, padding=True)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [11]:
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

#passing the encodings and labels as parameters
train_dataset = MyDataset(train_encodings, dataset_dict['train']['labels'])
val_dataset = MyDataset(val_encodings, dataset_dict['val']['labels'])
test_dataset = MyDataset(test_encodings, dataset_dict['test']['labels'])

## Fine-tuning

The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a model
to fine-tune, define the `TrainingArguments`/`TFTrainingArguments` and
instantiate a `Trainer`/`TFTrainer`.

In [12]:
training_args = TrainingArguments(
    output_dir='./results_6classes',                   # output directory
    num_train_epochs=EPOCHS,                  # total number of training epochs
    per_device_train_batch_size=BATCH_SIZE,   # batch size per device during training
    per_device_eval_batch_size=BATCH_SIZE,    # batch size for evaluation
    warmup_steps=100,                         # number of warmup steps for learning rate scheduler
    weight_decay=0.01,                        # strength of weight decay
    logging_dir='./logs',                     # directory for storing logs
    logging_steps=10,                         # when to print log
    #load_best_model_at_end=True,              # load or not best model at the end
)

num_labels = len(set(dataset_dict["train"]["labels"]))
print(num_labels)
model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=num_labels)

6


Downloading:   0%|          | 0.00/1.04G [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.decoder.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-xlm-roberta-base and are newly initialized: ['classifier.out

In [13]:
trainer = Trainer(
    model=model,                              # the instantiated 🤗 Transformers model to be trained
    args=training_args,                       # training arguments, defined above
    train_dataset=train_dataset,              # training dataset
    eval_dataset=val_dataset                  # evaluation dataset
)

trainer.train()

***** Running training *****
  Num examples = 816
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 612


Step,Training Loss
10,1.7874
20,1.8088
30,1.7988
40,1.7626
50,1.7319
60,1.7677
70,1.877
80,1.6632
90,1.5816
100,1.5444


Saving model checkpoint to ./results_6classes/checkpoint-500
Configuration saved in ./results_6classes/checkpoint-500/config.json
Model weights saved in ./results_6classes/checkpoint-500/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=612, training_loss=0.9266447550525853, metrics={'train_runtime': 80.0626, 'train_samples_per_second': 30.576, 'train_steps_per_second': 7.644, 'total_flos': 71708560082496.0, 'train_loss': 0.9266447550525853, 'epoch': 3.0})

In [14]:
trainer.save_model("./results_6classes/best_model") # save best model

Saving model checkpoint to ./results_6classes/best_model
Configuration saved in ./results_6classes/best_model/config.json
Model weights saved in ./results_6classes/best_model/pytorch_model.bin


## Evaluate on Test set

In [16]:
test_preds_raw, test_labels , _ = trainer.predict(test_dataset)
test_preds = np.argmax(test_preds_raw, axis=-1)

***** Running Prediction *****
  Num examples = 234
  Batch size = 4


[1 4 3 3 0 0 4 1 5 2 3 4 5 1 5 5 0 5 1 0 2 4 2 3 0 4 3 3 2 0 4 1 3 0 2 3 0
 1 2 4 3 2 5 0 1 2 5 4 1 5 5 4 0 5 3 4 1 4 5 1 2 3 2 4 2 1 0 1 4 4 1 0 0 3
 1 3 3 5 2 1 5 0 3 2 4 3 3 0 3 3 3 5 1 0 0 0 3 4 1 3 4 5 3 5 2 1 3 1 1 3 4
 1 0 1 0 5 2 0 4 4 0 4 2 5 2 1 0 1 1 4 1 3 1 5 2 4 4 1 0 5 2 2 4 1 4 2 4 0
 5 3 1 4 2 0 3 5 4 1 3 2 1 4 0 5 5 1 1 0 1 2 2 1 5 2 5 0 3 1 5 1 0 2 2 0 1
 4 0 3 3 4 2 5 3 0 4 1 2 5 3 3 1 2 2 5 2 1 2 4 0 2 4 5 5 4 4 5 3 3 4 0 4 2
 2 4 4 4 0 3 5 0 4 5 1 4]


In [29]:
print(test_labels), print(test_preds)

[1 4 3 3 0 0 4 1 5 2 3 4 5 1 5 5 0 5 1 0 2 4 2 3 0 4 3 3 2 0 4 1 3 0 2 3 0
 1 2 4 3 2 5 0 1 2 5 4 1 5 5 4 0 5 3 4 1 4 5 1 2 3 2 4 2 1 0 1 4 4 1 0 0 3
 1 3 3 5 2 1 5 0 3 2 4 3 3 0 3 3 3 5 1 0 0 0 3 4 1 3 4 5 3 5 2 1 3 1 1 3 4
 1 0 1 0 5 2 0 4 4 0 4 2 5 2 1 0 1 1 4 1 3 1 5 2 4 4 1 0 5 2 2 4 1 4 2 4 0
 5 3 1 4 2 0 3 5 4 1 3 2 1 4 0 5 5 1 1 0 1 2 2 1 5 2 5 0 3 1 5 1 0 2 2 0 1
 4 0 3 3 4 2 5 3 0 4 1 2 5 3 3 1 2 2 5 2 1 2 4 0 2 4 5 5 4 4 5 3 3 4 0 4 2
 2 4 4 4 0 3 5 0 4 5 1 4]
[1 3 3 3 0 0 4 1 4 2 3 4 5 1 2 2 0 1 1 0 5 3 2 3 0 4 0 3 2 0 5 1 3 0 4 3 1
 1 2 4 3 2 5 0 1 2 2 4 5 5 2 4 0 2 3 1 1 4 2 1 2 0 4 4 5 5 1 1 4 4 4 0 0 4
 1 3 3 2 2 5 5 0 3 2 4 3 3 0 3 3 3 5 1 0 0 0 4 4 1 3 4 5 3 5 2 1 3 1 1 3 4
 1 0 4 0 5 2 5 4 4 0 4 5 2 2 1 0 1 1 4 4 5 5 4 2 4 4 1 0 2 5 1 4 1 4 2 4 0
 5 3 1 5 5 5 2 2 4 1 3 2 1 4 0 2 5 1 3 0 1 2 5 1 5 2 5 0 3 1 5 1 0 2 2 0 5
 4 0 3 3 4 2 5 4 0 4 1 2 2 3 5 4 2 4 2 2 1 2 4 0 2 3 5 5 4 4 3 1 3 4 0 4 2
 2 1 4 4 0 3 5 0 4 2 1 4]


(None, None)

In [20]:
print(classification_report(test_labels, test_preds, digits=3)) #for all 6 classes

              precision    recall  f1-score   support

           0      0.943     0.892     0.917        37
           1      0.825     0.767     0.795        43
           2      0.643     0.730     0.684        37
           3      0.853     0.763     0.806        38
           4      0.755     0.841     0.796        44
           5      0.500     0.486     0.493        35

    accuracy                          0.752       234
   macro avg      0.753     0.746     0.748       234
weighted avg      0.758     0.752     0.753       234



In [21]:
test_labels_binarize = [1 if x!=0 else x for x in test_labels]


test_preds_binarize = [1 if x!=0 else x for x in test_preds]

In [30]:
print(test_labels_binarize), print(test_preds_binarize)

[1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1]
[1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,

(None, None)

In [26]:
print(classification_report(test_labels_binarize, test_preds_binarize, digits=3)) #for binarized classes- relevant(1) and irrelevant(0)

              precision    recall  f1-score   support

           0      0.943     0.892     0.917        37
           1      0.980     0.990     0.985       197

    accuracy                          0.974       234
   macro avg      0.961     0.941     0.951       234
weighted avg      0.974     0.974     0.974       234

