# Domain Adaptation

In this notebook we fine-tune a model with the pretrained [bert-large-portuguese-cased](https://huggingface.co/neuralmind/bert-large-portuguese-cased) to create a masked language model on the articles body. Then we proced to use this model, that's trained with news vocabulary to train a sentence classification model, training in the ADUs tokens to predict their labels.

In [None]:
!pip install pandas
!pip install datasetsº
!pip install transformers
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

## Loading model and tokenizer

In [None]:
model_name = "neuralmind/bert-large-portuguese-cased"

In [None]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained(model_name)
model.cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/648 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

Some weights of the model checkpoint at neuralmind/bert-large-portuguese-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/155 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/205k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

## Dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Load files

In [None]:
import pandas as pd

In [None]:
oparticles = pd.read_excel('/content/drive/Shareddrives/PLN/Assignment 2/data/OpArticles.xlsx')
oparticles = oparticles.drop(columns=['article_id', 'title', 'authors', 'meta_description','keywords', 'topics', 'publish_date', 'url_canonical'])

print(oparticles.info())
print(oparticles.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 373 entries, 0 to 372
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   body    373 non-null    object
dtypes: object(1)
memory usage: 3.0+ KB
None
                                                body
0  O poeta espanhol António Machado escrevia, uns...
1  “O mais excelente quadro posto a uma luz logo ...
2  1. As sociedades humanas parecem ser regidas p...
3  Este foi um Mundial incrível. Vimos actuações ...
4  O futebol sempre foi um jogo aparentemente sim...


In [None]:
adus = pd.read_excel('/content/drive/Shareddrives/PLN/Assignment 2/data/OpArticles_ADUs.xlsx')
adus = adus.drop(columns=['article_id', 'annotator', 'node','ranges'])
adus['label'].replace(['Value', 'Value(+)', 'Value(-)', 'Fact', 'Policy'],[0,1,2,3,4], inplace=True)

print(adus.info())
print(adus.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16743 entries, 0 to 16742
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   tokens  16743 non-null  object
 1   label   16743 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 261.7+ KB
None
                                              tokens  label
0           O facto não é apenas fruto da ignorância      0
1  havia no seu humor mais jornalismo (mais inves...      0
2                              É tudo cómico na FIFA      0
3  o que todos nós permitimos que esta organizaçã...      0
4            não nos fazem rir à custa dos poderosos      0


### Create datasets and split data

In [None]:
from datasets import Dataset
from datasets import DatasetDict

In [None]:
oparticles_hf = Dataset.from_pandas(oparticles)

# 90% train, 10% validation
train_test = oparticles_hf.train_test_split(test_size=0.1, shuffle=True, seed=42)

# gather everyone if you want to have a single DatasetDict
train_valid_test_oparticles = DatasetDict({
    'train': train_test['train'],
    'validation': train_test['test']
})
train_valid_test_oparticles

DatasetDict({
    train: Dataset({
        features: ['body'],
        num_rows: 335
    })
    validation: Dataset({
        features: ['body'],
        num_rows: 38
    })
})

In [None]:
adus_hf = Dataset.from_pandas(adus)

# 90% train, 10% test+validation
train_test = adus_hf.train_test_split(test_size=0.1, shuffle=True, seed=42)

# Split the 10% test+validation set in half test, half validation
valid_test = train_test['test'].train_test_split(test_size=0.5, shuffle=True, seed=42)

# gather everyone if you want to have a single DatasetDict
train_valid_test_adus = DatasetDict({
    'train': train_test['train'],
    'validation': valid_test['train'],
    'test': valid_test['test']
})
train_valid_test_adus

DatasetDict({
    train: Dataset({
        features: ['tokens', 'label'],
        num_rows: 15068
    })
    validation: Dataset({
        features: ['tokens', 'label'],
        num_rows: 837
    })
    test: Dataset({
        features: ['tokens', 'label'],
        num_rows: 838
    })
})

## Domain Opinion Articles Pre-Training



### Tokenize domain

In [None]:
def tokenize_domain(sample):
    return tokenizer(sample["body"])

tokenized_oparticles = train_valid_test_oparticles.map(tokenize_domain, batched=True, remove_columns=["body"])
tokenized_oparticles

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 335
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 38
    })
})

Now that we’ve tokenized our movie reviews, the next step is to group them all together and split the result into chunks

In [None]:
chunk_size = 128

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
domain_dataset = tokenized_oparticles.map(group_texts, batched=True)
domain_dataset

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 2963
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 296
    })
})

In [None]:
tokenizer.decode(domain_dataset["train"][5]["input_ids"])

'e investidores estrangeiros. Falta agora transformar Lisboa numa cidade atractiva para os lisboetas. [SEP] [CLS] 1. Estamos a assistir ao nascimento de genuínos processos políticos na União Europeia e isso é bom. Vou procurar explicar porquê. E também vou explicar as razões pelas quais isso desagrada a muitos europeístas que supostamente defendem a União. Até agora toda a construção europeia — e de uma maneira também muito evidente o Parlamento Europeu — foi fundamentalmente marcada pela lógica da despolitização. Um mundo à parte da política democrática usual dos Estados - membros. Esta última é feita de dissensões, controvérsias agudas e oposição'

### Train domain

In [None]:
from transformers import DataCollatorForLanguageModeling, Trainer, TrainingArguments

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
    )

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    fp16=True,
    learning_rate=2e-5,
    num_train_epochs=10,
    weight_decay=0.01,
    seed=42,
    data_seed=42,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True
    )

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=domain_dataset["train"],
    eval_dataset=domain_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator
    )

Using amp half precision backend


In [None]:
trainer.train()

***** Running training *****
  Num examples = 2963
  Num Epochs = 10
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 930


Epoch,Training Loss,Validation Loss
1,No log,1.589626
2,No log,1.481335
3,No log,1.399243
4,No log,1.388155
5,No log,1.488503
6,1.588700,1.460368
7,1.588700,1.39847
8,1.588700,1.4237
9,1.588700,1.43494
10,1.588700,1.429116


***** Running Evaluation *****
  Num examples = 296
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-93
Configuration saved in ./results/checkpoint-93/config.json
Model weights saved in ./results/checkpoint-93/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-93/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-93/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 296
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-186
Configuration saved in ./results/checkpoint-186/config.json
Model weights saved in ./results/checkpoint-186/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-186/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-186/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 296
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-279
Configuration saved in ./results/checkpoint-279/config.json
Model weights s

TrainOutput(global_step=930, training_loss=1.527977071782594, metrics={'train_runtime': 995.3461, 'train_samples_per_second': 29.769, 'train_steps_per_second': 0.934, 'total_flos': 6903959515837440.0, 'train_loss': 1.527977071782594, 'epoch': 10.0})

In [None]:
trainer.save_model('/content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained')

Saving model checkpoint to /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained
Configuration saved in /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/config.json
Model weights saved in /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/pytorch_model.bin
tokenizer config file saved in /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/tokenizer_config.json
Special tokens file saved in /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/special_tokens_map.json


In [None]:
!rm -rf ./results/

## Finetuning for ADUs' labelling

Since we want to use the model for classification, we should load it with an appropriate classification head:

In [None]:
from transformers import AutoModelForSequenceClassification

model_name = '/content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained'

def get_model():
  model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)
  model.cuda()

  return model

In [None]:
def tokenize_adus(sample):
    return tokenizer(sample["tokens"], truncation=True, max_length=81, padding="max_length")

tokenized_adus = train_valid_test_adus.map(tokenize_adus, batched=True, remove_columns=["tokens"])

  0%|          | 0/16 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

### Fine-tuning

The next step is to [fine-tune](https://huggingface.co/docs/transformers/training) the model with our train data. To do so, we can make use of a [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer).
There are several aspects of training that you can specify via [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).

In [None]:
from transformers import TrainingArguments, Trainer
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
import numpy as np
from transformers import DataCollatorWithPadding
from torch import nn, tensor

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        #inputs = inputs.to(device)
        labels = inputs.get("labels")
        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # compute custom loss (5 labels with different weight)
        loss_fct = nn.CrossEntropyLoss(weight=tensor([0.41, 2.37, 1.15, 0.92, 5.01]))
        loss_fct.cuda()
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

def get_trainingArgs():
    return TrainingArguments(
        output_dir="./results",
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        fp16=True,
        learning_rate=2e-5,
        num_train_epochs=5,
        weight_decay=0.01,
        seed=42,
        data_seed=42,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="f1"
    )

def get_trainer(model_init_, args_, dataset_, tokenizer_, data_collator_, compute_metrics_):
    return CustomTrainer(
        model_init=model_init_,
        args=args_,
        train_dataset=dataset_["train"],
        eval_dataset=dataset_["validation"],
        tokenizer=tokenizer_,
        data_collator=data_collator_,
        compute_metrics=compute_metrics_
    )

In [None]:
trainer = get_trainer(
    get_model,
    get_trainingArgs(),
    tokenized_adus,
    tokenizer,
    DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics
    )

# Train Model
display(trainer.train())

# Check performance in validation set
display(trainer.evaluate())

# Check how the model fares in our test set.
display(trainer.predict(test_dataset=tokenized_adus["test"]))

# Save model for future use
trainer.save_model('/content/drive/Shareddrives/PLN/Assignment 2/models/domain/finetuned')

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/config.json
Model config BertConfig {
  "_name_or_path": "/content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LA

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.874422,0.574671,0.575295,0.534646,0.681538
2,0.993900,0.963483,0.587814,0.581288,0.549014,0.686961
3,0.674800,1.042538,0.58184,0.591904,0.555941,0.674898
4,0.509700,1.208726,0.592593,0.603391,0.56663,0.66991
5,0.406000,1.328075,0.58184,0.591231,0.557613,0.646951


***** Running Evaluation *****
  Num examples = 837
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-471
Configuration saved in ./results/checkpoint-471/config.json
Model weights saved in ./results/checkpoint-471/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-471/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-471/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 837
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-942
Configuration saved in ./results/checkpoint-942/config.json
Model weights saved in ./results/checkpoint-942/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-942/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-942/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 837
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-1413
Configuration saved in ./results/checkpoint-1413/config.json
Model we

TrainOutput(global_step=2355, training_loss=0.6015410502245472, metrics={'train_runtime': 1521.2226, 'train_samples_per_second': 49.526, 'train_steps_per_second': 1.548, 'total_flos': 1.110782435351076e+16, 'train_loss': 0.6015410502245472, 'epoch': 5.0})

***** Running Evaluation *****
  Num examples = 837
  Batch size = 32


{'epoch': 5.0,
 'eval_accuracy': 0.5925925925925926,
 'eval_f1': 0.6033914267974125,
 'eval_loss': 1.2087256908416748,
 'eval_precision': 0.5666297649919372,
 'eval_recall': 0.6699099983636067,
 'eval_runtime': 4.432,
 'eval_samples_per_second': 188.853,
 'eval_steps_per_second': 6.092}

***** Running Prediction *****
  Num examples = 838
  Batch size = 32


PredictionOutput(predictions=array([[ 0.967 ,  4.254 , -3.611 , -1.0205, -1.713 ],
       [ 4.418 , -2.367 , -0.1798,  2.367 , -3.787 ],
       [ 4.336 , -2.64  ,  1.313 ,  0.56  , -2.361 ],
       ...,
       [ 3.264 , -4.61  ,  2.744 ,  2.43  , -3.33  ],
       [ 1.4375, -3.94  ,  4.38  ,  0.881 , -2.207 ],
       [ 2.69  , -4.363 ,  3.65  ,  2.143 , -3.697 ]], dtype=float16), label_ids=array([0, 0, 0, 0, 2, 0, 0, 0, 3, 1, 3, 2, 3, 2, 3, 1, 0, 0, 2, 0, 0, 2,
       1, 0, 2, 0, 2, 2, 0, 0, 3, 0, 3, 0, 0, 0, 4, 2, 0, 0, 0, 3, 3, 0,
       0, 2, 0, 3, 3, 2, 2, 4, 0, 0, 1, 3, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0,
       0, 0, 0, 0, 1, 2, 0, 0, 2, 4, 3, 4, 0, 0, 1, 0, 0, 0, 3, 0, 2, 0,
       0, 2, 1, 2, 3, 4, 0, 4, 0, 3, 3, 0, 0, 0, 0, 2, 0, 2, 3, 0, 0, 0,
       0, 0, 1, 3, 3, 0, 2, 0, 2, 2, 0, 1, 3, 0, 0, 0, 0, 0, 4, 0, 2, 0,
       3, 3, 0, 0, 0, 2, 1, 1, 0, 0, 0, 2, 0, 3, 3, 0, 0, 1, 0, 3, 0, 0,
       0, 3, 0, 0, 0, 0, 0, 0, 2, 0, 3, 0, 0, 0, 1, 3, 0, 1, 0, 0, 0, 2,
       0, 1, 0, 2, 3, 3,

Saving model checkpoint to /content/drive/Shareddrives/PLN/Assignment 2/models/domain/finetuned
Configuration saved in /content/drive/Shareddrives/PLN/Assignment 2/models/domain/finetuned/config.json
Model weights saved in /content/drive/Shareddrives/PLN/Assignment 2/models/domain/finetuned/pytorch_model.bin
tokenizer config file saved in /content/drive/Shareddrives/PLN/Assignment 2/models/domain/finetuned/tokenizer_config.json
Special tokens file saved in /content/drive/Shareddrives/PLN/Assignment 2/models/domain/finetuned/special_tokens_map.json


In [None]:
!rm -rf ./results/