# Agumented and No Duplicates

In this notebook we load a preprocessed dataset that does not contain conflicting labels (different anotators classify the ADUs with different labels) with a majority voting. After that we performed a train-test split and translated the train data to English and then back to Portuguese, in order to augment our train data

In [None]:
!pip install pandas
!pip install datasets
!pip install transformers
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
!pip install optuna

## Loading a dataset

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


For ease of usage with Transformer models, we convert the dataset into a Hugging Face dataset and split it into train, validation and test sets.

In [3]:
import pandas as pd
from datasets import load_dataset, DatasetDict
import numpy as np

# Load pre-split dataset
train = load_dataset('csv', data_files='/content/drive/Shareddrives/PLN/Assignment 2/data/augmented/OpArticles_ADUs_train_final.csv', split='train')
test_valid = load_dataset('csv', data_files='/content/drive/Shareddrives/PLN/Assignment 2/data/augmented/OpArticles_ADUs_test_final.csv', split='train')

# Split the 10% test+validation set in half test, half validation
valid_test = test_valid.train_test_split(test_size=0.5)

# gather everyone if you want to have a single DatasetDict
train_valid_test_dataset = DatasetDict({
    'train': train,
    'validation': valid_test['train'],
    'test': valid_test['test']
})

Using custom data configuration default-a4c5285da80c97b7
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-a4c5285da80c97b7/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)
Using custom data configuration default-8bf195b38240fc95
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-8bf195b38240fc95/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)


##  Tokenizer

We first load the tokenizer for our model:

In [4]:
from transformers import AutoTokenizer

def get_tokenizer(name):
    return AutoTokenizer.from_pretrained(name)

Now we need to [preprocess](https://huggingface.co/docs/transformers/preprocessing) our data.

Obtaining the length of the longest sequences in our data splits

In [5]:
def find_max_length(dataset):
    return len(max(dataset, key=lambda x: len(x.split())).split())

train_max_length = find_max_length(train_valid_test_dataset["train"]["tokens"])
val_max_length = find_max_length(train_valid_test_dataset["validation"]["tokens"])
test_max_length = find_max_length(train_valid_test_dataset["test"]["tokens"])

print(f"Longest sequence in train set has {train_max_length} words")
print(f"Longest sequence in val set has {val_max_length} words")
print(f"Longest sequence in test set has {test_max_length} words")

Longest sequence in train set has 79 words
Longest sequence in val set has 80 words
Longest sequence in test set has 82 words


Tokenize entire dataset

In [6]:
# Define tokenizer
tokenizer = None

def tokenize_dataset(sample):
    return tokenizer(sample["tokens"], truncation=True, max_length=81, padding="max_length")

def get_tokenized_data(dataset):
    return dataset.map(tokenize_dataset, batched=True, remove_columns=["tokens"])

## Loading model

In [7]:
from transformers import AutoModelForSequenceClassification

# Define later
model_name = None

def get_model():
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5, ignore_mismatched_sizes=True)
    model.cuda()

    return model

### Fine-tuning

The next step is to [fine-tune](https://huggingface.co/docs/transformers/training) the model with our train data. To do so, we can make use of a [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer).
There are several aspects of training that you can specify via [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).

## Fine-tuning

The next step is to [fine-tune](https://huggingface.co/docs/transformers/training) the model with our train data. To do so, we can make use of a [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer).
There are several aspects of training that you can specify via [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).

#### Custom training to use a weighted loss

Useful for our unbalanced training set

In [8]:
from sklearn.utils import class_weight

class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(train_valid_test_dataset['train']['label']),
    y=train_valid_test_dataset['train']['label']
    )
class_weights

array([0.76430425, 0.42116612, 2.84216778, 1.34850299, 4.4672112 ])

In [9]:
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from torch import nn, tensor
from IPython.display import display

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        #inputs = inputs.to(device)
        labels = inputs.get("labels")
        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # compute custom loss (5 labels with different weight)
        loss_fct = nn.CrossEntropyLoss(weight=tensor([0.76, 0.42, 2.84, 1.35, 4.47]))
        loss_fct.cuda()
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

def get_trainingArgs():
    return TrainingArguments(
        output_dir="./results",
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        fp16=True,
        learning_rate=2e-5,
        num_train_epochs=3,
        weight_decay=0.01,
        seed=42,
        data_seed=42,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="eval_f1"
    )

def get_trainer(model_init_, args_, dataset_, tokenizer_, data_collator_, compute_metrics_):
    return Trainer(
        model_init=model_init_,
        args=args_,
        train_dataset=dataset_["train"],
        eval_dataset=dataset_["validation"],
        tokenizer=tokenizer_,
        data_collator=data_collator_,
        compute_metrics=compute_metrics_
    )

def train_model(model_name):
  global tokenizer
  tokenizer = get_tokenizer(model_name)
  tokenized_dataset = get_tokenized_data(train_valid_test_dataset)

  trainer = get_trainer(
    get_model,
    get_trainingArgs(),
    tokenized_dataset,
    tokenizer,
    DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics
    )
  
  # Train Model
  display(trainer.train())

  # Check performance in validation set
  display(trainer.evaluate())

  # Check how the model fares in our test set.
  display(trainer.predict(test_dataset=tokenized_dataset["test"]))

  # Save model for future use
  trainer.save_model('/content/drive/Shareddrives/PLN/Assignment 2/models/augmented/' + model_name)


## Testing

In [10]:
model_name = "neuralmind/bert-large-portuguese-cased"

In [11]:
train_model(model_name)

Downloading:   0%|          | 0.00/155 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/648 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/205k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

loading configuration file https://huggingface.co/neuralmind/bert-large-portuguese-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/c534071830642050813fa94003dbf1234413b3f1d5dc66d259fbc82ff7d5fd59.c8340a82acfbbcd2dd960b86d2886ee120b21896ef0294150f0391918ae6ced5
Model config BertConfig {
  "_name_or_path": "neuralmind/bert-large-portuguese-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads

Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

storing https://huggingface.co/neuralmind/bert-large-portuguese-cased/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/016fb7702039667c9fb9dd2ceffaf04027b13e525a6248cda2a4a87dbb8687af.881d7200bce807f871637ac9d552c541b2d4b00146a0bf1ab0360f3640031273
creating metadata file for /root/.cache/huggingface/transformers/016fb7702039667c9fb9dd2ceffaf04027b13e525a6248cda2a4a87dbb8687af.881d7200bce807f871637ac9d552c541b2d4b00146a0bf1ab0360f3640031273
loading weights file https://huggingface.co/neuralmind/bert-large-portuguese-cased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/016fb7702039667c9fb9dd2ceffaf04027b13e525a6248cda2a4a87dbb8687af.881d7200bce807f871637ac9d552c541b2d4b00146a0bf1ab0360f3640031273
Some weights of the model checkpoint at neuralmind/bert-large-portuguese-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictio

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.9725,0.917506,0.595833,0.567888,0.582544,0.561988
2,0.6314,1.093188,0.5975,0.573598,0.589873,0.562411
3,0.3916,1.356288,0.586667,0.567018,0.565777,0.570969


***** Running Evaluation *****
  Num examples = 1200
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-599
Configuration saved in ./results/checkpoint-599/config.json
Model weights saved in ./results/checkpoint-599/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-599/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-599/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1200
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-1198
Configuration saved in ./results/checkpoint-1198/config.json
Model weights saved in ./results/checkpoint-1198/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1198/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1198/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1200
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-1797
Configuration saved in ./results/checkpoint-1797/config.json


TrainOutput(global_step=1797, training_loss=0.601668721381067, metrics={'train_runtime': 1142.6204, 'train_samples_per_second': 50.258, 'train_steps_per_second': 1.573, 'total_flos': 8466656773622364.0, 'train_loss': 0.601668721381067, 'epoch': 3.0})

***** Running Evaluation *****
  Num examples = 1200
  Batch size = 32


{'epoch': 3.0,
 'eval_accuracy': 0.5975,
 'eval_f1': 0.5735976290392364,
 'eval_loss': 1.093187689781189,
 'eval_precision': 0.5898728187528886,
 'eval_recall': 0.5624111112712293,
 'eval_runtime': 6.304,
 'eval_samples_per_second': 190.356,
 'eval_steps_per_second': 6.028}

***** Running Prediction *****
  Num examples = 1201
  Batch size = 32


PredictionOutput(predictions=array([[ 1.192   ,  4.26    , -1.529   , -1.288   , -2.447   ],
       [ 2.963   ,  1.047   , -2.596   ,  1.291   , -3.523   ],
       [-0.5127  ,  2.273   ,  2.877   , -1.967   , -2.545   ],
       ...,
       [-0.009674,  1.51    , -3.078   ,  3.447   , -2.898   ],
       [ 3.697   ,  1.086   , -1.954   , -0.6206  , -1.987   ],
       [ 1.783   ,  1.455   , -2.781   ,  2.336   , -3.531   ]],
      dtype=float16), label_ids=array([0, 0, 3, ..., 3, 0, 1]), metrics={'test_loss': 1.065811038017273, 'test_accuracy': 0.607826810990841, 'test_f1': 0.5761504089001257, 'test_precision': 0.5998510006105106, 'test_recall': 0.5573621758266, 'test_runtime': 6.333, 'test_samples_per_second': 189.641, 'test_steps_per_second': 6.0})

Saving model checkpoint to /content/drive/Shareddrives/PLN/Assignment 2/models/augmented/neuralmind/bert-large-portuguese-cased
Configuration saved in /content/drive/Shareddrives/PLN/Assignment 2/models/augmented/neuralmind/bert-large-portuguese-cased/config.json
Model weights saved in /content/drive/Shareddrives/PLN/Assignment 2/models/augmented/neuralmind/bert-large-portuguese-cased/pytorch_model.bin
tokenizer config file saved in /content/drive/Shareddrives/PLN/Assignment 2/models/augmented/neuralmind/bert-large-portuguese-cased/tokenizer_config.json
Special tokens file saved in /content/drive/Shareddrives/PLN/Assignment 2/models/augmented/neuralmind/bert-large-portuguese-cased/special_tokens_map.json


In [12]:
!rm -rf ./results/

### With domain adaptation model

In [13]:
model_name = '/content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained'

In [14]:
train_model(model_name)

Didn't find file /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/added_tokens.json. We won't load it.
loading file /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/vocab.txt
loading file /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/tokenizer.json
loading file None
loading file /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/special_tokens_map.json
loading file /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/tokenizer_config.json


  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/config.json
Model config BertConfig {
  "_name_or_path": "/content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LA

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.9786,0.923761,0.608333,0.582616,0.597287,0.574875
2,0.6318,1.076979,0.606667,0.583337,0.587054,0.580872
3,0.3931,1.371055,0.595,0.571012,0.565913,0.577561


***** Running Evaluation *****
  Num examples = 1200
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-599
Configuration saved in ./results/checkpoint-599/config.json
Model weights saved in ./results/checkpoint-599/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-599/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-599/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1200
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-1198
Configuration saved in ./results/checkpoint-1198/config.json
Model weights saved in ./results/checkpoint-1198/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1198/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1198/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1200
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-1797
Configuration saved in ./results/checkpoint-1797/config.json


TrainOutput(global_step=1797, training_loss=0.6023351194862532, metrics={'train_runtime': 1145.3249, 'train_samples_per_second': 50.139, 'train_steps_per_second': 1.569, 'total_flos': 8466656773622364.0, 'train_loss': 0.6023351194862532, 'epoch': 3.0})

***** Running Evaluation *****
  Num examples = 1200
  Batch size = 32


{'epoch': 3.0,
 'eval_accuracy': 0.6066666666666667,
 'eval_f1': 0.5833371045128058,
 'eval_loss': 1.076979398727417,
 'eval_precision': 0.5870539442171665,
 'eval_recall': 0.5808719335231988,
 'eval_runtime': 6.2818,
 'eval_samples_per_second': 191.029,
 'eval_steps_per_second': 6.049}

***** Running Prediction *****
  Num examples = 1201
  Batch size = 32


PredictionOutput(predictions=array([[ 0.677 ,  3.67  ,  0.3188, -2.463 , -2.346 ],
       [ 3.037 , -0.1995, -2.256 ,  1.571 , -2.355 ],
       [ 0.362 ,  2.428 ,  2.96  , -2.611 , -2.736 ],
       ...,
       [-0.4675,  1.9375, -3.35  ,  3.875 , -3.055 ],
       [ 3.527 ,  1.43  , -2.88  ,  0.4988, -3.09  ],
       [ 1.275 ,  2.016 , -3.604 ,  2.58  , -3.223 ]], dtype=float16), label_ids=array([0, 0, 3, ..., 3, 0, 1]), metrics={'test_loss': 1.0366098880767822, 'test_accuracy': 0.6244796003330558, 'test_f1': 0.6070919120448623, 'test_precision': 0.6386950313301535, 'test_recall': 0.5842018761051871, 'test_runtime': 6.3399, 'test_samples_per_second': 189.436, 'test_steps_per_second': 5.994})

Saving model checkpoint to /content/drive/Shareddrives/PLN/Assignment 2/models/augmented//content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained
Configuration saved in /content/drive/Shareddrives/PLN/Assignment 2/models/augmented//content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/config.json
Model weights saved in /content/drive/Shareddrives/PLN/Assignment 2/models/augmented//content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/pytorch_model.bin
tokenizer config file saved in /content/drive/Shareddrives/PLN/Assignment 2/models/augmented//content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/tokenizer_config.json
Special tokens file saved in /content/drive/Shareddrives/PLN/Assignment 2/models/augmented//content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/special_tokens_map.json


In [15]:
!rm -rf ./results/