## Loading Data, Importing Packages

Let's first load some of the needed packages and the dataset

In [4]:
from datasets import load_dataset
import transformers
from transformers import AutoTokenizer, AutoModel
from transformers import BertForTokenClassification, pipeline
import torch
from transformers import BertForSequenceClassification, Trainer, TrainingArguments, BertTokenizer, AutoModelForTokenClassification

We will import the full dataset, preprocess it, and then split it into the three datasets as mentioned in the dask description

In [5]:
full_data = load_dataset('polyglot_ner', name='nl', split="train[:6000]")

Downloading builder script:   0%|          | 0.00/6.01k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/86.1k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/22.5k [00:00<?, ?B/s]

Downloading and preparing dataset polyglot_ner/nl to /root/.cache/huggingface/datasets/polyglot_ner/nl/1.0.0/bb2e45c90cd345c87dfd757c8e2b808b78b0094543b511ac49bc0129699609c1...


Downloading data:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Generating train split:   0%|          | 0/520664 [00:00<?, ? examples/s]

Dataset polyglot_ner downloaded and prepared to /root/.cache/huggingface/datasets/polyglot_ner/nl/1.0.0/bb2e45c90cd345c87dfd757c8e2b808b78b0094543b511ac49bc0129699609c1. Subsequent calls will reuse this data.


Let's look at the tags we have, and how many of them there are. Then, we'll create a dictionary to map the tags to integers.

In [6]:
ner_tags = full_data['ner']
ner_tags = [item for sublist in ner_tags for item in sublist]
ner_tags = list(set(ner_tags))
ner_to_ix = dict([(ner_tags[i], i) for i in range(len(ner_tags))])

As we'll work with the dutch language it makes sense to use a BERT model that has specifically been trained on dutch language. This is what we're doing in the next step.

In [7]:
tokenizer = AutoTokenizer.from_pretrained("GroNLP/bert-base-dutch-cased")
model = AutoModelForTokenClassification.from_pretrained("GroNLP/bert-base-dutch-cased", num_labels=5)


Downloading:   0%|          | 0.00/254 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/608 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/242k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/437M [00:00<?, ?B/s]

Some weights of the model checkpoint at GroNLP/bert-base-dutch-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at GroNLP/bert-base-dutch-cased

Let's look at the structure of the data we have. We have the 'id' which is just an id for every sentence in the dataset. 'lang' is the language, which is the same for all sentences here. 'words' is all the ?already tokenized? words. Note that upper lowercasing is preserved. 'ner' is the named entity recodnition. I think this is what we want to predict.

In [8]:
print(full_data)
print(full_data[0])

Dataset({
    features: ['id', 'lang', 'words', 'ner'],
    num_rows: 6000
})
{'id': '0', 'lang': 'nl', 'words': ['De', 'rustige', 'omgeving', 'en', 'centrale', 'ligging', 'hebben', 'de', 'plaats', 'populair', 'gemaakt', 'bij', 'de', 'wat', 'rijkere', 'bevolking', ';', 'Swieqi', 'staat', 'dan', 'ook', 'bekend', 'om', 'de', 'moeizaamheid', 'van', 'het', 'er', 'vinden', 'van', 'een', 'beschikbare', 'woning', '.'], 'ner': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']}


### Preprocessing the datasets
After preprocessing the dataset we should have "input_ids", "token_type_ids" and "attention_mask" as additional columns in our dataset. The Autotokenizer should do this job for us. Further, the data is already split up, so we can join it back into a whole sentence and then apply our tokenizer. Or do we even need a tokenizer?

In [9]:
encoded_dataset = [tokenizer(" ".join(item['words']), return_tensors="pt", padding='max_length', truncation=True, max_length=128) for item in full_data]     

In [10]:
cur_idx=0
for enc_item, item in zip(encoded_dataset, full_data):
        #enc_item['labels'] = tokenizer(" ".join(item['ner']), return_tensors="pt", padding="max_length", truncation=True, max_length=128)['input_ids']    
        l1 = [ner_to_ix[item['ner'][i]] for i in range(len(item['ner']))]
        l1 = l1 + [4 for i in range(128 - len(l1))]
        if len(l1) == 128:  # remove the elements where length of 'ner' is > than 128
            enc_item['labels'] = torch.tensor(l1)
        cur_idx += 1

Removing the entries where we did not add any labels, as they would have been larger than 128. This could be done in different ways, but this is a handy way as there are only few examples which are larger than 128.

In [12]:
ml=0
i = 0
to_delete=[]
for idx, item in enumerate(encoded_dataset):
    for key in item:
        if len(item[key]) > ml:
            ml = len(item[key])
            i = idx
    if len(item) < 4:
        #print("HERE: ", idx)
        to_delete.append(idx)
        
for index in sorted(to_delete, reverse=True):
    del encoded_dataset[index]

In [13]:
# now separate the data into the corresponding train1, train2, and test set
#before that, 'unpack' every entry in our dataset
for item in encoded_dataset:
    for key in item:
        item[key] = torch.squeeze(item[key])

We will randomly choose the entries. In the end, we should have 3 datasets.

In [None]:
import numpy as np
import random
i1, i2, i3 = int(1000/6000 * len(encoded_dataset)), int(3000/6000 * len(encoded_dataset)), int(2000/6000 * len(encoded_dataset))

In [19]:
random.shuffle(encoded_dataset)
train1, train2, test = encoded_dataset[0:i1], encoded_dataset[i1:i1+i2], encoded_dataset[i1+i2:i1+i2+i3]

Now we can get to the actual Training!
All the models will have the same parameters but different datasets to ensure comparability. Of course, the testing dataset is the same for all 3 models.

### Model 1: Fine-tuned with 1000 sentences

In [22]:
training_args = TrainingArguments(
    num_train_epochs=5,
    weight_decay=0.01,
    output_dir='results',
    logging_dir='logs',
    no_cuda=False,  # defaults to false anyway, just to be explicit
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train1,
)

In [23]:
trainer.train()

***** Running training *****
  Num examples = 998
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 625
  Number of trainable parameters = 108550661
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.0489


Saving model checkpoint to results/checkpoint-500
Configuration saved in results/checkpoint-500/config.json
Model weights saved in results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in results/checkpoint-500/tokenizer_config.json
Special tokens file saved in results/checkpoint-500/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=625, training_loss=0.0420844274520874, metrics={'train_runtime': 149.5348, 'train_samples_per_second': 33.37, 'train_steps_per_second': 4.18, 'total_flos': 325976545190400.0, 'train_loss': 0.0420844274520874, 'epoch': 5.0})

In [24]:
preds = trainer.predict(test)

***** Running Prediction *****
  Num examples = 1997
  Batch size = 8


In [25]:
#print(preds.predictions[:2])
#print(preds.predictions[:2].argmax(-1))
print(preds.label_ids[0])
print(preds.metrics)

[2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4]
{'test_loss': 0.045674361288547516, 'test_runtime': 17.7648, 'test_samples_per_second': 112.413, 'test_steps_per_second': 14.073}


In [26]:
from sklearn.metrics import r2_score, f1_score
from sklearn.preprocessing import MultiLabelBinarizer
m = MultiLabelBinarizer().fit(preds.label_ids)

predictions = preds.predictions.argmax(-1)
print("R2-Score: ", r2_score(preds.label_ids, predictions))
print("F1-Score (micro): ", f1_score( m.transform(preds.label_ids), m.transform(predictions), average='micro'))
print("F1-Score (macro): ", f1_score( m.transform(preds.label_ids), m.transform(predictions), average='macro'))

R2-Score:  0.5907441352014564
F1-Score (micro):  0.9430238726790451
F1-Score (macro):  0.7552959097049979


### Model 2: Fine-tuned with 3000 sentences

In [27]:
training_args = TrainingArguments(
    num_train_epochs=5,
    weight_decay=0.01,
    output_dir='results',
    logging_dir='logs',
    no_cuda=False,  # defaults to false anyway, just to be explicit
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train2,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [28]:
trainer.train()

***** Running training *****
  Num examples = 2996
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1875
  Number of trainable parameters = 108550661


Step,Training Loss
500,0.0381
1000,0.0236
1500,0.014


Saving model checkpoint to results/checkpoint-500
Configuration saved in results/checkpoint-500/config.json
Model weights saved in results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in results/checkpoint-500/tokenizer_config.json
Special tokens file saved in results/checkpoint-500/special_tokens_map.json
Saving model checkpoint to results/checkpoint-1000
Configuration saved in results/checkpoint-1000/config.json
Model weights saved in results/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in results/checkpoint-1000/tokenizer_config.json
Special tokens file saved in results/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to results/checkpoint-1500
Configuration saved in results/checkpoint-1500/config.json
Model weights saved in results/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in results/checkpoint-1500/tokenizer_config.json
Special tokens file saved in results/checkpoint-1500/special_tokens_map.json


Training complet

TrainOutput(global_step=1875, training_loss=0.02200989990234375, metrics={'train_runtime': 436.0421, 'train_samples_per_second': 34.354, 'train_steps_per_second': 4.3, 'total_flos': 978582895180800.0, 'train_loss': 0.02200989990234375, 'epoch': 5.0})

In [29]:
preds = trainer.predict(test)

***** Running Prediction *****
  Num examples = 1997
  Batch size = 8


In [30]:
m = MultiLabelBinarizer().fit(preds.label_ids)

predictions = preds.predictions.argmax(-1)
print("R2-Score: ", r2_score(preds.label_ids, predictions))
print("F1-Score (micro): ", f1_score( m.transform(preds.label_ids), m.transform(predictions), average='micro'))
print("F1-Score (macro): ", f1_score( m.transform(preds.label_ids), m.transform(predictions), average='macro'))

R2-Score:  0.6619770933745397
F1-Score (micro):  0.9537190082644628
F1-Score (macro):  0.8283957450258548


### Model 3: Fine-tuned with 3000 sentences and frozen embeddings

In [33]:
for name, param in model.named_parameters():
  param.requires_grad = True

Freeze embeddings, i.e. freeze all encoder layers from BERT.

In [34]:
for name, param in model.named_parameters():
  if name.startswith("bert.encoder"): # choose whatever you like here
    param.requires_grad = False

training_args = TrainingArguments(
    num_train_epochs=5,
    weight_decay=0.01,
    output_dir='results',
    logging_dir='logs',
    no_cuda=False,  # defaults to false anyway, just to be explicit
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train2,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [35]:
trainer.train()

***** Running training *****
  Num examples = 2996
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1875
  Number of trainable parameters = 23496197


Step,Training Loss
500,0.0067
1000,0.0055
1500,0.0042


Saving model checkpoint to results/checkpoint-500
Configuration saved in results/checkpoint-500/config.json
Model weights saved in results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in results/checkpoint-500/tokenizer_config.json
Special tokens file saved in results/checkpoint-500/special_tokens_map.json
Saving model checkpoint to results/checkpoint-1000
Configuration saved in results/checkpoint-1000/config.json
Model weights saved in results/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in results/checkpoint-1000/tokenizer_config.json
Special tokens file saved in results/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to results/checkpoint-1500
Configuration saved in results/checkpoint-1500/config.json
Model weights saved in results/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in results/checkpoint-1500/tokenizer_config.json
Special tokens file saved in results/checkpoint-1500/special_tokens_map.json


Training complet

TrainOutput(global_step=1875, training_loss=0.005303391901652018, metrics={'train_runtime': 339.9833, 'train_samples_per_second': 44.061, 'train_steps_per_second': 5.515, 'total_flos': 978582895180800.0, 'train_loss': 0.005303391901652018, 'epoch': 5.0})

In [36]:
preds = trainer.predict(test)

***** Running Prediction *****
  Num examples = 1997
  Batch size = 8


In [37]:
m = MultiLabelBinarizer().fit(preds.label_ids)

predictions = preds.predictions.argmax(-1)
print("R2-Score: ", r2_score(preds.label_ids, predictions))
print("F1-Score (micro): ", f1_score( m.transform(preds.label_ids), m.transform(predictions), average='micro'))
print("F1-Score (macro): ", f1_score( m.transform(preds.label_ids), m.transform(predictions), average='macro'))

R2-Score:  0.6684060069641861
F1-Score (micro):  0.9537351880473982
F1-Score (macro):  0.8295866427882548
