## Loading Data, Importing Packages

Let's first load some of the needed packages and the dataset

In [1]:
from datasets import load_dataset
import transformers
from transformers import AutoTokenizer, AutoModel
from transformers import BertForTokenClassification, pipeline
import torch
from transformers import BertForSequenceClassification, Trainer, TrainingArguments, BertTokenizer, AutoModelForTokenClassification

We will import the full dataset, preprocess it, and then split it into the three datasets as mentioned in the dask description

In [2]:
full_data = load_dataset('polyglot_ner', name='nl', split="train[:6000]")

Found cached dataset polyglot_ner (C:/Users/Damja/.cache/huggingface/datasets/polyglot_ner/nl/1.0.0/bb2e45c90cd345c87dfd757c8e2b808b78b0094543b511ac49bc0129699609c1)


In [3]:
ner_tags = full_data['ner']
ner_tags = [item for sublist in ner_tags for item in sublist]
ner_tags = list(set(ner_tags))
ner_to_ix = dict([(ner_tags[i], i) for i in range(len(ner_tags))])

KeyError: 'PAD'

In [4]:
tokenizer = AutoTokenizer.from_pretrained("GroNLP/bert-base-dutch-cased")
model = AutoModelForTokenClassification.from_pretrained("GroNLP/bert-base-dutch-cased", num_labels=5)


Some weights of the model checkpoint at GroNLP/bert-base-dutch-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at GroNLP/bert-base-dutch-cased

Let's look at the structure of the data we have. We have the 'id' which is just an id for every sentence in the dataset. 'lang' is the language, which is the same for all sentences here. 'words' is all the ?already tokenized? words. Note that upper lowercasing is preserved. 'ner' is the named entity recodnition. I think this is what we want to predict.

In [5]:
print(full_data)
print(full_data[0])

Dataset({
    features: ['id', 'lang', 'words', 'ner'],
    num_rows: 6000
})
{'id': '0', 'lang': 'nl', 'words': ['De', 'rustige', 'omgeving', 'en', 'centrale', 'ligging', 'hebben', 'de', 'plaats', 'populair', 'gemaakt', 'bij', 'de', 'wat', 'rijkere', 'bevolking', ';', 'Swieqi', 'staat', 'dan', 'ook', 'bekend', 'om', 'de', 'moeizaamheid', 'van', 'het', 'er', 'vinden', 'van', 'een', 'beschikbare', 'woning', '.'], 'ner': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']}


### Preprocessing the datasets
After preprocessing the dataset we should have "input_ids", "token_type_ids" and "attention_mask" as additional columns in our dataset. The Autotokenizer should do this job for us. Further, the data is already split up, so we can join it back into a whole sentence and then apply our tokenizer. Or do we even need a tokenizer?

In [6]:
encoded_dataset = [tokenizer(" ".join(item['words']), return_tensors="pt", padding='max_length', truncation=True, max_length=128) for item in full_data]     

In [7]:
cur_idx=0
for enc_item, item in zip(encoded_dataset, full_data):
        #enc_item['labels'] = tokenizer(" ".join(item['ner']), return_tensors="pt", padding="max_length", truncation=True, max_length=128)['input_ids']    
        l1 = [ner_to_ix[item['ner'][i]] for i in range(len(item['ner']))]
        l1 = l1 + [4 for i in range(128 - len(l1))]
        if len(l1) == 128:  # remove the elements where length of 'ner' is > than 128
            enc_item['labels'] = torch.tensor(l1)
        cur_idx += 1

In [8]:
## DELETE
for enc_item, item in zip(encoded_dataset, full_data):
    l1 = [ner_to_ix[item['ner'][i]] for i in range(len(item['ner']))]
    l1 = l1 + [4 for i in range(128 - len(l1))]
    #enc_item['labels'] = torch.tensor(l1)
    break
    
#l1

Removing the entries where we did not add any labels, as they would have been larger than 128.

In [9]:
ml=0
i = 0
to_delete=[]
for idx, item in enumerate(encoded_dataset):
    for key in item:
        if len(item[key]) > ml:
            ml = len(item[key])
            i = idx
    if len(item) < 4:
        #print("HERE: ", idx)
        to_delete.append(idx)
        
for index in sorted(to_delete, reverse=True):
    del encoded_dataset[index]

HERE:  85
HERE:  144
HERE:  1425
HERE:  2691
HERE:  3353
HERE:  3762
HERE:  4260


In [11]:
# now separate the data into the corresponding train1, train2, and test set
#before that, 'unpack' every entry in our dataset
for item in encoded_dataset:
    for key in item:
        item[key] = torch.squeeze(item[key])

        
#train_set = encoded_dataset[:100]
#test_set = encoded_dataset[100:]

We will randomly choose the entries. In the end, we should have 3 datasets.

In [55]:
from numpy import random
import numpy as np
import random
arr1 = np.array(range(len(encoded_dataset)))
arr1 = random.choice(arr1, len(encoded_dataset), replace=False)
i1, i2, i3 = int(1000/6000 * len(encoded_dataset)), int(3000/6000 * len(encoded_dataset)), int(2000/6000 * len(encoded_dataset))
ids1, ids2, ids3 = arr1[0:i1], arr1[i1:i1+i2], arr1[i1+i2:i1+i2+i3]

In [73]:
random.shuffle(encoded_dataset)
train1, train2, test = encoded_dataset[0:i1], encoded_dataset[i1:i1+i2], encoded_dataset[i1+i2:i1+i2+i3]

Now we can get to the actual Training!

### Model 1: Fine-tuned with 1000 sentences

In [102]:
#### just for testing amk
train_set = encoded_dataset[0:100]
test_set = encoded_dataset[100:200]

In [112]:
train_set

[{'input_ids': tensor([    1,  7020,   426, 10548,   392,  8952,  2058,  7559,   117, 20255,
         10537, 12211,  5169, 24919, 10669, 17370,    16, 13930,   393, 13644,
         11130, 12214,   117, 13213, 13903, 18523, 25439, 20255, 10537, 12211,
          3409,   131,  4147,    13,     2,     3,     3,     3,     3,     3,
             3,     3,     3,     3,     3,     3,     3,     3,     3,     3,
             3,     3,     3,     3,     3,     3,     3,     3,     3,     3,
             3,     3,     3,     3,     3,     3,     3,     3,     3,     3,
             3,     3,     3,     3,     3,     3,     3,     3,     3,     3,
             3,     3,     3,     3,     3,     3,     3,     3,     3,     3,
             3,     3,     3,     3,     3,     3,     3,     3,     3,     3,
             3,     3,     3,     3,     3,     3,     3,     3,     3,     3,
             3,     3,     3,     3,     3,     3,     3,     3,     3,     3,
             3,     3,     3,     3,  

In [110]:
training_args = TrainingArguments(
    num_train_epochs=1,
    weight_decay=0.01,
    output_dir='results',
    logging_dir='logs',
    no_cuda=False,  # defaults to false anyway, just to be explicit
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_set,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [111]:
trainer.train()

***** Running training *****
  Num examples = 100
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 13
  Number of trainable parameters = 108550661


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=13, training_loss=0.09525174361008865, metrics={'train_runtime': 90.1319, 'train_samples_per_second': 1.109, 'train_steps_per_second': 0.144, 'total_flos': 6532596096000.0, 'train_loss': 0.09525174361008865, 'epoch': 1.0})

In [83]:
### just testing, may be deleted
### NOT NEEDED IT IS DONE ABOVCE I THINK
ml=0
i = 0
to_delete=[]
for idx, item in enumerate(test_data):
    for key in item:
        if len(item[key]) > ml:
            ml = len(item[key])
            i = idx
    if len(item) < 4:
        print("HERE: ", idx)
        to_delete.append(idx)
        
for index in sorted(to_delete, reverse=True):
    del test_data[index]

NameError: name 'test_data' is not defined

In [105]:
preds = trainer.predict(test_set)

***** Running Prediction *****
  Num examples = 100
  Batch size = 8


In [106]:
#print(preds.predictions[:2])
#print(preds.predictions[:2].argmax(-1))
print(preds.label_ids[0])
print(preds.metrics)

[2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4]
{'test_loss': 0.13727156817913055, 'test_runtime': 27.134, 'test_samples_per_second': 3.685, 'test_steps_per_second': 0.479}


In [108]:
from sklearn.metrics import r2_score, f1_score
from sklearn.preprocessing import MultiLabelBinarizer
m = MultiLabelBinarizer().fit(preds.label_ids)

predictions = preds.predictions.argmax(-1)
print(r2_score(preds.label_ids, predictions))
print(f1_score( m.transform(preds.label_ids), m.transform(predictions), average='micro'))
print(f1_score( m.transform(preds.label_ids), m.transform(predictions), average='macro'))

0.1707459915837319
0.9153318077803204
0.4
