# 1. Information about the submission

## 1.1 Name and number of the assignment

1. Semantic role labelling

## 1.2 Student name

Denis Isaev

## 1.3 Codalab user ID / nickname / username

denzelito

## 1.4 Additional comments

# 2. Technical Report

https://github.com/s-nlp/semantic-role-labelling

https://codalab.lisn.upsaclay.fr/competitions/531


## 2.1 Methodology



*   использовалась дистиллированная версия многоязычной модели bert-base для русского и английского языков **rubert-tiny2**
*   при ее обучении замораживались и размораживались слои, менялось кол-во эпох обучения
*   видя что модель продолжает улучаться по качеству с каждой эпохой - увеличивал кол-во эпох попутно размораживая слои
*   для инференса использовался пайплайн отбора ner-а с максимальным модельным скором
*   ключевой метрикой для оценки модели в codalab является f1-мера


## 2.2 Discussion of results

Матрица сопряжонности, построенная по предсказаниям модели на 10% отложенной выборки от train-выборки.

_ | O | B-Aspect | B-Object | B-Predicate | I-Aspect | I-Object | I-Predicate
--- |--- |--- |--- |--- |--- |--- |---
O | 4799 | 53 | 80 | 29 | 0 | 0 | 1
B-Aspect | 102 | 78 | 9 | 2 | 0 | 0 | 2
B-Object | 32 | 0 | 592 | 1 | 0 | 0 | 0
B-Predicate | 18 | 2 | 1 | 285 | 0 | 0 | 1
I-Aspect | 43 | 12 | 1 | 0 | 0 | 0 | 1
I-Object | 12 | 0 | 9 | 0 | 0 | 0 | 0
I-Predicate | 21 | 4 | 1 | 5 | 0 | 0 | 8

Как видно из таблицы выше модель хорошо детектит начало именнованных сущностей, и вовсе не может выделить ее продолжение.

# 3. Code

## 3.1 Requirements

In [None]:
pip install transformers sentencepiece datasets transformers[torch] seqeval

In [2]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from IPython.display import clear_output

import torch
from datasets import Dataset, DatasetDict, load_metric
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import DataCollatorForTokenClassification
from transformers import pipeline

torch.manual_seed(73)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
metric = load_metric("seqeval")

model_checkpoint = "xlm-roberta-large-finetuned-conll03-english"

  metric = load_metric("seqeval")


Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

## 3.2 Download the data

In [3]:
!git clone https://github.com/s-nlp/semantic-role-labelling
!ls semantic-role-labelling

Cloning into 'semantic-role-labelling'...
remote: Enumerating objects: 20, done.[K
remote: Counting objects: 100% (20/20), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 20 (delta 3), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (20/20), 174.73 KiB | 2.69 MiB/s, done.
Resolving deltas: 100% (3/3), done.
baseline.ipynb	    evaluation	test_no_answers.tsv
dev_no_answers.tsv  README.md	train.tsv


In [4]:
def read_dataset(filename, splitter="\t"):
    data = []
    sentence = []
    tags = []
    with open(filename) as f:
        for line in f:
            if not line.isspace():
                word, tag = line.split(splitter)
                sentence.append(word)
                tags.append(tag.strip())
            else:
                data.append((sentence, tags))
                sentence = []
                tags = []
    return data

In [5]:
train_data = read_dataset('semantic-role-labelling/train.tsv')
dev_data = read_dataset('semantic-role-labelling/dev_no_answers.tsv', splitter="\n")
test_data = read_dataset('semantic-role-labelling/test_no_answers.tsv', splitter="\n")

print('train len:', len(train_data))
print('dev len:', len(dev_data))
print('test  len:', len(test_data), '\n')

train len: 2334
dev len: 283
test  len: 360 



In [6]:
train_data, val_data = train_test_split(train_data, test_size=0.1, random_state=1)

train_df = pd.DataFrame(train_data, columns=['tokens', 'tags'])
val_df = pd.DataFrame(val_data, columns=['tokens', 'tags'])
dev_df = pd.DataFrame(dev_data, columns=['tokens', 'tags'])
test_df = pd.DataFrame(test_data, columns=['tokens', 'tags'])
display(train_df.head(2))

Unnamed: 0,tokens,tags
0,"[acrylic, or, some, form, of, plastic, will, w...","[B-Object, O, O, O, O, B-Object, O, O, B-Predi..."
1,"[ford, is, faster, than, bmw, but, in, handlin...","[B-Object, O, B-Predicate, O, B-Object, O, O, ..."


## 3.3 Preprocessing

In [7]:
label_list = sorted({label for item in train_df['tags'] for label in item })
if 'O' in label_list:
    label_list.remove('O')
    label_list = ['O'] + label_list
label_list

['O',
 'B-Aspect',
 'B-Object',
 'B-Predicate',
 'I-Aspect',
 'I-Object',
 'I-Predicate']

In [8]:
ner_data = DatasetDict({
    'train': Dataset.from_pandas(train_df),
    'test': Dataset.from_pandas(val_df)
})
ner_data

DatasetDict({
    train: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 2100
    })
    test: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 234
    })
})

In [9]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/852 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [10]:
def tokenize_and_align_labels(examples, label_all_tokens=False):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples['tags']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        label_ids = [label_list.index(idx) if isinstance(idx, str) else idx for idx in label_ids]

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [11]:
tokenized_datasets = ner_data.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/2100 [00:00<?, ? examples/s]

Map:   0%|          | 0/234 [00:00<?, ? examples/s]

## 3.4 My method of text processing

In [12]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint,
                                                        num_labels=len(label_list),
                                                        ignore_mismatched_sizes=True)
model.config.id2label = dict(enumerate(label_list))
model.config.label2id = {v: k for k, v in model.config.id2label.items()}

Downloading pytorch_model.bin:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of the model checkpoint at xlm-roberta-large-finetuned-conll03-english were not used when initializing XLMRobertaForTokenClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-large-finetuned-conll03-english and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([8, 1024]) in the checkpoint and torch.Size

In [13]:
data_collator = DataCollatorForTokenClassification(tokenizer)

In [14]:
def compute_metrics(p):
    predictions, labels, inputs = p.predictions, p.label_ids, p.inputs
    predictions = np.argmax(p.predictions, axis=2)

    # send only the first token of each word to the evaluation
    true_predictions = []
    true_labels = []
    for prediction, label, tokens in zip(predictions, labels, inputs):
        true_predictions.append([])
        true_labels.append([])
        for (p, l, t) in zip(prediction, label, tokens):
            if l != -100 and not tokenizer.convert_ids_to_tokens(int(t)).startswith('##'):
                true_predictions[-1].append(label_list[p])
                true_labels[-1].append(label_list[l])

    results = metric.compute(predictions=true_predictions, references=true_labels, zero_division=0)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [15]:
batch_size = 16
args = TrainingArguments(
    "ner",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=10,
    weight_decay=0.01,
    save_strategy='no',
    report_to='none',
    include_inputs_for_metrics=True,
)

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [16]:
trainer.evaluate()

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 1.303774118423462,
 'eval_precision': 0.1955128205128205,
 'eval_recall': 0.05422222222222222,
 'eval_f1': 0.0848990953375087,
 'eval_accuracy': 0.7770793036750484,
 'eval_runtime': 6.418,
 'eval_samples_per_second': 36.46,
 'eval_steps_per_second': 2.337}

In [17]:
for param in model.roberta.parameters():
    param.requires_grad = False

In [18]:
for name, param in model.named_parameters():
    if param.requires_grad:
        print(name)
        print(param)

classifier.weight
Parameter containing:
tensor([[ 0.0037,  0.0112,  0.0235,  ...,  0.0281,  0.0107, -0.0087],
        [ 0.0056, -0.0295, -0.0287,  ...,  0.0132, -0.0269, -0.0025],
        [ 0.0003,  0.0275, -0.0017,  ..., -0.0138,  0.0008,  0.0119],
        ...,
        [-0.0169,  0.0095, -0.0295,  ...,  0.0230, -0.0146, -0.0261],
        [-0.0006,  0.0002, -0.0123,  ...,  0.0063, -0.0187,  0.0265],
        [-0.0044, -0.0093, -0.0266,  ..., -0.0117,  0.0206, -0.0189]],
       device='cuda:0', requires_grad=True)
classifier.bias
Parameter containing:
tensor([-0.0261, -0.0173, -0.0266, -0.0106,  0.0177, -0.0186,  0.0073],
       device='cuda:0', requires_grad=True)


In [19]:
trainer.train()



Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.706961,0.653722,0.179556,0.281729,0.817054
2,No log,0.655469,0.693182,0.271111,0.389776,0.830432
3,No log,0.635002,0.686404,0.278222,0.395952,0.830754
4,0.744400,0.622115,0.689655,0.284444,0.402769,0.832044
5,0.744400,0.612931,0.690171,0.287111,0.405524,0.832527
6,0.744400,0.605483,0.690832,0.288,0.406524,0.832689
7,0.744400,0.599919,0.687898,0.288,0.406015,0.832527
8,0.641600,0.596178,0.689873,0.290667,0.409006,0.833172
9,0.641600,0.594103,0.690678,0.289778,0.408265,0.833333
10,0.641600,0.593247,0.690678,0.289778,0.408265,0.833333


TrainOutput(global_step=1320, training_loss=0.676998115308357, metrics={'train_runtime': 305.2456, 'train_samples_per_second': 68.797, 'train_steps_per_second': 4.324, 'total_flos': 2509982029081560.0, 'train_loss': 0.676998115308357, 'epoch': 10.0})

In [20]:
trainer.evaluate()

{'eval_loss': 0.5932469964027405,
 'eval_precision': 0.690677966101695,
 'eval_recall': 0.2897777777777778,
 'eval_f1': 0.4082654978083908,
 'eval_accuracy': 0.8333333333333334,
 'eval_runtime': 3.1228,
 'eval_samples_per_second': 74.934,
 'eval_steps_per_second': 4.803,
 'epoch': 10.0}

In [21]:
# разморозка
for param in model.parameters():
    param.requires_grad = True

In [22]:
args = TrainingArguments(
    "ner",
    evaluation_strategy = "epoch",
    learning_rate=1e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=20,
    weight_decay=0.01,
    save_strategy='no',
    report_to='none',
    include_inputs_for_metrics=True,
)

In [23]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [24]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.198305,0.740075,0.878222,0.803252,0.932946
2,No log,0.175889,0.801925,0.888889,0.84317,0.941006
3,No log,0.17007,0.836134,0.884444,0.859611,0.94568
4,0.158600,0.176699,0.839496,0.888,0.863067,0.94697
5,0.158600,0.177303,0.832632,0.88,0.855661,0.945358
6,0.158600,0.205115,0.823673,0.896889,0.858723,0.944552
7,0.158600,0.2132,0.834586,0.888,0.860465,0.946647
8,0.053600,0.238552,0.854671,0.878222,0.866287,0.949549
9,0.053600,0.249938,0.841026,0.874667,0.857516,0.945841
10,0.053600,0.275932,0.841484,0.887111,0.863695,0.947614


KeyboardInterrupt: ignored

In [25]:
trainer.evaluate()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.198305,0.740075,0.878222,0.803252,0.932946
2,No log,0.175889,0.801925,0.888889,0.84317,0.941006
3,No log,0.17007,0.836134,0.884444,0.859611,0.94568
4,0.158600,0.176699,0.839496,0.888,0.863067,0.94697
5,0.158600,0.177303,0.832632,0.88,0.855661,0.945358
6,0.158600,0.205115,0.823673,0.896889,0.858723,0.944552
7,0.158600,0.2132,0.834586,0.888,0.860465,0.946647
8,0.053600,0.238552,0.854671,0.878222,0.866287,0.949549
9,0.053600,0.249938,0.841026,0.874667,0.857516,0.945841
10,0.053600,0.275932,0.841484,0.887111,0.863695,0.947614


{'eval_loss': 0.3547682464122772,
 'eval_precision': 0.8327814569536424,
 'eval_recall': 0.8942222222222223,
 'eval_f1': 0.8624089155593656,
 'eval_accuracy': 0.9471308833010961}

In [26]:
p = trainer.predict(tokenized_datasets["test"])


predictions, labels, inputs = p.predictions, p.label_ids, tokenized_datasets["test"]['input_ids']
predictions = np.argmax(p.predictions, axis=2)

# send only the first token of each word to the evaluation
true_predictions = []
true_labels = []
for prediction, label, tokens in zip(predictions, labels, inputs):
    true_predictions.append([])
    true_labels.append([])
    for (p, l, t) in zip(prediction, label, tokens):
        if l != -100 and not tokenizer.convert_ids_to_tokens(int(t)).startswith('##'):
            true_predictions[-1].append(label_list[p])
            true_labels[-1].append(label_list[l])

results = metric.compute(predictions=true_predictions, references=true_labels)
results

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.198305,0.740075,0.878222,0.803252,0.932946
2,No log,0.175889,0.801925,0.888889,0.84317,0.941006
3,No log,0.17007,0.836134,0.884444,0.859611,0.94568
4,0.158600,0.176699,0.839496,0.888,0.863067,0.94697
5,0.158600,0.177303,0.832632,0.88,0.855661,0.945358
6,0.158600,0.205115,0.823673,0.896889,0.858723,0.944552
7,0.158600,0.2132,0.834586,0.888,0.860465,0.946647
8,0.053600,0.238552,0.854671,0.878222,0.866287,0.949549
9,0.053600,0.249938,0.841026,0.874667,0.857516,0.945841
10,0.053600,0.275932,0.841484,0.887111,0.863695,0.947614


{'Aspect': {'precision': 0.524390243902439,
  'recall': 0.6683937823834197,
  'f1': 0.5876993166287016,
  'number': 193},
 'Object': {'precision': 0.9276729559748428,
  'recall': 0.944,
  'f1': 0.9357652656621729,
  'number': 625},
 'Predicate': {'precision': 0.8803680981595092,
  'recall': 0.9348534201954397,
  'f1': 0.9067930489731437,
  'number': 307},
 'overall_precision': 0.8327814569536424,
 'overall_recall': 0.8942222222222223,
 'overall_f1': 0.8624089155593656,
 'overall_accuracy': 0.9471308833010961}

In [27]:
cm = pd.DataFrame(
    confusion_matrix(sum(true_labels, []), sum(true_predictions, []), labels=label_list),
    index=label_list,
    columns=label_list
)
cm

Unnamed: 0,O,B-Aspect,B-Object,B-Predicate,I-Aspect,I-Object,I-Predicate
O,4788,84,39,26,22,1,2
B-Aspect,43,139,0,1,7,0,3
B-Object,29,3,592,0,0,1,0
B-Predicate,7,3,1,294,0,0,2
I-Aspect,18,8,0,0,27,0,4
I-Object,3,0,4,0,0,14,0
I-Predicate,10,3,0,2,2,0,22


In [28]:
# for i in cm.index:
#   # ' | '.join(cm.columns.tolist())
#   print(i,'|',' | '.join([str(j) for j in cm.loc[i].tolist()]))

In [29]:
model.save_pretrained('ner_bert.bin')
tokenizer.save_pretrained('ner_bert.bin')

('ner_bert.bin/tokenizer_config.json',
 'ner_bert.bin/special_tokens_map.json',
 'ner_bert.bin/sentencepiece.bpe.model',
 'ner_bert.bin/added_tokens.json',
 'ner_bert.bin/tokenizer.json')

## 3.5 Inference

In [30]:
pipe_none = pipeline(model=model, tokenizer=tokenizer, task='ner', aggregation_strategy='none', device=0)
pipe_max = pipeline(model=model, tokenizer=tokenizer, task='ner', aggregation_strategy='max', device=0)

def xsr_prediction(current_token_list: list) -> list:
  answer_list = []
  for i in tqdm(current_token_list):
    initial_len = len(i)
    text_i = ' '.join(i)  # 458 is example of a difficult text
    tokens_i = tokenizer(text_i, return_tensors='pt').to(model.device)
    with torch.no_grad():
      pred_i = model(**tokens_i)

    a = pd.DataFrame(pipe_max(text_i))
    a_2 = pd.DataFrame(pipe_none(text_i))
    if len(a)>0:
      a = a.merge(a_2[['entity', 'start']], on='start', how='left')
      a['entity'].fillna('O',inplace=True)
      a['entity'] = np.where((a.entity.isnull()) & (a.entity!='O'), 'B'+a['entity_group'], a['entity'])

    answer_list_i = []
    start = 0

    for word_i in i:
      end = text_i.find(' ')
      ner_i = 'O'
      if len(a)>0:
        if len(a[(a.start==start) & (a.end==start+end)])>0:
          ner_i = a[(a.start==start) & (a.end==start+end)].entity.iloc[0]
      # print(word_i.ljust(15), ner_i.ljust(10), start, start+end)
      start+=end+1
      text_i = text_i[end+1:]
      answer_list_i.append(word_i+'\t'+ner_i)
    assert initial_len == len(answer_list_i)
    answer_list.append(answer_list_i)
  return answer_list

validation check

In [31]:
for x,y in zip(val_df.loc[4,'tokens'], val_df.loc[4,'tags']):
  print(x.ljust(10), y)

but        O
,          O
i          O
suspect    O
you        O
might      O
have       O
fewer      B-Predicate
problems   B-Aspect
with       O
postgresql B-Object
over       O
mysql      B-Object
,          O
as         O
both       O
postgre    O
and        O
ms         O
sql        O
are        O
"          O
acid       O
"          O
compliant  O
.          O


In [32]:
val_answer_list = xsr_prediction(val_df['tokens'].tolist())
clear_output(wait=True)

for i in val_answer_list[4]:
  print(i.split('\t')[0].ljust(10), i.split('\t')[1])

but        O
,          O
i          O
suspect    O
you        O
might      O
have       O
fewer      B-Predicate
problems   B-Aspect
with       O
postgresql B-Object
over       O
mysql      B-Object
,          O
as         O
both       O
postgre    O
and        O
ms         O
sql        O
are        O
"          O
acid       O
"          O
compliant  O
.          O


In [33]:
dev_answer_list = xsr_prediction(dev_df['tokens'].tolist())
clear_output()

In [34]:
test_answer_list = xsr_prediction(test_df['tokens'].tolist())
clear_output()

## 3.6 Zip-file saving

In [35]:
with open("out_dev.tsv", "w") as w:
    with torch.no_grad():
        for sentence_i in tqdm(dev_answer_list):
            for i in sentence_i:
                w.write(f"{i}\n")
            w.write("\n")

  0%|          | 0/283 [00:00<?, ?it/s]

In [36]:
!zip out_dev.zip out_dev.tsv

  adding: out_dev.tsv (deflated 74%)


In [37]:
with open("out_test.tsv", "w") as w:
    with torch.no_grad():
        for sentence_i in tqdm(test_answer_list):
            for i in sentence_i:
                w.write(f"{i}\n")
            w.write("\n")

  0%|          | 0/360 [00:00<?, ?it/s]

In [38]:
!zip out_test.zip out_test.tsv

  adding: out_test.tsv (deflated 75%)
