# 1. Information about the submission

## 1.1 Name and number of the assignment

1. Semantic role labelling

## 1.2 Student name

Denis Isaev

## 1.3 Codalab user ID / nickname / username

denzelito

## 1.4 Additional comments

# 2. Technical Report

https://github.com/s-nlp/semantic-role-labelling

https://codalab.lisn.upsaclay.fr/competitions/531


## 2.1 Methodology

*   схема подготовки данных следующая: <br>
tsv-file -> list -> pandas -> Dataset -> DatasetDict -> function map
*   использовалась дистиллированная версия многоязычной модели bert-base для русского и английского языков **rubert-tiny2** <br>
> bert-base модели - модели представляющие собой энкодерную часть трансформера, обученные на предсказании маскированных токенов и задаче NSP; <br>
> при ее объявление указывалось количество прогнозируемых классов, были переданы словари именованных сущностей
*   в качестве валидационного датасета было отложено 10% от train-выборки
*   при ее обучении замораживались и размораживались слои, менялось кол-во эпох обучения <br>
видя что модель продолжает улучаться по качеству с каждой эпохой - увеличивал кол-во эпох, если метрики переставали улучшаться - размораживал слои
*   для обучения модели использовался пайплайн из либы huggingface по следующей схеме формирования: <br>
DataCollatorForTokenClassification, metric, TrainingArguments -> Trainer
*   для инференса использовался пайплайн отбора ner-а с максимальным модельным скором
*   ключевой метрикой для оценки модели в codalab является f1-мера


## 2.2 Discussion of results

Матрица сопряжонности, построенная по предсказаниям модели на 10% отложенной выборки.

_ | O | B-Aspect | B-Object | B-Predicate | I-Aspect | I-Object | I-Predicate
--- |--- |--- |--- |--- |--- |--- |---
O | 4799 | 53 | 80 | 29 | 0 | 0 | 1
B-Aspect | 102 | 78 | 9 | 2 | 0 | 0 | 2
B-Object | 32 | 0 | 592 | 1 | 0 | 0 | 0
B-Predicate | 18 | 2 | 1 | 285 | 0 | 0 | 1
I-Aspect | 43 | 12 | 1 | 0 | 0 | 0 | 1
I-Object | 12 | 0 | 9 | 0 | 0 | 0 | 0
I-Predicate | 21 | 4 | 1 | 5 | 0 | 0 | 8

Как видно из таблицы выше модель хорошо детектит начало именнованных сущностей, и вовсе не может выделить ее продолжение. <br>
С этой задачей справилась модель **xlm-roberta-large-finetuned-conll03-english**, но codalab сломался, и не захотел 10.08 и 11.08 принимать результаты (формат был верный). Для инфо привожу ее матрицу смежности. По запросу могу предоставить ноутбук с данной моделью.

_ | O | B-Aspect | B-Object | B-Predicate | I-Aspect | I-Object | I-Predicate
--- |--- |--- |--- |--- |--- |--- |---
O | 4788 | 84 | 39 | 26 | 22 | 1 | 2
B-Aspect | 43 | 139 | 0 | 1 | 7 | 0 | 3
B-Object | 29 | 3 | 592 | 0 | 0 | 1 | 0
B-Predicate | 7 | 3 | 1 | 294 | 0 | 0 | 2
I-Aspect | 18 | 8 | 0 | 0 | 27 | 0 | 4
I-Object | 3 | 0 | 4 | 0 | 0 | 14 | 0
I-Predicate | 10 | 3 | 0 | 2 | 2 | 0 | 22

Method | dev_f1 | test_f1
--- | --- | ---
baseline | 0.43 | 0.50
rubert_tiny2 | 0.51 | 0.63

Модель выше baseline-модели на обоих тестовых датасетах.

# 3. Code

## 3.1 Requirements

In [None]:
pip install transformers sentencepiece datasets transformers[torch] seqeval

In [None]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from IPython.display import clear_output

import torch
from datasets import Dataset, DatasetDict, load_metric
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import DataCollatorForTokenClassification
from transformers import pipeline

torch.manual_seed(73)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
metric = load_metric("seqeval")

model_checkpoint = "cointegrated/rubert-tiny2"

## 3.2 Download the data

In [None]:
!git clone https://github.com/s-nlp/semantic-role-labelling
!ls semantic-role-labelling

Cloning into 'semantic-role-labelling'...
remote: Enumerating objects: 20, done.[K
remote: Counting objects: 100% (20/20), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 20 (delta 3), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (20/20), 174.73 KiB | 3.49 MiB/s, done.
Resolving deltas: 100% (3/3), done.
baseline.ipynb	    evaluation	test_no_answers.tsv
dev_no_answers.tsv  README.md	train.tsv


In [None]:
def read_dataset(filename, splitter="\t"):
    data = []
    sentence = []
    tags = []
    with open(filename) as f:
        for line in f:
            if not line.isspace():
                word, tag = line.split(splitter)
                sentence.append(word)
                tags.append(tag.strip())
            else:
                data.append((sentence, tags))
                sentence = []
                tags = []
    return data

In [None]:
train_data = read_dataset('semantic-role-labelling/train.tsv')
dev_data = read_dataset('semantic-role-labelling/dev_no_answers.tsv', splitter="\n")
test_data = read_dataset('semantic-role-labelling/test_no_answers.tsv', splitter="\n")

print('train len:', len(train_data))
print('dev len:', len(dev_data))
print('test  len:', len(test_data), '\n')

train len: 2334
dev len: 283
test  len: 360 



In [None]:
train_data, val_data = train_test_split(train_data, test_size=0.1, random_state=1)

train_df = pd.DataFrame(train_data, columns=['tokens', 'tags'])
val_df = pd.DataFrame(val_data, columns=['tokens', 'tags'])
dev_df = pd.DataFrame(dev_data, columns=['tokens', 'tags'])
test_df = pd.DataFrame(test_data, columns=['tokens', 'tags'])
display(train_df.head(2))

Unnamed: 0,tokens,tags
0,"[acrylic, or, some, form, of, plastic, will, w...","[B-Object, O, O, O, O, B-Object, O, O, B-Predi..."
1,"[ford, is, faster, than, bmw, but, in, handlin...","[B-Object, O, B-Predicate, O, B-Object, O, O, ..."


## 3.3 Preprocessing

In [None]:
label_list = sorted({label for item in train_df['tags'] for label in item })
if 'O' in label_list:
    label_list.remove('O')
    label_list = ['O'] + label_list
label_list

['O',
 'B-Aspect',
 'B-Object',
 'B-Predicate',
 'I-Aspect',
 'I-Object',
 'I-Predicate']

In [None]:
ner_data = DatasetDict({
    'train': Dataset.from_pandas(train_df),
    'test': Dataset.from_pandas(val_df)
})
ner_data

DatasetDict({
    train: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 2100
    })
    test: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 234
    })
})

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/401 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.74M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
def tokenize_and_align_labels(examples, label_all_tokens=False):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples['tags']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        label_ids = [label_list.index(idx) if isinstance(idx, str) else idx for idx in label_ids]

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [None]:
tokenized_datasets = ner_data.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/2100 [00:00<?, ? examples/s]

Map:   0%|          | 0/234 [00:00<?, ? examples/s]

## 3.4 My method of text processing

In [None]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(label_list))
model.config.id2label = dict(enumerate(label_list))
model.config.label2id = {v: k for k, v in model.config.id2label.items()}

Downloading (…)lve/main/config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/118M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at cointegrated/rubert-tiny2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer)

In [None]:
def compute_metrics(p):
    predictions, labels, inputs = p.predictions, p.label_ids, p.inputs
    predictions = np.argmax(p.predictions, axis=2)

    # send only the first token of each word to the evaluation
    true_predictions = []
    true_labels = []
    for prediction, label, tokens in zip(predictions, labels, inputs):
        true_predictions.append([])
        true_labels.append([])
        for (p, l, t) in zip(prediction, label, tokens):
            if l != -100 and not tokenizer.convert_ids_to_tokens(int(t)).startswith('##'):
                true_predictions[-1].append(label_list[p])
                true_labels[-1].append(label_list[l])

    results = metric.compute(predictions=true_predictions, references=true_labels, zero_division=0)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [None]:
batch_size = 16
args = TrainingArguments(
    "ner",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=10,
    weight_decay=0.01,
    save_strategy='no',
    report_to='none',
    include_inputs_for_metrics=True,
)

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.evaluate()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 1.9973313808441162,
 'eval_precision': 0.048113207547169815,
 'eval_recall': 0.18133333333333335,
 'eval_f1': 0.07604846225535881,
 'eval_accuracy': 0.11476466795615732,
 'eval_runtime': 4.4744,
 'eval_samples_per_second': 52.297,
 'eval_steps_per_second': 3.352}

In [None]:
for param in model.bert.parameters():
    param.requires_grad = False

In [None]:
for name, param in model.named_parameters():
    if param.requires_grad:
        print(name)
        print(param)

classifier.weight
Parameter containing:
tensor([[ 0.0051,  0.0054, -0.0366,  ...,  0.0061, -0.0284, -0.0108],
        [-0.0210, -0.0219, -0.0226,  ..., -0.0361, -0.0256, -0.0429],
        [ 0.0106, -0.0167, -0.0068,  ..., -0.0333,  0.0084,  0.0269],
        ...,
        [ 0.0284, -0.0129, -0.0188,  ..., -0.0240, -0.0188, -0.0123],
        [-0.0075, -0.0060, -0.0349,  ...,  0.0018,  0.0023, -0.0303],
        [-0.0061, -0.0038, -0.0315,  ...,  0.0020, -0.0069,  0.0053]],
       device='cuda:0', requires_grad=True)
classifier.bias
Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0.], device='cuda:0', requires_grad=True)


In [None]:
trainer.train()



Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.662726,0.082832,0.159111,0.108947,0.525951
2,No log,1.399876,0.187029,0.110222,0.138702,0.745164
3,No log,1.201725,0.404,0.089778,0.146909,0.794487
4,1.483200,1.057556,0.537931,0.069333,0.122835,0.802547
5,1.483200,0.955209,0.589744,0.061333,0.111111,0.803675
6,1.483200,0.884239,0.601942,0.055111,0.100977,0.803514
7,1.483200,0.836345,0.631579,0.053333,0.098361,0.804159
8,0.950500,0.805792,0.62766,0.052444,0.096801,0.803997
9,0.950500,0.788866,0.641304,0.052444,0.09696,0.80432
10,0.950500,0.783437,0.641304,0.052444,0.09696,0.80432


TrainOutput(global_step=1320, training_loss=1.1223943768125593, metrics={'train_runtime': 41.8325, 'train_samples_per_second': 502.002, 'train_steps_per_second': 31.554, 'total_flos': 18248927495760.0, 'train_loss': 1.1223943768125593, 'epoch': 10.0})

In [None]:
trainer.evaluate()

{'eval_loss': 0.7834369540214539,
 'eval_precision': 0.6413043478260869,
 'eval_recall': 0.052444444444444446,
 'eval_f1': 0.09695973705834018,
 'eval_accuracy': 0.8043197936814958,
 'eval_runtime': 0.6087,
 'eval_samples_per_second': 384.43,
 'eval_steps_per_second': 24.643,
 'epoch': 10.0}

In [None]:
# разморозка
for param in model.parameters():
    param.requires_grad = True

In [None]:
args = TrainingArguments(
    "ner",
    evaluation_strategy = "epoch",
    learning_rate=1e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=20,
    weight_decay=0.01,
    save_strategy='no',
    report_to='none',
    include_inputs_for_metrics=True,
)

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.463185,0.694132,0.494222,0.577362,0.857511
2,No log,0.38524,0.69175,0.678222,0.684919,0.883462
3,No log,0.338526,0.756833,0.713778,0.734675,0.900709
4,0.442300,0.315578,0.748883,0.744889,0.746881,0.90619
5,0.442300,0.29457,0.771505,0.765333,0.768407,0.914571
6,0.442300,0.281964,0.770186,0.771556,0.77087,0.916667
7,0.442300,0.273932,0.768028,0.785778,0.776801,0.918762
8,0.292000,0.2638,0.787744,0.788444,0.788094,0.922469
9,0.292000,0.258678,0.786972,0.794667,0.790801,0.923759
10,0.292000,0.255324,0.787533,0.797333,0.792403,0.924404


TrainOutput(global_step=2640, training_loss=0.2839001207640677, metrics={'train_runtime': 77.517, 'train_samples_per_second': 541.816, 'train_steps_per_second': 34.057, 'total_flos': 36420815393976.0, 'train_loss': 0.2839001207640677, 'epoch': 20.0})

In [None]:
trainer.evaluate()

{'eval_loss': 0.24197769165039062,
 'eval_precision': 0.7897435897435897,
 'eval_recall': 0.8213333333333334,
 'eval_f1': 0.8052287581699347,
 'eval_accuracy': 0.928755641521599,
 'eval_runtime': 0.2794,
 'eval_samples_per_second': 837.473,
 'eval_steps_per_second': 53.684,
 'epoch': 20.0}

In [None]:
p = trainer.predict(tokenized_datasets["test"])


predictions, labels, inputs = p.predictions, p.label_ids, tokenized_datasets["test"]['input_ids']
predictions = np.argmax(p.predictions, axis=2)

# send only the first token of each word to the evaluation
true_predictions = []
true_labels = []
for prediction, label, tokens in zip(predictions, labels, inputs):
    true_predictions.append([])
    true_labels.append([])
    for (p, l, t) in zip(prediction, label, tokens):
        if l != -100 and not tokenizer.convert_ids_to_tokens(int(t)).startswith('##'):
            true_predictions[-1].append(label_list[p])
            true_labels[-1].append(label_list[l])

results = metric.compute(predictions=true_predictions, references=true_labels)
results

{'Aspect': {'precision': 0.42953020134228187,
  'recall': 0.3316062176165803,
  'f1': 0.37426900584795325,
  'number': 193},
 'Object': {'precision': 0.8455988455988456,
  'recall': 0.9376,
  'f1': 0.889226100151745,
  'number': 625},
 'Predicate': {'precision': 0.8353658536585366,
  'recall': 0.8925081433224755,
  'f1': 0.8629921259842518,
  'number': 307},
 'overall_precision': 0.7897435897435897,
 'overall_recall': 0.8213333333333334,
 'overall_f1': 0.8052287581699347,
 'overall_accuracy': 0.928755641521599}

In [None]:
cm = pd.DataFrame(
    confusion_matrix(sum(true_labels, []), sum(true_predictions, []), labels=label_list),
    index=label_list,
    columns=label_list
)
cm

Unnamed: 0,O,B-Aspect,B-Object,B-Predicate,I-Aspect,I-Object,I-Predicate
O,4799,53,80,29,0,0,1
B-Aspect,102,78,9,2,0,0,2
B-Object,32,0,592,1,0,0,0
B-Predicate,18,2,1,285,0,0,1
I-Aspect,43,12,1,0,0,0,1
I-Object,12,0,9,0,0,0,0
I-Predicate,21,4,1,5,0,0,8


In [None]:
# for i in cm.index:
#   # ' | '.join(cm.columns.tolist())
#   print(i,'|',' | '.join([str(j) for j in cm.loc[i].tolist()]))

In [None]:
model.save_pretrained('ner_bert.bin')
tokenizer.save_pretrained('ner_bert.bin')

('ner_bert.bin/tokenizer_config.json',
 'ner_bert.bin/special_tokens_map.json',
 'ner_bert.bin/vocab.txt',
 'ner_bert.bin/added_tokens.json',
 'ner_bert.bin/tokenizer.json')

## 3.5 Inference

In [None]:
pipe_none = pipeline(model=model, tokenizer=tokenizer, task='ner', aggregation_strategy='none', device=0)
pipe_max = pipeline(model=model, tokenizer=tokenizer, task='ner', aggregation_strategy='max', device=0)

def xsr_prediction(current_token_list: list) -> list:
  answer_list = []
  for i in tqdm(current_token_list):
    initial_len = len(i)
    text_i = ' '.join(i)  # 458 is example of a difficult text
    tokens_i = tokenizer(text_i, return_tensors='pt').to(model.device)
    with torch.no_grad():
      pred_i = model(**tokens_i)

    a = pd.DataFrame(pipe_max(text_i))
    a_2 = pd.DataFrame(pipe_none(text_i))
    if len(a)>0:
      a = a.merge(a_2[['entity', 'start']], on='start', how='left')
      a['entity'].fillna('O',inplace=True)
      a['entity'] = np.where((a.entity.isnull()) & (a.entity!='O'), 'B'+a['entity_group'], a['entity'])

    answer_list_i = []
    start = 0

    for word_i in i:
      end = text_i.find(' ')
      ner_i = 'O'
      if len(a)>0:
        if len(a[(a.start==start) & (a.end==start+end)])>0:
          ner_i = a[(a.start==start) & (a.end==start+end)].entity.iloc[0]
      # print(word_i.ljust(15), ner_i.ljust(10), start, start+end)
      start+=end+1
      text_i = text_i[end+1:]
      answer_list_i.append(word_i+'\t'+ner_i)
    assert initial_len == len(answer_list_i)
    answer_list.append(answer_list_i)
  return answer_list

validation check

In [None]:
for x,y in zip(val_df.loc[4,'tokens'], val_df.loc[4,'tags']):
  print(x.ljust(10), y)

but        O
,          O
i          O
suspect    O
you        O
might      O
have       O
fewer      B-Predicate
problems   B-Aspect
with       O
postgresql B-Object
over       O
mysql      B-Object
,          O
as         O
both       O
postgre    O
and        O
ms         O
sql        O
are        O
"          O
acid       O
"          O
compliant  O
.          O


In [None]:
val_answer_list = xsr_prediction(val_df['tokens'].tolist())
clear_output(wait=True)

for i in val_answer_list[4]:
  print(i.split('\t')[0].ljust(10), i.split('\t')[1])

but        O
,          O
i          O
suspect    O
you        O
might      O
have       O
fewer      B-Predicate
problems   O
with       O
postgresql B-Object
over       O
mysql      B-Object
,          O
as         O
both       O
postgre    B-Object
and        O
ms         B-Object
sql        O
are        O
"          O
acid       O
"          O
compliant  O
.          O


In [None]:
dev_answer_list = xsr_prediction(dev_df['tokens'].tolist())
clear_output()

In [None]:
test_answer_list = xsr_prediction(test_df['tokens'].tolist())
clear_output()

## 3.6 Zip-file saving

In [None]:
with open("out_dev.tsv", "w") as w:
    with torch.no_grad():
        for sentence_i in tqdm(dev_answer_list):
            for i in sentence_i:
                w.write(f"{i}\n")
            w.write("\n")

  0%|          | 0/283 [00:00<?, ?it/s]

In [None]:
!zip out_dev.zip out_dev.tsv

  adding: out_dev.tsv (deflated 73%)


In [None]:
with open("out_test.tsv", "w") as w:
    with torch.no_grad():
        for sentence_i in tqdm(test_answer_list):
            for i in sentence_i:
                w.write(f"{i}\n")
            w.write("\n")

  0%|          | 0/360 [00:00<?, ?it/s]

In [None]:
!zip out_test.zip out_test.tsv

  adding: out_test.tsv (deflated 74%)
