**Задание**
1.  Дообучить берт на задачу NER;
2.  Дообучить GPT на генерацию текста;
3*. Дообучить T5 на задачу суммаризации текста.

## 1. NER

In [None]:
!pip install -q datasets transformers seqeval corus razdel

In [None]:
!pip install evaluate

In [None]:
import numpy as np
import pandas as pd
import evaluate
import logging
import torch
from datasets import load_dataset, load_metric
from datasets import Dataset, DatasetDict
from corus import load_rudrec
from collections import Counter, defaultdict
from razdel import tokenize
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers.trainer import logger as noisy_logger
from sklearn.metrics import confusion_matrix

In [None]:
model_checkpoint = "cointegrated/rubert-tiny2"
batch_size = 16

## Loading the dataset

In [None]:
!wget https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json

--2023-09-24 11:30:24--  https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/cimm-kzn/RuDReC/master/data/rudrec_annotated.json [following]
--2023-09-24 11:30:24--  https://raw.githubusercontent.com/cimm-kzn/RuDReC/master/data/rudrec_annotated.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1773014 (1.7M) [text/plain]
Saving to: ‘rudrec_annotated.json.1’


2023-09-24 11:30:25 (36.1 MB/s) - ‘rudrec_annotated.json.1’ saved [1773014/1773014]



In [None]:
drugs = list(load_rudrec('rudrec_annotated.json'))
print(len(drugs))

4809


In [None]:
# Посмотрим пример документа
drugs[13]

RuDReCRecord(
    file_name='2535409.tsv',
    text='Как только мой муж видит признаки простуды он постоянно покупает себе "Амизон".\n',
    sentence_id=3,
    entities=[RuDReCEntity(
         entity_id='*[2]_se',
         entity_text='простуды',
         entity_type='DI',
         start=34,
         end=42,
         concept_id='C0009443',
         concept_name=nan
     ),
     RuDReCEntity(
         entity_id='*[3]_se',
         entity_text='Амизон',
         entity_type='Drugname',
         start=71,
         end=77,
         concept_id='C0915256',
         concept_name='Amizon'
     )]
)

In [None]:
# Посмотрим какие сущности вообще у нас есть
type2text = defaultdict(Counter)
ents = Counter()

for item in drugs:
    for e in item.entities:
        ents[e.entity_type] += 1
        type2text[e.entity_type][e.entity_text] += 1

for k, v in ents.most_common():
    print(k, v)
    print(type2text[k].most_common(3))

DI 1401
[('простуды', 64), ('ОРВИ', 47), ('профилактики', 42)]
Drugname 1043
[('Виферон', 33), ('Анаферон', 25), ('Циклоферон', 24)]
Drugform 836
[('таблетки', 154), ('таблеток', 79), ('свечи', 63)]
ADR 720
[('аллергия', 16), ('слабость', 13), ('диарея', 12)]
Drugclass 330
[('противовирусный', 21), ('противовирусное', 18), ('противовирусных', 13)]
Finding 236
[('аллергии', 12), ('температуры', 6), ('сонливости', 5)]


In [None]:
drugs[13].text

'Как только мой муж видит признаки простуды он постоянно покупает себе "Амизон".\n'

Напишем функцию, перекладывающую разметку сущностей на уровень слов. Будем использовать [IOB](https://en.wikipedia.org/wiki/Inside–outside–beginning_(tagging))-нотацию, чтобы разделять несколько сущностей одного типа, идущих подряд

In [None]:
def extract_labels(item):
    raw_toks = list(tokenize(item.text))
    words = [tok.text for tok in raw_toks]
    word_labels = ['O'] * len(raw_toks)
    char2word = [None] * len(item.text)
    for i, word in enumerate(raw_toks):
        char2word[word.start:word.stop] = [i] * len(word.text)

    for e in item.entities:
        e_words = sorted({idx for idx in char2word[e.start:e.end] if idx is not None})
        word_labels[e_words[0]] = 'B-' + e.entity_type
        for idx in e_words[1:]:
            word_labels[idx] = 'I-' + e.entity_type

    return {'tokens': words, 'tags': word_labels}

In [None]:
print(extract_labels(drugs[13]))

{'tokens': ['Как', 'только', 'мой', 'муж', 'видит', 'признаки', 'простуды', 'он', 'постоянно', 'покупает', 'себе', '"', 'Амизон', '"', '.'], 'tags': ['O', 'O', 'O', 'O', 'O', 'O', 'B-DI', 'O', 'O', 'O', 'O', 'O', 'B-Drugname', 'O', 'O']}


In [None]:
ner_data = [extract_labels(item) for item in drugs]
ner_train, ner_test = train_test_split(ner_data, test_size=0.1, random_state=13)

In [None]:
# Пример данных
pd.options.display.max_colwidth = 300
pd.DataFrame(ner_train).sample(3)

Unnamed: 0,tokens,tags
3306,"[Я, уже, стараюсь, не, расчесываться, в, течение, дня, и, голову, не, сушить, феном, сильно, .]","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]"
1886,"[Пришлось, купить, его, .]","[O, O, O, O]"
1937,"[Разрешены, к, применению, как, взрослыми, и, детьми, ,, так, и, кормящими, и, беременными, женщинами, .]","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]"


In [None]:
# Соберём все виды меток в список
label_list = sorted({label for item in ner_train for label in item['tags']})
if 'O' in label_list:
    label_list.remove('O')
    label_list += ['O']
label_list

['B-ADR',
 'B-DI',
 'B-Drugclass',
 'B-Drugform',
 'B-Drugname',
 'B-Finding',
 'I-ADR',
 'I-DI',
 'I-Drugclass',
 'I-Drugform',
 'I-Drugname',
 'I-Finding',
 'O']

Сложим наши данные в объект [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), нативный для huggingface.

In [None]:
ner_data = DatasetDict({
    'train': Dataset.from_pandas(pd.DataFrame(ner_train)),
    'test': Dataset.from_pandas(pd.DataFrame(ner_test))
})
ner_data

DatasetDict({
    train: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 4328
    })
    test: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 481
    })
})

## Preprocessing the data

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
example = ner_train[13]
print(example["tokens"])

['К', 'сожалению', ',', 'но', 'в', 'нашем', 'случае', 'дочке', 'это', 'средство', 'не', 'помогло', '.']


In [None]:
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print(tokens)

['[CLS]', 'К', 'сожалению', ',', 'но', 'в', 'нашем', 'случае', 'до', '##чке', 'это', 'средство', 'не', 'помогло', '.', '[SEP]']


In [None]:
# Чтобы перейти с уровня слов на уровень subword tokens, нужно ещё раз предобработать тексты
len(example["tags"]), len(tokenized_input["input_ids"])

(13, 16)

We're now ready to write the function that will preprocess our samples. We feed them to the `tokenizer` with the argument `truncation=True` (to truncate texts that are bigger than the maximum size allowed by the model) and `is_split_into_words=True` (as seen above). Then we align the labels with the token ids using the strategy we picked:

In [None]:
def tokenize_and_align_labels(examples, label_all_tokens=True):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples['tags']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        label_ids = [label_list.index(idx) if isinstance(idx, str) else idx for idx in label_ids]

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
tokenize_and_align_labels(ner_data['train'][22:23])

{'input_ids': [[2, 24121, 16394, 30, 23, 735, 3]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 12, 12, 12, 12, 12, -100]]}

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command

In [None]:
tokenized_datasets = ner_data.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/4328 [00:00<?, ? examples/s]

Map:   0%|          | 0/481 [00:00<?, ? examples/s]

## Fine-tuning the model

In [None]:
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(label_list))
model.config.id2label = dict(enumerate(label_list))
model.config.label2id = {v: k for k, v in model.config.id2label.items()}

Some weights of BertForTokenClassification were not initialized from the model checkpoint at cointegrated/rubert-tiny2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a `Trainer`, we will need to define three more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
# !pip install accelerate -U

In [None]:
args = TrainingArguments(
    "ner",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=10,
    weight_decay=0.01,
    save_strategy='no',
    report_to='none',
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay.

Then we will need a data collator that will batch our processed examples together while applying padding to make them all the same size (each pad will be padded to the length of its longest example). There is a data collator for this task in the Transformers library, that not only pads the inputs, but also the labels:

In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer)

The last thing to define for our `Trainer` is how to compute the metrics from the predictions. Here we will load the [`seqeval`](https://github.com/chakki-works/seqeval) metric (which is commonly used to evaluate results on the CONLL dataset) via the Datasets library.

In [None]:
# Hint: Indeed, metrics are loaded from hf.co. E.g. seqeval is loaded from here: https://huggingface.co/spaces/evaluate-metric/seqeval/tree/main
metric = evaluate.load("seqeval")

In [None]:
example = ner_train[4]
labels = example['tags']
metric.compute(predictions=[labels], references=[labels])

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dtype.type(ret / rcount)


{'overall_precision': 0.0,
 'overall_recall': 0.0,
 'overall_f1': 0.0,
 'overall_accuracy': 1.0}

So we will need to do a bit of post-processing on our predictions:
- select the predicted index (with the maximum logit) for each token
- convert it to its string label
- ignore everywhere we set a label of -100

The following function does all this post-processing on the result of `Trainer.evaluate` (which is a namedtuple containing predictions and labels) before applying the metric:

In [None]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels, zero_division=0)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

Note that we drop the precision/recall/f1 computed for each category and only focus on the overall precision/recall/f1/accuracy.

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [None]:
trainer = Trainer(model,
                  args,
                  train_dataset=tokenized_datasets["train"],
                  eval_dataset=tokenized_datasets["test"],
                  data_collator=data_collator,
                  tokenizer=tokenizer,
                  compute_metrics=compute_metrics)

In [None]:
trainer.evaluate()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 2.7640469074249268,
 'eval_precision': 0.01683093252463988,
 'eval_recall': 0.1252821670428894,
 'eval_f1': 0.029675177115358914,
 'eval_accuracy': 0.025841115517461943,
 'eval_runtime': 4.287,
 'eval_samples_per_second': 112.2,
 'eval_steps_per_second': 7.231}

В начале обучения попробуем заморозить все параметры в модели кроме последнего слоя. Посмотрим насколько хорошо модель обучается

In [None]:
for param in model.bert.parameters():
    param.requires_grad = False

In [None]:
for name, param in model.named_parameters():
    if param.requires_grad:
        print(name)
        print(param, '\n')

classifier.weight
Parameter containing:
tensor([[-0.0326,  0.0119, -0.0192,  ..., -0.0204, -0.0110,  0.0019],
        [-0.0212, -0.0157, -0.0160,  ...,  0.0103,  0.0227, -0.0148],
        [ 0.0241,  0.0314, -0.0028,  ..., -0.0051, -0.0238,  0.0208],
        ...,
        [-0.0200,  0.0136,  0.0352,  ...,  0.0041,  0.0044, -0.0366],
        [ 0.0284, -0.0063, -0.0161,  ..., -0.0398,  0.0106, -0.0212],
        [-0.0100, -0.0064,  0.0264,  ..., -0.0092, -0.0088, -0.0149]],
       device='cuda:0', requires_grad=True) 

classifier.bias
Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0',
       requires_grad=True) 



In [None]:
noisy_logger.setLevel(logging.WARNING)

We can now finetune our model by just calling the `train` method:

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.574741,0.340426,0.072235,0.119181,0.853652
2,1.667600,1.179205,0.767442,0.037246,0.071044,0.859025
3,1.667600,0.926128,1.0,0.030474,0.059146,0.858769
4,1.019200,0.776914,1.0,0.030474,0.059146,0.858769
5,1.019200,0.691925,1.0,0.032731,0.063388,0.859025
6,0.748200,0.643409,0.954545,0.047404,0.090323,0.86056
7,0.748200,0.615336,0.945455,0.058691,0.110521,0.86184
8,0.657900,0.599353,0.901639,0.062077,0.116156,0.862095
9,0.657900,0.591057,0.878788,0.065463,0.121849,0.862479
10,0.614400,0.588464,0.882353,0.06772,0.125786,0.862735


TrainOutput(global_step=2710, training_loss=0.9151002919102067, metrics={'train_runtime': 31.963, 'train_samples_per_second': 1354.066, 'train_steps_per_second': 84.786, 'total_flos': 30836082346560.0, 'train_loss': 0.9151002919102067, 'epoch': 10.0})

Модель недообучилась: похоже, что нужно обучить больше слоёв. Разморозим их все (но, воможно, более правильно было бы разморозить лишь несколько верхних), и поучимся ещё эпох 20.

In [None]:
# разморозка
for param in model.parameters():
    param.requires_grad = True

In [None]:
args = TrainingArguments("ner",
                         evaluation_strategy = "epoch",
                         learning_rate=1e-5,
                         per_device_train_batch_size=batch_size,
                         per_device_eval_batch_size=batch_size,
                         num_train_epochs=20,
                         weight_decay=0.01,
                         save_strategy='no',
                         report_to='none')

In [None]:
trainer = Trainer(model,
                  args,
                  train_dataset=tokenized_datasets["train"],
                  eval_dataset=tokenized_datasets["test"],
                  data_collator=data_collator,
                  tokenizer=tokenizer,
                  compute_metrics=compute_metrics)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.405937,0.502035,0.417607,0.455946,0.890751
2,0.417600,0.353462,0.580263,0.497743,0.535844,0.903544
3,0.417600,0.330278,0.585106,0.558691,0.571594,0.908149
4,0.316800,0.305939,0.592179,0.598194,0.595171,0.914417
5,0.316800,0.291262,0.600892,0.608352,0.604599,0.916208
6,0.275100,0.282568,0.608407,0.620767,0.614525,0.917999
7,0.275100,0.274493,0.592284,0.641084,0.615718,0.919151
8,0.242900,0.267703,0.609989,0.647856,0.628352,0.921325
9,0.242900,0.263862,0.602911,0.654628,0.627706,0.921837
10,0.220400,0.259939,0.6091,0.664786,0.635726,0.924268


TrainOutput(global_step=5420, training_loss=0.23401196469240082, metrics={'train_runtime': 168.7981, 'train_samples_per_second': 512.802, 'train_steps_per_second': 32.109, 'total_flos': 55995481000320.0, 'train_loss': 0.23401196469240082, 'epoch': 20.0})

In [None]:
trainer.evaluate()

{'eval_loss': 0.24920159578323364,
 'eval_precision': 0.6186612576064908,
 'eval_recall': 0.6884875846501128,
 'eval_f1': 0.6517094017094016,
 'eval_accuracy': 0.9278495586542151,
 'eval_runtime': 0.7484,
 'eval_samples_per_second': 642.724,
 'eval_steps_per_second': 41.423,
 'epoch': 20.0}

To get the precision/recall/f1 computed for each category now that we have finished training, we can apply the same function as before on the result of the `predict` method:

In [None]:
predictions, labels, _ = trainer.predict(tokenized_datasets["test"])
predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)
true_predictions = [
    [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results = metric.compute(predictions=true_predictions, references=true_labels)
results

  _warn_prf(average, modifier, msg_start, len(result))


{'ADR': {'precision': 0.2962962962962963,
  'recall': 0.32323232323232326,
  'f1': 0.30917874396135264,
  'number': 99},
 'DI': {'precision': 0.3644578313253012,
  'recall': 0.5377777777777778,
  'f1': 0.43447037701974867,
  'number': 225},
 'Drugclass': {'precision': 0.8604651162790697,
  'recall': 0.8705882352941177,
  'f1': 0.8654970760233918,
  'number': 85},
 'Drugform': {'precision': 0.8448275862068966,
  'recall': 0.8448275862068966,
  'f1': 0.8448275862068967,
  'number': 116},
 'Drugname': {'precision': 0.8284883720930233,
  'recall': 0.9405940594059405,
  'f1': 0.8809891808346214,
  'number': 303},
 'Finding': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 58},
 'overall_precision': 0.6186612576064908,
 'overall_recall': 0.6884875846501128,
 'overall_f1': 0.6517094017094016,
 'overall_accuracy': 0.9278495586542151}

In [None]:
cm = pd.DataFrame(
    confusion_matrix(sum(true_labels, []), sum(true_predictions, []), labels=label_list),
    index=label_list,
    columns=label_list)
cm

Unnamed: 0,B-ADR,B-DI,B-Drugclass,B-Drugform,B-Drugname,B-Finding,I-ADR,I-DI,I-Drugclass,I-Drugform,I-Drugname,I-Finding,O
B-ADR,37,36,0,0,0,0,4,0,0,0,0,0,22
B-DI,9,145,4,2,7,0,4,5,0,0,0,0,49
B-Drugclass,0,3,74,1,0,0,0,0,0,0,0,0,7
B-Drugform,0,4,1,98,3,0,0,0,0,0,0,0,10
B-Drugname,0,3,0,1,290,0,0,0,0,0,0,0,9
B-Finding,8,32,0,0,0,0,0,0,0,0,0,0,18
I-ADR,11,12,0,0,1,0,25,19,0,0,0,0,32
I-DI,4,26,1,0,11,0,8,18,0,0,0,0,43
I-Drugclass,0,0,0,0,0,0,0,0,0,0,0,0,0
I-Drugform,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
model.save_pretrained('ner_bert.bin')
tokenizer.save_pretrained('ner_bert.bin')

('ner_bert.bin/tokenizer_config.json',
 'ner_bert.bin/special_tokens_map.json',
 'ner_bert.bin/vocab.txt',
 'ner_bert.bin/added_tokens.json',
 'ner_bert.bin/tokenizer.json')

## Test the model

In [None]:
text = ' '.join(ner_train[13]['tokens'])
text = ' '.join(ner_test[5]['tokens'])
text

'Ребенок пошел в детский сад , а это постоянные сопли , кашель , постоянные больничные ( 2-4 дня ходим , 2 недели болеем ) , и вот в аптеке мне посоветовали препарат " Иммунал " , как прекрасное средство для повышения иммунитета .'

In [None]:
tokens = tokenizer(text, return_tensors='pt')
tokens = {k: v.to(model.device) for k, v in tokens.items()}

with torch.no_grad():
    pred = model(**tokens)
pred.logits.shape

torch.Size([1, 53, 13])

In [None]:
indices = pred.logits.argmax(dim=-1)[0].cpu().numpy()
token_text = tokenizer.convert_ids_to_tokens(tokens['input_ids'][0])
for t, idx in zip(token_text, indices):
    print(f'{t:15s} {label_list[idx]:10s}')

[CLS]           O         
Ребенок         O         
пошел           O         
в               O         
детский         O         
сад             O         
,               O         
а               O         
это             O         
постоянные      O         
соп             B-DI      
##ли            B-DI      
,               O         
кашель          B-DI      
,               O         
постоянные      O         
больни          O         
##чные          B-DI      
(               O         
2               O         
-               O         
4               O         
дня             O         
ход             O         
##им            O         
,               O         
2               O         
недели          O         
более           O         
##м             O         
)               O         
,               O         
и               O         
вот             O         
в               O         
аптеке          O         
мне             O         
п

Проверка модели через pipeline [`huggingface.co`](https://huggingface.co)

In [None]:
from transformers import pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='ner', aggregation_strategy='average', device=0)

In [None]:
print(text)
print(pipe(text))

Ребенок пошел в детский сад , а это постоянные сопли , кашель , постоянные больничные ( 2-4 дня ходим , 2 недели болеем ) , и вот в аптеке мне посоветовали препарат " Иммунал " , как прекрасное средство для повышения иммунитета .
[{'entity_group': 'DI', 'score': 0.69857335, 'word': 'сопли', 'start': 47, 'end': 52}, {'entity_group': 'DI', 'score': 0.644177, 'word': 'кашель', 'start': 55, 'end': 61}, {'entity_group': 'Drugname', 'score': 0.95609707, 'word': 'Иммунал', 'start': 167, 'end': 174}, {'entity_group': 'DI', 'score': 0.4951239, 'word': 'повышения', 'start': 207, 'end': 216}, {'entity_group': 'DI', 'score': 0.31575242, 'word': 'иммунитета', 'start': 217, 'end': 227}]


Вывод: получилось неплохо