## Семинар 09: Нейросетевые модели поиска. Часть II.

В этом семинаре мы:
- вспомним основы pytorch;
- познакомимся с библиотекой [transformers](https://huggingface.co/docs/transformers/index);
- загрузим претрейн XLM-RoBERTa и дообучим под задачу ранжирования.

In [None]:
%pip install --upgrade pip
%pip install -r requirements.txt

### Torch: recap

В этой секции мы вспомним основы библиотеки `torch`.

Самостоятельно вы можете посмотреть [небольшой обзор](https://kharshit.github.io/blog/2021/12/03/pytorch-basics-tutorial) базовых понятий.

Как установить pytorch: https://pytorch.org/get-started/locally/

In [3]:
import torch

Основа торча - это `torch.Tensor` (аналог `np.ndarray`), который поддерживает работу на CPU / GPU и автодифференцирование (autograd).

In [4]:
# C, H, W
a = torch.Tensor(size=(3, 28, 28))
print(a.dtype, a.type(), a.shape, a.device)
# a.reshpae()
print(a.view(-1, 56).shape)

torch.float32 torch.FloatTensor torch.Size([3, 28, 28]) cpu
torch.Size([42, 56])


Можем конвертировать np в torch и обратно:

In [5]:
b = torch.tensor([[1, 1], [1, 1]])
# tensor -> np array
b = b.numpy()
print(type(b))
# np array -> tensor
b = torch.tensor(b)  # torch.from_numpy(b)
print(type(b))

<class 'numpy.ndarray'>
<class 'torch.Tensor'>


Можем положить тензора на GPU и выполнять любые операции:

In [6]:
# check if CUDA available
print(torch.cuda.is_available())
# check if tensor on GPU
print(b.is_cuda)
# move tensor to GPU
print(b.cuda()) # defaults to gpu:0 # or to.device('cuda')
# move tensor to CPU
print(b.cpu()) # or to.device('cpu')
# check tensor device
print(b.device)

True
False
tensor([[1, 1],
        [1, 1]], device='cuda:0')
tensor([[1, 1],
        [1, 1]])
cpu


Torch поддерживает автоматический расчет градиентов через autograd:

In [7]:
x = torch.randn(2,2, requires_grad=True)
y = x**2
#y.retain_grad()  # retain gradient
# each tensor has a .grad_fn attribute that references a Function that created it
print(f'y.grad_fn: {y.grad_fn}')
z = y.mean()

print(f'x.grad: {x.grad}')
z.backward()
print(f'x.grad: {x.grad}\n\
x/2: {x/2}\n\
y.grad: {y.grad}')  # dz/dy

y.grad_fn: <PowBackward0 object at 0x7f2b72341190>
x.grad: None
x.grad: tensor([[-0.4718,  0.4012],
        [-0.6341, -0.1710]])
x/2: tensor([[-0.4718,  0.4012],
        [-0.6341, -0.1710]], grad_fn=<DivBackward0>)
y.grad: None


  y.grad: {y.grad}')  # dz/dy


Далее нам понадобится опция отключить дифференцирование:

In [8]:
print(x.requires_grad)
print((x ** 2).requires_grad)

with torch.no_grad():
    print((x ** 2).requires_grad)

print(x.requires_grad)
y = x.detach()
# best way to copy a tensor
# y = x.detach().clone()
print(y.requires_grad)

True
True
False
True
False


### Подготовка данных

В этот раз мы снова будем работать с датасетом MS MARCO. Он по-прежнему содержит набор запросов (=сессий) и соответствующие пассажи (=документы).

Предобработка не изменилась, поэтому просто скачаем готовый датасет.

In [9]:
import os
import numpy as np
import pandas as pd

from IPython.display import clear_output
from tqdm.notebook import tqdm

In [10]:
DATA_DIR = os.path.expanduser("./data")
if not os.path.exists(DATA_DIR):
    os.mkdir(DATA_DIR)

In [None]:
!source download_data.sh https://cloud.mail.ru/public/WQ3d/XccSPdk1Z ./data/ms_marco_tokenized.tsv.gz

In [13]:
data = pd.read_csv(os.path.join(DATA_DIR, "ms_marco_tokenized.tsv.gz"), sep='\t', compression="gzip")

Посмотрим на получившийся датасет. Оказывается, что для каждого запроса в среднем 1 релевантный и 29 нерелевантных документов.

In [38]:
data.head()

Unnamed: 0,qid,query,doc,label,query_tokens,doc_tokens
0,0,where is whitemarsh island,"Whitemarsh Island, Georgia. Whitemarsh Island ...",1.0,"['where', 'is', 'whitemarsh', 'island']","['whitemarsh', 'island', ',', 'georgia', '.', ..."
1,0,where is whitemarsh island,the strategy of island hopping was used by the...,0.0,"['where', 'is', 'whitemarsh', 'island']","['the', 'strategy', 'of', 'island', 'hopping',..."
2,0,where is whitemarsh island,"For the island near Dunedin, see White Island,...",0.0,"['where', 'is', 'whitemarsh', 'island']","['for', 'the', 'island', 'near', 'dunedin', ',..."
3,0,where is whitemarsh island,"Jekyll Island, at 5,700 acres, is the smallest...",0.0,"['where', 'is', 'whitemarsh', 'island']","['jekyll', 'island', ',', 'at', '5', ',', '700..."
4,0,where is whitemarsh island,Sibu Island. A scuba diver at Sibu Island. Sib...,0.0,"['where', 'is', 'whitemarsh', 'island']","['sibu', 'island', '.', 'a', 'scuba', 'diver',..."


In [39]:
1.0 / data['label'].mean()

28.968417792647475

Разделим датасет на train / val / test. Разделяем с группировкой по сессиям (запросам).

In [40]:
TEST_SIZE=3_000
test_data = data[(400_000  - TEST_SIZE < data['qid']) & (data['qid'] <= 400_000)]
val_data = data[(400_000 - 2 * TEST_SIZE < data['qid']) & (data['qid'] <= 400_000 - TEST_SIZE)]
train_data = data[data['qid'] <= 400_000 - 2 * TEST_SIZE]

In [41]:
# Соберем токены и тексты для экспериментов.

query_texts, query_tokens = train_data.drop_duplicates(subset=["query"])[["query", "query_tokens"]].values.T
doc_texts, doc_tokens = train_data.drop_duplicates(subset=["doc"])[["doc", "doc_tokens"]].values.T

train_tokens = np.hstack([query_tokens, doc_tokens])
train_texts = np.hstack([query_texts, doc_texts])

### Задача и метрика

Будем решать задачу переранжирования текстовых пассажей для запросов.

Для каждого запроса есть набор релевантных и не релевантных пассажей.

Требуется отранжировать пассажи относительно запроса, чтобы релевантный пассаж стоял выше нерелевантных.

Как и в прошлом семинаре, будем использовать метрику [Mean Reciprocal Rank](https://www.evidentlyai.com/ranking-metrics/mean-reciprocal-rank-mrr) (MRR). Она определяется так:

$$ MRR = \frac{1}{|Q|} \sum_{q_i} \frac{1}{rank_{i}},$$

где $ rank_i $ - позиция __первого релевантного__ док-та для запроса $q_i$, $ |Q| $ - кол-во запросов в выборке.

In [42]:
from torchmetrics.retrieval import RetrievalMRR

# Подробнее про реализацию:
# https://github.com/Lightning-AI/torchmetrics/blob/master/src/torchmetrics/functional/retrieval/reciprocal_rank.py
mrr = RetrievalMRR(top_k=10)

def MRR(preds, target, qids):
    assert isinstance(preds, np.ndarray)
    assert isinstance(target, np.ndarray)
    assert isinstance(qids, np.ndarray)
    score = mrr(torch.Tensor(preds), torch.Tensor(target), indexes=torch.LongTensor(qids - min(qids)))
    return score.item()

In [43]:
results = {}

test_doc_texts, test_doc_tokens = test_data[["doc", "doc_tokens"]].values.T
test_query_texts, test_query_tokens = test_data.drop_duplicates(subset=["query"])[["query", "query_tokens"]].values.T

#### Бейзлайн: Random

Измерим качество случайного предсказания релевантности:

In [44]:
results["random"] =  MRR(np.random.random(len(test_data)), test_data['label'].values, test_data['qid'].values)

#### Бейзлайн: BM25

Теперь применим алгоритм BM25. До появления трансформеров это был стабильно хороший бейзлайн в задаче ранжирования.

In [45]:
# Код с прошлого семинара: исполнять не обязательно.
# Можем просто скопировать результат.

assert False, "You really want to execute the cell?"

from rank_bm25 import BM25Okapi

bm25 = BM25Okapi(list(test_doc_tokens))
bm25_preds = np.zeros(len(test_data))
for q_text, q_tokens in tqdm(zip(test_query_texts, test_query_tokens), total=len(test_query_texts)):
    doc_scores = bm25.get_scores(q_tokens)
    mask = test_data['query'] == q_text
    bm25_preds[mask] = doc_scores[mask]

results["bm25"] = MRR(bm25_preds, test_data['label'].values, test_data['qid'].values)

AssertionError: You really want to execute the cell?

In [46]:
if "bm25" not in results:
    results["bm25"] = 0.60042

In [123]:
# Также добавим результаты для простых эмбеддингов из прошлого семинара:
results["word2vec"] = 0.22002
results["fasttext"] = 0.24425

### Библиотека transformers

In [47]:
from torch import nn
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModel, XLMRobertaTokenizer, XLMRobertaTokenizer
from sklearn.metrics import roc_auc_score

# PyTorch TensorBoard support
from torch.utils.tensorboard import SummaryWriter

Возьмем в качестве предобученной модели [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base). Загрузим модель и токенизатор с помощью generic классов:

In [None]:
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base', cache_dir=DATA_DIR)
model = AutoModel.from_pretrained("xlm-roberta-base", cache_dir=DATA_DIR)

In [51]:
# Приготовим семпл.
sample_query = "IV КРОССОВОК REEBOK МУЖСКОЙ ANSWER".lower()
sample_title = "РОССИЯ БЕСПЛАТНЫЙ ЦЕНА ОТЗЫВ REEBOK V55619 МУЖСКОЙ FOOTBOX КУПИТЬ STEPOVER ПРИМЕРКА IV АРТИКУЛ ИНТЕРНЕТ ДОСТАВКА КРОССОВОК ANSWER МАГАЗИН".lower()

In [52]:
encoded_input = tokenizer(sample_query, return_tensors='pt')
encoded_input

{'input_ids': tensor([[     0,     17,    334, 204090,  38920,   2297,    456,     13,  12720,
          30300,   5902,  35166,      2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [53]:
# разбиваем текст на токены (подслова)
tokenizer.tokenize(sample_query)

['▁i', 'v', '▁крос', 'сов', 'ок', '▁re', 'e', 'bok', '▁муж', 'ской', '▁answer']

In [54]:
encoded_input = tokenizer(sample_query, sample_title, return_tensors='pt')

In [55]:
encoded_input

{'input_ids': tensor([[     0,     17,    334, 204090,  38920,   2297,    456,     13,  12720,
          30300,   5902,  35166,      2,      2,  86856,  31126,  11271,  33681,
           2192,  21013, 100414,    456,     13,  12720,     81, 163406,   2947,
          30300,   5902,  57616,  11728,  78297,  29954,   5465,  12049,    415,
             17,    334, 234764,   9727,  86478, 204090,  38920,   2297,  35166,
          21246,      2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Посмотрим, что за токены с id = 0 и id = 2: это специальные токены начала и конца последовательности.

In [57]:
tokenizer.decode([0, 2])

'<s></s>'

In [58]:
tokenizer.decode(tokenizer(sample_query, sample_title)['input_ids'])

'<s> iv кроссовок reebok мужской answer</s></s> россия бесплатный цена отзыв reebok v55619 мужской footbox купить stepover примерка iv артикул интернет доставка кроссовок answer магазин</s>'

In [59]:
encoded_input = tokenizer([sample_query, sample_title], return_tensors='pt', padding=True, truncation=True)

In [60]:
encoded_input

{'input_ids': tensor([[     0,     17,    334, 204090,  38920,   2297,    456,     13,  12720,
          30300,   5902,  35166,      2,      1,      1,      1,      1,      1,
              1,      1,      1,      1,      1,      1,      1,      1,      1,
              1,      1,      1,      1,      1,      1,      1],
        [     0,  86856,  31126,  11271,  33681,   2192,  21013, 100414,    456,
             13,  12720,     81, 163406,   2947,  30300,   5902,  57616,  11728,
          78297,  29954,   5465,  12049,    415,     17,    334, 234764,   9727,
          86478, 204090,  38920,   2297,  35166,  21246,      2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [61]:
#токен паддинга
tokenizer.decode([1])

'<pad>'

Получили input с размером BS=2. Попробуем засунуть это в модель и посмотреть на ее аутпут.

Как выглядит наша модель?

In [None]:
print(model)

In [73]:
model.encoder.layer[0]

XLMRobertaLayer(
  (attention): XLMRobertaAttention(
    (self): XLMRobertaSelfAttention(
      (query): Linear(in_features=768, out_features=768, bias=True)
      (key): Linear(in_features=768, out_features=768, bias=True)
      (value): Linear(in_features=768, out_features=768, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (output): XLMRobertaSelfOutput(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
  (intermediate): XLMRobertaIntermediate(
    (dense): Linear(in_features=768, out_features=3072, bias=True)
    (intermediate_act_fn): GELUActivation()
  )
  (output): XLMRobertaOutput(
    (dense): Linear(in_features=3072, out_features=768, bias=True)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
)

In [74]:
# forward pass
output = model(**encoded_input)

In [75]:
output

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.0910,  0.0780,  0.0613,  ..., -0.0556,  0.0469, -0.0158],
         [-0.0765, -0.0358, -0.0316,  ...,  0.3345,  0.0167, -0.1170],
         [-0.0198, -0.0243, -0.0061,  ...,  0.0950, -0.0217, -0.0441],
         ...,
         [ 0.0318,  0.0126,  0.0411,  ..., -0.0326, -0.0391, -0.0561],
         [ 0.0318,  0.0126,  0.0411,  ..., -0.0326, -0.0391, -0.0561],
         [ 0.0318,  0.0126,  0.0411,  ..., -0.0326, -0.0391, -0.0561]],

        [[ 0.0660,  0.0676,  0.0653,  ..., -0.0466,  0.0372,  0.0063],
         [-0.1194,  0.0219, -0.0149,  ...,  0.0499, -0.0863,  0.1584],
         [-0.0698,  0.0572,  0.0582,  ..., -0.0559, -0.0245,  0.0068],
         ...,
         [-0.1252, -0.0771,  0.0626,  ...,  0.0576, -0.0283,  0.0953],
         [-0.0365,  0.0628,  0.0632,  ..., -0.1079, -0.0211,  0.0636],
         [ 0.0506,  0.0556,  0.0053,  ..., -0.1313, -0.0339,  0.0464]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_ou

In [76]:
# выходы последнего слоя
output.last_hidden_state

tensor([[[ 0.0910,  0.0780,  0.0613,  ..., -0.0556,  0.0469, -0.0158],
         [-0.0765, -0.0358, -0.0316,  ...,  0.3345,  0.0167, -0.1170],
         [-0.0198, -0.0243, -0.0061,  ...,  0.0950, -0.0217, -0.0441],
         ...,
         [ 0.0318,  0.0126,  0.0411,  ..., -0.0326, -0.0391, -0.0561],
         [ 0.0318,  0.0126,  0.0411,  ..., -0.0326, -0.0391, -0.0561],
         [ 0.0318,  0.0126,  0.0411,  ..., -0.0326, -0.0391, -0.0561]],

        [[ 0.0660,  0.0676,  0.0653,  ..., -0.0466,  0.0372,  0.0063],
         [-0.1194,  0.0219, -0.0149,  ...,  0.0499, -0.0863,  0.1584],
         [-0.0698,  0.0572,  0.0582,  ..., -0.0559, -0.0245,  0.0068],
         ...,
         [-0.1252, -0.0771,  0.0626,  ...,  0.0576, -0.0283,  0.0953],
         [-0.0365,  0.0628,  0.0632,  ..., -0.1079, -0.0211,  0.0636],
         [ 0.0506,  0.0556,  0.0053,  ..., -0.1313, -0.0339,  0.0464]]],
       grad_fn=<NativeLayerNormBackward0>)

In [77]:
output.last_hidden_state.shape

torch.Size([2, 34, 768])

In [78]:
output.last_hidden_state[:, 0, :]

tensor([[ 0.0910,  0.0780,  0.0613,  ..., -0.0556,  0.0469, -0.0158],
        [ 0.0660,  0.0676,  0.0653,  ..., -0.0466,  0.0372,  0.0063]],
       grad_fn=<SliceBackward0>)

Получили тензор с размерностями (BS=2, Seq_len=34, hidden_size=768).

В каждом seq_len=0 позиции стоит токен отвечающий за `\<s\>` или `[CLS]`.

In [None]:
for name, par in model.named_parameters():
    print(name)

In [81]:
model.config

XLMRobertaConfig {
  "_name_or_path": "xlm-roberta-base",
  "architectures": [
    "XLMRobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "xlm-roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.44.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 250002
}

### Cross-Encoder для ранжирования

Соберем кросс-енкодер для ранжирования

In [84]:
class RankBert(nn.Module):
    def __init__(self, train_layers_count=2):
        super(RankBert, self).__init__()

        self.bert = AutoModel.from_pretrained("xlm-roberta-base")
        self.config = self.bert.config

        # freeze all the layers without bias and LN
        for name, par in self.bert.named_parameters():
            if 'bias' in name or 'LayerNorm' in name:
                continue
            par.requires_grad = False

        # unfreeze some of the layers
        layer_count = self.config.num_hidden_layers
        for i in range(train_layers_count):
            for par in self.bert.encoder.layer[layer_count - 1 - i].parameters():
                par.requires_grad = True

        # map cls token embedding to relevance score
        self.head = nn.Linear(self.config.hidden_size, 1)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
        x = self.bert(input_ids=input_ids,
                      token_type_ids=token_type_ids,
                      attention_mask=attention_mask
                      )[0][:, 0, :] #hidden_state of [CLS]
        x = self.head(x)
        return x

Как бы выглядел Bi Encoder вариант переранжироващика:

In [83]:
class BiRankBert(nn.Module):
    def __init__(self, emb_size=64, train_layers_count=2):
        super(RankBert, self).__init__()

        self.bert = AutoModel.from_pretrained("xlm-roberta-base")
        self.config = self.bert.config

        # freeze all the layers without bias and LN
        for name, par in self.bert.named_parameters():
            if 'bias' in name or 'LayerNorm' in name:
                continue
            par.requires_grad = False

        # unfreeze some of the layers
        layer_count = self.config.num_hidden_layers
        for i in range(train_layers_count):
            for par in self.bert.encoder.layer[layer_count - 1 - i].parameters():
                par.requires_grad = True

        # map cls token emb to low dimension emb
        self.head = nn.Linear(self.config.hidden_size, emb_size)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
        x = self.bert(input_ids=input_ids,
                      token_type_ids=token_type_ids,
                      attention_mask=attention_mask
                      )[0][:, 0, :] #hidden_state of [CLS]
        x = self.head(x)
        return x

Но для такой модели надо подавать данные в другом формате:


- должно быть 2 отдельных тензора для токенов запроса и документа
- Bi Encoder: preds = (model(q_tokens) * model(doc_tokens)).sum(-1)
- Cross Encoder: preds = model(q_doc_tokens)

### Подготовка данных для обучения

In [85]:
class RankDataset(Dataset):
    def __init__(self, data, neg_p=1.0):
        self.neg_p = neg_p
        if self.neg_p < 1.:
            self.data = pd.concat([data[data['label'] == 1],
                                   data[data['label'] == 0].sample(frac=self.neg_p)])
        else:
            self.data = data

    def __getitem__(self, index):
        query, text, label = self.data.iloc[index, [1, 2, 3]]

        return [query.lower(), text.lower()], label

    def __len__(self):
        return len(self.data)

In [87]:
dataset_train = RankDataset(train_data, neg_p=0.3)
dataset_valid = RankDataset(val_data, neg_p=1.)

In [88]:
dataset_train[1]

(['where is your perineum',
  'that part of the floor of the pelvis that lies between the tops of the thighs. in the male, the perineum lies between the anus and the scrotum. in the female, it includes the external genitalia. the area between the opening of the vagina and the anus in a woman, or the area between the scrotum and the anus in a man.'],
 1.0)

In [89]:
def compose_batch(batch):
    texts = [x for x, _ in batch]
    ys = torch.tensor([y for _, y in batch]).reshape((-1, 1)).float()
    tokens = tokenizer(texts, padding=True, truncation=True, max_length=64, return_tensors='pt')
    return tokens, ys

In [90]:
train_dataloader = DataLoader(dataset_train, shuffle=True, batch_size=128, collate_fn=compose_batch, num_workers=2)
valid_dataloader = DataLoader(dataset_valid, shuffle=False, batch_size=128, collate_fn=compose_batch)

In [91]:
len(dataset_train), len(train_dataloader)

(3934724, 30741)

In [92]:
sample_batch = next(iter(train_dataloader))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [93]:
sample_batch[0]

{'input_ids': tensor([[     0,   2750,     83,  ...,    808,     27,      2],
        [     0,  72382,    111,  ...,      6, 199293,      2],
        [     0,   2367, 113660,  ...,      1,      1,      1],
        ...,
        [     0,   2367,  10644,  ...,   1651,     64,      2],
        [     0,   3642,   4989,  ...,  78574,      5,      2],
        [     0,   2367, 113660,  ...,  15991,    194,      2]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]])}

Детокенизируем элементы батча

In [94]:
tokenizer.decode(sample_batch[0]['input_ids'][0])

'<s> who is burnside street named after in portland oregon</s></s> just like in portland, oregon, sioux falls wants you to #recycleright too and only put plastic bottles, tubs, buckets, and jugs in the recycling bin. most one t...</s>'

In [95]:
tokenizer.decode(sample_batch[0]['input_ids'][1])

'<s> benefits of drinking lemon water</s></s> tea preparation. most research on the medicinal benefits of tulsi were done with extracts, but you can receive benefits from drinking the tea, according to the university of maryland medical center. to prepare the tea, the authors of the way of ayurved</s>'

### Постановка обучения

##### Инициализируем модель

In [96]:
model = RankBert(train_layers_count=2)

In [97]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Thu Aug 22 14:00:07 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:04:00.0 Off |                  N/A |
| 23%   20C    P8     8W / 250W |    482MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:05:00.0 Off |                  N/A |
| 23%   24C    P8     8W / 250W |      2MiB / 11264MiB |      0%      Default |
|       

In [None]:
# Кладем модель на гпу
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)

In [99]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Thu Aug 22 14:00:28 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:04:00.0 Off |                  N/A |
| 23%   26C    P2    54W / 250W |   1596MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:05:00.0 Off |                  N/A |
| 23%   24C    P8     8W / 250W |      2MiB / 11264MiB |      0%      Default |
|       

Видим, что модель занимает на гпушке ~1.5 Gb.

##### Конфиг и tensorboard

In [106]:
!mkdir cross_encоder_checkpoint

mkdir: невозможно создать каталог «cross_encоder_checkpoint»: Файл существует


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [101]:
class config:
    EPOCHS=1
    LR=1e-4
    WD=0.01
    SAVE_DIR="cross_encоder_checkpoint"
    SAVE_INTERVAL=1000
    BATCH_SIZE=64
    ACCUM_BS=1
    DEVICE=torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
    LOG_INTERVAL=250

writer = SummaryWriter('cross_encоder_checkpoint/ms_marco_v2')

loss_fn = nn.BCEWithLogitsLoss()

In [102]:
#!rm -rf cross_encоder_checkpoint

Дока про запуск tensorboard с pytorch: https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html

Чтобы запустить для нашего примера, нужно исполнить следующую команду в терминале из-под окружения:

In [None]:
# tensorboard --logdir=./cross_encоder_checkpoint/ --port=9999
# ... и открыть появившуюся ссылку

##### Шедулер и оптимизатор

In [103]:
optimizer = torch.optim.AdamW(model.parameters(), lr=config.LR, weight_decay=config.WD)
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer,
                                                pct_start=0.1,
                                                max_lr=config.LR,
                                                epochs=config.EPOCHS,
                                                steps_per_epoch=len(train_dataloader))

Оптимизация в torch: https://pytorch.org/docs/stable/optim.html

Оптимизатор Adam в torch: https://pytorch.org/docs/stable/generated/torch.optim.Adam.html

OneCycle шедулер: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.OneCycleLR.html#torch.optim.lr_scheduler.OneCycleLR

##### Циклы обучения

In [105]:
# Инстанс модели и данные должны находиться на одном девайсе для прямого / обратного прохода.
def move_batch_to_device(batch, device):
    batch_x, y = batch
    for key in batch_x:
        batch_x[key] = batch_x[key].to(device)
    y = y.to(device)
    return batch_x, y

In [108]:
import gc

def train_one_epoch(epoch_index, tb_writer):
    running_loss = 0.
    running_auc = 0.
    last_loss = 0.

    device = config.DEVICE
    # Here, we use enumerate(training_loader) instead of
    # iter(training_loader) so that we can track the batch
    # index and do some intra-epoch reporting
    for i, batch in enumerate(train_dataloader):
        # Every data instance is an input + label pair
        batch_x, y = move_batch_to_device(batch, device)

        # Zero your gradients for every batch!
        optimizer.zero_grad()

        # Make predictions for this batch
        outputs = model(**batch_x)

        # Compute the loss and its gradients
        loss = loss_fn(outputs, y)
        loss.backward()

        # Adjust learning weights
        optimizer.step()
        scheduler.step()

        # Gather data and report
        running_loss += loss.item()

        y = y.cpu().int().numpy()
        if y.sum() > 0:
            #compute metric
            with torch.no_grad():
                auc = roc_auc_score(y,
                                    outputs.cpu().numpy(),
                                    labels=np.array([0, 1]))
            running_auc += np.mean(auc)
        else:
            running_auc += 1

        #logging to tb
        tb_x = epoch_index * len(train_dataloader) + i + 1
        tb_writer.add_scalar('lr', scheduler.get_last_lr()[0], tb_x)
        tb_writer.add_scalar('Train/auc', auc, tb_x)
        tb_writer.add_scalar('Train/loss', loss, tb_x)

        if i % config.LOG_INTERVAL == config.LOG_INTERVAL - 1:
            last_loss = running_loss / config.LOG_INTERVAL # loss per batch
            last_auc = running_auc / config.LOG_INTERVAL # loss per batch
            print('  batch {} loss: {}, auc: {}'.format(i + 1, last_loss, last_auc))

            tb_writer.add_scalar('Train/running_loss', last_loss, tb_x)
            tb_writer.add_scalar('Train/running_auc', last_auc, tb_x)
            running_loss = 0.
            running_auc = 0.

        if i % 10 == 0: #clean up memory
            gc.collect()
            torch.cuda.empty_cache()

    return last_loss, last_auc

In [None]:
# Можно остановить после batch=1500

epoch_number = 0

best_vloss = 1_000_000.

for epoch in range(config.EPOCHS):
    print('EPOCH {}:'.format(epoch_number + 1))

    # Make sure gradient tracking is on, and do a pass over the data
    model.train(True)
    avg_loss, avg_auc = train_one_epoch(epoch_number, writer)

    running_vloss = 0.0
    running_vauc = 0.0
    # Set the model to evaluation mode, disabling dropout and using population
    # statistics for batch normalization.
    model.eval()

    # Disable gradient computation and reduce memory consumption.
    with torch.no_grad():
        preds = []
        for i, batch in enumerate(valid_dataloader):
            batch_x, y = move_batch_to_device(batch, config.DEVICE)
            voutputs = model(**batch_x)
            vloss = loss_fn(voutputs, y)
            running_vloss += vloss

            y = y.cpu().int().numpy()
            if y.sum() > 0:
                #compute metric
                with torch.no_grad():
                    auc = roc_auc_score(y,
                                        voutputs.cpu().numpy(),
                                        labels=np.array([0, 1]))
                running_vauc += np.mean(auc)
            else:
                running_vauc += 1

            preds.append(voutputs)

    #compute valid mrr
    preds = torch.cat(preds).view(-1).cpu()
    val_mrr = MRR(preds, val_data['label'].values, val_data['qid'].values)

    avg_vloss = running_vloss / (i + 1)
    avg_vauc = running_vauc / (i + 1)
    print('LOSS train {} valid {}'.format(avg_loss, avg_vloss))
    print('AUC train {} valid {}'.format(avg_auc, avg_vauc))
    print('MRR valid {}'.format(val_mrr))

    writer.add_scalar('Valid/mrr', val_mrr, epoch)
    writer.add_scalar('Valid/loss', avg_vloss, epoch)
    writer.add_scalar('Valid/auc', avg_vauc, epoch)

    # Track best performance, and save the model's state
    if avg_vloss < best_vloss:
        best_vloss = avg_vloss

        checkpoint = {'epoch': epoch, 'model_state_dict': model.state_dict(),
                      'optimizer_state_dict': optimizer.state_dict(),
                      'scheduler_state_dict': scheduler.state_dict(),
                      'best_vloss': best_vloss}

        torch.save(checkpoint, f'{config.SAVE_DIR}/ckpt_epoch_{epoch}_loss{best_vloss}.pt')

    epoch_number += 1

### Inference

In [110]:
test_dataset = RankDataset(test_data, neg_p=1.)

In [111]:
test_dataloader = DataLoader(test_dataset, shuffle=False, batch_size=1024, collate_fn=compose_batch)

In [None]:
def get_test_perds(model):
    model.eval() #eval mode
    y_test = []

    for i, batch in enumerate(tqdm(test_dataloader, position=0, leave=True, desc=f"Iteration: {'test'}")): #итерируемся по батчам
        batch_x, y = move_batch_to_device(batch, config.DEVICE)
        with torch.no_grad():
            preds = model(**batch_x)
            y_test += [preds]

    y_test = torch.cat(y_test).view(-1).cpu().numpy()
    return y_test

y_test = get_test_perds(model.float())

In [114]:
y_test[:5]

array([-0.4610981, -5.541357 , -6.0795994, -4.716203 , -3.0710356],
      dtype=float32)

In [115]:
print(roc_auc_score(test_dataset.data['label'].values, y_test, labels=np.array([0, 1])))

0.9388990054149596


In [116]:
# make model float16 precision
y_test = get_test_perds(model.half())

Iteration: test:   0%|          | 0/91 [00:00<?, ?it/s]

In [117]:
y_test[:5]

array([-0.4373, -5.54  , -6.082 , -4.715 , -3.066 ], dtype=float16)

### Подсчет MRR

In [118]:
test_data['pred'] = np.array(y_test)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data['pred'] = np.array(y_test)


In [119]:
test_data.head()

Unnamed: 0,qid,query,doc,label,query_tokens,doc_tokens,pred
12230454,397001,highest point in kentucky,"Harlan County, Kentucky, U.S. Black Mountain i...",1.0,"['highest', 'point', 'in', 'kentucky']","['harlan', 'county', ',', 'kentucky', ',', 'u'...",-0.437256
12230455,397001,highest point in kentucky,"It is here, where the land is dominated by the...",0.0,"['highest', 'point', 'in', 'kentucky']","['it', 'is', 'here', ',', 'where', 'the', 'lan...",-5.539062
12230456,397001,highest point in kentucky,KENTUCKY ANCESTORS. GENEALOGICAL QUARTERLY. OF...,0.0,"['highest', 'point', 'in', 'kentucky']","['kentucky', 'ancestors', '.', 'genealogical',...",-6.082031
12230457,397001,highest point in kentucky,The outcome of the political struggle in Kentu...,0.0,"['highest', 'point', 'in', 'kentucky']","['the', 'outcome', 'of', 'the', 'political', '...",-4.714844
12230458,397001,highest point in kentucky,"As a division of Circuit Court, which is the h...",0.0,"['highest', 'point', 'in', 'kentucky']","['as', 'a', 'division', 'of', 'circuit', 'cour...",-3.066406


In [125]:
# Cross-Encoder Rank Bert
results["xlm-roberta-base"] = MRR(y_test, test_data['label'].values, test_data['qid'].values)

In [128]:
for name, value in sorted(results.items(), key=lambda x: -x[1]):
    print(f'{value:.5f}\t', name)

0.77125	 xlm-roberta-base
0.60042	 bm25
0.24425	 fasttext
0.22002	 word2vec
0.10755	 random


Ура, мы получили решение лучше нашего бейзлайна!

Как улучшать решение:
- пробовать pairwise / listwise лоссы
- разморозить бОльшую часть сети
- учить дольше / больше данных
- попробовать другие претрейны (валидно для английского языка)
- оптимизировать скорость обучения (след. за то же время можно прогнать больше данных)
- расширять контекст / добавлять новые текстовые поля

__Что еще можно посмотреть:__
- Как ускорить обучение с помощью mixed-precision
    - *https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html*
    - https://pytorch.org/docs/stable/notes/amp_examples.html
- Как учить модели на нескольких гпу (можно использовать например на кеггле)
    - https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
    - https://pytorch.org/docs/stable/notes/ddp.html
- Библиотеки для более удобного обучения сетей
    - https://github.com/Lightning-AI/pytorch-lightning (общий случай)
    - https://huggingface.co/docs/transformers/main/en/trainer (трансформеры)
    - https://huggingface.co/docs/transformers/main_classes/trainer    