# Seminar: Neural Network Rerankers

Как установить pytorch: https://pytorch.org/get-started/locally/

In [8]:
#!pip install transformers

In [None]:
!pip show torch
!pip show transformers

### Torch: quick guide

https://kharshit.github.io/blog/2021/12/03/pytorch-basics-tutorial

In [1]:
import torch


libgomp: Invalid value for environment variable OMP_NUM_THREADS


Основа торча - это torch.Tensor (аналог np.array), который поддерживает работу на CPU / GPU и автодифференцирование (autograd).

In [2]:
# C, H, W
a = torch.Tensor(size=(3, 28, 28))
print(a.dtype, a.type(), a.shape, a.device)
# a.reshpae()
print(a.view(-1, 56).shape)

torch.float32 torch.FloatTensor torch.Size([3, 28, 28]) cpu
torch.Size([42, 56])


Можем конвертировать np в torch и обратно

In [3]:
b = torch.tensor([[1, 1], [1, 1]])
# tensor -> np array
b = b.numpy()
print(type(b))
# np array -> tensor
b = torch.tensor(b)  # torch.from_numpy(b)
print(type(b))

<class 'numpy.ndarray'>
<class 'torch.Tensor'>


Можем положить тензора на GPU и выполнять любые операции

In [4]:
# check if CUDA available
print(torch.cuda.is_available())
# check if tensor on GPU
print(b.is_cuda)
# move tensor to GPU
print(b.cuda()) # defaults to gpu:0 # or to.device('cuda')
# move tensor to CPU
print(b.cpu()) # or to.device('cpu')
# check tensor device
print(b.device)

True
False


tensor([[1, 1],
        [1, 1]], device='cuda:0')
tensor([[1, 1],
        [1, 1]])
cpu


Autograd

In [5]:
x = torch.randn(2,2, requires_grad=True)
y = x**2
#y.retain_grad()  # retain gradient
# each tensor has a .grad_fn attribute that references a Function that created it
print(f'y.grad_fn: {y.grad_fn}')
z = y.mean()

print(f'x.grad: {x.grad}')
z.backward()
print(f'x.grad: {x.grad}\n\
x/2: {x/2}\n\
y.grad: {y.grad}')  # dz/dy

y.grad_fn: <PowBackward0 object at 0x7fccec2c5ac0>
x.grad: None
x.grad: tensor([[ 3.4375e-01, -4.1741e-04],
        [ 7.8012e-01,  1.5559e-01]])
x/2: tensor([[ 3.4375e-01, -4.1741e-04],
        [ 7.8012e-01,  1.5559e-01]], grad_fn=<DivBackward0>)
y.grad: None


  return self._grad


отключение дифференцирования

In [6]:
print(x.requires_grad)
print((x ** 2).requires_grad)

with torch.no_grad():
    print((x ** 2).requires_grad)

print(x.requires_grad)
y = x.detach()
# best way to copy a tensor
# y = x.detach().clone()
print(y.requires_grad)

True
True
False
True
False


### Подготовка данных

In [75]:
import pandas as pd
import numpy as np
import tqdm
import os

np.random.seed(42)

Будем решать задачу переранжирования текстовых пассажей для запросов. 

Для каждого запроса есть набор релевантных и не релевантных пассажей.

Требуется отражировать пассажи относительно запроса, чтобы релевантный пассажй стоял выше нерелевантных.

Для каждого запроса в среднем 1 релевантный и ~30 не релевантных док-ов.

Датасет: https://huggingface.co/datasets/Tevatron/msmarco-passage

In [76]:
from datasets import load_dataset

msmarco_dataset = load_dataset("Tevatron/msmarco-passage", cache_dir="/home/jovyan/ndermolaev/lectures/cache")

Found cached dataset msmarco-passage (/home/jovyan/ndermolaev/lectures/cache/Tevatron___msmarco-passage/default/0.0.1/300947ae554083632b487251f17ce2100425fd1135048532fb20afa1d66e9e62)


  0%|          | 0/4 [00:00<?, ?it/s]

In [77]:
def dataset_pandas(dataset):
    rows = []
    for i, row in enumerate(dataset):
        current_row = []
        for pos_sample in row['positive_passages']:
            current_row = []
            current_row.append(i) # qid
            current_row.append(row['query']) # query
            current_row.append(pos_sample['text']) # text
            current_row.append(1.) # label
            rows.append(current_row)

        for neg_sample in row['negative_passages']:
            current_row = []
            current_row.append(i) # qid
            current_row.append(row['query']) # query
            current_row.append(neg_sample['text']) # text
            current_row.append(0.) # label
            rows.append(current_row)
    print(len(rows))

    return pd.DataFrame(rows, columns=['qid', 'query', 'text', 'label'])

In [78]:
data = dataset_pandas(msmarco_dataset['train'])

12346948


In [79]:
data

Unnamed: 0,qid,query,text,label
0,0,where is whitemarsh island,"Whitemarsh Island, Georgia. Whitemarsh Island ...",1.0
1,0,where is whitemarsh island,the strategy of island hopping was used by the...,0.0
2,0,where is whitemarsh island,"For the island near Dunedin, see White Island,...",0.0
3,0,where is whitemarsh island,"Jekyll Island, at 5,700 acres, is the smallest...",0.0
4,0,where is whitemarsh island,Sibu Island. A scuba diver at Sibu Island. Sib...,0.0
...,...,...,...,...
12346943,400781,where is vernon tx,of the Lamar County. Chamber of Commerce Stree...,0.0
12346944,400781,where is vernon tx,"Honda Civic near Midland, TX; Honda Civic in L...",0.0
12346945,400781,where is vernon tx,"Driving distance from Dallas, TX to Houston, T...",0.0
12346946,400781,where is vernon tx,"Distance to cities nearby McKinney, TX and Pla...",0.0


Разделим датасет на train / val / test. Разделяем с группировкой по сессиям (запросам).

In [14]:
TEST_SIZE=3_000
test_data = data[(400_000  - TEST_SIZE < data['qid']) & (data['qid'] <= 400_000)]
val_data = data[(400_000 - 2 * TEST_SIZE < data['qid']) & (data['qid'] <= 400_000 - TEST_SIZE)]
train_data = data[data['qid'] <= 400_000 - 2 * TEST_SIZE]

In [15]:
test_data

Unnamed: 0,qid,query,text,label
12230454,397001,highest point in kentucky,"Harlan County, Kentucky, U.S. Black Mountain i...",1.0
12230455,397001,highest point in kentucky,"It is here, where the land is dominated by the...",0.0
12230456,397001,highest point in kentucky,KENTUCKY ANCESTORS. GENEALOGICAL QUARTERLY. OF...,0.0
12230457,397001,highest point in kentucky,The outcome of the political struggle in Kentu...,0.0
12230458,397001,highest point in kentucky,"As a division of Circuit Court, which is the h...",0.0
...,...,...,...,...
12323107,400000,what does inclement weather mean,When there is heavy snow or when inclement wea...,0.0
12323108,400000,what does inclement weather mean,The SUV also offers two versions of its Quadra...,0.0
12323109,400000,what does inclement weather mean,"Todayâs and tonightâs Evansville, IN weath...",0.0
12323110,400000,what does inclement weather mean,"(of the weather, the elements, etc.) severe, r...",0.0


In [16]:
train_data

Unnamed: 0,qid,query,text,label
0,0,where is whitemarsh island,"Whitemarsh Island, Georgia. Whitemarsh Island ...",1.0
1,0,where is whitemarsh island,the strategy of island hopping was used by the...,0.0
2,0,where is whitemarsh island,"For the island near Dunedin, see White Island,...",0.0
3,0,where is whitemarsh island,"Jekyll Island, at 5,700 acres, is the smallest...",0.0
4,0,where is whitemarsh island,Sibu Island. A scuba diver at Sibu Island. Sib...,0.0
...,...,...,...,...
12138018,394000,can your ejection fraction be improved,Many different heart and vascular diseases can...,0.0
12138019,394000,can your ejection fraction be improved,1 Make sure the decimal point is in the right ...,0.0
12138020,394000,can your ejection fraction be improved,Ensure vs CCK: A HIDA scan is used to assess b...,0.0
12138021,394000,can your ejection fraction be improved,Not Everyone With an Ejection Fraction â¤30% ...,0.0


In [17]:
train_data['label'].mean()

0.03452168446212369

Получается в среднем на 1 позитив - 29 негативов.

# Baselines


Чтобы оценить качество ранжирование будем смотреть на метрику MRR.

Что такое MRR: https://www.evidentlyai.com/ranking-metrics/mean-reciprocal-rank-mrr

$ |Q| $ - кол-во запросов в выборке

$ MMR = \frac{1}{|Q|} \sum_{q_i} \frac{1}{rank_{i}} $

$ rank_i $ - позиция первого релевантного док-та для запроса $q_i$

In [18]:
import torch
from torchmetrics.retrieval import RetrievalMRR

def MRR(preds, target, qids):
    mrr = RetrievalMRR(top_k=10)

    return mrr(torch.Tensor(preds), 
               torch.Tensor(target), 
               indexes=torch.LongTensor(qids - min(qids)))

Random Baseline

In [19]:
MRR(np.random.random(len(test_data)), test_data['label'].values, test_data['qid'].values)

tensor(0.1334)

##### BM25 baseline

In [20]:
#!pip install rank-bm25

In [21]:
from rank_bm25 import BM25Okapi

corpus = test_data['text'].values
tokenized_corpus = [doc.split(" ") for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)

In [22]:
def get_bm25_scores():
    queries = test_data['query'].unique()
    bm25_preds = np.zeros(len(test_data))
    for q in tqdm.tqdm(queries):
        tokenized_query = q.split(" ")
        doc_scores = bm25.get_scores(tokenized_query)
        mask = test_data['query'] == q
        bm25_preds[mask] = doc_scores[mask]
    return bm25_preds

In [23]:
bm25_preds = get_bm25_scores()

100%|██████████| 3000/3000 [06:20<00:00,  7.88it/s]


In [24]:
bm25_preds

array([ 8.55475333, 12.53993883,  4.0933996 , ...,  8.82491291,
        0.        ,  0.        ])

In [25]:
MRR(bm25_preds, test_data['label'].values, test_data['qid'].values)

tensor(0.3118)

Получили решение лучше чем рандом на 20%!

## Работа с библиотекой transformers

In [31]:
import torch
import transformers
from torch import nn
from torch.utils import data
from torch.utils.data import Dataset
import numpy as np
from transformers import AutoTokenizer, AutoModel, XLMRobertaTokenizer, XLMRobertaTokenizer
from sklearn.metrics import roc_auc_score
# PyTorch TensorBoard support
from torch.utils.tensorboard import SummaryWriter

Возьмем в качестве предобученной модели xlm-roberta-base https://huggingface.co/FacebookAI/xlm-roberta-base

In [32]:
sample_query = "IV КРОССОВОК REEBOK МУЖСКОЙ ANSWER".lower()
sample_title = "РОССИЯ БЕСПЛАТНЫЙ ЦЕНА ОТЗЫВ REEBOK V55619 МУЖСКОЙ FOOTBOX КУПИТЬ STEPOVER ПРИМЕРКА IV АРТИКУЛ ИНТЕРНЕТ ДОСТАВКА КРОССОВОК ANSWER МАГАЗИН".lower()

In [33]:
sample_query, sample_title

('iv кроссовок reebok мужской answer',
 'россия бесплатный цена отзыв reebok v55619 мужской footbox купить stepover примерка iv артикул интернет доставка кроссовок answer магазин')

Загрузим модель и токенизатор

In [29]:
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
model = AutoModel.from_pretrained("xlm-roberta-base")

In [34]:
# prepare input
text = sample_query
encoded_input = tokenizer(text, return_tensors='pt')

In [35]:
encoded_input

{'input_ids': tensor([[     0,     17,    334, 204090,  38920,   2297,    456,     13,  12720,
          30300,   5902,  35166,      2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [36]:
tokenizer.tokenize(text) #разбиваем текст на токены (подслова)

['▁i', 'v', '▁крос', 'сов', 'ок', '▁re', 'e', 'bok', '▁муж', 'ской', '▁answer']

In [37]:
encoded_input = tokenizer(sample_query, sample_title, return_tensors='pt')

In [38]:
encoded_input

{'input_ids': tensor([[     0,     17,    334, 204090,  38920,   2297,    456,     13,  12720,
          30300,   5902,  35166,      2,      2,  86856,  31126,  11271,  33681,
           2192,  21013, 100414,    456,     13,  12720,     81, 163406,   2947,
          30300,   5902,  57616,  11728,  78297,  29954,   5465,  12049,    415,
             17,    334, 234764,   9727,  86478, 204090,  38920,   2297,  35166,
          21246,      2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

посмотрим что такое токены 0 и 2

In [39]:
tokenizer.decode([0, 2]) # это специальные токены начала и конца последовательности

'<s></s>'

In [40]:
tokenizer.decode(tokenizer(sample_query, sample_title)['input_ids'])

'<s> iv кроссовок reebok мужской answer</s></s> россия бесплатный цена отзыв reebok v55619 мужской footbox купить stepover примерка iv артикул интернет доставка кроссовок answer магазин</s>'

In [41]:
encoded_input = tokenizer([sample_query, sample_title], return_tensors='pt', padding=True, truncation=True)

In [42]:
encoded_input

{'input_ids': tensor([[     0,     17,    334, 204090,  38920,   2297,    456,     13,  12720,
          30300,   5902,  35166,      2,      1,      1,      1,      1,      1,
              1,      1,      1,      1,      1,      1,      1,      1,      1,
              1,      1,      1,      1,      1,      1,      1],
        [     0,  86856,  31126,  11271,  33681,   2192,  21013, 100414,    456,
             13,  12720,     81, 163406,   2947,  30300,   5902,  57616,  11728,
          78297,  29954,   5465,  12049,    415,     17,    334, 234764,   9727,
          86478, 204090,  38920,   2297,  35166,  21246,      2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [43]:
tokenizer.decode([1]) #токен паддинга

'<pad>'

Получили input с размером BS=2. Попробуем засунуть это в модель и посмотреть на ее аутпут.

Как выглядит наша модель?

In [44]:
print(model)

XLMRobertaModel(
  (embeddings): XLMRobertaEmbeddings(
    (word_embeddings): Embedding(250002, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): XLMRobertaEncoder(
    (layer): ModuleList(
      (0): XLMRobertaLayer(
        (attention): XLMRobertaAttention(
          (self): XLMRobertaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): XLMRobertaSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
     

In [45]:
# forward pass
output = model(**encoded_input)

In [46]:
output

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.0910,  0.0780,  0.0613,  ..., -0.0556,  0.0469, -0.0158],
         [-0.0765, -0.0358, -0.0316,  ...,  0.3345,  0.0167, -0.1170],
         [-0.0198, -0.0243, -0.0061,  ...,  0.0950, -0.0217, -0.0441],
         ...,
         [ 0.0318,  0.0126,  0.0411,  ..., -0.0326, -0.0391, -0.0561],
         [ 0.0318,  0.0126,  0.0411,  ..., -0.0326, -0.0391, -0.0561],
         [ 0.0318,  0.0126,  0.0411,  ..., -0.0326, -0.0391, -0.0561]],

        [[ 0.0660,  0.0676,  0.0653,  ..., -0.0466,  0.0372,  0.0063],
         [-0.1194,  0.0219, -0.0149,  ...,  0.0499, -0.0863,  0.1584],
         [-0.0698,  0.0572,  0.0582,  ..., -0.0559, -0.0245,  0.0068],
         ...,
         [-0.1252, -0.0771,  0.0626,  ...,  0.0576, -0.0283,  0.0953],
         [-0.0365,  0.0628,  0.0632,  ..., -0.1079, -0.0211,  0.0636],
         [ 0.0506,  0.0556,  0.0053,  ..., -0.1313, -0.0339,  0.0464]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_ou

In [47]:
output.last_hidden_state # выходы последнего слоя

tensor([[[ 0.0910,  0.0780,  0.0613,  ..., -0.0556,  0.0469, -0.0158],
         [-0.0765, -0.0358, -0.0316,  ...,  0.3345,  0.0167, -0.1170],
         [-0.0198, -0.0243, -0.0061,  ...,  0.0950, -0.0217, -0.0441],
         ...,
         [ 0.0318,  0.0126,  0.0411,  ..., -0.0326, -0.0391, -0.0561],
         [ 0.0318,  0.0126,  0.0411,  ..., -0.0326, -0.0391, -0.0561],
         [ 0.0318,  0.0126,  0.0411,  ..., -0.0326, -0.0391, -0.0561]],

        [[ 0.0660,  0.0676,  0.0653,  ..., -0.0466,  0.0372,  0.0063],
         [-0.1194,  0.0219, -0.0149,  ...,  0.0499, -0.0863,  0.1584],
         [-0.0698,  0.0572,  0.0582,  ..., -0.0559, -0.0245,  0.0068],
         ...,
         [-0.1252, -0.0771,  0.0626,  ...,  0.0576, -0.0283,  0.0953],
         [-0.0365,  0.0628,  0.0632,  ..., -0.1079, -0.0211,  0.0636],
         [ 0.0506,  0.0556,  0.0053,  ..., -0.1313, -0.0339,  0.0464]]],
       grad_fn=<NativeLayerNormBackward0>)

In [48]:
output.last_hidden_state.shape

torch.Size([2, 34, 768])

In [49]:
# output.last_hidden_state[:, 0, :]

Получили тензор с размерностями (BS=2, Seq_len=34, hidden_size=768)

В каждом seq_len=0 позиции стоит токен отвечающий за \<s\> (или [CLS])

In [50]:
for name, par in model.named_parameters():
    print(name)

embeddings.word_embeddings.weight
embeddings.position_embeddings.weight
embeddings.token_type_embeddings.weight
embeddings.LayerNorm.weight
embeddings.LayerNorm.bias
encoder.layer.0.attention.self.query.weight
encoder.layer.0.attention.self.query.bias
encoder.layer.0.attention.self.key.weight
encoder.layer.0.attention.self.key.bias
encoder.layer.0.attention.self.value.weight
encoder.layer.0.attention.self.value.bias
encoder.layer.0.attention.output.dense.weight
encoder.layer.0.attention.output.dense.bias
encoder.layer.0.attention.output.LayerNorm.weight
encoder.layer.0.attention.output.LayerNorm.bias
encoder.layer.0.intermediate.dense.weight
encoder.layer.0.intermediate.dense.bias
encoder.layer.0.output.dense.weight
encoder.layer.0.output.dense.bias
encoder.layer.0.output.LayerNorm.weight
encoder.layer.0.output.LayerNorm.bias
encoder.layer.1.attention.self.query.weight
encoder.layer.1.attention.self.query.bias
encoder.layer.1.attention.self.key.weight
encoder.layer.1.attention.self.key

In [51]:
model.config

XLMRobertaConfig {
  "_name_or_path": "xlm-roberta-base",
  "architectures": [
    "XLMRobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "xlm-roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.39.3",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 250002
}

## Cross-Encoder для ранжирования

Соберем кросс-енкодер для ранжирования

In [52]:
class RankBert(nn.Module):
    def __init__(self, train_layers_count=2):
        super(RankBert, self).__init__()

        self.bert = AutoModel.from_pretrained("xlm-roberta-base")
        self.config = self.bert.config

        # freeze all layers without bias and LN
        for name, par in self.bert.named_parameters():
            if 'bias' in name or 'LayerNorm' in name:
                continue
            par.requires_grad = False

        layer_count = self.config.num_hidden_layers
        for i in range(train_layers_count): #unfreeze somw layers
            for par in self.bert.encoder.layer[layer_count - 1 - i].parameters():
                par.requires_grad = True
        
        # map cls token emb to relevance score
        self.head = nn.Linear(self.config.hidden_size, 1) 
        
    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
        x = self.bert(input_ids=input_ids,
                      token_type_ids=token_type_ids,
                      attention_mask=attention_mask
                      )[0][:, 0, :] #hidden_state of [CLS]
        x = self.head(x)
        return x

Теперь осталось правильно подготовить данные

#### Как бы выглядел Bi Encoder вариант переранжироващика

In [53]:
class BiRankBert(nn.Module):
    def __init__(self, emb_size=64, train_layers_count=2):
        super(RankBert, self).__init__()

        self.bert = AutoModel.from_pretrained("xlm-roberta-base")
        self.config = self.bert.config

        # freeze all layers without bias and LN
        for name, par in self.bert.named_parameters():
            if 'bias' in name or 'LayerNorm' in name:
                continue
            par.requires_grad = False

        layer_count = self.config.num_hidden_layers
        for i in range(train_layers_count): #unfreeze somw layers
            for par in self.bert.encoder.layer[layer_count - 1 - i].parameters():
                par.requires_grad = True
        
        # map cls token emb to low dimension emb
        self.head = nn.Linear(self.config.hidden_size, emb_size) 
        
    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
        x = self.bert(input_ids=input_ids,
                      token_type_ids=token_type_ids,
                      attention_mask=attention_mask
                      )[0][:, 0, :] #hidden_state of [CLS]
        x = self.head(x)
        return x

Но для такой модели надо подавать данные в другом формате:

- должно быть 2 отдельных тензора для токенов запроса и документа

- Bi Encoder: preds = (model(q_tokens) * model(doc_tokens)).sum(-1)
- Cross Encoder: preds = model(q_doc_tokens)

## Подготовка данных для обучения

In [54]:
class RankDataset(Dataset):
    def __init__(self, data, neg_p=1.0):
        self.neg_p = neg_p
        if self.neg_p < 1.:
            self.data = pd.concat([data[data['label'] == 1], 
                                   data[data['label'] == 0].sample(frac=self.neg_p)])
        else:
            self.data = data
        
    def __getitem__(self, index):
        query, text, label = self.data.iloc[index, [1, 2, 3]]

        return [query.lower(), text.lower()], label

    def __len__(self):
        return len(self.data)

In [55]:
def compose_batch(batch):
    texts = [x for x, _ in batch]
    ys = torch.tensor([y for _, y in batch]).reshape((-1, 1)).float()

    tokens = tokenizer(texts, padding=True, truncation=True, max_length=64, return_tensors='pt')

    return tokens, ys

In [56]:
dataset_train = RankDataset(train_data, neg_p=0.3)
dataset_valid = RankDataset(val_data, neg_p=1.)

In [57]:
dataset_train[1]

(['where is your perineum',
  'that part of the floor of the pelvis that lies between the tops of the thighs. in the male, the perineum lies between the anus and the scrotum. in the female, it includes the external genitalia. the area between the opening of the vagina and the anus in a woman, or the area between the scrotum and the anus in a man.'],
 1.0)

In [58]:
train_dataloader = data.DataLoader(dataset_train, shuffle=True, batch_size=128, collate_fn=compose_batch, num_workers=2)
valid_dataloader = data.DataLoader(dataset_valid, shuffle=False, batch_size=128, collate_fn=compose_batch)

In [59]:
len(dataset_train), len(train_dataloader)

(3934724, 30741)

In [60]:
sample_batch = next(iter(train_dataloader))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [61]:
sample_batch[0]

{'input_ids': tensor([[     0,   2367,  30482,  ...,   1221,  69255,      2],
        [     0,    831,    398,  ...,    831,  28123,      2],
        [     0,   3229,    831,  ...,     54,     10,      2],
        ...,
        [     0,  45484,     85,  ...,  52490,      7,      2],
        [     0,  72382,    111,  ...,   5791, 190534,      2],
        [     0,   2367,     83,  ...,   1821, 175457,      2]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]])}

Детокенизируем элементы батча

In [62]:
tokenizer.decode(sample_batch[0]['input_ids'][0])

'<s> what makes a vibration in leg and back</s></s> while the unit is weight sensing, the tub will spin slowly back and forth and some vibration will occur. this is normal. while the unit is still spinning at a slow speed (400 rpm or less), the unit will vibra</s>'

In [63]:
tokenizer.decode(sample_batch[0]['input_ids'][1])

'<s> can you use shaving cream to make slime</s></s> however, since slime molds in garden mulch or other areas are not harmful, removal is not necessary. for this reason, slime mold control with chemicals is more trouble than it is worth. few chemicals can permanent</s>'

### Постановка обучения

In [64]:
model = RankBert(train_layers_count=2)

In [65]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Wed Apr 10 20:44:53 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:89:00.0 Off |                    0 |
| N/A   25C    P0              67W / 400W |   1323MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [66]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)

RankBert(
  (bert): XLMRobertaModel(
    (embeddings): XLMRobertaEmbeddings(
      (word_embeddings): Embedding(250002, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): XLMRobertaEncoder(
      (layer): ModuleList(
        (0): XLMRobertaLayer(
          (attention): XLMRobertaAttention(
            (self): XLMRobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): XLMRobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm):

In [67]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Wed Apr 10 20:45:26 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:89:00.0 Off |                    0 |
| N/A   25C    P0              67W / 400W |   2437MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

Оптимизация в торче: https://pytorch.org/docs/stable/optim.html

Adam в torch: https://pytorch.org/docs/stable/generated/torch.optim.Adam.html

OneCycle шедулер: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.OneCycleLR.html#torch.optim.lr_scheduler.OneCycleLR

In [57]:
!mkdir cross_encоder_checkpoint

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [68]:
class config:
    EPOCHS=1
    LR=1e-4
    WD=0.01
    SAVE_DIR="cross_encоder_checkpoint"
    SAVE_INTERVAL=1000
    BATCH_SIZE=64
    ACCUM_BS=1
    DEVICE=torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
    LOG_INTERVAL=250

writer = SummaryWriter('cross_encоder_checkpoint/ms_marco_v2')

loss_fn = nn.BCEWithLogitsLoss()

In [1]:
#!rm -rf cross_encоder_checkpoint

In [69]:
optimizer = torch.optim.AdamW(model.parameters(), lr=config.LR, weight_decay=config.WD)
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer,
                                                pct_start=0.1,
                                                max_lr=config.LR,
                                                epochs=config.EPOCHS, 
                                                steps_per_epoch=len(train_dataloader))

In [70]:
from tqdm.notebook import tqdm as tqdm_note
import gc

In [71]:
def move_batch_to_device(batch, device):
    batch_x, y = batch
    for key in batch_x:
        batch_x[key] = batch_x[key].to(device)
    y = y.to(device)
    return batch_x, y

In [72]:
move_batch_to_device(sample_batch, device)

({'input_ids': tensor([[     0,   2367,  30482,  ...,   1221,  69255,      2],
         [     0,    831,    398,  ...,    831,  28123,      2],
         [     0,   3229,    831,  ...,     54,     10,      2],
         ...,
         [     0,  45484,     85,  ...,  52490,      7,      2],
         [     0,  72382,    111,  ...,   5791, 190534,      2],
         [     0,   2367,     83,  ...,   1821, 175457,      2]],
        device='cuda:0'), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         ...,
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]], device='cuda:0')},
 tensor([[0.],
         [0.],
         [0.],
         [1.],
         [0.],
         [0.],
         [0.],
         [0.],
         [0.],
         [0.],
         [0.],
         [1.],
         [0.],
         [0.],
         [0.],
         [0.],
         [0.],
         [0.],
         [0.],
         

Как использовать tensorboard https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html

In [73]:
def train_one_epoch(epoch_index, tb_writer):
    running_loss = 0.
    running_auc = 0.
    last_loss = 0.

    device = config.DEVICE
    # Here, we use enumerate(training_loader) instead of
    # iter(training_loader) so that we can track the batch
    # index and do some intra-epoch reporting
    for i, batch in enumerate(train_dataloader):
        # Every data instance is an input + label pair
        batch_x, y = move_batch_to_device(batch, device)

        # Zero your gradients for every batch!
        optimizer.zero_grad()

        # Make predictions for this batch
        outputs = model(**batch_x)

        # Compute the loss and its gradients
        loss = loss_fn(outputs, y)
        loss.backward()

        # Adjust learning weights
        optimizer.step()
        scheduler.step()

        # Gather data and report
        running_loss += loss.item()

        y = y.cpu().int().numpy()
        if y.sum() > 0:
            #compute metric
            with torch.no_grad():
                auc = roc_auc_score(y, 
                                    outputs.cpu().numpy(), 
                                    labels=np.array([0, 1]))
            running_auc += np.mean(auc)
        else:
            running_auc += 1
        
        #logging to tb
        tb_x = epoch_index * len(train_dataloader) + i + 1
        tb_writer.add_scalar('lr', scheduler.get_last_lr()[0], tb_x)
        tb_writer.add_scalar('Train/auc', auc, tb_x)
        tb_writer.add_scalar('Train/loss', loss, tb_x)
        
        if i % config.LOG_INTERVAL == config.LOG_INTERVAL - 1:
            last_loss = running_loss / config.LOG_INTERVAL # loss per batch
            last_auc = running_auc / config.LOG_INTERVAL # loss per batch
            print('  batch {} loss: {}, auc: {}'.format(i + 1, last_loss, last_auc))
            
            tb_writer.add_scalar('Train/running_loss', last_loss, tb_x)
            tb_writer.add_scalar('Train/running_auc', last_auc, tb_x)
            running_loss = 0.
            running_auc = 0.

        if i % 10 == 0: #clean up memory
            gc.collect()
            torch.cuda.empty_cache()

    return last_loss, last_auc

In [74]:
epoch_number = 0

best_vloss = 1_000_000.

for epoch in range(config.EPOCHS):
    print('EPOCH {}:'.format(epoch_number + 1))

    # Make sure gradient tracking is on, and do a pass over the data
    model.train(True)
    avg_loss, avg_auc = train_one_epoch(epoch_number, writer)
    
    running_vloss = 0.0
    running_vauc = 0.0
    # Set the model to evaluation mode, disabling dropout and using population
    # statistics for batch normalization.
    model.eval()

    # Disable gradient computation and reduce memory consumption.
    with torch.no_grad():
        preds = []
        for i, batch in enumerate(valid_dataloader):
            batch_x, y = move_batch_to_device(batch, config.DEVICE)
            voutputs = model(**batch_x)
            vloss = loss_fn(voutputs, y)
            running_vloss += vloss

            y = y.cpu().int().numpy()
            if y.sum() > 0:
                #compute metric
                with torch.no_grad():
                    auc = roc_auc_score(y, 
                                        voutputs.cpu().numpy(), 
                                        labels=np.array([0, 1]))
                running_vauc += np.mean(auc)
            else:
                running_vauc += 1
                
            preds.append(voutputs)

    #compute valid mrr
    preds = torch.cat(preds).view(-1).cpu()
    val_mrr = MRR(preds, val_data['label'].values, val_data['qid'].values)

    avg_vloss = running_vloss / (i + 1)
    avg_vauc = running_vauc / (i + 1)
    print('LOSS train {} valid {}'.format(avg_loss, avg_vloss))
    print('AUC train {} valid {}'.format(avg_auc, avg_vauc))
    print('MRR valid {}'.format(val_mrr))

    writer.add_scalar('Valid/mrr', val_mrr, epoch)
    writer.add_scalar('Valid/loss', avg_vloss, epoch)
    writer.add_scalar('Valid/auc', avg_vauc, epoch)
    
    # Track best performance, and save the model's state
    if avg_vloss < best_vloss:
        best_vloss = avg_vloss

        checkpoint = {'epoch': epoch, 'model_state_dict': model.state_dict(),
                      'optimizer_state_dict': optimizer.state_dict(),
                      'scheduler_state_dict': scheduler.state_dict(),
                      'best_vloss': best_vloss}

        torch.save(checkpoint, f'{config.SAVE_DIR}/ckpt_epoch_{epoch}_loss{best_vloss}.pt')

    epoch_number += 1

EPOCH 1:


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  batch 250 loss: 0.3889099364876747, auc: 0.5043006947204248
  batch 500 loss: 0.3470394178032875, auc: 0.5552770926764914
  batch 750 loss: 0.3170865324139595, auc: 0.690314798881317
  batch 1000 loss: 0.2549331806898117, auc: 0.8437926432325951
  batch 1250 loss: 0.22242624154686927, auc: 0.892422261370732
  batch 1500 loss: 0.20495209580659865, auc: 0.9158962879196262
  batch 1750 loss: 0.1951291216313839, auc: 0.9200777550000614
  batch 2000 loss: 0.187299045920372, auc: 0.9288148783152385
  batch 2250 loss: 0.17793370378017426, auc: 0.9365735948471829
  batch 2500 loss: 0.17562542206048964, auc: 0.9400550511514238
  batch 2750 loss: 0.17605843663215637, auc: 0.9386838210694142
  batch 3000 loss: 0.16815103495121003, auc: 0.9459537652181457
  batch 3250 loss: 0.17135702481865883, auc: 0.9418520492223905
  batch 3500 loss: 0.16708818045258522, auc: 0.9470855737048376
  batch 3750 loss: 0.16372986966371536, auc: 0.9481513280546877
  batch 4000 loss: 0.16252191585302353, auc: 0.94666

## Inference

In [293]:
test_dataset = RankDataset(test_data, neg_p=1.)

In [294]:
test_dataloader = data.DataLoader(test_dataset, shuffle=False, batch_size=1024, collate_fn=compose_batch)

In [295]:
def get_test_perds(model):
    model.eval() #eval mode
    y_test = []

    for i, batch in enumerate(tqdm.tqdm(test_dataloader, position=0, leave=True, desc=f"Iteration: {'test'}")): #итерируемся по батчам
        batch_x, y = move_batch_to_device(batch, config.DEVICE)
        with torch.no_grad():
            preds = model(**batch_x)
            y_test += [preds]

    y_test = torch.cat(y_test).view(-1).cpu().numpy()
    return y_test

y_test = get_test_perds(model.float())

Iteration: test: 100%|██████████| 91/91 [00:20<00:00,  4.52it/s]


In [296]:
y_test[:5]

array([5.5373108e-01, 2.8407093e-02, 2.0300053e-04, 2.7058361e-04,
       2.9450431e-02], dtype=float32)

In [297]:
print(roc_auc_score(test_dataset.data['label'].values, y_test, labels=np.array([0, 1])))

0.9692076540926228


In [298]:
#model.half() # make model float16 precision
y_test = get_test_perds(model.half())

Iteration: test: 100%|██████████| 91/91 [00:20<00:00,  4.37it/s]


In [299]:
y_test[:5]

array([5.522e-01, 2.844e-02, 2.034e-04, 2.716e-04, 2.948e-02],
      dtype=float16)

## Подсчет MRR

In [300]:
test_data['pred'] = np.array(y_test)

In [301]:
test_data

Unnamed: 0,qid,query,text,label,pred
12230454,397001,highest point in kentucky,"Harlan County, Kentucky, U.S. Black Mountain i...",1.0,0.552246
12230455,397001,highest point in kentucky,"It is here, where the land is dominated by the...",0.0,0.028442
12230456,397001,highest point in kentucky,KENTUCKY ANCESTORS. GENEALOGICAL QUARTERLY. OF...,0.0,0.000203
12230457,397001,highest point in kentucky,The outcome of the political struggle in Kentu...,0.0,0.000272
12230458,397001,highest point in kentucky,"As a division of Circuit Court, which is the h...",0.0,0.029480
...,...,...,...,...,...
12323107,400000,what does inclement weather mean,When there is heavy snow or when inclement wea...,0.0,0.000191
12323108,400000,what does inclement weather mean,The SUV also offers two versions of its Quadra...,0.0,0.000255
12323109,400000,what does inclement weather mean,"Todayâs and tonightâs Evansville, IN weath...",0.0,0.000185
12323110,400000,what does inclement weather mean,"(of the weather, the elements, etc.) severe, r...",0.0,0.255615


In [302]:
# Cross-Encoder Rank Bert
MRR(y_test, test_data['label'].values, test_data['qid'].values)

tensor(0.8183)

# Что еще можно посмотреть

Как улучшать решение:

- пробовать pairwise / listwise лоссы
- разморозить бОльшую часть сети
- учить дольше / больше данных
- попробовать другие претрейны (валидно для английского языка)
- оптимизировать скорость обучения (след. за то же время можно прогнать больше данных)
- расширять контекст / добавлять новые текстовые поля

- Как ускорить обучение с помощью mixed-precision
    - *https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html*
    - https://pytorch.org/docs/stable/notes/amp_examples.html
- Как учить модели на нескольких гпу (можно использовать например на кеггле)
    - https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
    - https://pytorch.org/docs/stable/notes/ddp.html
- Библиотеки для более удобного обучения сетей
    - https://github.com/Lightning-AI/pytorch-lightning (общий случай)
    - https://huggingface.co/docs/transformers/main/en/trainer (трансформеры)
    - https://huggingface.co/docs/transformers/main_classes/trainer    