<a href="https://colab.research.google.com/github/VitalyGladyshev/LLM-engineering/blob/main/LLM_GLVV_HW10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Домашнее задание: Ranking для RAG на русском языке

**Датасет:** [ai-forever/ria-news-retrieval](https://huggingface.co/datasets/ai-forever/ria-news-retrieval)

**Цель:** Адаптировать пайплайн ранжирования для русскоязычных новостей

**Pipeline:** BM25 → Dense → RRF → MMR → Cross-encoder rerank

**Baseline модели:**
- BM25: простая токенизация по `\W+`
- Dense: `sentence-transformers/paraphrase-MiniLM-L3-v2`
- Cross-encoder: `cross-encoder/ms-marco-MiniLM-L-6-v2`

**Оценка:** 3 (BM25) + 3 (Dense) + 3 (Rerank) + 1 (бонус) = **10 баллов**



In [None]:
# Конфигурация
STUDENT_NAME = 'Гладышев'  # ИЗМЕНИТЕ НА СВОЮ ФАМИЛИЮ
PIPELINE_NAME = 'baseline'
QUICK_RUN = True  # False для полного запуска
MAX_CORPUS = 1000 if QUICK_RUN else 7044
MAX_QUERIES = 200 if QUICK_RUN else 10000
SEED = 42
TOP_K = 100

from pathlib import Path
BASE = Path.cwd()
DATA_DIR = BASE/'dataset'/'data'
RUNS_DIR = BASE/'runs'/STUDENT_NAME/PIPELINE_NAME
for d in [DATA_DIR, RUNS_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print(f'👤 Student: {STUDENT_NAME}')
print(f'🔧 Pipeline: {PIPELINE_NAME}')
print(f'⚡ QUICK_RUN: {QUICK_RUN} (corpus={MAX_CORPUS}, queries={MAX_QUERIES})')
print(f'📁 DATA_DIR: {DATA_DIR}')
print(f'📁 RUNS_DIR: {RUNS_DIR}')


👤 Student: Гладышев
🔧 Pipeline: baseline
⚡ QUICK_RUN: True (corpus=1000, queries=200)
📁 DATA_DIR: /content/dataset/data
📁 RUNS_DIR: /content/runs/Гладышев/baseline


In [None]:
import os, json, math, time, random, re, csv
import numpy as np
import pandas as pd
from tqdm import tqdm
from statistics import mean

random.seed(SEED)
np.random.seed(SEED)

# Baseline токенизация
TOKEN_SPLIT = re.compile(r"[^\w]+", flags=re.UNICODE)

def tokenize(text):
    """Baseline: простой split по не-словесным символам"""
    if not isinstance(text, str):
        text = str(text) if text else ''
    return [t for t in TOKEN_SPLIT.split(text) if t]

# IO helpers
def read_jsonl(path):
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if line:
                yield json.loads(line)

def write_jsonl(path, records):
    with open(path, 'w', encoding='utf-8') as f:
        for r in records:
            f.write(json.dumps(r, ensure_ascii=False) + '\n')

print('Imports loaded')
print(f'Tokenize test: {tokenize("Привет, мир! Это тест.")}')


Imports loaded
Tokenize test: ['Привет', 'мир', 'Это', 'тест']


## 📦 Dataset: ai-forever/ria-news-retrieval

Загружаем датасет через Hugging Face `datasets` library

**Структура:**
- `corpus` (test): ~7k документов (`_id`, `title`, `text`)
- `queries` (queries): ~10k запросов (`_id`, `text`)
- `default` (test): ~10k qrels (`query-id`, `corpus-id`, `score` 0/1)


In [None]:
from datasets import load_dataset

print('Загрузка датасета ai-forever/ria-news-retrieval...')
print('Может занять 1-2 минуты при первом запуске (кэшируется)\n')

# Загрузка трёх subset'ов
corpus_ds = load_dataset("ai-forever/ria-news-retrieval", "corpus", split="corpus")
queries_ds = load_dataset("ai-forever/ria-news-retrieval", "queries", split="queries")
qrels_ds = load_dataset("ai-forever/ria-news-retrieval", "default", split="test")

# Конвертация в списки
corpus_all = list(corpus_ds)
queries_all = list(queries_ds)
qrels_all = list(qrels_ds)

print(f'Загружено:')
print(f'Корпус: {len(corpus_all)} документов')
print(f'Запросы: {len(queries_all)} запросов')
print(f'Qrels: {len(qrels_all)} пар')


Загрузка датасета ai-forever/ria-news-retrieval...
Может занять 1-2 минуты при первом запуске (кэшируется)



README.md:   0%|          | 0.00/343 [00:00<?, ?B/s]

data/corpus.jsonl.zip:   0%|          | 0.00/380M [00:00<?, ?B/s]

Generating corpus split:   0%|          | 0/704344 [00:00<?, ? examples/s]

data/queries.jsonl:   0%|          | 0.00/1.43M [00:00<?, ?B/s]

Generating queries split:   0%|          | 0/10000 [00:00<?, ? examples/s]

test.jsonl: 0.00B [00:00, ?B/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Загружено:
Корпус: 704344 документов
Запросы: 10000 запросов
Qrels: 10000 пар


In [None]:
# Sampling для QUICK_RUN
print(f'Sampling (QUICK_RUN={QUICK_RUN})...')

# 1. Выбираем запросы
if QUICK_RUN and len(queries_all) > MAX_QUERIES:
    random.Random(SEED).shuffle(queries_all)
    queries_sel = queries_all[:MAX_QUERIES]
else:
    queries_sel = queries_all

sel_qids = set(q['_id'] for q in queries_sel)
print(f'Запросы: {len(queries_sel)}')

# 2. Фильтруем qrels по выбранным запросам
qrels_sel = [q for q in qrels_all if q['query-id'] in sel_qids]
needed_docids = set(q['corpus-id'] for q in qrels_sel if q['score'] > 0)
print(f'Qrels: {len(qrels_sel)} пар')
print(f'Нужных документов (rel>0): {len(needed_docids)}')

# 3. Выбираем корпус: сначала нужные, потом остальные
corpus_needed = [d for d in corpus_all if d['_id'] in needed_docids]
corpus_other = [d for d in corpus_all if d['_id'] not in needed_docids]
random.Random(SEED).shuffle(corpus_other)

corpus_sel = (corpus_needed + corpus_other)[:MAX_CORPUS]
corpus_docids = set(d['_id'] for d in corpus_sel)
print(f'Корпус: {len(corpus_sel)} документов')

# 4. Финальная фильтрация qrels
qrels_final = [q for q in qrels_sel if q['corpus-id'] in corpus_docids]
print(f'Финальные qrels: {len(qrels_final)} пар')


Sampling (QUICK_RUN=True)...
Запросы: 200
Qrels: 200 пар
Нужных документов (rel>0): 200
Корпус: 1000 документов
Финальные qrels: 200 пар


In [None]:
# Запись unified schema
corpus_out = DATA_DIR/'corpus.jsonl'
queries_out = DATA_DIR/'queries.eval.jsonl'
qrels_out = DATA_DIR/'qrels.eval.tsv'

print('Запись unified файлов...')

# Corpus
write_jsonl(corpus_out, [
    {'docid': d['_id'], 'title': d.get('title', ''), 'text': d['text']}
    for d in corpus_sel
])

# Queries
write_jsonl(queries_out, [
    {'qid': q['_id'], 'text': q['text']}
    for q in queries_sel
])

# Qrels (TSV format)
with open(qrels_out, 'w', encoding='utf-8') as f:
    for q in qrels_final:
        f.write(f"{q['query-id']}\t{q['corpus-id']}\t{q['score']}\n")

print(f'Файлы записаны:')
print(f'{corpus_out}')
print(f'{queries_out}')
print(f'{qrels_out}')


Запись unified файлов...
Файлы записаны:
/content/dataset/data/corpus.jsonl
/content/dataset/data/queries.eval.jsonl
/content/dataset/data/qrels.eval.tsv


In [None]:
# Загрузка для пайплайна
print('\nЗагрузка данных для пайплайна...')

docids = []
corpus_text_map = {}
for o in read_jsonl(corpus_out):
    did = o['docid']
    docids.append(did)
    # Объединяем title и text
    corpus_text_map[did] = f"{o.get('title', '')}. {o['text']}".strip()

qids = []
query_texts = []
for o in read_jsonl(queries_out):
    qids.append(o['qid'])
    query_texts.append(o['text'])

print(f'Загружено:')
print(f'{len(docids)} документов')
print(f'{len(qids)} запросов')
print(f'\n Пример запроса:')
print(f'   qid: {qids[0]}')
print(f'   text: {query_texts[0]}')



Загрузка данных для пайплайна...
Загружено:
1000 документов
200 запросов

 Пример запроса:
   qid: 3771
   text: подъезд дома в оренбуржье на время выселят из-за разлива ртути


## 🔍 BM25 Retrieval (Baseline)

**Baseline конфигурация:**
- Токенизация: простой regex split по `\W+`
- Нормализация: ?
- Без стемминга/лемматизации
- Без удаления стоп-слов
- Параметры: k1=1.2, b=0.75

### 📝 Задание 1 (3 балла): Улучшите BM25

**Что попробовать:**
1. Токенизаторы: `nltk`, `razdel`, `spacy`
2. Нормализация: стемминг (Snowball) или лемматизация (pymorphy2, mystem)
3. Очистка: удаление стоп-слов (NLTK Russian)
4. Параметры: k1, b

**Критерии:**
- Любое улучшение NDCG@10: +1 балл
- Любое улучшение NDCG@5: +1 балл: +1 балл
- Обоснование выбора: +1 балл


In [None]:
!pip install rank_bm25

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [None]:
# =========================
# BM25 RETRIEVAL FUNCTIONS
# =========================

import re, json, csv, time
import numpy as np
from tqdm import tqdm
from rank_bm25 import BM25Okapi

TOKEN_SPLIT = re.compile(r"[^\w]+", flags=re.UNICODE)

def tokenize(text: str):
    """Простая лексическая токенизация"""
    if not isinstance(text, str):
        text = '' if text is None else str(text)
    return [t for t in TOKEN_SPLIT.split(text) if t]

def build_bm25_index(corpus_tokens, k1=1.2, b=0.75):
    """Построение BM25 индекса"""
    return BM25Okapi(corpus_tokens, k1=k1, b=b)

def bm25_search(bm25_index, query_tokens, docids, top_k=100):
    """Поиск top-K документов по BM25"""
    if not query_tokens:
        return []

    scores = bm25_index.get_scores(query_tokens)
    k = min(top_k, len(scores))
    if k == 0:
        return []

    idx = np.argpartition(scores, -k)[-k:]
    top_idx = idx[np.argsort(scores[idx])[::-1]]
    return [docids[i] for i in top_idx]

def run_bm25_retrieval(corpus_tokens, query_tokens_list, docids, qids,
                       top_k=100, k1=1.2, b=0.75):
    """
    Полный цикл BM25 retrieval

    Returns:
        outputs: list of {'qid': str, 'docids': [str]}
        per_query_ms: list of (qid, latency_ms)
    """
    bm25 = build_bm25_index(corpus_tokens, k1=k1, b=b)

    outputs = []
    per_query_ms = []

    for qid, qtok in tqdm(list(zip(qids, query_tokens_list)), desc='BM25'):
        t0 = time.perf_counter()
        top_docids = bm25_search(bm25, qtok, docids, top_k=top_k)
        t1 = time.perf_counter()

        per_query_ms.append((qid, (t1 - t0) * 1000.0))
        outputs.append({'qid': qid, 'docids': top_docids})

    return outputs, per_query_ms


In [None]:
# =========================
# BM25 BASELINE
# =========================

bm25_dir = RUNS_DIR/'bm25'
bm25_dir.mkdir(parents=True, exist_ok=True)

print('BM25 Retrieval (Baseline)')
print(f'k1=1.2, b=0.75\n')

# Токенизация корпуса и запросов
corpus_tokens = [tokenize(corpus_text_map[d]) for d in docids]
query_tokens_list = [tokenize(text) for text in query_texts]

# Запуск BM25
bm25_outputs, bm25_latency = run_bm25_retrieval(
    corpus_tokens=corpus_tokens,
    query_tokens_list=query_tokens_list,
    docids=docids,
    qids=qids,
    top_k=TOP_K,
    k1=1.2,
    b=0.75
)

# Сохранение
write_jsonl(bm25_dir/f'top{TOP_K}.jsonl', bm25_outputs)

print(f'\nBM25 done!')
print(f'Saved to: {bm25_dir}')


BM25 Retrieval (Baseline)
k1=1.2, b=0.75



BM25: 100%|██████████| 200/200 [00:01<00:00, 194.63it/s]


BM25 done!
Saved to: /content/runs/Гладышев/baseline/bm25





In [None]:
!pip install -q razdel pymystem3 nltk

In [None]:
import re
import time
import numpy as np
from tqdm import tqdm
from rank_bm25 import BM25Okapi

import razdel
from pymystem3 import Mystem
import nltk

In [None]:
try:
    from nltk.corpus import stopwords
    STOPWORDS = set(stopwords.words('russian'))
except:
    nltk.download('stopwords')
    from nltk.corpus import stopwords
    STOPWORDS = set(stopwords.words('russian'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
print(f'Загружено {len(STOPWORDS)} стоп-слов')
print(f'Примеры стоп-слов: {list(STOPWORDS)[:10]}')

Загружено 151 стоп-слов
Примеры стоп-слов: ['тут', 'когда', 'никогда', 'тот', 'со', 'и', 'один', 'их', 'потому', 'про']


In [None]:
mystem = Mystem()

Installing mystem to /root/.local/bin/mystem from http://download.cdn.yandex.net/mystem/mystem-3.1-linux-64bit.tar.gz


#### Стемминг

In [None]:
from nltk.stem.snowball import SnowballStemmer

try:
    stemmer = SnowballStemmer("russian")
except:
    nltk.download('snowball_data')
    stemmer = SnowballStemmer("russian")

def tokenize_with_stemming(text: str, remove_stopwords=True):
    """
    Токенизация со стеммингом Snowball
    """
    if not isinstance(text, str):
        text = '' if text is None else str(text)

    text = text.lower()
    tokens = [token.text for token in razdel.tokenize(text)]
    tokens = [t for t in tokens if t.isalpha() and len(t) > 1]

    if remove_stopwords:
        tokens = [t for t in tokens if t not in STOPWORDS]

    # Стемминг
    tokens = [stemmer.stem(t) for t in tokens]

    return tokens

#### Лемматизация

In [None]:

def tokenize_improved(text: str, use_lemmatization=True, remove_stopwords=True):
    """
    Улучшенная токенизация для русского языка с Mystem

    Args:
        text: входной текст
        use_lemmatization: применять лемматизацию
        remove_stopwords: удалять стоп-слова

    Returns:
        список токенов
    """
    if not isinstance(text, str):
        text = '' if text is None else str(text)

    text = text.lower()

    if use_lemmatization:
        # Лемматизация через Mystem
        # Mystem возвращает список, где чередуются токены и пробелы
        lemmas = mystem.lemmatize(text)
        # Фильтруем: только буквенные токены длиной > 1
        tokens = [t.strip() for t in lemmas if t.strip() and t.strip().isalpha() and len(t.strip()) > 1]
    else:
        # Токенизация с помощью razdel
        tokens = [token.text for token in razdel.tokenize(text)]
        # Фильтрация: только буквенные токены длиной > 1
        tokens = [t for t in tokens if t.isalpha() and len(t) > 1]

    if remove_stopwords:
        tokens = [t for t in tokens if t not in STOPWORDS]

    return tokens

In [None]:
def build_bm25_index(corpus_tokens, k1=1.5, b=0.75):
    """Построение BM25 индекса"""
    return BM25Okapi(corpus_tokens, k1=k1, b=b)


def bm25_search(bm25_index, query_tokens, docids, top_k=100):
    """Поиск top-K документов по BM25"""
    if not query_tokens:
        return []

    scores = bm25_index.get_scores(query_tokens)
    k = min(top_k, len(scores))
    if k == 0:
        return []

    idx = np.argpartition(scores, -k)[-k:]
    top_idx = idx[np.argsort(scores[idx])[::-1]]
    return [docids[i] for i in top_idx]


def run_bm25_retrieval_improved(corpus_tokens, query_tokens_list, docids, qids,
                                 top_k=100, k1=1.5, b=0.75):
    """
    Полный цикл BM25 retrieval (улучшенная версия)

    Returns:
        outputs: list of {'qid': str, 'docids': [str]}
        per_query_ms: list of (qid, latency_ms)
    """
    bm25 = build_bm25_index(corpus_tokens, k1=k1, b=b)

    outputs = []
    per_query_ms = []

    for qid, qtok in tqdm(list(zip(qids, query_tokens_list)), desc='BM25 Improved'):
        t0 = time.perf_counter()
        top_docids = bm25_search(bm25, qtok, docids, top_k=top_k)
        t1 = time.perf_counter()

        per_query_ms.append((qid, (t1 - t0) * 1000.0))
        outputs.append({'qid': qid, 'docids': top_docids})

    return outputs, per_query_ms


In [None]:
test_text = query_texts[0]
test_tokens = tokenize_improved(test_text, use_lemmatization=True, remove_stopwords=True)
print(f'Тест токенизации:')
print(f'   Оригинал: {test_text}')
print(f'   Токены: {test_tokens[:15]}')

Тест токенизации:
   Оригинал: подъезд дома в оренбуржье на время выселят из-за разлива ртути
   Токены: ['подъезд', 'дом', 'оренбуржье', 'время', 'выселять', 'разлив', 'ртуть']


In [None]:
bm25_lemm_dir = RUNS_DIR/'bm25_lemmatization'
bm25_lemm_dir.mkdir(parents=True, exist_ok=True)

print('BM25 Mystem лемматизация + стоп-слова')
print(f'Токенизатор: razdel')
print(f'Нормализация: Mystem')
print(f'Стоп-слова: NLTK Russian ({len(STOPWORDS)} слов)')
print(f'Параметры: k1=1.5, b=0.75\n')

# Токенизация корпуса
corpus_tokens_improved = []
for d in tqdm(docids, desc='Corpus'):
    tokens = tokenize_improved(corpus_text_map[d], use_lemmatization=True, remove_stopwords=True)
    corpus_tokens_improved.append(tokens)

# Токенизация запросов
query_tokens_improved = []
for text in tqdm(query_texts, desc='Queries'):
    tokens = tokenize_improved(text, use_lemmatization=True, remove_stopwords=True)
    query_tokens_improved.append(tokens)

# BM25
bm25_outputs_improved, bm25_latency_improved = run_bm25_retrieval_improved(
    corpus_tokens=corpus_tokens_improved,
    query_tokens_list=query_tokens_improved,
    docids=docids,
    qids=qids,
    top_k=TOP_K,
    k1=1.5,
    b=0.75
)

# Сохранение результатов
output_file = bm25_lemm_dir / f'top{TOP_K}.jsonl'
write_jsonl(output_file, bm25_outputs_improved)

print(f'\nBM25 с лемматизацией')
print(f'Saved to: {bm25_lemm_dir}')
print(f'Results: {output_file}')


BM25 Mystem лемматизация + стоп-слова
Токенизатор: razdel
Нормализация: Mystem
Стоп-слова: NLTK Russian (151 слов)
Параметры: k1=1.5, b=0.75



Corpus: 100%|██████████| 1000/1000 [00:16<00:00, 59.46it/s]
Queries: 100%|██████████| 200/200 [00:00<00:00, 971.06it/s] 
BM25 Improved: 100%|██████████| 200/200 [00:01<00:00, 178.36it/s]



BM25 с лемматизацией
Saved to: /content/runs/Гладышев/baseline/bm25_lemmatization
Results: /content/runs/Гладышев/baseline/bm25_lemmatization/top100.jsonl


In [None]:
print('BM25 - Стемминг (Snowball) + стоп-слова')

bm25_stem_dir = RUNS_DIR/'bm25_stemming'
bm25_stem_dir.mkdir(parents=True, exist_ok=True)

corpus_tokens_stem = []
for d in tqdm(docids, desc='Corpus (stem)'):
    tokens = tokenize_with_stemming(corpus_text_map[d], remove_stopwords=True)
    corpus_tokens_stem.append(tokens)

query_tokens_stem = []
for text in tqdm(query_texts, desc='Queries (stem)'):
    tokens = tokenize_with_stemming(text, remove_stopwords=True)
    query_tokens_stem.append(tokens)

bm25_outputs_stem, bm25_latency_stem = run_bm25_retrieval_improved(
    corpus_tokens=corpus_tokens_stem,
    query_tokens_list=query_tokens_stem,
    docids=docids,
    qids=qids,
    top_k=TOP_K,
    k1=1.5,
    b=0.75
)

output_file_stem = bm25_stem_dir / f'top{TOP_K}.jsonl'
write_jsonl(output_file_stem, bm25_outputs_stem)

print(f'\nBM25 со стеммингом')
print(f'Saved to: {bm25_stem_dir}')
print(f'Results: {output_file_stem}')

BM25 - Стемминг (Snowball) + стоп-слова


Corpus (stem): 100%|██████████| 1000/1000 [00:15<00:00, 64.09it/s]
Queries (stem): 100%|██████████| 200/200 [00:00<00:00, 595.23it/s]
BM25 Improved: 100%|██████████| 200/200 [00:01<00:00, 148.74it/s]


BM25 со стеммингом
Saved to: /content/runs/Гладышев/baseline/bm25_stemming
Results: /content/runs/Гладышев/baseline/bm25_stemming/top100.jsonl





#### Обоснование выбора

Для улчшения качества BM25 проверено два подхода: cокращение словаря с помощью стемминга и с помощью лемматизации. Слова приведены к нижнему регистру. Также выполнена фильтрация стоп-слов. Оба данных подхода существенно Что критически важно для улучшения статистических характеристик TА и IDF встречемости входящих в формулу BM25
$$score(D, Q) = Σ IDF(qi) × [TF(qi, D) × (k₁ + 1)] / [TF(qi, D) + k₁ × (1 - b + b × |D|/avgdl)]$$
 Применение данных подходов существенно улучшило качество поиска и оказало значительное влияние на метрики:
```
                  stage  precision@5  precision@10  recall@100   mrr@10   ndcg@5  ndcg@10  num_queries
     bm25_lemmatization        0.194        0.0975       0.980 0.958214 0.960773 0.962440          200
          bm25_stemming        0.192        0.0965       0.970 0.948214 0.950773 0.952440          200
                   bm25        0.189        0.0955       0.960 0.921500 0.926360 0.929586          200
```


## Dense Retrieval (Baseline)

**Baseline конфигурация:**
- Модель: `sentence-transformers/distiluse-base-multilingual-cased-v2`

### Задание 2 (3 балла): Подберите лучшую модель эмбеддингов

**Что попробовать:**

**Русскоязычные модели:**
- ai-forever/FRIDA

**Multilingual модели:**
- intfloat/multilingual-e5-large
- intfloat/multilingual-e5-large-instruct

**Опционально: файнтюнинг**
- Используйте qrels для triplets: (query, pos_doc, neg_doc)
- Loss: MultipleNegativesRankingLoss или TripletLoss
- Валидация: hold-out 20% queries

**Критерии:**
- Любое улучшение NDCG@10: +1 балл
- Сравнение 3+ моделей с анализом: +1 балл
- Файнтюнинг с описанием: +1 балл (вместо сравнения)


In [None]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.12.0


In [None]:
# =========================
# DENSE RETRIEVAL FUNCTIONS
# =========================

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from tqdm import tqdm

def load_dense_model(model_name):
    """Загрузка модели эмбеддингов"""
    return SentenceTransformer(model_name)

def encode_texts(model, texts, batch_size=256, normalize=True, desc='Encoding'):
    """Кодирование текстов в эмбеддинги"""
    embeddings = model.encode(
        texts,
        batch_size=batch_size,
        show_progress_bar=True,
        convert_to_numpy=True,
        normalize_embeddings=normalize
    )
    return embeddings.astype('float32')

def build_faiss_index(embeddings, metric='ip'):
    """
    Построение FAISS индекса

    Args:
        embeddings: numpy array (N, dim)
        metric: 'ip' (inner product) or 'l2' (euclidean)
    """
    dim = embeddings.shape[1]
    if metric == 'ip':
        index = faiss.IndexFlatIP(dim)
    elif metric == 'l2':
        index = faiss.IndexFlatL2(dim)
    else:
        raise ValueError(f"Unknown metric: {metric}")

    index.add(embeddings)
    return index

def faiss_search(index, query_embeddings, top_k=100):
    """Поиск top-K через FAISS"""
    D, I = index.search(query_embeddings, top_k)
    return D, I

def run_dense_retrieval(model_name, corpus_texts, query_texts, docids, qids,
                       top_k=100, batch_size=256, normalize=True):
    """
    Полный цикл Dense retrieval

    Returns:
        outputs: list of {'qid': str, 'docids': [str]}
        doc_embs: numpy array (corpus embeddings)
        q_embs: numpy array (query embeddings)
    """
    # Загрузка модели
    model = load_dense_model(model_name)

    # Кодирование
    doc_embs = encode_texts(model, corpus_texts, batch_size=batch_size,
                           normalize=normalize, desc='Encoding corpus')
    q_embs = encode_texts(model, query_texts, batch_size=batch_size,
                         normalize=normalize, desc='Encoding queries')

    # FAISS индекс и поиск
    index = build_faiss_index(doc_embs, metric='ip')
    D, I = faiss_search(index, q_embs, top_k=top_k)

    # Формирование результатов
    outputs = []
    for qi, qid in enumerate(qids):
        top_docids = [docids[j] for j in I[qi]]
        outputs.append({'qid': qid, 'docids': top_docids})

    return outputs, doc_embs, q_embs


In [None]:
# =========================
# DENSE RETRIEVAL BASELINE
# =========================

dense_dir = RUNS_DIR/'dense'
dense_dir.mkdir(parents=True, exist_ok=True)

# BASELINE: измените модель для улучшения
DENSE_MODEL = 'sentence-transformers/paraphrase-MiniLM-L3-v2'
BATCH_SIZE = 256

print(f'Dense Retrieval (Baseline)')
print(f'Model: {DENSE_MODEL}\n')

# Подготовка текстов
corpus_texts = [corpus_text_map[d] for d in docids]

# Запуск Dense retrieval
dense_outputs, doc_embs, q_embs = run_dense_retrieval(
    model_name=DENSE_MODEL,
    corpus_texts=corpus_texts,
    query_texts=query_texts,
    docids=docids,
    qids=qids,
    top_k=TOP_K,
    batch_size=BATCH_SIZE,
    normalize=True
)

# Сохранение
write_jsonl(dense_dir/f'top{TOP_K}.jsonl', dense_outputs)

print(f'\nDense done!')
print(f'Model: {DENSE_MODEL}')
print(f'Dimension: {doc_embs.shape[1]}')
print(f'Saved to: {dense_dir}')


Dense Retrieval (Baseline)
Model: sentence-transformers/paraphrase-MiniLM-L3-v2



modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/69.6M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Dense done!
Model: sentence-transformers/paraphrase-MiniLM-L3-v2
Dimension: 384
Saved to: /content/runs/Гладышев/baseline/dense


In [None]:
# =========================
# DENSE RETRIEVAL intfloat/multilingual-e5-large
# =========================

dense_e5_large_dir = RUNS_DIR/'dense-e5-large'
dense_e5_large_dir.mkdir(parents=True, exist_ok=True)

# BASELINE: измените модель для улучшения
DENSE_MODEL = 'intfloat/multilingual-e5-large'
BATCH_SIZE = 256

print(f'Dense Retrieval (Baseline)')
print(f'Model: {DENSE_MODEL}\n')

# Подготовка текстов
corpus_texts = [corpus_text_map[d] for d in docids]

# Запуск Dense retrieval
dense_outputs, doc_embs, q_embs = run_dense_retrieval(
    model_name=DENSE_MODEL,
    corpus_texts=corpus_texts,
    query_texts=query_texts,
    docids=docids,
    qids=qids,
    top_k=TOP_K,
    batch_size=BATCH_SIZE,
    normalize=True
)

# Сохранение
write_jsonl(dense_e5_large_dir/f'top{TOP_K}.jsonl', dense_outputs)

print(f'\nDense done!')
print(f'Model: {DENSE_MODEL}')
print(f'Dimension: {doc_embs.shape[1]}')
print(f'Saved to: {dense_e5_large_dir}')


Dense Retrieval (Baseline)
Model: intfloat/multilingual-e5-large



modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/690 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Dense done!
Model: intfloat/multilingual-e5-large
Dimension: 1024
Saved to: /content/runs/Гладышев/baseline/dense-e5-large


In [None]:
# =========================
# DENSE RETRIEVAL intfloat/multilingual-e5-large-instruct
# =========================

dense_e5_large_instruct_dir = RUNS_DIR/'dense-e5-large-instruct'
dense_e5_large_instruct_dir.mkdir(parents=True, exist_ok=True)

# BASELINE: измените модель для улучшения
DENSE_MODEL = 'intfloat/multilingual-e5-large-instruct'
BATCH_SIZE = 256

print(f'Dense Retrieval (Baseline)')
print(f'Model: {DENSE_MODEL}\n')

# Подготовка текстов
corpus_texts = [corpus_text_map[d] for d in docids]

# Запуск Dense retrieval
dense_outputs, doc_embs, q_embs = run_dense_retrieval(
    model_name=DENSE_MODEL,
    corpus_texts=corpus_texts,
    query_texts=query_texts,
    docids=docids,
    qids=qids,
    top_k=TOP_K,
    batch_size=BATCH_SIZE,
    normalize=True
)

# Сохранение
write_jsonl(dense_e5_large_instruct_dir/f'top{TOP_K}.jsonl', dense_outputs)

print(f'\nDense done!')
print(f'Model: {DENSE_MODEL}')
print(f'Dimension: {doc_embs.shape[1]}')
print(f'Saved to: {dense_e5_large_instruct_dir}')


Dense Retrieval (Baseline)
Model: intfloat/multilingual-e5-large-instruct



modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/128 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_xlm-roberta_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/690 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Dense done!
Model: intfloat/multilingual-e5-large-instruct
Dimension: 1024
Saved to: /content/runs/Гладышев/baseline/dense-e5-large-instruct


In [None]:
# del model

In [None]:
import torch
import gc

gc.collect()
torch.cuda.empty_cache()

In [None]:
# =========================
# DENSE RETRIEVAL ai-forever/FRIDA
# =========================

dense_frida_dir = RUNS_DIR/'dense-frida'
dense_frida_dir.mkdir(parents=True, exist_ok=True)

# BASELINE: измените модель для улучшения
DENSE_MODEL = 'ai-forever/FRIDA'
BATCH_SIZE = 32

print(f'Dense Retrieval (Baseline)')
print(f'Model: {DENSE_MODEL}\n')

# Подготовка текстов
corpus_texts = [corpus_text_map[d] for d in docids]

# Запуск Dense retrieval
dense_outputs, doc_embs, q_embs = run_dense_retrieval(
    model_name=DENSE_MODEL,
    corpus_texts=corpus_texts,
    query_texts=query_texts,
    docids=docids,
    qids=qids,
    top_k=TOP_K,
    batch_size=BATCH_SIZE,
    normalize=True
)

# Сохранение
write_jsonl(dense_frida_dir/f'top{TOP_K}.jsonl', dense_outputs)

print(f'\nDense done!')
print(f'Model: {DENSE_MODEL}')
print(f'Dimension: {doc_embs.shape[1]}')
print(f'Saved to: {dense_frida_dir}')

Dense Retrieval (Baseline)
Model: ai-forever/FRIDA



Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/7 [00:00<?, ?it/s]


Dense done!
Model: ai-forever/FRIDA
Dimension: 1536
Saved to: /content/runs/Гладышев/baseline/dense-frida


#### Сравнение моделей

Для аостроения эмбеддингов были протестированы 3 мультиязычные энкодерные модели:
- intfloat/multilingual-e5-large
- intfloat/multilingual-e5-large-instruct
- ai-forever/FRIDA

Модели e5 используют размерность 1024 и приэтом обладают относительно компактными размерами (около 2 Гб), что позволяет применять её на доступных GPU и получать наилучшеие производительность и качество. Модель Frida соущественно более требовательна к ресурсам и для её успешного размещения в памяти Tesla T4 пришлось уменьшить размер батча с 256 дл 32.

Прменение моделей e5 позволило добиться наиболее существенного улучшения показателей метрик:
```
                  stage  precision@5  precision@10  recall@100   mrr@10   ndcg@5  ndcg@10  num_queries
         dense-e5-large        0.199        0.1000       1.000 0.989881 0.990655 0.992321          200
            dense-frida        0.198        0.0995       0.995 0.988125 0.988155 0.989732          200
dense-e5-large-instruct        0.199        0.0995       1.000 0.987917 0.989653 0.989653          200
                  dense        0.081        0.0445       0.695 0.297312 0.319911 0.333158          200
```

## RRF Fusion

Reciprocal Rank Fusion объединяет BM25 и Dense rankings:

score(doc) = sum_over_rankers 1/(k + rank(doc))


Параметр k=60 (стандартный) контролирует damping.


In [None]:
# =========================
# RRF FUSION FUNCTIONS
# =========================

def rrf_fuse(rankings_list, k=60, top_k=100):
    """
    Reciprocal Rank Fusion для нескольких ранкингов

    Args:
        rankings_list: list of [docid1, docid2, ...] (несколько ранкингов)
        k: параметр RRF (обычно 60)
        top_k: сколько документов вернуть

    Returns:
        fused_ranking: [docid1, docid2, ...]
    """
    scores = {}
    for ranking in rankings_list:
        for i, docid in enumerate(ranking, start=1):
            scores[docid] = scores.get(docid, 0.0) + 1.0 / (k + i)

    # Сортировка по score (desc)
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
    return [docid for docid, _ in ranked]

def run_rrf_fusion(rankings_dict_list, qids, k=60, top_k=100):
    """
    Применение RRF fusion ко всем запросам

    Args:
        rankings_dict_list: list of dict {qid -> [docids]} (несколько систем)
        qids: список qid
        k: параметр RRF
        top_k: сколько документов вернуть

    Returns:
        outputs: list of {'qid': str, 'docids': [str]}
    """
    outputs = []
    for qid in qids:
        # Собираем ранкинги от всех систем для этого запроса
        rankings = [rd.get(qid, []) for rd in rankings_dict_list]
        fused = rrf_fuse(rankings, k=k, top_k=top_k)
        outputs.append({'qid': qid, 'docids': fused})

    return outputs


In [None]:
# =========================
# RRF FUSION
# =========================

rrf_dir = RUNS_DIR/'fusion_rrf'
rrf_dir.mkdir(parents=True, exist_ok=True)

RRF_K = 60

print(f'RRF Fusion (k={RRF_K})\n')

# Загрузка результатов BM25 и Dense
bm25_map = {o['qid']: o['docids'] for o in read_jsonl(bm25_dir/f'top{TOP_K}.jsonl')}
dense_map = {o['qid']: o['docids'] for o in read_jsonl(dense_dir/f'top{TOP_K}.jsonl')}

# Fusion
fused_outputs = run_rrf_fusion(
    rankings_dict_list=[bm25_map, dense_map],
    qids=qids,
    k=RRF_K,
    top_k=TOP_K
)

# Сохранение
write_jsonl(rrf_dir/f'top{TOP_K}.jsonl', fused_outputs)

print(f'\nRRF done!')
print(f'k={RRF_K}')
print(f'Saved to: {rrf_dir}')


RRF Fusion (k=60)


RRF done!
k=60
Saved to: /content/runs/Гладышев/baseline/fusion_rrf


## MMR Diversification

Maximal Marginal Relevance выбирает top-K документов с балансом релевантность/разнообразие:

score = λ * sim(query, doc) - (1-λ) * max_sim(doc, selected)


- λ=1.0: только релевантность (как обычный поиск)
- λ=0.0: только разнообразие
- λ=0.5: баланс (baseline)





In [None]:
# =========================
# MMR DIVERSIFICATION FUNCTIONS
# =========================

def mmr_select(query_vec, candidate_docids, doc_embs, doc_index,
               lam=0.5, k=10):
    """
    Maximal Marginal Relevance selection

    Args:
        query_vec: вектор запроса (numpy array)
        candidate_docids: список docid кандидатов
        doc_embs: матрица эмбеддингов документов (numpy array)
        doc_index: dict docid -> index в doc_embs
        lam: баланс релевантность/разнообразие (0-1)
        k: сколько документов выбрать

    Returns:
        selected: список выбранных docid
    """
    selected = []
    remaining = candidate_docids.copy()

    while remaining and len(selected) < k:
        best_d = None
        best_score = -1e9

        for d in remaining:
            i = doc_index.get(d)
            if i is None:
                continue

            # Similarity to query
            sim = float(np.dot(query_vec, doc_embs[i]))

            # Max similarity to already selected
            if selected:
                sims = [float(np.dot(doc_embs[i], doc_embs[doc_index[s]]))
                       for s in selected if s in doc_index]
                div = max(sims) if sims else 0.0
            else:
                div = 0.0

            # MMR score
            score = lam * sim - (1.0 - lam) * div

            if score > best_score:
                best_score = score
                best_d = d

        if best_d:
            selected.append(best_d)
            remaining.remove(best_d)

    return selected

def run_mmr_diversification(candidate_outputs, q_embs, doc_embs, docids, qids,
                            lam=0.5, select_k=10, candidate_pool=20):
    """
    Применение MMR ко всем запросам

    Args:
        candidate_outputs: list of {'qid': str, 'docids': [str]}
        q_embs: матрица эмбеддингов запросов
        doc_embs: матрица эмбеддингов документов
        docids: список всех docid (для построения индекса)
        qids: список qid
        lam: баланс релевантность/разнообразие
        select_k: сколько документов выбрать
        candidate_pool: из скольких кандидатов выбирать

    Returns:
        outputs: list of {'qid': str, 'docids': [str]}
    """
    # Построение индекса docid -> position
    doc_index = {d: i for i, d in enumerate(docids)}

    outputs = []
    for qi, qid in enumerate(tqdm(qids, desc='MMR')):
        # Берем top-candidate_pool кандидатов
        cands = candidate_outputs[qi]['docids'][:candidate_pool]
        qv = q_embs[qi]

        # MMR selection
        mmr_list = mmr_select(qv, cands, doc_embs, doc_index, lam=lam, k=select_k)
        outputs.append({'qid': qid, 'docids': mmr_list})

    return outputs


In [None]:
# =========================
# MMR DIVERSIFICATION
# =========================

mmr_dir = RUNS_DIR/'mmr'
mmr_dir.mkdir(parents=True, exist_ok=True)

MMR_LAMBDA = 0.5
MMR_SELECT_K = 100
CANDIDATE_POOL = 20

print(f'MMR Diversification')
print(f'Lambda: {MMR_LAMBDA}')
print(f'Select top-{MMR_SELECT_K} from top-{CANDIDATE_POOL} candidates\n')

# Применение MMR
mmr_outputs = run_mmr_diversification(
    candidate_outputs=fused_outputs,
    q_embs=q_embs,
    doc_embs=doc_embs,
    docids=docids,
    qids=qids,
    lam=MMR_LAMBDA,
    select_k=MMR_SELECT_K,
    candidate_pool=CANDIDATE_POOL
)

# Сохранение
write_jsonl(mmr_dir/f'top{MMR_SELECT_K}.jsonl', mmr_outputs)

print(f'\nMMR done!')
print(f'Lambda: {MMR_LAMBDA}')
print(f'Saved to: {mmr_dir}')


MMR Diversification
Lambda: 0.5
Select top-100 from top-20 candidates



MMR: 100%|██████████| 200/200 [00:00<00:00, 230.45it/s]


MMR done!
Lambda: 0.5
Saved to: /content/runs/Гладышев/baseline/mmr





## Cross-encoder Rerank (Baseline)

**Baseline конфигурация:**
- Модель: `cross-encoder/ms-marco-MiniLM-L-6-v2`
- Multilingual, легковесная (~90MB)
- Max length: 256 tokens
- Device: auto (GPU if available, else CPU)

### Задание 3 (3 балла): Улучшите reranking

**Что попробовать:**

**Готовые модели:**
- `cross-encoder/ms-marco-MiniLM-L-12-v2` (больше слоёв)
- `amberoad/bert-multilingual-passage-reranking-msmarco`
- `cross-encoder/mmarco-mMiniLMv2-L12-H384-v1`

**Файнтюнинг:**
- Создайте пары (query, doc, label) из qrels
- Negative sampling: top-100 от BM25/Dense, не релевантные
- Loss: CrossEntropyLoss или BCEWithLogitsLoss
- Валидация: hold-out 20% queries

**Оптимизация:**
- ONNX export + INT8 quantization
- Batch size tuning
- Измерение latency

**Критерии:**
- Любое улучшение NDCG@10: +1 балл
- Сравнение 2+ моделей или файнтюнинг: +1 балл
- ONNX/INT8 оптимизация: +1 балл


In [None]:
# =========================
# CROSS-ENCODER RERANK FUNCTIONS
# =========================

import torch
import time
import numpy as np
from sentence_transformers import CrossEncoder
from tqdm import tqdm

def load_cross_encoder(model_name, device='auto', max_length=256):
    """
    Загрузка cross-encoder модели

    Args:
        model_name: название модели
        device: 'cuda', 'cpu', или 'auto' (автоопределение)
        max_length: максимальная длина токенов
    """
    if device == 'auto':
        device = 'cuda' if torch.cuda.is_available() else 'cpu'

    ce = CrossEncoder(model_name, device=device, max_length=max_length)

    # FP16 на GPU для ускорения
    if device == 'cuda':
        try:
            ce.model.half()
            print(f'Using FP16 on GPU')
        except:
            print(f'FP16 not supported, using FP32')

    return ce, device

def batch_pairs(pairs, batch_size):
    """Генератор батчей пар (query, doc)"""
    for i in range(0, len(pairs), batch_size):
        yield pairs[i:i+batch_size]


In [None]:
def rerank_candidates(ce, query_text, candidate_docids, corpus_text_map,
                     batch_size=128):
    """
    Реранкинг кандидатов для одного запроса

    Returns:
        ranked_docids: список docid отсортированных по релевантности
        scores: список скоров
    """
    pairs = [(query_text, corpus_text_map[d]) for d in candidate_docids]

    scores = []
    for chunk in batch_pairs(pairs, batch_size):
        s = ce.predict(chunk, batch_size=len(chunk),
                      show_progress_bar=False, convert_to_numpy=True)
        scores.extend(list(s))

    # Сортировка по убыванию скора
    order = np.argsort(np.array(scores))[::-1]
    ranked_docids = [candidate_docids[i] for i in order]
    ranked_scores = [scores[i] for i in order]

    return ranked_docids, ranked_scores


In [None]:
def run_cross_encoder_rerank(model_name, candidate_outputs, query_texts,
                             corpus_text_map, qids, top_k=100,
                             batch_size=128, device='auto', max_length=256):
    """
    Полный цикл cross-encoder reranking

    Args:
        model_name: название cross-encoder модели
        candidate_outputs: list of {'qid': str, 'docids': [str]} (кандидаты для реранка)
        query_texts: список текстов запросов
        corpus_text_map: dict docid -> text
        qids: список qid
        top_k: сколько кандидатов реранкать
        batch_size: размер батча
        device: 'cuda', 'cpu', или 'auto'
        max_length: максимальная длина токенов

    Returns:
        outputs: list of {'qid': str, 'docids': [str]}
        per_query_ms: list of (qid, latency_ms)
        device: использованное устройство
    """
    # Загрузка модели
    ce, device = load_cross_encoder(model_name, device=device, max_length=max_length)

    outputs = []
    per_query_ms = []

    for qi, qid in enumerate(tqdm(qids, desc='Reranking')):
        # Берем top_k кандидатов
        cands = candidate_outputs[qi]['docids'][:top_k]

        # Реранкинг с замером времени
        t0 = time.perf_counter()
        ranked_docids, scores = rerank_candidates(
            ce, query_texts[qi], cands, corpus_text_map, batch_size=batch_size
        )
        t1 = time.perf_counter()

        per_query_ms.append((qid, (t1 - t0) * 1000.0))
        outputs.append({'qid': qid, 'docids': ranked_docids})

    return outputs, per_query_ms, device

In [None]:
# =========================
# CROSS-ENCODER RERANK (BASELINE)
# =========================

rerank_dir = RUNS_DIR/'rerank'
rerank_dir.mkdir(parents=True, exist_ok=True)

# BASELINE: измените модель для улучшения
RERANK_MODEL = 'cross-encoder/ms-marco-MiniLM-L-6-v2'
MAX_LEN = 256
RERANK_BATCH = 128

print(f'Cross-encoder Rerank (Baseline)')
print(f'Model: {RERANK_MODEL}\n')

# Реранкинг
rerank_outputs, rerank_latency, device = run_cross_encoder_rerank(
    model_name=RERANK_MODEL,
    candidate_outputs=fused_outputs,
    query_texts=query_texts,
    corpus_text_map=corpus_text_map,
    qids=qids,
    top_k=TOP_K,
    batch_size=RERANK_BATCH,
    device='auto',
    max_length=MAX_LEN
)

# Сохранение
write_jsonl(rerank_dir/f'top{TOP_K}.jsonl', rerank_outputs)

# Статистика
vals = [ms for _, ms in rerank_latency]
p50 = float(np.percentile(vals, 50))
p95 = float(np.percentile(vals, 95))

print(f'\nRerank done!')
print(f'Device: {device}')
print(f'Batch size: {RERANK_BATCH}')
print(f'Latency p50: {p50:.2f} ms')
print(f'Latency p95: {p95:.2f} ms')
print(f'Saved to: {rerank_dir}')


Cross-encoder Rerank (Baseline)
Model: cross-encoder/ms-marco-MiniLM-L-6-v2



config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

Using FP16 on GPU


Reranking: 100%|██████████| 200/200 [01:17<00:00,  2.60it/s]


Rerank done!
Device: cuda
Batch size: 128
Latency p50: 295.20 ms
Latency p95: 870.74 ms
Saved to: /content/runs/Гладышев/baseline/rerank





In [None]:
import torch
import gc

gc.collect()
torch.cuda.empty_cache()

In [None]:
# =========================
# CROSS-ENCODER RERANK cross-encoder/ms-marco-MiniLM-L-12-v2
# =========================

rerank_12_dir = RUNS_DIR/'rerank-12'
rerank_12_dir.mkdir(parents=True, exist_ok=True)

# BASELINE: измените модель для улучшения
RERANK_MODEL = 'cross-encoder/ms-marco-MiniLM-L-12-v2'
MAX_LEN = 256
RERANK_BATCH = 128

print(f'Cross-encoder Rerank ms-marco-MiniLM-L-12-v2')
print(f'Model: {RERANK_MODEL}\n')

# Реранкинг
rerank_outputs, rerank_latency, device = run_cross_encoder_rerank(
    model_name=RERANK_MODEL,
    candidate_outputs=fused_outputs,
    query_texts=query_texts,
    corpus_text_map=corpus_text_map,
    qids=qids,
    top_k=TOP_K,
    batch_size=RERANK_BATCH,
    device='auto',
    max_length=MAX_LEN
)

# Сохранение
write_jsonl(rerank_12_dir/f'top{TOP_K}.jsonl', rerank_outputs)

# Статистика
vals = [ms for _, ms in rerank_latency]
p50 = float(np.percentile(vals, 50))
p95 = float(np.percentile(vals, 95))

print(f'\nRerank done!')
print(f'Device: {device}')
print(f'Batch size: {RERANK_BATCH}')
print(f'Latency p50: {p50:.2f} ms')
print(f'Latency p95: {p95:.2f} ms')
print(f'Saved to: {rerank_12_dir}')

Cross-encoder Rerank (Baseline)
Model: cross-encoder/ms-marco-MiniLM-L-12-v2



config.json:   0%|          | 0.00/791 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

Using FP16 on GPU


Reranking: 100%|██████████| 200/200 [01:05<00:00,  3.05it/s]


Rerank done!
Device: cuda
Batch size: 128
Latency p50: 302.62 ms
Latency p95: 473.35 ms
Saved to: /content/runs/Гладышев/baseline/rerank-12





In [None]:
# =========================
# CROSS-ENCODER RERANK cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
# =========================

rerank_mMiniLMv2_dir = RUNS_DIR/'rerank-mMiniLMv2'
rerank_mMiniLMv2_dir.mkdir(parents=True, exist_ok=True)

# BASELINE: измените модель для улучшения
RERANK_MODEL = 'cross-encoder/mmarco-mMiniLMv2-L12-H384-v1'
MAX_LEN = 256
RERANK_BATCH = 128

print(f'Cross-encoder Rerank mmarco-mMiniLMv2-L12-H384-v1')
print(f'Model: {RERANK_MODEL}\n')

# Реранкинг
rerank_outputs, rerank_latency, device = run_cross_encoder_rerank(
    model_name=RERANK_MODEL,
    candidate_outputs=fused_outputs,
    query_texts=query_texts,
    corpus_text_map=corpus_text_map,
    qids=qids,
    top_k=TOP_K,
    batch_size=RERANK_BATCH,
    device='auto',
    max_length=MAX_LEN
)

# Сохранение
write_jsonl(rerank_mMiniLMv2_dir/f'top{TOP_K}.jsonl', rerank_outputs)

# Статистика
vals = [ms for _, ms in rerank_latency]
p50 = float(np.percentile(vals, 50))
p95 = float(np.percentile(vals, 95))

print(f'\nRerank done!')
print(f'Device: {device}')
print(f'Batch size: {RERANK_BATCH}')
print(f'Latency p50: {p50:.2f} ms')
print(f'Latency p95: {p95:.2f} ms')
print(f'Saved to: {rerank_mMiniLMv2_dir}')

Cross-encoder Rerank (Baseline)
Model: cross-encoder/mmarco-mMiniLMv2-L12-H384-v1



config.json:   0%|          | 0.00/891 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

Using FP16 on GPU


Reranking: 100%|██████████| 200/200 [00:49<00:00,  4.01it/s]


Rerank done!
Device: cuda
Batch size: 128
Latency p50: 227.30 ms
Latency p95: 347.32 ms
Saved to: /content/runs/Гладышев/baseline/rerank-mMiniLMv2





In [None]:
def rerank_candidates(ce, query_text, candidate_docids, corpus_text_map,
                     batch_size=128):
    """
    Реранкинг кандидатов для одного запроса

    Returns:
        ranked_docids: список docid отсортированных по релевантности
        scores: список скоров
    """
    pairs = [(query_text, corpus_text_map[d]) for d in candidate_docids]

    scores = []
    for chunk in batch_pairs(pairs, batch_size):
        s = ce.predict(chunk, batch_size=len(chunk),
                      show_progress_bar=False, convert_to_numpy=True)
        scores.extend(list(s))

    # Конвертируем в NumPy массивы для эффективной индексации
    scores_np = np.array(scores)
    candidate_docids_np = np.array(candidate_docids)

    # Сортировка по убыванию скора
    order = np.argsort(scores_np)[::-1]
    ranked_docids = candidate_docids_np[order].tolist()
    ranked_scores = scores_np[order].tolist()

    return ranked_docids, ranked_scores

In [None]:
# =========================
# CROSS-ENCODER RERANK amberoad/bert-multilingual-passage-reranking-msmarco
# =========================

rerank_bert_multilingual_dir = RUNS_DIR/'rerank-bert-multilingual'
rerank_bert_multilingual_dir.mkdir(parents=True, exist_ok=True)

# BASELINE: измените модель для улучшения
RERANK_MODEL = 'amberoad/bert-multilingual-passage-reranking-msmarco'
MAX_LEN = 256
RERANK_BATCH = 128

print(f'Cross-encoder Rerank bert-multilingual-passage-reranking-msmarco')
print(f'Model: {RERANK_MODEL}\n')

# Реранкинг
rerank_outputs, rerank_latency, device = run_cross_encoder_rerank(
    model_name=RERANK_MODEL,
    candidate_outputs=fused_outputs,
    query_texts=query_texts,
    corpus_text_map=corpus_text_map,
    qids=qids,
    top_k=TOP_K,
    batch_size=RERANK_BATCH,
    device='auto',
    max_length=MAX_LEN
)

# Сохранение
write_jsonl(rerank_bert_multilingual_dir/f'top{TOP_K}.jsonl', rerank_outputs)

# Статистика
vals = [ms for _, ms in rerank_latency]
p50 = float(np.percentile(vals, 50))
p95 = float(np.percentile(vals, 95))

print(f'\nRerank done!')
print(f'Device: {device}')
print(f'Batch size: {RERANK_BATCH}')
print(f'Latency p50: {p50:.2f} ms')
print(f'Latency p95: {p95:.2f} ms')
print(f'Saved to: {rerank_bert_multilingual_dir}')

Cross-encoder Rerank bert-multilingual-passage-reranking-msmarco
Model: amberoad/bert-multilingual-passage-reranking-msmarco

Using FP16 on GPU


Reranking: 100%|██████████| 200/200 [01:17<00:00,  2.56it/s]


Rerank done!
Device: cuda
Batch size: 128
Latency p50: 368.01 ms
Latency p95: 500.23 ms
Saved to: /content/runs/Гладышев/baseline/rerank-bert-multilingual





#### Сравнение моделей

Для улучшения качества реранкинга в домене русского языка были протестированы 3 кросс-энкодерные модели:

- `cross-encoder/ms-marco-MiniLM-L-12-v2` (больше слоёв)
- `amberoad/bert-multilingual-passage-reranking-msmarco`
- `cross-encoder/mmarco-mMiniLMv2-L12-H384-v1`

Модель cross-encoder/ms-marco-MiniLM-L-12-v2 является расширенной версией базовой модели. При использовании amberoad/bert-multilingual-passage-reranking-msmarco столкнулся с различием представления входных и выходных параметров.
Модель cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 позволила достичь наибелее существенного улучшения метрик.

Однако, для более существенного улучшения показателей необходимо произвести дообучение

```
                  stage  precision@5  precision@10  recall@100   mrr@10   ndcg@5  ndcg@10  num_queries
       rerank_mMiniLMv2        0.195        0.0975       0.980 0.964500 0.967023 0.967023          200
              rerank_12        0.125        0.0720       0.980 0.544540 0.554891 0.586128          200
                 rerank        0.126        0.0720       0.980 0.525982 0.542850 0.572089          200
```

## Evaluation

Оценка всех этапов pipeline по метрикам:
- Precision@5, Precision@10
- Recall@100
- MRR@10
- NDCG@5, NDCG@10


In [None]:
# =========================
# EVALUATION FUNCTIONS
# =========================

import math
import csv
import json
from collections import defaultdict

def load_qrels(path):
    """Загрузка qrels из TSV файла"""
    qrels = defaultdict(dict)
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if not line or line.startswith('#'):
                continue
            parts = line.split('\t') if '\t' in line else line.split()
            if len(parts) < 3:
                continue
            try:
                rel = int(float(parts[-1]))
            except:
                continue

            # Support 3-col and TREC 4-col
            if len(parts) >= 4 and parts[1].lower() in {'0', 'q0'}:
                qid, docid = parts[0], parts[2]
            else:
                qid, docid = parts[0], parts[1]
            qrels[qid][docid] = rel

    return qrels

def precision_at_k(binary_rels, k):
    """Precision@K"""
    k = min(k, len(binary_rels))
    return 0.0 if k == 0 else sum(binary_rels[:k]) / float(k)

def recall_at_k(binary_rels, total_rel, k):
    """Recall@K"""
    if total_rel <= 0:
        return 0.0
    k = min(k, len(binary_rels))
    return sum(binary_rels[:k]) / float(total_rel)

def reciprocal_rank_at_k(binary_rels, k):
    """MRR@K"""
    k = min(k, len(binary_rels))
    for i in range(k):
        if binary_rels[i] > 0:
            return 1.0 / float(i + 1)
    return 0.0

def dcg_at_k(graded_rels, k):
    """DCG@K"""
    k = min(k, len(graded_rels))
    dcg = 0.0
    for i in range(k):
        rel = graded_rels[i]
        gain = (2.0 ** rel - 1.0)
        denom = math.log2(i + 2)
        dcg += gain / denom
    return dcg

def ndcg_at_k(graded_rels, ideal, k):
    """NDCG@K"""
    a = dcg_at_k(graded_rels, k)
    b = dcg_at_k(ideal, k)
    return 0.0 if b <= 0 else a / b

def evaluate_ranking(qrels, predictions, p_at=(5, 10), r_at=100,
                    mrr_at=10, ndcg_at=(5, 10)):
    """
    Полная оценка ранкинга

    Args:
        qrels: dict {qid -> {docid -> relevance}}
        predictions: dict {qid -> [docid1, docid2, ...]}
        p_at: кортеж K для Precision@K
        r_at: K для Recall@K
        mrr_at: K для MRR@K
        ndcg_at: кортеж K для NDCG@K

    Returns:
        aggregated: dict с усредненными метриками
        per_query: dict {qid -> metrics}
    """
    all_qids = set(qrels.keys()) | set(predictions.keys())
    per_q = {}
    zero_rel_q, missing_pred = 0, 0

    for qid in all_qids:
        qrels_for_q = qrels.get(qid, {})
        ranked_docids = predictions.get(qid, [])

        if not ranked_docids:
            missing_pred += 1

        total_rel = sum(1 for r in qrels_for_q.values() if r > 0)
        if total_rel == 0:
            zero_rel_q += 1

        graded_seq = [int(qrels_for_q.get(d, 0)) for d in ranked_docids]
        binary_seq = [1 if r > 0 else 0 for r in graded_seq]
        ideal_graded = sorted(qrels_for_q.values(), reverse=True)

        p5 = precision_at_k(binary_seq, p_at[0])
        p10 = precision_at_k(binary_seq, p_at[1])
        r100 = recall_at_k(binary_seq, total_rel, r_at)
        mrr10 = reciprocal_rank_at_k(binary_seq, mrr_at)
        ndcg5 = ndcg_at_k(graded_seq, ideal_graded, ndcg_at[0]) if ideal_graded else 0.0
        ndcg10 = ndcg_at_k(graded_seq, ideal_graded, ndcg_at[1]) if ideal_graded else 0.0

        per_q[qid] = {
            'precision@5': p5,
            'precision@10': p10,
            'recall@100': r100,
            'mrr@10': mrr10,
            'ndcg@5': ndcg5,
            'ndcg@10': ndcg10,
            'total_relevant': float(total_rel),
        }

    def safe_mean(vs):
        vs = list(vs)
        return 0.0 if not vs else sum(vs) / float(len(vs))

    agg = {
        'precision@5': safe_mean(m['precision@5'] for m in per_q.values()),
        'precision@10': safe_mean(m['precision@10'] for m in per_q.values()),
        'recall@100': safe_mean(m['recall@100'] for m in per_q.values()),
        'mrr@10': safe_mean(m['mrr@10'] for m in per_q.values()),
        'ndcg@5': safe_mean(m['ndcg@5'] for m in per_q.values()),
        'ndcg@10': safe_mean(m['ndcg@10'] for m in per_q.values()),
        'counts': {
            'total_qids': len(all_qids),
            'qids_with_rels': sum(1 for q in per_q.values() if q['total_relevant'] > 0),
            'qids_zero_rels': zero_rel_q,
            'qids_missing_predictions': missing_pred,
        },
    }

    return agg, per_q

def save_metrics(output_dir, aggregated, per_query):
    """Сохранение метрик в JSON и CSV"""
    # metrics.json
    with open(output_dir/'metrics.json', 'w', encoding='utf-8') as f:
        json.dump(aggregated, f, indent=2, ensure_ascii=False)

    # per_query.csv
    with open(output_dir/'per_query.csv', 'w', encoding='utf-8', newline='') as f:
        w = csv.writer(f)
        w.writerow(['qid', 'precision@5', 'precision@10', 'recall@100',
                   'mrr@10', 'ndcg@5', 'ndcg@10', 'total_relevant'])
        for qid, m in sorted(per_query.items()):
            w.writerow([
                qid,
                f"{m['precision@5']:.6f}",
                f"{m['precision@10']:.6f}",
                f"{m['recall@100']:.6f}",
                f"{m['mrr@10']:.6f}",
                f"{m['ndcg@5']:.6f}",
                f"{m['ndcg@10']:.6f}",
                int(m['total_relevant'])
            ])

print('Evaluation functions loaded')


Evaluation functions loaded


In [None]:
# =========================
# EVALUATION ВСЕХ ЭТАПОВ
# =========================

print('Evaluating all stages...\n')

# Загрузка qrels
qrels = load_qrels(qrels_out)

# Этапы для оценки
stages = [
    ('bm25', bm25_dir, TOP_K),
    ('bm25_stemming', bm25_stem_dir, TOP_K),
    ('bm25_lemmatization', bm25_lemm_dir, TOP_K),
    ('dense', dense_dir, TOP_K),
    ('dense-e5-large', dense_e5_large_dir, TOP_K),
    ('dense-e5-large-instruct', dense_e5_large_instruct_dir, TOP_K),
    ('dense-frida', dense_frida_dir, TOP_K),
    ('fusion_rrf', rrf_dir, TOP_K),
    ('mmr', mmr_dir, MMR_SELECT_K),
    ('rerank', rerank_dir, TOP_K),
    ('rerank_12', rerank_12_dir, TOP_K),
    # ('rerank_bert_multilingual', rerank_bert_multilingual_dir, TOP_K),
    ('rerank_mMiniLMv2', rerank_mMiniLMv2_dir, TOP_K)
]

stage_metrics = {}

for stage_name, stage_dir, k in stages:
    print(f'Evaluating {stage_name}...')

    # Загрузка предсказаний
    preds_path = stage_dir / f'top{k}.jsonl'
    if not preds_path.exists():
        print(f'  Skipped (file not found)')
        continue

    preds = {o['qid']: o['docids'] for o in read_jsonl(preds_path)}

    # Оценка
    agg, per_q = evaluate_ranking(qrels, preds)

    # Сохранение
    save_metrics(stage_dir, agg, per_q)

    stage_metrics[stage_name] = agg

    # Вывод
    print(f"  NDCG@10: {agg['ndcg@10']:.4f}")
    print(f"  MRR@10:  {agg['mrr@10']:.4f}")
    print(f"  P@5:     {agg['precision@5']:.4f}")
    print()

print('Evaluation complete!')


Evaluating all stages...

Evaluating bm25...
  NDCG@10: 0.9296
  MRR@10:  0.9215
  P@5:     0.1890

Evaluating bm25_stemming...
  NDCG@10: 0.9524
  MRR@10:  0.9482
  P@5:     0.1920

Evaluating bm25_lemmatization...
  NDCG@10: 0.9624
  MRR@10:  0.9582
  P@5:     0.1940

Evaluating dense...
  NDCG@10: 0.3332
  MRR@10:  0.2973
  P@5:     0.0810

Evaluating dense-e5-large...
  NDCG@10: 0.9923
  MRR@10:  0.9899
  P@5:     0.1990

Evaluating dense-e5-large-instruct...
  NDCG@10: 0.9897
  MRR@10:  0.9879
  P@5:     0.1990

Evaluating dense-frida...
  NDCG@10: 0.9897
  MRR@10:  0.9881
  P@5:     0.1980

Evaluating fusion_rrf...
  NDCG@10: 0.6763
  MRR@10:  0.6338
  P@5:     0.1360

Evaluating mmr...
  NDCG@10: 0.9522
  MRR@10:  0.9513
  P@5:     0.1910

Evaluating rerank...
  NDCG@10: 0.5721
  MRR@10:  0.5260
  P@5:     0.1260

Evaluating rerank_12...
  NDCG@10: 0.5861
  MRR@10:  0.5445
  P@5:     0.1250

Evaluating rerank_mMiniLMv2...
  NDCG@10: 0.9670
  MRR@10:  0.9645
  P@5:     0.1950

Ev

## Leaderboard

Сравнение всех этапов pipeline по метрикам


In [None]:
# =========================
# LEADERBOARD ПО ЭТАПАМ
# =========================

import pandas as pd

print('Building leaderboard...\n')

# Сбор метрик
leaderboard_data = []
for stage_name, metrics in stage_metrics.items():
    row = {
        'stage': stage_name,
        'precision@5': metrics['precision@5'],
        'precision@10': metrics['precision@10'],
        'recall@100': metrics['recall@100'],
        'mrr@10': metrics['mrr@10'],
        'ndcg@5': metrics['ndcg@5'],
        'ndcg@10': metrics['ndcg@10'],
        'num_queries': metrics['counts']['total_qids'],
    }

    # Добавляем latency если есть
    stage_dir = RUNS_DIR / stage_name
    lat_path = stage_dir / 'latency.csv'
    if lat_path.exists():
        try:
            with open(lat_path, 'r', encoding='utf-8') as f:
                rows = list(csv.reader(f))
            for r in rows:
                if len(r) >= 2:
                    if r[0] == 'p50_ms':
                        row['latency_p50_ms'] = float(r[1])
                    elif r[0] == 'p95_ms':
                        row['latency_p95_ms'] = float(r[1])
        except:
            pass

    leaderboard_data.append(row)

# Создание DataFrame
df = pd.DataFrame(leaderboard_data)

# Сортировка по NDCG@10 (desc), затем MRR@10, затем P@5
df = df.sort_values(
    ['ndcg@10', 'mrr@10', 'precision@5'],
    ascending=[False, False, False]
).reset_index(drop=True)

# Сохранение
leaderboard_csv = RUNS_DIR / 'leaderboard_per_stage.csv'
df.to_csv(leaderboard_csv, index=False)

print(f'Leaderboard saved to: {leaderboard_csv}\n')

# Отображение
print('=' * 100)
print('LEADERBOARD (sorted by NDCG@10)')
print('=' * 100)
print(df.to_string(index=False))
print('=' * 100)


Building leaderboard...

Leaderboard saved to: /content/runs/Гладышев/baseline/leaderboard_per_stage.csv

LEADERBOARD (sorted by NDCG@10)
                  stage  precision@5  precision@10  recall@100   mrr@10   ndcg@5  ndcg@10  num_queries
         dense-e5-large        0.199        0.1000       1.000 0.989881 0.990655 0.992321          200
            dense-frida        0.198        0.0995       0.995 0.988125 0.988155 0.989732          200
dense-e5-large-instruct        0.199        0.0995       1.000 0.987917 0.989653 0.989653          200
       rerank_mMiniLMv2        0.195        0.0975       0.980 0.964500 0.967023 0.967023          200
     bm25_lemmatization        0.194        0.0975       0.980 0.958214 0.960773 0.962440          200
          bm25_stemming        0.192        0.0965       0.970 0.948214 0.950773 0.952440          200
                    mmr        0.191        0.0955       0.955 0.951250 0.952153 0.952153          200
                   bm25        0.189  