# Baseline

1. Получение данных
2. Обрезка текста до максимальной длины
3. Вычисление эмбеддингов текстов
4. Вычисление меры близости 
5. Выбор топ 5 текстов подходящих для ответа на вопрос 

- Препроцессинг: **Обрезка текста до максимальной длины**
- Модель для извлечения эмбеддингов: **ai-forever/ru-en-RoSBERTa**
- Мера близости: **косинусное расстояние**

# Import

In [24]:
import os

while os.getcwd().split("/")[-1] != "alfa-hack-rag":
    os.chdir(os.path.abspath(os.path.join(os.getcwd(), "..")))

In [25]:
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

# Data

In [26]:
df_websites = pd.read_csv("data/websites_updated.csv")
df_questions = pd.read_csv("data/questions_clean.csv")
df_sample = pd.read_csv("data/sample_submission.csv")

In [27]:
df_websites

Unnamed: 0,web_id,url,kind,title,text
0,1,https://alfabank.ru/,html,"Альфа-Банк - кредитные и дебетовые карты, кред...",Рассчитайте выгоду\nРасчёт калькулятора предва...
1,2,https://alfabank.ru/a-club/,html,А-Клуб. Деньги имеют значение,Брокерские услуги\nОткрытие брокерского счёта ...
2,3,https://alfabank.ru/a-club/ultimate/,html,А-Клуб. Деньги имеют значение,Хотите получить больше информации?\nПозвоните ...
3,4,https://alfabank.ru/actions/rules/,html,Скидки по картам,Правила проведения Акции «Альфа Пятница. Бараб...
4,5,https://alfabank.ru/alfafuture/,html,Альфа‑Будущее: Платформа для развития студенто...,Образование\nМагистратуры\nМагистратура ВШЭ\nМ...
...,...,...,...,...,...
1933,1934,https://alfabank.ru/help/t/retail/alfaforbusin...,html,Как вернуть деньги покупателю и как рассчитыва...,Возврат денег покупателю можно оформить через ...
1934,1935,https://alfabank.ru/help/articles/investments/...,html,Как вывести деньги с брокерского счёта — Альфа...,Вывести деньги с брокерского счёта можно на ка...
1935,1936,https://alfabank.ru/make-money/investments/hel...,html,Пополнение и вывод средств — Альфа-Инвестиции,Вывести деньги с брокерского счёта можно на сл...
1936,1937,https://alfabank.ru/everyday/smart/,html,Альфа-Смарт — подписка Альфа-Банка,"Альфа-Смарт — семейная подписка, запущенная в ..."


# Preproccesing

In [28]:
def shorten(text, max_length=2048):
    return text[:max_length]

# Compute sentence embeddings

In [29]:
model = SentenceTransformer("ai-forever/ru-en-RoSBERTa")

Some weights of RobertaModel were not initialized from the model checkpoint at ai-forever/ru-en-RoSBERTa and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [30]:
model.parameters

<bound method Module.parameters of SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'RobertaModel'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)>

In [31]:
model.get_sentence_embedding_dimension()

1024

In [6]:
def embed_text(df, model_name, target_col, batch_size=32):
    model = SentenceTransformer(model_name)
    
    texts = df[target_col].fillna('').astype(str).apply(shorten).tolist()

    result = []
    for i in tqdm(range(0, len(texts), batch_size), desc="Encoding"):
        batch = texts[i:i+batch_size]
        emb = model.encode(batch, 
                          convert_to_numpy=True, 
                          normalize_embeddings=True)
        result.append(emb)
    
    result = np.vstack(result)
    
    embedding_columns = [f'embedding_{i}' for i in range(result.shape[1])]
    df_with_embeddings = df.copy()
    df_with_embeddings[embedding_columns] = result
    
    return df_with_embeddings

In [7]:
sites = embed_text(df_websites, 'ai-forever/ru-en-RoSBERTa', 'text')
sites

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/241 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.61G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at ai-forever/ru-en-RoSBERTa and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/958 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

Encoding: 100%|██████████| 61/61 [01:41<00:00,  1.66s/it]
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_e

Unnamed: 0,web_id,url,kind,title,text,embedding_0,embedding_1,embedding_2,embedding_3,embedding_4,...,embedding_1014,embedding_1015,embedding_1016,embedding_1017,embedding_1018,embedding_1019,embedding_1020,embedding_1021,embedding_1022,embedding_1023
0,1,https://alfabank.ru/,html,"Альфа-Банк - кредитные и дебетовые карты, кред...",Рассчитайте выгоду\nРасчёт калькулятора предва...,0.013509,0.006796,0.009510,0.048672,-0.007688,...,-0.000641,0.038122,-0.028294,0.028959,-0.000162,-0.014146,0.017700,-0.022414,0.006091,-0.044830
1,2,https://alfabank.ru/a-club/,html,А-Клуб. Деньги имеют значение,Брокерские услуги\nОткрытие брокерского счёта ...,-0.041107,-0.001472,0.026361,0.047391,0.001708,...,-0.014570,0.025882,-0.030591,0.038744,-0.025966,0.009511,0.025547,0.022768,0.004238,-0.004135
2,3,https://alfabank.ru/a-club/ultimate/,html,А-Клуб. Деньги имеют значение,Хотите получить больше информации?\nПозвоните ...,-0.009065,0.000051,0.015965,0.033519,0.024208,...,-0.008801,0.053359,0.002976,0.037233,-0.009688,0.003119,0.036557,-0.028921,0.013264,-0.026199
3,4,https://alfabank.ru/actions/rules/,html,Скидки по картам,Правила проведения Акции «Альфа Пятница. Бараб...,-0.058902,0.019763,-0.004363,0.019863,0.008751,...,-0.013358,0.035795,0.018340,0.012307,-0.024570,0.010721,0.039160,-0.008672,-0.030079,-0.055393
4,5,https://alfabank.ru/alfafuture/,html,Альфа‑Будущее: Платформа для развития студенто...,Образование\nМагистратуры\nМагистратура ВШЭ\nМ...,-0.007574,-0.011450,-0.009981,0.013079,-0.035314,...,0.003785,0.007196,-0.043528,0.034571,-0.039345,0.013316,0.050854,-0.004953,0.013118,-0.021082
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1933,1934,https://alfabank.ru/help/t/retail/alfaforbusin...,html,Как вернуть деньги покупателю и как рассчитыва...,Возврат денег покупателю можно оформить через ...,0.022462,0.067073,-0.015545,0.043786,-0.013131,...,-0.009115,0.041333,0.004006,0.003709,-0.017756,-0.011623,0.008138,0.001285,0.033338,-0.015912
1934,1935,https://alfabank.ru/help/articles/investments/...,html,Как вывести деньги с брокерского счёта — Альфа...,Вывести деньги с брокерского счёта можно на ка...,-0.022150,0.036436,0.019438,0.052357,0.005636,...,0.007010,0.026344,0.014384,0.004413,-0.008854,-0.006162,0.014489,-0.003948,0.010612,-0.030656
1935,1936,https://alfabank.ru/make-money/investments/hel...,html,Пополнение и вывод средств — Альфа-Инвестиции,Вывести деньги с брокерского счёта можно на сл...,-0.028034,0.040042,0.013712,0.034763,-0.003354,...,-0.001437,0.036733,0.023656,0.027383,-0.023155,-0.003702,-0.003167,0.002208,-0.001758,-0.033719
1936,1937,https://alfabank.ru/everyday/smart/,html,Альфа-Смарт — подписка Альфа-Банка,"Альфа-Смарт — семейная подписка, запущенная в ...",-0.045740,0.027985,-0.025107,0.047942,0.012523,...,-0.025771,0.019342,-0.014856,-0.012152,0.005634,-0.016059,0.019125,-0.005863,0.016272,-0.009939


In [8]:
questions = embed_text(df_questions, 'ai-forever/ru-en-RoSBERTa', 'query')
questions

Some weights of RobertaModel were not initialized from the model checkpoint at ai-forever/ru-en-RoSBERTa and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Encoding: 100%|██████████| 219/219 [00:28<00:00,  7.73it/s]
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embed

Unnamed: 0,q_id,query,embedding_0,embedding_1,embedding_2,embedding_3,embedding_4,embedding_5,embedding_6,embedding_7,...,embedding_1014,embedding_1015,embedding_1016,embedding_1017,embedding_1018,embedding_1019,embedding_1020,embedding_1021,embedding_1022,embedding_1023
0,1,Номер счета,-0.012877,0.027439,0.023905,0.036862,-0.000816,0.006027,0.022327,-0.005227,...,-0.019661,0.034770,0.031024,-0.002712,-0.010064,0.013632,0.023903,0.000271,-0.011161,-0.040607
1,2,Где узнать бик и счёт,0.000214,0.024718,0.014103,0.019982,-0.006372,-0.017389,0.003129,0.005741,...,-0.007185,0.010509,0.007659,-0.046301,-0.034807,0.017573,0.001267,-0.010014,-0.009135,-0.021278
2,3,Мне не приходят коды для подтверждения данной ...,0.014703,0.041297,0.016088,0.008821,0.074290,-0.019158,-0.040166,-0.024152,...,-0.065829,0.044673,0.062866,0.000840,-0.009016,0.011175,0.031374,0.015261,0.028817,0.004347
3,4,"Оформила рассрочку ,но уведомлений никаких не ...",0.028114,-0.004120,0.015379,0.035578,-0.015421,0.057478,0.017070,-0.017116,...,0.002153,0.035717,-0.003090,-0.032001,-0.043637,0.012864,0.056545,0.013781,0.030093,-0.014530
4,5,"Здравствуйте, когда смогу пользоваться кредитн...",0.008561,0.012064,0.013061,0.023878,0.014897,0.014463,-0.010186,0.026527,...,0.015925,0.020984,0.030830,-0.024779,0.002765,-0.036021,0.055920,-0.047593,0.062412,-0.009168
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6972,6973,"Здравствуйте, оплатил вчера ЖКХ а кэшбек не на...",0.012439,0.067618,0.017861,0.019968,-0.013472,0.028462,-0.031961,-0.033061,...,0.002918,0.004825,0.034997,-0.050951,-0.042346,-0.006414,-0.025997,0.019675,-0.009169,-0.064054
6973,6974,"Здравствуйте, можно ли заказать реквизиты бан...",0.039468,0.014551,0.026186,0.056659,-0.040782,-0.045271,-0.000353,0.000982,...,0.008389,0.042277,0.019400,-0.000584,-0.035703,0.016717,0.016312,-0.013048,0.008009,-0.008127
6974,6975,"Здравствуйте, подскажите пожалуйста где я могу...",0.009843,-0.020083,-0.014177,0.062453,0.014680,0.038489,0.020040,-0.036071,...,0.015107,0.037006,-0.002694,-0.019032,-0.036190,0.042568,0.055310,0.002818,0.024842,-0.044323
6975,6976,Реквизиты для оплаты номера карты,0.031896,0.024385,0.020896,0.060969,0.010456,-0.017141,0.003675,0.007107,...,-0.017462,0.059177,0.033081,-0.009632,-0.015507,0.012784,0.024969,0.004433,-0.036257,-0.020282


# Similarity

In [9]:
def cosine_sim(df_sites, df_question):
    cols_sites = [col for col in df_sites.columns if col.startswith('embedding_')]
    cols_question = [col for col in df_question.columns if col.startswith('embedding_')]

    site_embeddings = df_sites[cols_sites].values
    question_embeddings = df_question[cols_question].values

    cosine_sim_matrix = cosine_similarity(question_embeddings, site_embeddings)

    results = []
    
    for i, (q_idx, question_row) in enumerate(tqdm(df_question.iterrows(), total=len(df_question), desc="Processing")):
        for j, (s_idx, site_row) in enumerate(df_sites.iterrows()):
            cosine_sim = cosine_sim_matrix[i, j]
            
            results.append({
                'q_id': question_row['q_id'],
                'web_id': site_row['web_id'],
                'query': question_row['query'],
                'site_text': site_row['text'],
                'cosine_similarity': cosine_sim
            })
    
    # Создаем финальный датафрейм
    cosine_df = pd.DataFrame(results)

    return cosine_df

In [10]:
cosine_df = cosine_sim(sites, questions)

Processing: 100%|██████████| 6977/6977 [06:50<00:00, 16.99it/s]


In [11]:
cosine_df

Unnamed: 0,q_id,web_id,query,site_text,cosine_similarity
0,1,1,Номер счета,Рассчитайте выгоду\nРасчёт калькулятора предва...,0.585289
1,1,2,Номер счета,Брокерские услуги\nОткрытие брокерского счёта ...,0.540874
2,1,3,Номер счета,Хотите получить больше информации?\nПозвоните ...,0.545769
3,1,4,Номер счета,Правила проведения Акции «Альфа Пятница. Бараб...,0.509613
4,1,5,Номер счета,Образование\nМагистратуры\nМагистратура ВШЭ\nМ...,0.445588
...,...,...,...,...,...
13521421,6977,1934,Можно ли отключить автопополнение брокерского ...,Возврат денег покупателю можно оформить через ...,0.570023
13521422,6977,1935,Можно ли отключить автопополнение брокерского ...,Вывести деньги с брокерского счёта можно на ка...,0.589133
13521423,6977,1936,Можно ли отключить автопополнение брокерского ...,Вывести деньги с брокерского счёта можно на сл...,0.576352
13521424,6977,1937,Можно ли отключить автопополнение брокерского ...,"Альфа-Смарт — семейная подписка, запущенная в ...",0.517069


In [12]:
top5_df = cosine_df.groupby('q_id').apply(
    lambda x: x.nlargest(5, 'cosine_similarity')
).reset_index(drop=True)

  top5_df = cosine_df.groupby('q_id').apply(


In [13]:
top5_df

Unnamed: 0,q_id,web_id,query,site_text,cosine_similarity
0,1,1567,Номер счета,Ещё нужен номер карты\nНомер карты\nНомер карт...,0.682799
1,1,372,Номер счета,Альфа-Банк\nПолезное о продуктах\nВ статье раз...,0.669153
2,1,135,Номер счета,Код валюты\nНаименование банка\nи адрес\nНомер...,0.667427
3,1,92,Номер счета,Образцы заполнения кассовых документов\nВыдача...,0.662159
4,1,1825,Номер счета,НАКОПИТЕЛЬНЫЙ СЧЕТ «МОЙ СЕЙФ» \nС 01.11.201...,0.652185
...,...,...,...,...,...
34880,6977,403,Можно ли отключить автопополнение брокерского ...,Проценты на остаток брокерского счёта,0.714328
34881,6977,1928,Можно ли отключить автопополнение брокерского ...,Инвестор может в любое время расторгнуть догов...,0.642905
34882,6977,425,Можно ли отключить автопополнение брокерского ...,Подключите опцию и оплачивайте покупки по карт...,0.627746
34883,6977,1043,Можно ли отключить автопополнение брокерского ...,Можно ли отменить рассрочку Держателям кредитн...,0.625677


In [14]:
web_list_df = top5_df.groupby('q_id')['web_id'].apply(
    lambda x: "[" + ", ".join(map(str, x.tolist())) + "]"
).reset_index()
web_list_df.columns = ['q_id', 'web_list']

In [15]:
from datetime import datetime

web_list_df.to_csv(f'ranking_results/submit{datetime.now()}.csv', index=False)