# Baseline

1. Получение данных
2. Обрезка текста до максимальной длины
3. Вычисление эмбеддингов текстов
4. Вычисление меры близости 
5. Выбор топ 5 текстов подходящих для ответа на вопрос 

- Препроцессинг: **Обрезка текста до максимальной длины**
- Модель для извлечения эмбеддингов: **ai-forever/FRIDA**
- Мера близости: **косинусное расстояние**

# Import

In [1]:
import os

while os.getcwd().split("/")[-1] != "alfa-hack-rag":
    os.chdir(os.path.abspath(os.path.join(os.getcwd(), "..")))

In [2]:
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

# Data

In [3]:
df_websites = pd.read_csv("data/websites_updated.csv")
df_questions = pd.read_csv("data/questions_clean.csv")
df_sample = pd.read_csv("data/sample_submission.csv")

In [4]:
df_websites

Unnamed: 0,web_id,url,kind,title,text
0,1,https://alfabank.ru/,html,"Альфа-Банк - кредитные и дебетовые карты, кред...",Рассчитайте выгоду\nРасчёт калькулятора предва...
1,2,https://alfabank.ru/a-club/,html,А-Клуб. Деньги имеют значение,Брокерские услуги\nОткрытие брокерского счёта ...
2,3,https://alfabank.ru/a-club/ultimate/,html,А-Клуб. Деньги имеют значение,Хотите получить больше информации?\nПозвоните ...
3,4,https://alfabank.ru/actions/rules/,html,Скидки по картам,Правила проведения Акции «Альфа Пятница. Бараб...
4,5,https://alfabank.ru/alfafuture/,html,Альфа‑Будущее: Платформа для развития студенто...,Образование\nМагистратуры\nМагистратура ВШЭ\nМ...
...,...,...,...,...,...
1933,1934,https://alfabank.ru/help/t/retail/alfaforbusin...,html,Как вернуть деньги покупателю и как рассчитыва...,Возврат денег покупателю можно оформить через ...
1934,1935,https://alfabank.ru/help/articles/investments/...,html,Как вывести деньги с брокерского счёта — Альфа...,Вывести деньги с брокерского счёта можно на ка...
1935,1936,https://alfabank.ru/make-money/investments/hel...,html,Пополнение и вывод средств — Альфа-Инвестиции,Вывести деньги с брокерского счёта можно на сл...
1936,1937,https://alfabank.ru/everyday/smart/,html,Альфа-Смарт — подписка Альфа-Банка,"Альфа-Смарт — семейная подписка, запущенная в ..."


# Preproccesing

In [None]:
def shorten(text, max_length=2048):
    return text[:max_length]

# Compute sentence embeddings

In [6]:
def embed_text(df, model_name, target_col, batch_size=32):
    model = SentenceTransformer(model_name)
    
    texts = df[target_col].fillna('').astype(str).apply(shorten).tolist()

    result = []
    for i in tqdm(range(0, len(texts), batch_size), desc="Encoding"):
        batch = texts[i:i+batch_size]
        emb = model.encode(batch, 
                          convert_to_numpy=True, 
                          normalize_embeddings=True)
        result.append(emb)
    
    result = np.vstack(result)
    
    embedding_columns = [f'embedding_{i}' for i in range(result.shape[1])]
    df_with_embeddings = df.copy()
    df_with_embeddings[embedding_columns] = result
    
    return df_with_embeddings

In [7]:
sites = embed_text(df_websites, 'ai-forever/FRIDA', 'text')
sites

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/509 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/823 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.29G [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/958 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

Encoding: 100%|██████████| 61/61 [04:39<00:00,  4.58s/it]
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_e

Unnamed: 0,web_id,url,kind,title,text,embedding_0,embedding_1,embedding_2,embedding_3,embedding_4,...,embedding_1526,embedding_1527,embedding_1528,embedding_1529,embedding_1530,embedding_1531,embedding_1532,embedding_1533,embedding_1534,embedding_1535
0,1,https://alfabank.ru/,html,"Альфа-Банк - кредитные и дебетовые карты, кред...",Рассчитайте выгоду\nРасчёт калькулятора предва...,-0.003249,-0.020401,0.027337,0.006725,-0.036608,...,0.006631,0.026930,-0.010927,0.015409,-0.028900,-0.001167,-0.024208,-0.011017,0.008020,0.007052
1,2,https://alfabank.ru/a-club/,html,А-Клуб. Деньги имеют значение,Брокерские услуги\nОткрытие брокерского счёта ...,-0.009440,-0.061582,0.026435,0.044492,-0.002672,...,-0.000069,0.009618,-0.028911,0.027790,-0.015526,-0.012378,0.016891,-0.026201,0.023016,0.015856
2,3,https://alfabank.ru/a-club/ultimate/,html,А-Клуб. Деньги имеют значение,Хотите получить больше информации?\nПозвоните ...,-0.002939,-0.027929,0.029933,0.023813,-0.004421,...,0.010756,0.008974,0.000365,0.054321,-0.037606,-0.008107,0.009505,-0.007027,0.013637,0.013283
3,4,https://alfabank.ru/actions/rules/,html,Скидки по картам,Правила проведения Акции «Альфа Пятница. Бараб...,-0.033699,-0.011261,-0.005979,0.071950,0.002548,...,-0.018337,0.014755,0.032974,0.010965,-0.024063,-0.011818,-0.019787,-0.019710,0.031310,-0.002010
4,5,https://alfabank.ru/alfafuture/,html,Альфа‑Будущее: Платформа для развития студенто...,Образование\nМагистратуры\nМагистратура ВШЭ\nМ...,-0.045007,-0.034232,-0.002173,0.037662,-0.008874,...,-0.022496,0.006195,0.001555,0.019326,-0.014310,-0.022655,0.007760,-0.013740,0.036208,0.038437
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1933,1934,https://alfabank.ru/help/t/retail/alfaforbusin...,html,Как вернуть деньги покупателю и как рассчитыва...,Возврат денег покупателю можно оформить через ...,-0.015971,0.021874,0.009625,0.013757,-0.014239,...,-0.002150,-0.013522,0.027452,-0.002157,0.002773,0.028878,-0.017706,0.013381,0.021673,0.019196
1934,1935,https://alfabank.ru/help/articles/investments/...,html,Как вывести деньги с брокерского счёта — Альфа...,Вывести деньги с брокерского счёта можно на ка...,0.004135,-0.028117,0.017499,0.054837,-0.055319,...,0.019717,0.010931,-0.000260,0.021390,-0.020282,0.018530,-0.013940,-0.006937,0.046657,-0.027553
1935,1936,https://alfabank.ru/make-money/investments/hel...,html,Пополнение и вывод средств — Альфа-Инвестиции,Вывести деньги с брокерского счёта можно на сл...,0.003172,-0.027826,0.008401,0.062487,-0.057488,...,-0.001974,0.017078,0.002795,0.009026,-0.038530,0.016653,-0.012292,-0.001809,0.044300,-0.014507
1936,1937,https://alfabank.ru/everyday/smart/,html,Альфа-Смарт — подписка Альфа-Банка,"Альфа-Смарт — семейная подписка, запущенная в ...",-0.004382,0.014563,0.013223,0.053914,-0.043583,...,-0.043578,0.000444,-0.014084,-0.005074,-0.023519,0.004564,-0.023168,-0.024929,0.043034,0.020864


In [8]:
questions = embed_text(df_questions, 'ai-forever/FRIDA', 'query')
questions

Encoding: 100%|██████████| 219/219 [01:06<00:00,  3.30it/s]
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with_embeddings[embedding_columns] = result
  df_with

Unnamed: 0,q_id,query,embedding_0,embedding_1,embedding_2,embedding_3,embedding_4,embedding_5,embedding_6,embedding_7,...,embedding_1526,embedding_1527,embedding_1528,embedding_1529,embedding_1530,embedding_1531,embedding_1532,embedding_1533,embedding_1534,embedding_1535
0,1,Номер счета,-0.009606,0.022116,-0.015717,0.021197,-0.000418,-0.001497,0.010700,0.056437,...,-0.005985,-0.014645,0.036027,-0.043435,-0.023419,0.034221,-0.018146,-0.052293,-0.019528,0.016299
1,2,Где узнать бик и счёт,-0.027913,0.001696,0.002724,0.016252,0.016995,0.007613,0.024111,0.058602,...,-0.010818,-0.035081,0.019946,-0.025626,-0.020042,0.028988,0.014868,-0.025700,-0.056265,0.016393
2,3,Мне не приходят коды для подтверждения данной ...,0.006544,-0.023364,-0.026909,-0.030549,0.014789,0.047667,0.001782,0.036237,...,-0.027244,-0.005069,-0.010190,-0.006286,0.028107,0.027441,0.017797,0.018892,0.012521,0.020908
3,4,"Оформила рассрочку ,но уведомлений никаких не ...",-0.034723,0.006657,-0.009752,-0.003755,0.000713,0.003430,0.004350,0.034954,...,-0.043031,-0.009955,-0.024805,-0.045295,0.008682,-0.010122,-0.000699,0.003615,0.028302,0.014569
4,5,"Здравствуйте, когда смогу пользоваться кредитн...",-0.033890,0.011957,0.014870,0.008979,-0.042151,0.018168,-0.013802,-0.011115,...,-0.025627,-0.011822,0.011554,-0.019789,0.012919,0.035903,0.035126,0.031286,0.017035,0.066977
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6972,6973,"Здравствуйте, оплатил вчера ЖКХ а кэшбек не на...",-0.015995,-0.012645,-0.026982,-0.007115,-0.036015,0.017858,0.013707,0.024371,...,-0.039250,-0.047196,-0.018864,-0.034330,-0.008868,-0.004614,0.031712,-0.009220,-0.001613,-0.017158
6973,6974,"Здравствуйте, можно ли заказать реквизиты бан...",-0.008595,0.012027,0.004211,0.002909,-0.004248,-0.028359,-0.025351,0.001484,...,-0.032040,0.006886,-0.004590,-0.013969,-0.009959,-0.011138,0.023577,-0.014513,-0.008708,0.071997
6974,6975,"Здравствуйте, подскажите пожалуйста где я могу...",-0.019299,0.014888,-0.001857,0.002746,-0.012152,0.016654,0.005697,0.072246,...,-0.025471,-0.020309,0.008900,-0.071408,-0.008846,-0.000405,0.027210,0.000220,0.018858,0.022620
6975,6976,Реквизиты для оплаты номера карты,-0.003291,-0.026090,0.004332,0.013851,0.025512,-0.003967,0.009249,0.062804,...,0.032208,-0.048678,0.021592,-0.038068,-0.003432,-0.008183,0.006754,-0.053036,0.005197,0.022635


# Similarity

In [9]:
def cosine_sim(df_sites, df_question):
    cols_sites = [col for col in df_sites.columns if col.startswith('embedding_')]
    cols_question = [col for col in df_question.columns if col.startswith('embedding_')]

    site_embeddings = df_sites[cols_sites].values
    question_embeddings = df_question[cols_question].values

    cosine_sim_matrix = cosine_similarity(question_embeddings, site_embeddings)

    results = []
    
    for i, (q_idx, question_row) in enumerate(tqdm(df_question.iterrows(), total=len(df_question), desc="Processing")):
        for j, (s_idx, site_row) in enumerate(df_sites.iterrows()):
            cosine_sim = cosine_sim_matrix[i, j]
            
            results.append({
                'q_id': question_row['q_id'],
                'web_id': site_row['web_id'],
                'query': question_row['query'],
                'site_text': site_row['text'],
                'cosine_similarity': cosine_sim
            })
    
    # Создаем финальный датафрейм
    cosine_df = pd.DataFrame(results)

    return cosine_df

In [10]:
cosine_df = cosine_sim(sites, questions)

Processing: 100%|██████████| 6977/6977 [08:19<00:00, 13.96it/s]


In [11]:
cosine_df

Unnamed: 0,q_id,web_id,query,site_text,cosine_similarity
0,1,1,Номер счета,Рассчитайте выгоду\nРасчёт калькулятора предва...,0.167539
1,1,2,Номер счета,Брокерские услуги\nОткрытие брокерского счёта ...,0.245613
2,1,3,Номер счета,Хотите получить больше информации?\nПозвоните ...,0.149656
3,1,4,Номер счета,Правила проведения Акции «Альфа Пятница. Бараб...,0.108419
4,1,5,Номер счета,Образование\nМагистратуры\nМагистратура ВШЭ\nМ...,0.139368
...,...,...,...,...,...
13521421,6977,1934,Можно ли отключить автопополнение брокерского ...,Возврат денег покупателю можно оформить через ...,0.186377
13521422,6977,1935,Можно ли отключить автопополнение брокерского ...,Вывести деньги с брокерского счёта можно на ка...,0.417687
13521423,6977,1936,Можно ли отключить автопополнение брокерского ...,Вывести деньги с брокерского счёта можно на сл...,0.373985
13521424,6977,1937,Можно ли отключить автопополнение брокерского ...,"Альфа-Смарт — семейная подписка, запущенная в ...",0.316801


In [18]:
top5_df = cosine_df.groupby('q_id').apply(
    lambda x: x.nlargest(5, 'cosine_similarity')
).reset_index(drop=True)

  top5_df = cosine_df.groupby('q_id').apply(


In [19]:
top5_df

Unnamed: 0,q_id,web_id,query,site_text,cosine_similarity
0,1,1157,Номер счета,Альфа-Банк\nПолезное о продуктах\nЧтобы получа...,0.422728
1,1,372,Номер счета,Альфа-Банк\nПолезное о продуктах\nВ статье раз...,0.422688
2,1,1896,Номер счета,"Номер карты можно посмотреть на самой карте, в...",0.422145
3,1,593,Номер счета,31.01.2025\nПодробнее: «Альфа‑Счёт»,0.407634
4,1,1098,Номер счета,Альфа-Банк\nПолезное о продуктах\nIBAN (Intern...,0.407426
...,...,...,...,...,...
34880,6977,1304,Можно ли отключить автопополнение брокерского ...,Самый лёгкий способ начать копить\nАвтоматичес...,0.456107
34881,6977,438,Можно ли отключить автопополнение брокерского ...,Как открыть брокерский счёт? Сколько времени н...,0.431020
34882,6977,432,Можно ли отключить автопополнение брокерского ...,Откройте брокерский счёт в Альфа‑Банке\nСчёт б...,0.419225
34883,6977,1935,Можно ли отключить автопополнение брокерского ...,Вывести деньги с брокерского счёта можно на ка...,0.417687


In [None]:
web_list_df = top5_df.groupby('q_id')['web_id'].apply(
    lambda x: "[" + ", ".join(map(str, x.tolist())) + "]"
).reset_index()
web_list_df.columns = ['q_id', 'web_list']

In [17]:
from datetime import datetime

web_list_df.to_csv(f'ranking_results/submit{datetime.now()}.csv', index=False)