# Overview

- [Train Notebook](https://www.kaggle.com/code/sinchir0/fine-tuning-bge-train)

- make 25 retrieval data by `bge-large-en-v1.5`
- Fine-tuning `bge-large-en-v1.5` by retrieval data
  - `anchor`: `ConstructName` + `SubjectName` + `QuestionText` + `Answer[A-D]Text`
  - `positive`: Correct MisconceptionName
  - `negative`: Wrong MisconceptionName

ref: https://sbert.net/docs/sentence_transformer/training_overview.html#trainer

# Setting

In [39]:
DATA_PATH = "/kaggle/input/eedi-mining-misconceptions-in-mathematics"
MODEL_PATH = "/kaggle/input/bge_large_pandas/pytorch/default/1/content" + "/trained_model"

# Install

In [41]:
!python -m pip install -qq --no-index --find-links=/kaggle/input/eedi-library \
sentence-transformers\
pandas

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


# Import

In [42]:
import os

import numpy as np
import pandas as pd

import sentence_transformers


from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

# Data Load

In [45]:
test = pd.read_csv(f"{DATA_PATH}/test.csv")
misconception_mapping = pd.read_csv(f"{DATA_PATH}/misconception_mapping.csv")

# Preprocess

In [47]:
common_col = [
    "QuestionId",
    "ConstructName",
    "SubjectName",
    "QuestionText",
    "CorrectAnswer",
]

# Select specific columns
ans_cols = [f"Answer{alpha}Text" for alpha in ["A", "B", "C", "D"]]
columns_to_select = common_col + ans_cols
test_selected = test[columns_to_select]

# Unpivot (melt) the DataFrame
test_long = test_selected.melt(
    id_vars=common_col, 
    value_vars=ans_cols,
    var_name="AnswerType", 
    value_name="AnswerText"
)

# Add the "AllText" column by concatenating selected columns
test_long["AllText"] = (
    test_long["ConstructName"] + " " +
    test_long["SubjectName"] + " " +
    test_long["QuestionText"] + " " +
    test_long["AnswerText"]
)

# Extract the answer alphabet (A, B, C, D) from "AnswerType"
test_long["AnswerAlphabet"] = test_long["AnswerType"].str.extract(r"Answer([A-D])Text$")

# Add the "QuestionId_Answer" column by concatenating "QuestionId" and "AnswerAlphabet"
test_long["QuestionId_Answer"] = (
    test_long["QuestionId"].astype("string") + "_" + test_long["AnswerAlphabet"]
)

# Sort the DataFrame by "QuestionId_Answer"
test_long = test_long.sort_values(by="QuestionId_Answer")

# Display the first few rows
print(test_long.head())

   QuestionId                                      ConstructName  \
0        1869  Use the order of operations to carry out calcu...   
3        1869  Use the order of operations to carry out calcu...   
6        1869  Use the order of operations to carry out calcu...   
9        1869  Use the order of operations to carry out calcu...   
1        1870  Simplify an algebraic fraction by factorising ...   

                       SubjectName  \
0                           BIDMAS   
3                           BIDMAS   
6                           BIDMAS   
9                           BIDMAS   
1  Simplifying Algebraic Fractions   

                                        QuestionText CorrectAnswer  \
0  \[\n3 \times 2+4-5\n\]\nWhere do the brackets ...             A   
3  \[\n3 \times 2+4-5\n\]\nWhere do the brackets ...             A   
6  \[\n3 \times 2+4-5\n\]\nWhere do the brackets ...             A   
9  \[\n3 \times 2+4-5\n\]\nWhere do the brackets ...             A   
1  Simplify 

# BGE

In [48]:
model = SentenceTransformer(MODEL_PATH)

test_long_vec = model.encode(
    test_long["AllText"].to_list(), normalize_embeddings=True
)
misconception_mapping_vec = model.encode(
    misconception_mapping["MisconceptionName"].to_list(), normalize_embeddings=True
)
print(test_long_vec.shape)
print(misconception_mapping_vec.shape)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/81 [00:00<?, ?it/s]

(12, 1024)
(2587, 1024)


In [49]:
test_cos_sim_arr = cosine_similarity(test_long_vec, misconception_mapping_vec)
test_sorted_indices = np.argsort(-test_cos_sim_arr, axis=1)

# Make Submit File

In [51]:

# Assuming `test_long` and `test_sorted_indices` are Pandas DataFrame and Numpy array respectively

# Step 1: Add a new column "MisconceptionId" from `test_sorted_indices[:, :25]`
test_long["MisconceptionId"] = test_sorted_indices[:, :25].tolist()

# Step 2: Convert the list of misconceptions to a space-separated string
test_long["MisconceptionId"] = test_long["MisconceptionId"].apply(
    lambda x: " ".join(map(str, x))
)

# Step 3: Filter rows where "CorrectAnswer" != "AnswerAlphabet"
filtered = test_long[test_long["CorrectAnswer"] != test_long["AnswerAlphabet"]]

# Step 4: Select only the required columns
submission = filtered[["QuestionId_Answer", "MisconceptionId"]]

# Step 5: Sort by "QuestionId_Answer"
submission = submission.sort_values(by="QuestionId_Answer").reset_index(drop=True)


In [54]:
submission.to_csv("submission.csv", index=False)