# Overview

- [Train Notebook](https://www.kaggle.com/code/sinchir0/fine-tuning-bge-train)

- make 25 retrieval data by `bge-small-en-v1.5`
- Fine-tuning `bge-large-en-v1.5` by retrieval data
  - `anchor`: `ConstructName` + `SubjectName` + `QuestionText` + `Answer[A-D]Text`
  - `positive`: Correct MisconceptionName
  - `negative`: Wrong MisconceptionName

ref: https://sbert.net/docs/sentence_transformer/training_overview.html#trainer

# Setting

In [1]:
DATA_PATH = "/kaggle/input/eedi-mining-misconceptions-in-mathematics"
MODEL_PATH = "/kaggle/input/bge-small-train-models/pytorch/default/1" + "/trained_model"

# Install

In [2]:
!python -m pip install -qq --no-index --find-links=/kaggle/input/eedi-library sentence-transformers

# Import

In [3]:
import os

# import polars as pl
import pandas as pd
import numpy as np

from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

import sentence_transformers

  from tqdm.autonotebook import tqdm, trange


# Data Load

In [4]:
test = pd.read_csv(f"{DATA_PATH}/test.csv")
misconception_mapping = pd.read_csv(f"{DATA_PATH}/misconception_mapping.csv")

# Preprocess

In [5]:
# Define the list of common columns
common_col = [
    "QuestionId",
    "ConstructName",
    "SubjectName",
    "QuestionText",
    "CorrectAnswer",
]

# Select the required columns from the DataFrame
test_selected = test[common_col + [f"Answer{alpha}Text" for alpha in ["A", "B", "C", "D"]]]

# Unpivot the DataFrame using melt
test_melted = test_selected.melt(
    id_vars=common_col, 
    var_name="AnswerType", 
    value_name="AnswerText"
)

# Create the 'AllText' column by concatenating the specified columns
test_melted["AllText"] = (
    test_melted["ConstructName"] + " " +
    test_melted["SubjectName"] + " " +
    test_melted["QuestionText"] + " " +
    test_melted["AnswerText"]
)

# Extract the alphabet (A, B, C, D) from the 'AnswerType' column and create 'AnswerAlphabet' column
test_melted["AnswerAlphabet"] = test_melted["AnswerType"].str.extract(r"Answer([A-D])Text$")[0]

# Create the 'QuestionId_Answer' column by concatenating 'QuestionId' and 'AnswerAlphabet'
test_melted["QuestionId_Answer"] = test_melted["QuestionId"].astype(str) + "_" + test_melted["AnswerAlphabet"]

# Sort the DataFrame by 'QuestionId_Answer'
test_long = test_melted.sort_values("QuestionId_Answer")

# Display the first few rows
test_long.head()

Unnamed: 0,QuestionId,ConstructName,SubjectName,QuestionText,CorrectAnswer,AnswerType,AnswerText,AllText,AnswerAlphabet,QuestionId_Answer
0,1869,Use the order of operations to carry out calcu...,BIDMAS,\[\n3 \times 2+4-5\n\]\nWhere do the brackets ...,A,AnswerAText,\( 3 \times(2+4)-5 \),Use the order of operations to carry out calcu...,A,1869_A
3,1869,Use the order of operations to carry out calcu...,BIDMAS,\[\n3 \times 2+4-5\n\]\nWhere do the brackets ...,A,AnswerBText,\( 3 \times 2+(4-5) \),Use the order of operations to carry out calcu...,B,1869_B
6,1869,Use the order of operations to carry out calcu...,BIDMAS,\[\n3 \times 2+4-5\n\]\nWhere do the brackets ...,A,AnswerCText,\( 3 \times(2+4-5) \),Use the order of operations to carry out calcu...,C,1869_C
9,1869,Use the order of operations to carry out calcu...,BIDMAS,\[\n3 \times 2+4-5\n\]\nWhere do the brackets ...,A,AnswerDText,Does not need brackets,Use the order of operations to carry out calcu...,D,1869_D
1,1870,Simplify an algebraic fraction by factorising ...,Simplifying Algebraic Fractions,"Simplify the following, if possible: \( \frac{...",D,AnswerAText,\( m+1 \),Simplify an algebraic fraction by factorising ...,A,1870_A


# BGE

In [None]:
model = SentenceTransformer(MODEL_PATH)

test_long_vec = model.encode(
    test_long["AllText"].to_list(), normalize_embeddings=True
)
misconception_mapping_vec = model.encode(
    misconception_mapping["MisconceptionName"].to_list(), normalize_embeddings=True
)
print(test_long_vec.shape)
print(misconception_mapping_vec.shape)

In [None]:
test_cos_sim_arr = cosine_similarity(test_long_vec, misconception_mapping_vec)
test_sorted_indices = np.argsort(-test_cos_sim_arr, axis=1)

# Make Submit File

In [None]:
# Add the 'MisconceptionId' column with the top 25 sorted indices for each row
test_long["MisconceptionId"] = pd.Series(test_sorted_indices[:, :25].tolist())

# Convert each list of 'MisconceptionId' values to a space-separated string
test_long["MisconceptionId"] = test_long["MisconceptionId"].apply(lambda x: " ".join(map(str, x)))

In [None]:
# Filter rows where 'CorrectAnswer' is not equal to 'AnswerAlphabet'
submission = test_long[test_long["CorrectAnswer"] != test_long["AnswerAlphabet"]]

# Select only the 'QuestionId_Answer' and 'MisconceptionId' columns
submission = submission[["QuestionId_Answer", "MisconceptionId"]]

In [None]:
# Sort the DataFrame by 'QuestionId_Answer'
submission = submission.sort_values("QuestionId_Answer")

# Display the first few rows
submission.head()

In [None]:
submission.to_csv("submission.csv", index=False)