# MisMine: A Data-Driven Approach to Understanding Student Learning
### CSCE-689 Programming LLMs Final Project Report
### Yu-Sheng Chen (334003776)

Using dataset from EEDI Kaggle competition, I will try to build a model to predict the MisconceptionId for a given question.


## Data Preprocessing

In [11]:
import pandas as pd
training_path = "/kaggle/input/eedi-mining-misconceptions-in-mathematics/train.csv"
df = pd.read_csv(training_path)
df.head()

Unnamed: 0,QuestionId,ConstructId,ConstructName,SubjectId,SubjectName,CorrectAnswer,QuestionText,AnswerAText,AnswerBText,AnswerCText,AnswerDText,MisconceptionAId,MisconceptionBId,MisconceptionCId,MisconceptionDId
0,0,856,Use the order of operations to carry out calcu...,33,BIDMAS,A,\[\n3 \times 2+4-5\n\]\nWhere do the brackets ...,\( 3 \times(2+4)-5 \),\( 3 \times 2+(4-5) \),\( 3 \times(2+4-5) \),Does not need brackets,,,,1672.0
1,1,1612,Simplify an algebraic fraction by factorising ...,1077,Simplifying Algebraic Fractions,D,"Simplify the following, if possible: \( \frac{...",\( m+1 \),\( m+2 \),\( m-1 \),Does not simplify,2142.0,143.0,2142.0,
2,2,2774,Calculate the range from a list of data,339,Range and Interquartile Range from a List of Data,B,Tom and Katie are discussing the \( 5 \) plant...,Only\nTom,Only\nKatie,Both Tom and Katie,Neither is correct,1287.0,,1287.0,1073.0
3,3,2377,Recall and use the intersecting diagonals prop...,88,Properties of Quadrilaterals,C,The angles highlighted on this rectangle with ...,acute,obtuse,\( 90^{\circ} \),Not enough information,1180.0,1180.0,,1180.0
4,4,3387,Substitute positive integer values into formul...,67,Substitution into Formula,A,The equation \( f=3 r^{2}+3 \) is used to find...,\( 30 \),\( 27 \),\( 51 \),\( 24 \),,,,1818.0


In [12]:
df = df.drop(['QuestionId', 'ConstructId', 'SubjectId'], axis=1)
df.head()

Unnamed: 0,ConstructName,SubjectName,CorrectAnswer,QuestionText,AnswerAText,AnswerBText,AnswerCText,AnswerDText,MisconceptionAId,MisconceptionBId,MisconceptionCId,MisconceptionDId
0,Use the order of operations to carry out calcu...,BIDMAS,A,\[\n3 \times 2+4-5\n\]\nWhere do the brackets ...,\( 3 \times(2+4)-5 \),\( 3 \times 2+(4-5) \),\( 3 \times(2+4-5) \),Does not need brackets,,,,1672.0
1,Simplify an algebraic fraction by factorising ...,Simplifying Algebraic Fractions,D,"Simplify the following, if possible: \( \frac{...",\( m+1 \),\( m+2 \),\( m-1 \),Does not simplify,2142.0,143.0,2142.0,
2,Calculate the range from a list of data,Range and Interquartile Range from a List of Data,B,Tom and Katie are discussing the \( 5 \) plant...,Only\nTom,Only\nKatie,Both Tom and Katie,Neither is correct,1287.0,,1287.0,1073.0
3,Recall and use the intersecting diagonals prop...,Properties of Quadrilaterals,C,The angles highlighted on this rectangle with ...,acute,obtuse,\( 90^{\circ} \),Not enough information,1180.0,1180.0,,1180.0
4,Substitute positive integer values into formul...,Substitution into Formula,A,The equation \( f=3 r^{2}+3 \) is used to find...,\( 30 \),\( 27 \),\( 51 \),\( 24 \),,,,1818.0


In [13]:
# Map CorrectAnswer (A, B, C, D) to the corresponding AnswerText column dynamically
def get_correct_answer_text(row):
    # Dictionary to map 'A', 'B', 'C', 'D' to their respective AnswerText column names
    answer_columns = {
        'A': 'AnswerAText',
        'B': 'AnswerBText',
        'C': 'AnswerCText',
        'D': 'AnswerDText'
    }
    # Retrieve the correct column and return its value for the current row
    correct_column = answer_columns.get(row['CorrectAnswer'])
    return row[correct_column] if correct_column else None

# Create the CorrectAnswerText column by applying the function row by row
df['CorrectAnswerText'] = df.apply(get_correct_answer_text, axis=1)

# Display the updated DataFrame
df.head()

Unnamed: 0,ConstructName,SubjectName,CorrectAnswer,QuestionText,AnswerAText,AnswerBText,AnswerCText,AnswerDText,MisconceptionAId,MisconceptionBId,MisconceptionCId,MisconceptionDId,CorrectAnswerText
0,Use the order of operations to carry out calcu...,BIDMAS,A,\[\n3 \times 2+4-5\n\]\nWhere do the brackets ...,\( 3 \times(2+4)-5 \),\( 3 \times 2+(4-5) \),\( 3 \times(2+4-5) \),Does not need brackets,,,,1672.0,\( 3 \times(2+4)-5 \)
1,Simplify an algebraic fraction by factorising ...,Simplifying Algebraic Fractions,D,"Simplify the following, if possible: \( \frac{...",\( m+1 \),\( m+2 \),\( m-1 \),Does not simplify,2142.0,143.0,2142.0,,Does not simplify
2,Calculate the range from a list of data,Range and Interquartile Range from a List of Data,B,Tom and Katie are discussing the \( 5 \) plant...,Only\nTom,Only\nKatie,Both Tom and Katie,Neither is correct,1287.0,,1287.0,1073.0,Only\nKatie
3,Recall and use the intersecting diagonals prop...,Properties of Quadrilaterals,C,The angles highlighted on this rectangle with ...,acute,obtuse,\( 90^{\circ} \),Not enough information,1180.0,1180.0,,1180.0,\( 90^{\circ} \)
4,Substitute positive integer values into formul...,Substitution into Formula,A,The equation \( f=3 r^{2}+3 \) is used to find...,\( 30 \),\( 27 \),\( 51 \),\( 24 \),,,,1818.0,\( 30 \)


In [14]:
# Step 2: Define the function to create new rows
def create_transformed_rows(row):
    new_rows = []
    answers = ['AnswerAText', 'AnswerBText', 'AnswerCText', 'AnswerDText']
    misconceptions = ['MisconceptionAId', 'MisconceptionBId', 'MisconceptionCId', 'MisconceptionDId']
    
    for answer_col, misconception_col in zip(answers, misconceptions):
        # Skip the correct answer and NaN MisconceptionID
        if row['CorrectAnswerText'] != row[answer_col] and not pd.isna(row[misconception_col]):
            new_row = {
                'ConstructName': row['ConstructName'],
                'SubjectName': row['SubjectName'],
                'QuestionText': row['QuestionText'],
                'CorrectAnswerText': row['CorrectAnswerText'],
                'AnswerText': row[answer_col],
                'MisconceptionId': row[misconception_col]
            }
            new_rows.append(new_row)
    return new_rows

# Step 3: Apply the transformation
transformed_data = []
for _, row in df.iterrows():
    transformed_data.extend(create_transformed_rows(row))

# Step 4: Create the new DataFrame
transformed_df = pd.DataFrame(transformed_data)
transformed_df.head()

Unnamed: 0,ConstructName,SubjectName,QuestionText,CorrectAnswerText,AnswerText,MisconceptionId
0,Use the order of operations to carry out calcu...,BIDMAS,\[\n3 \times 2+4-5\n\]\nWhere do the brackets ...,\( 3 \times(2+4)-5 \),Does not need brackets,1672.0
1,Simplify an algebraic fraction by factorising ...,Simplifying Algebraic Fractions,"Simplify the following, if possible: \( \frac{...",Does not simplify,\( m+1 \),2142.0
2,Simplify an algebraic fraction by factorising ...,Simplifying Algebraic Fractions,"Simplify the following, if possible: \( \frac{...",Does not simplify,\( m+2 \),143.0
3,Simplify an algebraic fraction by factorising ...,Simplifying Algebraic Fractions,"Simplify the following, if possible: \( \frac{...",Does not simplify,\( m-1 \),2142.0
4,Calculate the range from a list of data,Range and Interquartile Range from a List of Data,Tom and Katie are discussing the \( 5 \) plant...,Only\nKatie,Only\nTom,1287.0


In [15]:
# Load the misconception ID to text mapping CSV
misconception_mapping_file = '/kaggle/input/eedi-mining-misconceptions-in-mathematics/misconception_mapping.csv'  # Update to the file path of the mapping
misconception_mapping_df = pd.read_csv(misconception_mapping_file)

# Assuming the mapping file has columns: 'MisconceptionID' and 'MisconceptionText'
# Merge the transformed DataFrame with the mapping DataFrame based on MisconceptionID
merged_df = pd.merge(transformed_df, misconception_mapping_df, on='MisconceptionId', how='left')

# Drop the original MisconceptionID column and rename the new column as MisconceptionText
merged_df = merged_df.drop(columns=['MisconceptionId'])
merged_df = merged_df.rename(columns={'MisconceptionText': 'MisconceptionId'})

merged_df.head()

Unnamed: 0,ConstructName,SubjectName,QuestionText,CorrectAnswerText,AnswerText,MisconceptionName
0,Use the order of operations to carry out calcu...,BIDMAS,\[\n3 \times 2+4-5\n\]\nWhere do the brackets ...,\( 3 \times(2+4)-5 \),Does not need brackets,"Confuses the order of operations, believes add..."
1,Simplify an algebraic fraction by factorising ...,Simplifying Algebraic Fractions,"Simplify the following, if possible: \( \frac{...",Does not simplify,\( m+1 \),Does not know that to factorise a quadratic ex...
2,Simplify an algebraic fraction by factorising ...,Simplifying Algebraic Fractions,"Simplify the following, if possible: \( \frac{...",Does not simplify,\( m+2 \),Thinks that when you cancel identical terms fr...
3,Simplify an algebraic fraction by factorising ...,Simplifying Algebraic Fractions,"Simplify the following, if possible: \( \frac{...",Does not simplify,\( m-1 \),Does not know that to factorise a quadratic ex...
4,Calculate the range from a list of data,Range and Interquartile Range from a List of Data,Tom and Katie are discussing the \( 5 \) plant...,Only\nKatie,Only\nTom,Believes if you changed all values by the same...


### Embedding

In [16]:
import torch
import math
from transformers import T5Tokenizer, T5EncoderModel

# Check if CUDA is available (i.e., GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"device: {device}")

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('witiko/mathberta')
model = AutoModel.from_pretrained('witiko/mathberta')

# Move the model to the GPU
model.to(device)

# Function for sinusoidal positional encoding
def positional_encoding(seq_len, dim, device):
    pe = torch.zeros(seq_len, dim, device=device)
    position = torch.arange(0, seq_len, device=device).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, dim, 2, device=device) * (-math.log(10000.0) / dim))
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe

# Function to convert text to embeddings with positional encoding
def get_embeddings(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    inputs = {key: value.to(device) for key, value in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
        token_embeddings = outputs.last_hidden_state  # Shape: (batch_size, seq_len, hidden_dim)
    
    seq_len, hidden_dim = token_embeddings.shape[1], token_embeddings.shape[2]
    pe = positional_encoding(seq_len, hidden_dim, device)
    
    # Add positional encodings to token embeddings
    token_embeddings_with_pe = token_embeddings + pe
    
    # Pooling: Average over the sequence dimension
    embeddings = token_embeddings_with_pe.mean(dim=1)
    return embeddings.squeeze().cpu().numpy()

# Function to apply the embedding conversion to each cell
def convert_cells_to_embeddings(df):
    a = df.columns
    for column in a:
        df[column + '_embedding'] = df[column].apply(lambda x: get_embeddings(str(x)))
    return df


""" About 5 mins"""
# Convert each cell's text in the DataFrame to embeddings
df_with_embeddings = convert_cells_to_embeddings(merged_df)

# Load the data and convert text to embeddings
map_df = pd.read_csv("/kaggle/input/eedi-mining-misconceptions-in-mathematics/misconception_mapping.csv")
map_df_with_embeddings = convert_cells_to_embeddings(map_df)

device: cuda


tokenizer_config.json:   0%|          | 0.00/348 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.61M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/534k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/294 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/683 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/586M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at witiko/mathberta and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
df_with_embeddings.head()

Unnamed: 0,ConstructName,SubjectName,QuestionText,CorrectAnswerText,AnswerText,MisconceptionName,ConstructName_embedding,SubjectName_embedding,QuestionText_embedding,CorrectAnswerText_embedding,AnswerText_embedding,MisconceptionName_embedding
0,Use the order of operations to carry out calcu...,BIDMAS,\[\n3 \times 2+4-5\n\]\nWhere do the brackets ...,\( 3 \times(2+4)-5 \),Does not need brackets,"Confuses the order of operations, believes add...","[0.04959985, 0.014729749, 0.034920234, 0.01752...","[-0.055310056, 0.17073403, -0.022181133, 0.155...","[-0.0013330285, -0.03429978, -0.020202065, 0.0...","[-0.018493686, 0.011237886, 0.0022131717, 0.07...","[-0.021918857, 0.045115843, -0.021832122, 0.02...","[-0.005899658, 0.066089705, -0.02102672, 0.002..."
1,Simplify an algebraic fraction by factorising ...,Simplifying Algebraic Fractions,"Simplify the following, if possible: \( \frac{...",Does not simplify,\( m+1 \),Does not know that to factorise a quadratic ex...,"[0.020621149, 0.017256275, 0.03514646, 0.00250...","[0.060939286, 0.027644148, 0.08008756, -0.0312...","[0.04532544, -0.022641057, 0.060456403, 0.0008...","[0.036091924, -0.009930244, 0.06971537, -0.027...","[0.12924047, 0.09619799, 0.14670639, 0.2118297...","[0.035504494, 0.02632645, -0.009300704, 0.0450..."
2,Simplify an algebraic fraction by factorising ...,Simplifying Algebraic Fractions,"Simplify the following, if possible: \( \frac{...",Does not simplify,\( m+2 \),Thinks that when you cancel identical terms fr...,"[0.020621149, 0.017256275, 0.03514646, 0.00250...","[0.060939286, 0.027644148, 0.08008756, -0.0312...","[0.04532544, -0.022641057, 0.060456403, 0.0008...","[0.036091924, -0.009930244, 0.06971537, -0.027...","[0.13379285, 0.09730019, 0.14354198, 0.2083104...","[0.023413155, 0.057533238, -0.01655458, 0.0303..."
3,Simplify an algebraic fraction by factorising ...,Simplifying Algebraic Fractions,"Simplify the following, if possible: \( \frac{...",Does not simplify,\( m-1 \),Does not know that to factorise a quadratic ex...,"[0.020621149, 0.017256275, 0.03514646, 0.00250...","[0.060939286, 0.027644148, 0.08008756, -0.0312...","[0.04532544, -0.022641057, 0.060456403, 0.0008...","[0.036091924, -0.009930244, 0.06971537, -0.027...","[0.13783681, 0.113175966, 0.15296637, 0.242789...","[0.035504494, 0.02632645, -0.009300704, 0.0450..."
4,Calculate the range from a list of data,Range and Interquartile Range from a List of Data,Tom and Katie are discussing the \( 5 \) plant...,Only\nKatie,Only\nTom,Believes if you changed all values by the same...,"[0.03937878, 0.05364158, 0.023004703, 0.049579...","[0.019607667, 0.09191745, -0.020040216, 0.0133...","[0.009799295, 0.01622148, -0.014125508, 0.0351...","[0.17989315, 0.10336397, 0.1624305, 0.07264595...","[0.19660223, 0.19645533, 0.15364648, 0.1399345...","[0.04475698, 0.038474713, 0.014422012, 0.03455..."


In [18]:
map_df_with_embeddings.head()

Unnamed: 0,MisconceptionId,MisconceptionName,MisconceptionId_embedding,MisconceptionName_embedding
0,0,Does not know that angles in a triangle sum to...,"[0.5126569, 0.3801039, 0.5945794, 0.37628245, ...","[0.076011285, -0.027505102, 0.046373926, -0.04..."
1,1,Uses dividing fractions method for multiplying...,"[0.5426599, 0.38091144, 0.5986686, 0.42581233,...","[-0.017427994, 0.053378757, -0.01871304, 0.012..."
2,2,Believes there are 100 degrees in a full turn,"[0.5473151, 0.3583281, 0.5907435, 0.4749967, 0...","[0.020258954, 0.052719492, -0.038508724, 0.015..."
3,3,Thinks a quadratic without a non variable term...,"[0.5362339, 0.34793526, 0.58784366, 0.4273047,...","[-0.009375696, 0.015626298, -0.013452273, 0.01..."
4,4,Believes addition of terms and powers of terms...,"[0.524827, 0.3573501, 0.6060773, 0.36942467, 0...","[-0.01730884, 0.042884838, -0.018736564, -0.04..."


In [19]:
""" About 2 mins"""
output_file_embeddings = 'transformed_with_embeddings_pos_mberta.csv'
df_with_embeddings.to_csv(output_file_embeddings, index=False)

In [20]:
output_file_embeddings = 'map_df_embeddings_pos_mberta.csv'
map_df_with_embeddings.to_csv(output_file_embeddings, index=False)

## Train the model

In [22]:
import pandas as pd

training_path = "/kaggle/working/transformed_with_embeddings_pos_mberta.csv"
df = pd.read_csv(training_path)
df = df.iloc[:, 6:12]
df.head()

Unnamed: 0,ConstructName_embedding,SubjectName_embedding,QuestionText_embedding,CorrectAnswerText_embedding,AnswerText_embedding,MisconceptionName_embedding
0,[ 4.95998487e-02 1.47297494e-02 3.49202342e-...,[-5.53100556e-02 1.70734033e-01 -2.21811328e-...,[-1.33302854e-03 -3.42997797e-02 -2.02020649e-...,[-1.84936859e-02 1.12378858e-02 2.21317168e-...,[-2.19188575e-02 4.51158434e-02 -2.18321215e-...,[-5.89965796e-03 6.60897046e-02 -2.10267194e-...
1,[ 2.06211489e-02 1.72562748e-02 3.51464599e-...,[ 6.09392859e-02 2.76441481e-02 8.00875574e-...,[ 4.53254394e-02 -2.26410571e-02 6.04564026e-...,[ 3.60919237e-02 -9.93024372e-03 6.97153732e-...,[ 1.29240468e-01 9.61979926e-02 1.46706387e-...,[ 3.55044939e-02 2.63264496e-02 -9.30070411e-...
2,[ 2.06211489e-02 1.72562748e-02 3.51464599e-...,[ 6.09392859e-02 2.76441481e-02 8.00875574e-...,[ 4.53254394e-02 -2.26410571e-02 6.04564026e-...,[ 3.60919237e-02 -9.93024372e-03 6.97153732e-...,[ 1.33792847e-01 9.73001868e-02 1.43541977e-...,[ 2.34131552e-02 5.75332381e-02 -1.65545791e-...
3,[ 2.06211489e-02 1.72562748e-02 3.51464599e-...,[ 6.09392859e-02 2.76441481e-02 8.00875574e-...,[ 4.53254394e-02 -2.26410571e-02 6.04564026e-...,[ 3.60919237e-02 -9.93024372e-03 6.97153732e-...,[ 1.37836814e-01 1.13175966e-01 1.52966365e-...,[ 3.55044939e-02 2.63264496e-02 -9.30070411e-...
4,[ 3.93787809e-02 5.36415800e-02 2.30047032e-...,[ 1.96076669e-02 9.19174477e-02 -2.00402159e-...,[ 9.79929511e-03 1.62214804e-02 -1.41255083e-...,[ 1.79893151e-01 1.03363968e-01 1.62430495e-...,[ 1.96602225e-01 1.96455330e-01 1.53646484e-...,[ 4.47569788e-02 3.84747125e-02 1.44220116e-...


In [23]:
training_path = "/kaggle/working/map_df_embeddings_pos_mberta.csv"
map_df = pd.read_csv(training_path)
map_df.head()

Unnamed: 0,MisconceptionId,MisconceptionName,MisconceptionId_embedding,MisconceptionName_embedding
0,0,Does not know that angles in a triangle sum to...,[ 5.12656927e-01 3.80103886e-01 5.94579399e-...,[ 7.60112852e-02 -2.75051016e-02 4.63739261e-...
1,1,Uses dividing fractions method for multiplying...,[ 5.42659879e-01 3.80911440e-01 5.98668575e-...,[-1.74279939e-02 5.33787571e-02 -1.87130403e-...
2,2,Believes there are 100 degrees in a full turn,[ 5.47315121e-01 3.58328104e-01 5.90743482e-...,[ 2.02589538e-02 5.27194925e-02 -3.85087244e-...
3,3,Thinks a quadratic without a non variable term...,[ 5.36233902e-01 3.47935259e-01 5.87843657e-...,[-9.37569607e-03 1.56262983e-02 -1.34522729e-...
4,4,Believes addition of terms and powers of terms...,[ 5.24827003e-01 3.57350111e-01 6.06077313e-...,[-1.73088405e-02 4.28848378e-02 -1.87365636e-...


In [24]:
import numpy as np
import pandas as pd
import ast

# Function to convert string to NumPy array
def str_to_array(x):
    return np.fromstring(x.strip('[]'), sep=' ')
# Apply the conversion to each column of the DataFrame
for col in df.columns:
    df[col] = df[col].apply(str_to_array)

map_df["MisconceptionName_embedding"] = map_df["MisconceptionName_embedding"].apply(str_to_array)

In [25]:
# add back the MisconceptionId
df["MisconceptionId"] = transformed_df["MisconceptionId"]
df.head()

Unnamed: 0,ConstructName_embedding,SubjectName_embedding,QuestionText_embedding,CorrectAnswerText_embedding,AnswerText_embedding,MisconceptionName_embedding,MisconceptionId
0,"[0.0495998487, 0.0147297494, 0.0349202342, 0.0...","[-0.0553100556, 0.170734033, -0.0221811328, 0....","[-0.00133302854, -0.0342997797, -0.0202020649,...","[-0.0184936859, 0.0112378858, 0.00221317168, 0...","[-0.0219188575, 0.0451158434, -0.0218321215, 0...","[-0.00589965796, 0.0660897046, -0.0210267194, ...",1672.0
1,"[0.0206211489, 0.0172562748, 0.0351464599, 0.0...","[0.0609392859, 0.0276441481, 0.0800875574, -0....","[0.0453254394, -0.0226410571, 0.0604564026, 0....","[0.0360919237, -0.00993024372, 0.0697153732, -...","[0.129240468, 0.0961979926, 0.146706387, 0.211...","[0.0355044939, 0.0263264496, -0.00930070411, 0...",2142.0
2,"[0.0206211489, 0.0172562748, 0.0351464599, 0.0...","[0.0609392859, 0.0276441481, 0.0800875574, -0....","[0.0453254394, -0.0226410571, 0.0604564026, 0....","[0.0360919237, -0.00993024372, 0.0697153732, -...","[0.133792847, 0.0973001868, 0.143541977, 0.208...","[0.0234131552, 0.0575332381, -0.0165545791, 0....",143.0
3,"[0.0206211489, 0.0172562748, 0.0351464599, 0.0...","[0.0609392859, 0.0276441481, 0.0800875574, -0....","[0.0453254394, -0.0226410571, 0.0604564026, 0....","[0.0360919237, -0.00993024372, 0.0697153732, -...","[0.137836814, 0.113175966, 0.152966365, 0.2427...","[0.0355044939, 0.0263264496, -0.00930070411, 0...",2142.0
4,"[0.0393787809, 0.05364158, 0.0230047032, 0.049...","[0.0196076669, 0.0919174477, -0.0200402159, 0....","[0.00979929511, 0.0162214804, -0.0141255083, 0...","[0.179893151, 0.103363968, 0.162430495, 0.0726...","[0.196602225, 0.19645533, 0.153646484, 0.13993...","[0.0447569788, 0.0384747125, 0.0144220116, 0.0...",1287.0


In [17]:
df.head()

Unnamed: 0,ConstructName_embedding,SubjectName_embedding,QuestionText_embedding,CorrectAnswerText_embedding,AnswerText_embedding,MisconceptionName_embedding,MisconceptionId
0,"[0.0495998487, 0.0147297494, 0.0349202342, 0.0...","[-0.0553100556, 0.170734033, -0.0221811328, 0....","[-0.00133302854, -0.0342997797, -0.0202020649,...","[-0.0184936859, 0.0112378858, 0.00221317168, 0...","[-0.0219188575, 0.0451158434, -0.0218321215, 0...","[-0.00589965796, 0.0660897046, -0.0210267194, ...",1672.0
1,"[0.0206211489, 0.0172562748, 0.0351464599, 0.0...","[0.0609392859, 0.0276441481, 0.0800875574, -0....","[0.0453254394, -0.0226410571, 0.0604564026, 0....","[0.0360919237, -0.00993024372, 0.0697153732, -...","[0.129240468, 0.0961979926, 0.146706387, 0.211...","[0.0355044939, 0.0263264496, -0.00930070411, 0...",2142.0
2,"[0.0206211489, 0.0172562748, 0.0351464599, 0.0...","[0.0609392859, 0.0276441481, 0.0800875574, -0....","[0.0453254394, -0.0226410571, 0.0604564026, 0....","[0.0360919237, -0.00993024372, 0.0697153732, -...","[0.133792847, 0.0973001868, 0.143541977, 0.208...","[0.0234131552, 0.0575332381, -0.0165545791, 0....",143.0
3,"[0.0206211489, 0.0172562748, 0.0351464599, 0.0...","[0.0609392859, 0.0276441481, 0.0800875574, -0....","[0.0453254394, -0.0226410571, 0.0604564026, 0....","[0.0360919237, -0.00993024372, 0.0697153732, -...","[0.137836814, 0.113175966, 0.152966365, 0.2427...","[0.0355044939, 0.0263264496, -0.00930070411, 0...",2142.0
4,"[0.0393787809, 0.05364158, 0.0230047032, 0.049...","[0.0196076669, 0.0919174477, -0.0200402159, 0....","[0.00979929511, 0.0162214804, -0.0141255083, 0...","[0.179893151, 0.103363968, 0.162430495, 0.0726...","[0.196602225, 0.19645533, 0.153646484, 0.13993...","[0.0447569788, 0.0384747125, 0.0144220116, 0.0...",1287.0


In [27]:
map_df.head()

Unnamed: 0,MisconceptionId,MisconceptionName,MisconceptionId_embedding,MisconceptionName_embedding
0,0,Does not know that angles in a triangle sum to...,[ 5.12656927e-01 3.80103886e-01 5.94579399e-...,"[0.0760112852, -0.0275051016, 0.0463739261, -0..."
1,1,Uses dividing fractions method for multiplying...,[ 5.42659879e-01 3.80911440e-01 5.98668575e-...,"[-0.0174279939, 0.0533787571, -0.0187130403, 0..."
2,2,Believes there are 100 degrees in a full turn,[ 5.47315121e-01 3.58328104e-01 5.90743482e-...,"[0.0202589538, 0.0527194925, -0.0385087244, 0...."
3,3,Thinks a quadratic without a non variable term...,[ 5.36233902e-01 3.47935259e-01 5.87843657e-...,"[-0.00937569607, 0.0156262983, -0.0134522729, ..."
4,4,Believes addition of terms and powers of terms...,[ 5.24827003e-01 3.57350111e-01 6.06077313e-...,"[-0.0173088405, 0.0428848378, -0.0187365636, -..."


### Customized Attention Model

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, TensorDataset

# Take df[[ConstructName_embedding,	SubjectName_embedding,	QuestionText_embedding,	CorrectAnswerText_embedding	AnswerText_embedding]]
X = np.array([np.stack(df.iloc[i, :5].values) for i in range(len(df))])  # Shape will be (4368, 768, 5)

# Extract the 2nd last column (MisconceptionName_embedding) as target (shape will be (4368, 768))
y = np.array([df.iloc[i, -2] for i in range(len(df))])  # Shape will be (4368, 768)

# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"device: {device}")

# Prepare the data
X_flattened = X  # (4368, 5, 768)
y_flattened = y  # Target embeddings

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_flattened, y_flattened, test_size=0.2, random_state=42, shuffle=False)

# Convert to torch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32).to(device)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).to(device)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32).to(device)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).to(device)

# Create DataLoader for batching
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_dataloader = DataLoader(train_dataset, batch_size=32)
test_dataloader = DataLoader(test_dataset, batch_size=32)

# Attention Mechanism (Scaled Dot-Product Attention)
class AttentionLayer(nn.Module):
    def __init__(self, embed_size):
        super(AttentionLayer, self).__init__()
        self.query_fc = nn.Linear(embed_size, embed_size)
        self.key_fc = nn.Linear(embed_size, embed_size)
        self.value_fc = nn.Linear(embed_size, embed_size)
    
    def forward(self, x):
        # Q, K, V
        query = self.query_fc(x)
        key = self.key_fc(x)
        value = self.value_fc(x)

        # Scaled dot-product attention
        scores = torch.matmul(query, key.transpose(-2, -1)) / np.sqrt(query.size(-1))
        attention_weights = torch.nn.functional.softmax(scores, dim=-1)
        attended_output = torch.matmul(attention_weights, value)
        
        return attended_output

# Custom Model with Attention Mechanism
class CustomAttentionModel(nn.Module):
    def __init__(self, input_dim, output_dim, embed_size=768, num_attention_heads=8):
        super(CustomAttentionModel, self).__init__()
        
        self.attention = AttentionLayer(embed_size)
        self.dense1 = nn.Linear(input_dim, 2048)
        self.dense2 = nn.Linear(2048, 1024)
        self.dense3 = nn.Linear(1024, embed_size)  # Output layer to match the target dimension
        
    def forward(self, x):
        # Apply attention to the sequence of embeddings
        x = self.attention(x)  # Shape: (batch_size, seq_len, embed_size)
        
        # Flatten the sequence embeddings into one dimension for the dense layers
        x = x.view(x.size(0), -1)  # Shape: (batch_size, seq_len * embed_size)
        
        # Feed-forward network
        x = torch.relu(self.dense1(x))
        x = torch.relu(self.dense2(x))
        x = self.dense3(x)  # Final output
        return x

# Initialize the model
input_dim = X_train.shape[1] * X_train.shape[2]  # (5 * 768) = 3840
output_dim = y_train.shape[1]  # 768
model = CustomAttentionModel(input_dim, output_dim).to(device)

# Define loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=1e-5)

# Training loop
num_epochs = 10000
for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0
    for batch in train_dataloader:
        optimizer.zero_grad()
        input_data, labels = batch
        input_data = input_data.to(device)
        labels = labels.to(device)

        # Forward pass
        outputs = model(input_data)  # Output shape: (batch_size, 768)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()

    print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {epoch_loss / len(train_dataloader)}")


MODEL_PATH = "model_weights_v1.pth"
torch.save(model.state_dict(), MODEL_PATH)

### Mixture of Experts (MoE) Model

In [28]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, TensorDataset

# Take df[[ConstructName_embedding,	SubjectName_embedding,	QuestionText_embedding,	CorrectAnswerText_embedding	AnswerText_embedding]]
X = np.array([np.stack(df.iloc[i, :5].values) for i in range(len(df))])  # Shape will be (4368, 768, 5)

# Extract the 2nd last column (MisconceptionName_embedding) as target (shape will be (4368, 768))
y = np.array([df.iloc[i, -2] for i in range(len(df))])  # Shape will be (4368, 768)

# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"device: {device}")

# Prepare the data
X_flattened = X  # (4368, 5, 768)
y_flattened = y  # Target embeddings

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_flattened, y_flattened, test_size=0.2, random_state=42, shuffle=False)

# Convert to torch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32).to(device)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).to(device)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32).to(device)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).to(device)

# Create DataLoader for batching
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_dataloader = DataLoader(train_dataset, batch_size=32)
test_dataloader = DataLoader(test_dataset, batch_size=32)

# Attention Mechanism (Scaled Dot-Product Attention)
class AttentionLayer(nn.Module):
    def __init__(self, embed_size):
        super(AttentionLayer, self).__init__()
        self.query_fc = nn.Linear(embed_size, embed_size)
        self.key_fc = nn.Linear(embed_size, embed_size)
        self.value_fc = nn.Linear(embed_size, embed_size)
    
    def forward(self, x):
        # Q, K, V
        query = self.query_fc(x)
        key = self.key_fc(x)
        value = self.value_fc(x)

        # Scaled dot-product attention
        scores = torch.matmul(query, key.transpose(-2, -1)) / np.sqrt(query.size(-1))
        attention_weights = torch.nn.functional.softmax(scores, dim=-1)
        attended_output = torch.matmul(attention_weights, value)
        
        return attended_output

# Custom Model with Attention Mechanism
class CustomAttentionModel(nn.Module):
    def __init__(self, input_dim, output_dim, embed_size=768, num_attention_heads=8):
        super(CustomAttentionModel, self).__init__()
        
        self.attention = AttentionLayer(embed_size)
        self.dense1 = nn.Linear(input_dim, 2048)
        self.dense2 = nn.Linear(2048, 1024)
        self.dense3 = nn.Linear(1024, embed_size)  # Output layer to match the target dimension
        
    def forward(self, x):
        # Apply attention to the sequence of embeddings
        x = self.attention(x)  # Shape: (batch_size, seq_len, embed_size)
        
        # Flatten the sequence embeddings into one dimension for the dense layers
        x = x.view(x.size(0), -1)  # Shape: (batch_size, seq_len * embed_size)
        
        # Feed-forward network
        x = torch.relu(self.dense1(x))
        x = torch.relu(self.dense2(x))
        x = self.dense3(x)  # Final output
        return x

class ExpertNetwork(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(ExpertNetwork, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim)
        )
    
    def forward(self, x):
        return self.net(x)

class MixtureOfExperts(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_experts=8, k=2):
        super(MixtureOfExperts, self).__init__()
        self.input_dim = input_dim
        self.num_experts = num_experts
        self.k = k  # Top-k experts to use
        
        # Expert networks
        self.experts = nn.ModuleList([
            ExpertNetwork(input_dim, hidden_dim, output_dim) 
            for _ in range(num_experts)
        ])
        
        # Gating network (simpler than before)
        self.gate = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_experts)
        )
        
    def forward(self, x):
        batch_size = x.shape[0]
        
        # Flatten input if needed
        if len(x.shape) > 2:
            x = x.view(batch_size, -1)
        
        # Get gates
        gate_logits = self.gate(x)  # [batch_size, num_experts]
        
        # Select top-k experts
        gates, indices = torch.topk(gate_logits, self.k, dim=-1)  # [batch_size, k]
        gates = torch.softmax(gates, dim=-1)
        
        # Initialize output tensor
        final_output = torch.zeros(batch_size, self.experts[0].net[-1].out_features).to(x.device)
        
        # Compute only selected experts
        for i in range(self.k):
            # Get the expert indices for this position
            expert_indices = indices[:, i]  # [batch_size]
            
            # Compute expert outputs for selected experts
            for j in range(self.num_experts):
                # Create a mask for this expert
                mask = (expert_indices == j)
                if mask.any():
                    # Only compute for samples that use this expert
                    expert_input = x[mask]
                    expert_output = self.experts[j](expert_input)
                    final_output[mask] += gates[mask, i].unsqueeze(-1) * expert_output
        
        return final_output

# Initialize the improved MoE model
input_dim = X_train.shape[1] * X_train.shape[2]  # (5 * 768) = 3840
hidden_dim = 1024
output_dim = y_train.shape[1]  # 768
model = MixtureOfExperts(
    input_dim=input_dim,
    hidden_dim=hidden_dim,
    output_dim=output_dim,
    num_experts=8,  # Increased number of experts
    k=2  # Only use top 2 experts per input
).to(device)

# Define loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=5)

# Training loop with modifications
num_epochs = 10000
for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0
    for batch in train_dataloader:
        optimizer.zero_grad()
        input_data, labels = batch
        
        # Forward pass
        outputs = model(input_data)
        loss = criterion(outputs, labels)
        
        # Backward pass
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        
        epoch_loss += loss.item()
    
    avg_loss = epoch_loss / len(train_dataloader)
    scheduler.step(avg_loss)
    
    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {avg_loss:.6f}")


MODEL_PATH = "model_weights_moe_v1.pth"
torch.save(model.state_dict(), MODEL_PATH)

device: cuda
Epoch 10/10000, Loss: 0.010625
Epoch 20/10000, Loss: 0.010137
Epoch 30/10000, Loss: 0.009741
Epoch 40/10000, Loss: 0.009399
Epoch 50/10000, Loss: 0.009114
Epoch 60/10000, Loss: 0.008900
Epoch 70/10000, Loss: 0.008674
Epoch 80/10000, Loss: 0.008441
Epoch 90/10000, Loss: 0.008156
Epoch 100/10000, Loss: 0.007851
Epoch 110/10000, Loss: 0.007538
Epoch 120/10000, Loss: 0.007350
Epoch 130/10000, Loss: 0.007270
Epoch 140/10000, Loss: 0.006903
Epoch 150/10000, Loss: 0.006623
Epoch 160/10000, Loss: 0.006386
Epoch 170/10000, Loss: 0.006154
Epoch 180/10000, Loss: 0.005972
Epoch 190/10000, Loss: 0.005771
Epoch 200/10000, Loss: 0.005591
Epoch 210/10000, Loss: 0.005444
Epoch 220/10000, Loss: 0.005293
Epoch 230/10000, Loss: 0.005179
Epoch 240/10000, Loss: 0.005102
Epoch 250/10000, Loss: 0.005343
Epoch 260/10000, Loss: 0.005056
Epoch 270/10000, Loss: 0.004925
Epoch 280/10000, Loss: 0.004821
Epoch 290/10000, Loss: 0.004729
Epoch 300/10000, Loss: 0.004643
Epoch 310/10000, Loss: 0.004563
Epoc

## Evaluation

In [29]:
# Evaluation
model.eval()
with torch.no_grad():
    predictions = []
    true_values = []
    for batch in test_dataloader:
        input_data, labels = batch
        input_data = input_data.to(device)

        # Forward pass
        outputs = model(input_data)  # Predicted embeddings
        predictions.append(outputs.cpu().numpy())
        true_values.append(labels.cpu().numpy())

# Flatten predictions and true values
predictions = np.concatenate(predictions, axis=0)
true_values = np.concatenate(true_values, axis=0)

In [None]:
# Compute accuracy by finding closest prediction in the embedding space
def calculate_top_k_accuracy(y_true, y_pred, map_embeddings, map_ids, k=5):
    correct_predictions = 0
    total_predictions = len(y_true)

    for i in range(total_predictions):
        predicted_embedding = y_pred[i].reshape(1, -1)  # Reshape to 2D for distance computation

        # Compute Euclidean distances between predicted embedding and all misconception embeddings
        distances = euclidean_distances(predicted_embedding, map_embeddings).flatten()

        # Find indices of the top-K closest embeddings
        top_k_indices = np.argsort(distances)[:k]

        if df.iloc[3494+i, -1] in top_k_indices:  # Misconception ID is in the top K
            correct_predictions += 1

    accuracy = correct_predictions / total_predictions
    return accuracy

# Compute Top-5 Accuracy with map_df Misconception IDs
misconception_ids = map_df['MisconceptionId'].values  # Extract Misconception IDs
misconception_embeddings = np.stack(map_df['MisconceptionName_embedding'].values)  # Parse the embeddings
top_k_accuracy = calculate_top_k_accuracy(true_values, predictions, misconception_embeddings, misconception_ids, k=25)
print(f"Top-k Test Accuracy: {top_k_accuracy * 100:.2f}%")

Top-k Test Accuracy: 14.42%
