## Prepping Files for Model Juptyer Notebook
The purpose of the notebook is to take files in the EdNet-KT1 data set and combine them together while also removing and changing some of the values in the columns. The changes will be outputted to a file called 'combined_dataset.csv'

Contributor: Will Sessoms

#### Data Merging Preparation
Our original dataset is comprised of one csv file per each of the 784k students, which results in massive overhead when reading in data. To prevent this, we're taking all of the relevant data and merging it into one unified dataset that can be easily read, navigated, and edited.
kt1_dir = "Data/KT1"
questions_fname = "Data/contents/questions.csv"


In [None]:
import os
import polars as pl # Using polars instead of pandas for speed. >9 million lines in 784k csv files.
from tqdm import tqdm
import pandas as pd
import pyarrow as pa # Needed for conversion from polars to pandas
from sklearn.model_selection import StratifiedShuffleSplit

In [None]:

# Load important columns of questions file (finding correct_answer and tags)
# This only needs to be done once but we'll reference it multiple times per student interaction
questions_fname = "Data/contents/questions.csv"
kt1_dir  = "Data/KT1"

questions = (
    pl.read_csv(questions_fname)
    .with_columns([
        pl.col("question_id").str.replace("q", "").cast(pl.Int32),
        pl.col("bundle_id").str.replace("b", "").cast(pl.Int32),
        pl.col("tags").cast(pl.Utf8)
    ])
    .select(["question_id", "correct_answer", "bundle_id", "tags"])
)


student_files = [os.path.join(kt1_dir, f) for f in os.listdir(kt1_dir) if f.endswith(".csv")]

dfs = []

### Data Fetching and Merging
Here, we take all of the information that we need from each KT1 file and combine it into a single .csv

In [None]:
# For each interaction, we take the student_id, question_id, bundle_id, tags, elapsed_time, and whether they answered correctly
for file in tqdm(student_files, desc="Progress"):
    # Take student_id from filename, remove 'u' prefix to make it int
    student_id = int(os.path.basename(file).replace("u", "").replace(".csv", ""))

    df = (
        pl.read_csv(file)
        .with_columns([
            pl.lit(student_id).alias("student_id"),
            pl.col("question_id").str.replace("q", "").cast(pl.Int32), # Remove 'q' prefix to make question_id int
        ])
        .join(questions, on="question_id", how="left")
        .with_columns([
            # Adds 'correct' column, which details if student got the question correct
            (pl.col("user_answer").str.strip_chars().str.to_lowercase() == pl.col("correct_answer").str.strip_chars().str.to_lowercase())
            .cast(pl.Int8)
            .alias("correct")
        ])
        .select(["student_id", "timestamp", "question_id", "bundle_id", "tags", "elapsed_time", "correct"])
    )

    # Tags are currently in a list, we need to flatten them so they work in csv
    df = df.with_columns(
        pl.col("tags")
        .cast(pl.Utf8)
        .str.replace_all(r"\[|\]|\s", "")
        .str.replace_all(",", ";")
        .alias("tags")
    )

    dfs.append(df)

# Sort by student_id, then timestamp
final_df = pl.concat(dfs, how="vertical").sort(["student_id", "timestamp"])

fname = "combined_dataset.csv"
final_df.write_csv(fname)
print(f"Saved {fname}")

### Splitting Data

Next, we need to split the data into our training and validation sets. Our priority is not to mix students, so none of the student in the training set will be in the validation set and vice versa. We additionally chose to split by student activity level (using the total number of interactions logged) to avoid skewing a set with better/worse students.

In [None]:
dataset = "combined_dataset.csv"
data = pl.read_csv(dataset)

# Make a dataframe that counts the number of interactions per student
n_interactions = data.group_by("student_id").len().rename({"len": "n_interactions"})
n_interactions_df = n_interactions.to_pandas()
# print(n_interactions.head())

# Put into 4 bins based off the number of interations for that student
# We do this to make sure that heavily active and inactive students are represented in both training and validation sets
bins = pd.qcut(n_interactions_df["n_interactions"], q=4, labels=False, duplicates="drop")
n_interactions_df["activity_bin"] = bins

# 80/20 split based on the activity bins
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=865)
train_idx, val_idx = next(splitter.split(n_interactions_df, n_interactions_df["activity_bin"]))

# This splits the students into training and validation sets based on their activity bins to ensure stratified sampling
train_students = n_interactions_df.iloc[train_idx]["student_id"].tolist()
val_students = n_interactions_df.iloc[val_idx]["student_id"].tolist()

# Sanity check
print(f"Training students: {len(train_students)}")
print(f"Validation students: {len(val_students)}")

# Filters the original data to create separate dataframes for training and validation students
train_df = data.filter(pl.col("student_id").is_in(train_students))
val_df = data.filter(pl.col("student_id").is_in(val_students))

# Save as parquet file to reduce file size and loading reading speed
train_df.write_parquet("train_data.parquet")
val_df.write_parquet("val_data.parquet")