# Research Problem

Educational researchers and data scientists face significant challenges in identifying struggling students early enough to provide timely interventions in online learning environments. While there are extensive data available, extracting meaningful predictive signals from student interactions remains difficult, particularly within the first portion of assignments when intervention would be most valuable. [The ASSISTments dataset](https://osf.io/59shv/files/osfstorage) provides a unique opportunity to address this challenge through its comprehensive data from 88 distinct assignment-level randomized controlled experiments conducted within [the ASSISTments platform](https://www.assistments.org/). This collection, analyzed initially in [Prihar et al.'s 2022 paper *Exploring Common Trends in Online Educational Experiments*](https://osf.io/f58dz), includes detailed clickstream data that captures temporal aspects of student engagement across diverse educational interventions. The rich multi-level student interaction data enables the development and evaluation of early warning systems that could identify struggling students before they fall significantly behind.

# Research Question

How can temporal engagement features derived from clickstream data in the ASSISTments experimental dataset predict student performance drops across different intervention types, and which feature selection methods most effectively identify at-risk students within the first 25% of an assignment? This research will leverage the dataset's granular student interaction logs to extract time-based engagement patterns, analyze how these patterns correlate with performance outcomes, and determine which combinations of features provide the earliest reliable signals of academic struggle across different intervention conditions.

# Load, Merge, and Clean Data

In [None]:
# Setup and Configuration
import polars as pl
from pathlib import Path
import gc

# --- Enable Global String Cache for Categoricals ---
pl.enable_string_cache()

# --- Configuration ---
BASE_DATA_PATH = Path('/Users/john/Downloads/osfstorage-archive')
EXPERIMENT_IDS_PATH = BASE_DATA_PATH / 'experiment_dataset_2021-09-23'

# Output path for the cleaned data
SAVE_CLEANED_PATH_POLARS_PARQUET = BASE_DATA_PATH / 'merged_experiment_data_cleaned_polars.parquet'
SAVE_CLEANED_PATH_POLARS_CSV = BASE_DATA_PATH / 'merged_experiment_data_cleaned_polars.csv'

print(f"Polars version: {pl.__version__}")
print(f"Base data path: {BASE_DATA_PATH}")
print(f"Experiment IDs path: {EXPERIMENT_IDS_PATH}")
print(f"Global String Cache enabled: {pl.using_string_cache()}")

Polars version: 1.29.0
Base data path: /Users/john/Downloads/osfstorage-archive
Experiment IDs path: /Users/john/Downloads/osfstorage-archive/experiment_dataset_2021-09-23
Global String Cache enabled: True


In [2]:
# Generate File Paths
try:
    if not EXPERIMENT_IDS_PATH.is_dir():
        raise FileNotFoundError(f"Error: Directory not found at {EXPERIMENT_IDS_PATH}")
    experiment_ids = [d.name for d in EXPERIMENT_IDS_PATH.iterdir() if d.is_dir()]
    print(f"Found {len(experiment_ids)} experiment ID directories.")
except FileNotFoundError as e:
    print(e)
    experiment_ids = []

performance_file_paths = [str(EXPERIMENT_IDS_PATH / exp_id / 'exp_alogs.csv') for exp_id in experiment_ids]
problems_file_paths = [str(EXPERIMENT_IDS_PATH / exp_id / 'exp_plogs.csv') for exp_id in experiment_ids]
actions_file_paths = [str(EXPERIMENT_IDS_PATH / exp_id / 'exp_slogs.csv') for exp_id in experiment_ids]
metrics_file_paths = [str(EXPERIMENT_IDS_PATH / exp_id / 'priors.csv') for exp_id in experiment_ids]

print("Sample performance file paths:", performance_file_paths[:2])
print("Sample problems file paths:", problems_file_paths[:2])
print("Sample actions file paths:", actions_file_paths[:2])
print("Sample metrics file paths:", metrics_file_paths[:2])

Found 88 experiment ID directories.
Sample performance file paths: ['/Users/john/Downloads/osfstorage-archive/experiment_dataset_2021-09-23/PSAU85Y/exp_alogs.csv', '/Users/john/Downloads/osfstorage-archive/experiment_dataset_2021-09-23/PSAXD6K/exp_alogs.csv']
Sample problems file paths: ['/Users/john/Downloads/osfstorage-archive/experiment_dataset_2021-09-23/PSAU85Y/exp_plogs.csv', '/Users/john/Downloads/osfstorage-archive/experiment_dataset_2021-09-23/PSAXD6K/exp_plogs.csv']
Sample actions file paths: ['/Users/john/Downloads/osfstorage-archive/experiment_dataset_2021-09-23/PSAU85Y/exp_slogs.csv', '/Users/john/Downloads/osfstorage-archive/experiment_dataset_2021-09-23/PSAXD6K/exp_slogs.csv']
Sample metrics file paths: ['/Users/john/Downloads/osfstorage-archive/experiment_dataset_2021-09-23/PSAU85Y/priors.csv', '/Users/john/Downloads/osfstorage-archive/experiment_dataset_2021-09-23/PSAXD6K/priors.csv']


In [3]:
# Define Schemas and Date Parsing Information

actions_schema = {
    'experiment_id': pl.Categorical,
    'student_id': pl.Categorical,
    'problem_id': pl.Categorical,
    'problem_part': pl.Categorical, 
    'scaffold_id': pl.Categorical,  
    'experiment_tag_path': pl.Utf8,
    'action': pl.Categorical,
    'timestamp': pl.Utf8,
    'assistments_reference_action_log_id': pl.UInt64
}
actions_parse_dates = ['timestamp']

problems_schema = {
    'experiment_id': pl.Categorical,
    'student_id': pl.Categorical,
    'problem_id': pl.Categorical,
    'problem_part': pl.Categorical, 
    'scaffold_id': pl.Categorical,  
    'problem_condition': pl.Categorical,
    'start_time': pl.Utf8,
    'end_time': pl.Utf8,
    'session_count': pl.UInt16,
    'time_on_task': pl.Float32,
    'first_response_or_request_time': pl.Float32,
    'first_answer': pl.Utf8,
    'correct': pl.Boolean,
    'reported_score': pl.Float32,
    'answer_before_tutoring': pl.Boolean,
    'attempt_count': pl.UInt16,
    'hints_available': pl.UInt16,
    'hints_given': pl.UInt16,
    'scaffold_problems_available': pl.UInt16,
    'scaffold_problems_given': pl.UInt16,
    'explanation_available': pl.Boolean,
    'explanation_given': pl.Boolean,
    'answer_given': pl.Boolean,
    'assistments_reference_problem_log_id': pl.UInt64
}
problems_parse_dates = ['start_time', 'end_time']

performance_schema = {
    'experiment_id': pl.Categorical,
    'student_id': pl.Categorical,
    'release_date': pl.Utf8,
    'due_date': pl.Utf8,
    'start_time': pl.Utf8,
    'end_time': pl.Utf8,
    'assignment_session_count': pl.Float32,
    'pretest_problem_count': pl.Float32,
    'pretest_correct': pl.Float32,
    'pretest_time_on_task': pl.Float32,
    'pretest_average_first_response_time': pl.Float32,
    'pretest_session_count': pl.Float32,
    'assigned_condition': pl.Categorical,
    'condition_time_on_task': pl.Float32,
    'condition_average_first_response_or_request_time': pl.Float32,
    'condition_problem_count': pl.Float32,
    'condition_total_correct': pl.Float32,
    'condition_total_correct_after_wrong_response': pl.Float32,
    'condition_total_correct_after_tutoring': pl.Float32,
    'condition_total_answers_before_tutoring': pl.Float32,
    'condition_total_attempt_count': pl.Float32,
    'condition_total_hints_available': pl.Float32,
    'condition_total_hints_given': pl.Float32,
    'condition_total_scaffold_problems_available': pl.Float32,
    'condition_total_scaffold_problems_given': pl.Float32,
    'condition_total_explanations_available': pl.Float32,
    'condition_total_explanations_given': pl.Float32,
    'condition_total_answers_given': pl.Float32,
    'condition_session_count': pl.Float32,
    'posttest_problem_count': pl.Float32,
    'posttest_correct': pl.Float32,
    'posttest_time_on_task': pl.Float32,
    'posttest_average_first_response_time': pl.Float32,
    'posttest_session_count': pl.Float32,
    'assistments_reference_assignment_log_id': pl.UInt64
}
performance_parse_dates = ['release_date', 'due_date', 'start_time', 'end_time']

metrics_schema = {
    'experiment_id': pl.Categorical,
    'student_id': pl.Categorical,
    'student_prior_started_skill_builder_count': pl.UInt32,
    'student_prior_completed_skill_builder_count': pl.UInt32,
    'student_prior_started_problem_set_count': pl.UInt32,
    'student_prior_completed_problem_set_count': pl.UInt32,
    'student_prior_completed_problem_count': pl.UInt32,
    'student_prior_median_first_response_time': pl.Float32,
    'student_prior_median_time_on_task': pl.Float32,
    'student_prior_average_correctness': pl.Float32,
    'student_prior_average_attempt_count': pl.Float32,
    'class_id': pl.Categorical,
    'class_creation_date': pl.Utf8,
    'class_student_count': pl.UInt16,
    'class_prior_skill_builder_count': pl.UInt32,
    'class_prior_problem_set_count': pl.UInt32,
    'class_prior_skill_builder_percent_started': pl.Float32,
    'class_prior_skill_builder_percent_completed': pl.Float32,
    'class_prior_problem_set_percent_started': pl.Float32,
    'class_prior_problem_set_percent_completed': pl.Float32,
    'class_prior_completed_problem_count': pl.UInt32,
    'class_prior_median_time_on_task': pl.Float32,
    'class_prior_median_first_response_time': pl.Float32,
    'class_prior_average_correctness': pl.Float32,
    'class_prior_average_attempt_count': pl.Float32,
    'teacher id': pl.Categorical, 
    'teacher_account_creation_date': pl.Utf8,
    'district_id': pl.Categorical,
    'location': pl.Categorical,
    'opportunity_zone': pl.Categorical,
    'locale_description': pl.Categorical
}
metrics_parse_dates = ['class_creation_date', 'teacher_account_creation_date']

In [4]:
# Helper Function for Memory-Efficient CSV Concatenation

def combine_polars_csvs(file_paths, schema=None, parse_dates_list=None,
                        known_date_format_str: str = None,
                        date_time_unit='us'):
    lazy_frames = []
    print(f"\nScanning {len(file_paths)} files...")

    common_columns_from_first_file = None
    if file_paths and schema:
        try:
            common_columns_from_first_file = pl.scan_csv(
                file_paths[0], infer_schema_length=100, n_rows=10
            ).collect_schema().names()
        except Exception as e:
            print(f"  Warning: Could not determine common columns from first file {file_paths[0]}: {e}")
            common_columns_from_first_file = list(schema.keys())

    problematic_file_for_date_parse = None
    current_col_for_date_parse = "unknown"

    for i, file_path_str in enumerate(file_paths):
        file_path = Path(file_path_str)
        if i % 10 == 0:
            print(f"  Scanning file {i+1}/{len(file_paths)}: {file_path.parent.name}/{file_path.name}")

        try:
            lf = pl.scan_csv(file_path,
                             schema=schema,
                             infer_schema_length=100,
                             null_values=["", "NA", "NaN", "null"])

            if parse_dates_list:
                date_parsing_expressions = []
                columns_to_check_for_dates = common_columns_from_first_file if common_columns_from_first_file else lf.collect_schema().names()

                for col_name in parse_dates_list:
                    current_col_for_date_parse = col_name
                    if col_name in columns_to_check_for_dates:
                        if col_name not in lf.collect_schema().names():
                            continue

                        date_expr = pl.col(col_name).cast(pl.Utf8, strict=False)

                        if known_date_format_str:
                            date_expr = date_expr.str.to_datetime(
                                format=known_date_format_str,
                                strict=False,
                                time_unit=date_time_unit
                            )
                        else:
                            date_expr = date_expr.str.to_datetime(
                                strict=False,
                                time_unit=date_time_unit
                            )
                        date_parsing_expressions.append(
                            date_expr.dt.convert_time_zone("UTC").alias(col_name)
                        )
                if date_parsing_expressions:
                    lf = lf.with_columns(date_parsing_expressions)

            lazy_frames.append(lf)
            problematic_file_for_date_parse = None

        except FileNotFoundError:
            print(f"  Warning: File not found, skipping: {file_path}")
        except pl.exceptions.NoDataError:
             print(f"  Warning: File is empty, skipping: {file_path}")
        except Exception as e:
            problematic_file_for_date_parse = file_path
            if "strptime" in str(e).lower() or "conversion" in str(e).lower() or "datetime" in str(e).lower():
                 print(f"  Potential date parsing error for {problematic_file_for_date_parse} (column likely '{current_col_for_date_parse}'): {e}")
            else:
                print(f"  Error scanning {file_path} or applying initial transforms: {e}")


    if not lazy_frames:
        print("  No lazy frames were created from scanning files.")
        return None

    print(f"Concatenating {len(lazy_frames)} lazy frames...")
    try:
        combined_lf = pl.concat(lazy_frames, how="vertical_relaxed")
        print("Collecting data into DataFrame (streaming enabled)...")
        # Reverted to engine="streaming" as per deprecation warning for Polars 1.29.0
        collected_df = combined_lf.collect(engine="streaming")
        print("Concatenation and collection complete.")
        return collected_df
    except Exception as e:
        print(f"Error during lazy concatenation or collection: {e}")
        if problematic_file_for_date_parse:
            print(f"  This might be related to an earlier issue in file: {problematic_file_for_date_parse}")
        return None

In [5]:
# Load DataFrames

COMMON_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S%.f%z" 

print("Combining Actions Data (exp_slogs)...")
actions_df = combine_polars_csvs(
    actions_file_paths, 
    schema=actions_schema, 
    parse_dates_list=actions_parse_dates,
    known_date_format_str=COMMON_DATETIME_FORMAT 
)
if actions_df is not None:
    print(f"Actions DataFrame shape: {actions_df.shape}")

print("\nCombining Problems Data (exp_plogs)...")
problems_df = combine_polars_csvs(
    problems_file_paths, 
    schema=problems_schema, 
    parse_dates_list=problems_parse_dates,
    known_date_format_str=COMMON_DATETIME_FORMAT 
)
if problems_df is not None:
    print(f"Problems DataFrame shape: {problems_df.shape}")

print("\nCombining Performance Data (exp_alogs)...")
performance_df = combine_polars_csvs(
    performance_file_paths, 
    schema=performance_schema, 
    parse_dates_list=performance_parse_dates,
    known_date_format_str=COMMON_DATETIME_FORMAT
)
if performance_df is not None:
    print(f"Performance DataFrame shape: {performance_df.shape}")

print("\nCombining Metrics Data (priors)...")
if 'teacher id' in metrics_schema: 
    metrics_schema_corrected = metrics_schema.copy()
    metrics_schema_corrected['teacher_id'] = metrics_schema_corrected.pop('teacher id')
else:
    metrics_schema_corrected = metrics_schema

metrics_df = combine_polars_csvs(
    metrics_file_paths, 
    schema=metrics_schema_corrected, 
    parse_dates_list=metrics_parse_dates,
    known_date_format_str=COMMON_DATETIME_FORMAT
)
if metrics_df is not None:
    if 'teacher id' in metrics_df.columns: 
        metrics_df = metrics_df.rename({'teacher id': 'teacher_id'})
    print(f"Metrics DataFrame shape: {metrics_df.shape}")

gc.collect()

Combining Actions Data (exp_slogs)...

Scanning 88 files...
  Scanning file 1/88: PSAU85Y/exp_slogs.csv
  Scanning file 11/88: PSAYCFH/exp_slogs.csv
  Scanning file 21/88: PSAZ2G4/exp_slogs.csv
  Scanning file 31/88: PSAQJFP/exp_slogs.csv
  Scanning file 41/88: PSAJVP8/exp_slogs.csv
  Scanning file 51/88: PSA9XWV/exp_slogs.csv
  Scanning file 61/88: PSAM4NK/exp_slogs.csv
  Scanning file 71/88: PSATP2Z/exp_slogs.csv
  Scanning file 81/88: PSASDZY/exp_slogs.csv
Concatenating 88 lazy frames...
Collecting data into DataFrame (streaming enabled)...
Concatenation and collection complete.
Actions DataFrame shape: (3708299, 9)

Combining Problems Data (exp_plogs)...

Scanning 88 files...
  Scanning file 1/88: PSAU85Y/exp_plogs.csv
  Scanning file 11/88: PSAYCFH/exp_plogs.csv
  Scanning file 21/88: PSAZ2G4/exp_plogs.csv
  Scanning file 31/88: PSAQJFP/exp_plogs.csv
  Scanning file 41/88: PSAJVP8/exp_plogs.csv
  Scanning file 51/88: PSA9XWV/exp_plogs.csv
  Scanning file 61/88: PSAM4NK/exp_plogs.c

253

In [6]:
# Merge DataFrames into One

merged_df = None
merge_successful = True

print("\n--- Starting Merge Operations ---")

if actions_df is None or actions_df.is_empty():
    print("Actions DataFrame is empty or None. Cannot proceed with merge.")
    merge_successful = False
else:
    merged_df = actions_df.clone()
    print(f"Starting with actions_df: {merged_df.shape}")

    # Merge Problems Data
    if problems_df is not None and not problems_df.is_empty() and merge_successful:
        try:
            print("Merging problems_df...")
            problem_keys = ['experiment_id', 'student_id', 'problem_id', 'problem_part', 'scaffold_id']
            merged_df = merged_df.join(problems_df, on=problem_keys, how="left", suffix="_problem")
            print(f"After merging problems_df: {merged_df.shape}")
            del problems_df 
            gc.collect()
        except Exception as e:
            print(f"Error merging problems_df: {e}")
            merge_successful = False
    elif merge_successful: 
        print("Skipping problems_df merge (not loaded or empty).")

    # Merge Performance Data
    if performance_df is not None and not performance_df.is_empty() and merge_successful:
        try:
            print("Merging performance_df...")
            perf_keys = ['experiment_id', 'student_id']
            merged_df = merged_df.join(performance_df, on=perf_keys, how="left", suffix="_perf")
            print(f"After merging performance_df: {merged_df.shape}")
            del performance_df
            gc.collect()
        except Exception as e:
            print(f"Error merging performance_df: {e}")
            merge_successful = False
    elif merge_successful:
        print("Skipping performance_df merge (not loaded or empty).")

    # Merge Metrics Data
    if metrics_df is not None and not metrics_df.is_empty() and merge_successful:
        try:
            print("Merging metrics_df...")
            metrics_keys = ['experiment_id', 'student_id']
            merged_df = merged_df.join(metrics_df, on=metrics_keys, how="left", suffix="_metrics")
            print(f"After merging metrics_df: {merged_df.shape}")
            del metrics_df
            gc.collect()
        except Exception as e:
            print(f"Error merging metrics_df: {e}")
            merge_successful = False
    elif merge_successful:
        print("Skipping metrics_df merge (not loaded or empty).")

    if merged_df is not None and merge_successful:
        print("\n--- Merge Complete ---")
        print("Final Merged DataFrame Info:")
        print(f"Shape: {merged_df.shape}")
        print("Columns in merged_df:", merged_df.columns)
        print(merged_df.head)
        print(merged_df.schema)
        
        if 'actions_df' in locals() and actions_df is not merged_df: 
            del actions_df
            gc.collect()
            
    elif merged_df is not None: 
        print("\n--- Merge Partially Complete or Some DataFrames Skipped ---")
        print("Columns in partially merged_df:", merged_df.columns)
    else: 
        print("\n--- Merge Failed or Base DataFrame (actions_df) was not suitable ---")


--- Starting Merge Operations ---
Starting with actions_df: (3708299, 9)
Merging problems_df...
After merging problems_df: (3708299, 28)
Merging performance_df...
After merging performance_df: (3711215, 61)
Merging metrics_df...
After merging metrics_df: (3711215, 90)

--- Merge Complete ---
Final Merged DataFrame Info:
Shape: (3711215, 90)
Columns in merged_df: ['experiment_id', 'student_id', 'problem_id', 'problem_part', 'scaffold_id', 'experiment_tag_path', 'action', 'timestamp', 'assistments_reference_action_log_id', 'problem_condition', 'start_time', 'end_time', 'session_count', 'time_on_task', 'first_response_or_request_time', 'first_answer', 'correct', 'reported_score', 'answer_before_tutoring', 'attempt_count', 'hints_available', 'hints_given', 'scaffold_problems_available', 'scaffold_problems_given', 'explanation_available', 'explanation_given', 'answer_given', 'assistments_reference_problem_log_id', 'release_date', 'due_date', 'start_time_perf', 'end_time_perf', 'assignment_

# Data Cleaning

In [7]:
# Data Cleaning

if 'merged_df' in locals() and merged_df is not None and merge_successful: 
    print("\n--- Starting Data Cleaning ---")
    print(f"Initial merged_df shape for cleaning: {merged_df.shape}")

    print("\n--- Renaming Columns ---")
    rename_map = {
        'assistments_reference_action_log_id': 'action_log_id',
        'start_time_perf': 'assignment_start_time',
        'end_time_perf': 'assignment_end_time',
        'assistments_reference_assignment_log_id': 'assignment_log_id'
    }
    actual_renames = {k: v for k, v in rename_map.items() if k in merged_df.columns}
    if actual_renames:
        print(f"Applying renames: {actual_renames}")
        merged_df = merged_df.rename(actual_renames)
    else:
        print("No columns matched for renaming based on the current rename_map.")

    column_transformations = []

    datetime_cols_final_check = [
        'timestamp', 'start_time', 'end_time', 'release_date', 'due_date',
        'assignment_start_time', 'assignment_end_time',
        'class_creation_date', 'teacher_account_creation_date'
    ]
    for col_name in datetime_cols_final_check:
        if col_name in merged_df.columns:
            current_dtype = merged_df[col_name].dtype
            if current_dtype == pl.Utf8:
                print(f"Scheduled for datetime re-parsing (UTF8 found): {col_name}")
                column_transformations.append(
                    pl.col(col_name).str.to_datetime(format=COMMON_DATETIME_FORMAT, strict=False, time_unit='us')
                    .dt.convert_time_zone("UTC")
                    .alias(col_name)
                )
            elif isinstance(current_dtype, pl.Datetime):
                current_tz = current_dtype.time_zone
                if current_tz is None:
                    print(f"Info: Datetime column '{col_name}' is naive. Localizing to UTC.")
                    column_transformations.append(
                        pl.col(col_name).dt.replace_time_zone("UTC", ambiguous='earliest').alias(col_name)
                    )
                elif current_tz != "UTC":
                    print(f"Scheduled for UTC conversion (already datetime, was '{current_tz}'): {col_name}")
                    column_transformations.append(
                        pl.col(col_name).dt.convert_time_zone("UTC").alias(col_name)
                    )

    cols_to_category_polars = [
        'experiment_id', 'student_id', 'problem_id', 'problem_part', 'scaffold_id',
        'experiment_tag_path', 'action', 'problem_condition', 'assigned_condition',
        'class_id', 'district_id', 'location', 'opportunity_zone',
        'locale_description', 'teacher_id'
    ]
    for col_name in cols_to_category_polars:
        if col_name in merged_df.columns and merged_df[col_name].dtype != pl.Categorical:
             column_transformations.append(pl.col(col_name).cast(pl.Categorical).alias(col_name))
             print(f"Scheduled for categorical conversion: {col_name}")

    float_to_int_casts = {
        'assignment_session_count': pl.UInt16, 'pretest_problem_count': pl.UInt16,
        'pretest_correct': pl.UInt16, 'pretest_session_count': pl.UInt16,
        'condition_problem_count': pl.UInt16, 'condition_total_correct': pl.UInt16,
        'condition_total_correct_after_wrong_response': pl.UInt16,
        'condition_total_correct_after_tutoring': pl.UInt16,
        'condition_total_answers_before_tutoring': pl.UInt16,
        'condition_total_attempt_count': pl.UInt32,
        'condition_total_hints_available': pl.UInt32, 'condition_total_hints_given': pl.UInt32,
        'condition_total_scaffold_problems_available': pl.UInt32,
        'condition_total_scaffold_problems_given': pl.UInt32,
        'condition_total_explanations_available': pl.UInt32,
        'condition_total_explanations_given': pl.UInt32,
        'condition_total_answers_given': pl.UInt32,
        'condition_session_count': pl.UInt16, 'posttest_problem_count': pl.UInt16,
        'posttest_correct': pl.UInt16, 'posttest_session_count': pl.UInt16,
    }
    for col_name, target_int_type in float_to_int_casts.items():
        if col_name in merged_df.columns:
            if merged_df[col_name].dtype == pl.Float32:
                column_transformations.append(
                    pl.col(col_name)
                      .fill_null(0)
                      .cast(target_int_type, strict=False)
                      .alias(col_name)
                )
                print(f"Scheduled '{col_name}' for Float32 to {target_int_type} conversion.")
            elif merged_df[col_name].dtype != target_int_type:
                print(f"Warning: Column '{col_name}' was expected to be Float32 for int conversion, but found {merged_df[col_name].dtype}. Skipping specific int cast.")

    # General Float64 to Float32 pass
    float64_cols = [col_name for col_name, dtype in merged_df.schema.items() if dtype == pl.Float64]
    for col_name in float64_cols:
        if col_name in merged_df.columns:
            column_transformations.append(pl.col(col_name).cast(pl.Float32).alias(col_name))
            print(f"Scheduled for Float64 to Float32 conversion: {col_name}")

    if column_transformations:
        print("\nApplying column type transformations...")
        merged_df = merged_df.with_columns(column_transformations)
        print("Type transformations applied.")

    print("\n--- Specific Value Cleaning ---")
    specific_value_cleaning_expressions = []

    if 'opportunity_zone' in merged_df.columns:
        if merged_df['opportunity_zone'].dtype != pl.Categorical:
             merged_df = merged_df.with_columns(pl.col('opportunity_zone').cast(pl.Categorical))
        specific_value_cleaning_expressions.append(
            pl.when(pl.col('opportunity_zone').cast(pl.Utf8) == "Yes").then(True)
              .when(pl.col('opportunity_zone').cast(pl.Utf8) == "No").then(False)
              .otherwise(None)
              .cast(pl.Boolean)
              .alias('opportunity_zone_bool')
        )
        print("Scheduled 'opportunity_zone' to boolean 'opportunity_zone_bool' conversion.")

    cat_cols_to_fill_info = {
        'district_id': 'Unknown_District',
        'location': 'Unknown_Location',
        'locale_description': 'Unknown_Locale'
    }
    for col_name, fill_val in cat_cols_to_fill_info.items():
        if col_name in merged_df.columns:
            if merged_df[col_name].dtype != pl.Categorical:
                merged_df = merged_df.with_columns(pl.col(col_name).cast(pl.Categorical))
                print(f"Casted '{col_name}' to Categorical before fill_null.")
            specific_value_cleaning_expressions.append(pl.col(col_name).fill_null(pl.lit(fill_val).cast(pl.Categorical)).alias(col_name))
            print(f"Scheduled fill_null for categorical {col_name} with '{fill_val}'.")

    if specific_value_cleaning_expressions:
        print("\nApplying specific value cleaning expressions...")
        merged_df = merged_df.with_columns(specific_value_cleaning_expressions)
        print("Specific value cleaning applied.")

    pandas_identified_empty_cols = [
         'problem_condition', 'start_time', 'end_time', 'session_count', 'time_on_task',
         'first_response_or_request_time', 'first_answer', 'correct', 'reported_score',
         'answer_before_tutoring', 'attempt_count', 'hints_available', 'hints_given',
         'scaffold_problems_available', 'scaffold_problems_given', 'explanation_available',
         'explanation_given', 'answer_given',
         'assistments_reference_problem_log_id'
    ]
    actual_empty_cols_to_drop = []
    if not merged_df.is_empty():
        for col_name in pandas_identified_empty_cols:
            if col_name in merged_df.columns and merged_df[col_name].is_null().all():
                actual_empty_cols_to_drop.append(col_name)
            elif col_name in merged_df.columns:
                null_count = merged_df[col_name].is_null().sum()
                if null_count > 0 :
                    print(f"Info: Column '{col_name}' (candidate for empty drop) was not fully null. Nulls: {null_count}/{merged_df.height}")

    if actual_empty_cols_to_drop:
        print(f"\nDropping fully empty columns: {actual_empty_cols_to_drop}")
        merged_df = merged_df.drop(actual_empty_cols_to_drop)
    else:
        print("\nNo fully empty columns (from the predefined list) identified for dropping.")

    if 'opportunity_zone' in merged_df.columns and 'opportunity_zone_bool' in merged_df.columns:
        print("Dropping original opportunity zone column: 'opportunity_zone'")
        merged_df = merged_df.drop('opportunity_zone')

    print(f"\nShape after Cleaning: {merged_df.shape}")
    print("Columns after cleaning:", merged_df.columns)
    gc.collect()

else:
    print("Skipping Cell 7 cleaning: merged_df not available, previous merge failed, or merge_successful flag is False.")


--- Starting Data Cleaning ---
Initial merged_df shape for cleaning: (3711215, 90)

--- Renaming Columns ---
Applying renames: {'assistments_reference_action_log_id': 'action_log_id', 'start_time_perf': 'assignment_start_time', 'end_time_perf': 'assignment_end_time', 'assistments_reference_assignment_log_id': 'assignment_log_id'}
Scheduled for categorical conversion: experiment_tag_path
Scheduled 'assignment_session_count' for Float32 to UInt16 conversion.
Scheduled 'pretest_problem_count' for Float32 to UInt16 conversion.
Scheduled 'pretest_correct' for Float32 to UInt16 conversion.
Scheduled 'pretest_session_count' for Float32 to UInt16 conversion.
Scheduled 'condition_problem_count' for Float32 to UInt16 conversion.
Scheduled 'condition_total_correct' for Float32 to UInt16 conversion.
Scheduled 'condition_total_correct_after_wrong_response' for Float32 to UInt16 conversion.
Scheduled 'condition_total_correct_after_tutoring' for Float32 to UInt16 conversion.
Scheduled 'condition_tot

In [8]:
# Create Reduced DataFrame and Save

if 'merged_df' in locals() and merged_df is not None and merge_successful:
    print("\n--- Reducing DataFrame to Essential Columns ---")

    essential_cols_to_keep_polars = [
        # Keys / Base Info from actions_df
        'experiment_id', 
        'student_id', 
        'timestamp', 
        'action', 
        'action_log_id',
        
        # From performance_df 
        'assignment_start_time', 
        'assignment_end_time',   
        'assignment_log_id',     
        'assignment_session_count', 
        'condition_problem_count',  
        'condition_time_on_task',   
        'condition_average_first_response_or_request_time', 
        'condition_total_correct',  
        'condition_total_attempt_count', 
        'condition_total_hints_given', 
        'condition_total_explanations_given', 
        
        # From metrics_df 
        'student_prior_average_correctness', 
        
        'opportunity_zone_bool', 
    ]
    
    # Filter to only include columns that actually exist in the cleaned merged_df
    final_essential_columns = [col for col in essential_cols_to_keep_polars if col in merged_df.columns]
    
    print(f"Attempting to select these {len(final_essential_columns)} essential columns: {final_essential_columns}")
    missing_essentials_for_reduction = [col for col in essential_cols_to_keep_polars if col not in final_essential_columns]

    if missing_essentials_for_reduction:
        print(f"Warning: The following conceptual essential columns were NOT FOUND in merged_df for reduction: {missing_essentials_for_reduction}")
        print("Please ensure their names are correct in the 'essential_cols_to_keep_polars' list and they exist in the output of Cell 7.")
    
    if not final_essential_columns:
        print("Error: No essential columns available for selection based on your list. Cannot create reduced DataFrame.")
        merged_df_reduced = None
    else:
        try:
            merged_df_reduced = merged_df.select(final_essential_columns)
            print(f"\nReduced DataFrame Info: Shape {merged_df_reduced.shape}")
            print(merged_df_reduced.head())
            print(merged_df_reduced.schema)

            print(f"\nAttempting to save cleaned and reduced DataFrame to: {SAVE_CLEANED_PATH_POLARS_PARQUET}")
            merged_df_reduced.write_parquet(SAVE_CLEANED_PATH_POLARS_PARQUET) 
            print(f"Successfully saved to {SAVE_CLEANED_PATH_POLARS_PARQUET}")
            
        except Exception as e:
            print(f"Error during final select or save: {e}")
            merged_df_reduced = None
            
    if 'merged_df' in locals(): 
        del merged_df 
        gc.collect()
        print("\nFull cleaned merged_df deleted from memory.")

else:
    print("Skipping Cell 8 (reduction and save): merged_df not available from Cell 7 or previous steps failed.")
    merged_df_reduced = None


--- Reducing DataFrame to Essential Columns ---
Attempting to select these 18 essential columns: ['experiment_id', 'student_id', 'timestamp', 'action', 'action_log_id', 'assignment_start_time', 'assignment_end_time', 'assignment_log_id', 'assignment_session_count', 'condition_problem_count', 'condition_time_on_task', 'condition_average_first_response_or_request_time', 'condition_total_correct', 'condition_total_attempt_count', 'condition_total_hints_given', 'condition_total_explanations_given', 'student_prior_average_correctness', 'opportunity_zone_bool']

Reduced DataFrame Info: Shape (3711215, 18)
shape: (5, 18)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ experimen ┆ student_i ┆ timestamp ┆ action    ┆ … ┆ condition ┆ condition ┆ student_p ┆ opportun │
│ t_id      ┆ d         ┆ ---       ┆ ---       ┆   ┆ _total_hi ┆ _total_ex ┆ rior_aver ┆ ity_zone │
│ ---       ┆ ---       ┆ datetime[ ┆ cat       ┆   ┆ nts_given ┆ planatio

In [9]:
# Load Cleaned Data

if 'merged_df_reduced' in locals() and merged_df_reduced is not None and not merged_df_reduced.is_empty() and SAVE_CLEANED_PATH_POLARS_PARQUET.exists():
    print(f"\n--- Loading Cleaned Parquet File ---")
    try:
        df_reloaded_polars = pl.read_parquet(SAVE_CLEANED_PATH_POLARS_PARQUET)
        print(f"Successfully reloaded: {SAVE_CLEANED_PATH_POLARS_PARQUET}")
        print(f"Reloaded DataFrame Shape: {df_reloaded_polars.shape}")
        print("\nReloaded DataFrame Head (from Parquet):")
        print(df_reloaded_polars.head())
        print("\nReloaded DataFrame Schema (from Parquet):")
        print(df_reloaded_polars.schema)
    except Exception as e:
        print(f"An error occurred while reloading the cleaned Parquet file: {e}")
elif SAVE_CLEANED_PATH_POLARS_PARQUET.exists():
     print(f"Cleaned Parquet file found at {SAVE_CLEANED_PATH_POLARS_PARQUET}, but merged_df_reduced may not have been successfully created or was empty in the previous step (this script might have been re-run starting from here). Consider reloading manually if needed.")
else:
    print(f"\nCleaned Parquet file not found at {SAVE_CLEANED_PATH_POLARS_PARQUET} or reduction/save step failed.")


--- Loading Cleaned Parquet File ---
Successfully reloaded: /Users/john/Downloads/osfstorage-archive/merged_experiment_data_cleaned_polars.parquet
Reloaded DataFrame Shape: (3711215, 18)

Reloaded DataFrame Head (from Parquet):
shape: (5, 18)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ experimen ┆ student_i ┆ timestamp ┆ action    ┆ … ┆ condition ┆ condition ┆ student_p ┆ opportun │
│ t_id      ┆ d         ┆ ---       ┆ ---       ┆   ┆ _total_hi ┆ _total_ex ┆ rior_aver ┆ ity_zone │
│ ---       ┆ ---       ┆ datetime[ ┆ cat       ┆   ┆ nts_given ┆ planation ┆ age_corre ┆ _bool    │
│ cat       ┆ cat       ┆ μs, UTC]  ┆           ┆   ┆ ---       ┆ s_g…      ┆ ctn…      ┆ ---      │
│           ┆           ┆           ┆           ┆   ┆ u32       ┆ ---       ┆ ---       ┆ bool     │
│           ┆           ┆           ┆           ┆   ┆           ┆ u32       ┆ f32       ┆          │
╞═══════════╪═══════════╪═══════════╪═══════════╪