# Notebook 1:<br> **Dataset Exploration and Preprocessing**

## Problem Framing and Dataset Discovery

### 1. Problem Definition

As defined in the `Problem Definition` document:<br>

Predict whether a student in a given course will fall into Low / Medium / High risk of course outcome by end of course, using data recorded up to Week 8 (Day 56), so that course staff can prioritize early supportive interventions for learners most likely to need help.

Classification:<br>

1. **Low** risk for *Pass* or *Distinction*
2. **Medium** risk for *Fail*  
3. **High** risk for *Withdrawn*

For more details, please reference `/reports/01_problem_definition.md` in this repository and the `README.md` documents.


### Step 1 — Define the classification problem, unit, horizon, and prediction time
- **Cell 1 (Markdown):** Notebook title + problem framing  
  - Unit: **student–course enrolment**  
  - Prediction time: **Week 8 (Day 56)**  
  - Horizon: end-of-course outcome (`final_result`)

### 2. Setup

Initial setup for the notebook:

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import hashlib
import joblib

from sklearn.model_selection import StratifiedGroupKFold, cross_validate
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.dummy import DummyClassifier

RANDOM_STATE = 42 # For reproducibility
CUTOFF_DAY = 56  # this is the cutoff day set to 8 weeks of the course (Day 56)

# Directories
RAW_DIR = Path("inputs/raw")    # raw data directory
OUT_DIR = Path("outputs")       # output directory  
FIG_DIR = OUT_DIR / "figures"   # figure output directory
TAB_DIR = OUT_DIR / "tables"    # table output directory
DATA_DIR = OUT_DIR / "data"     # processed data output directory

# Set pandas display options
pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 140)

ModuleNotFoundError: No module named 'pandas'

In [None]:
# Install requirements into the current kernel and ensure output directories exist
import sys
import subprocess
print("Installing requirements from requirements.txt (this may take a minute)...")
subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", "pip"])
subprocess.check_call([sys.executable, "-m", "pip", "install", "-r", "requirements.txt"])
print("Done. If imports still fail, restart the kernel (Kernel -> Restart).")

# Ensure output directories exist
for d in [OUT_DIR, FIG_DIR, TAB_DIR, DATA_DIR]:
    d.mkdir(parents=True, exist_ok=True)
print("Output directories created/verified.")