# Notebook 1:<br> **Dataset Exploration and Preprocessing**

## Problem Framing and Dataset Discovery

### 1. Problem Definition

As defined in the `Problem Definition` document:<br>

Predict whether a student in a given course will fall into Low / Medium / High risk of course outcome by end of course, using data recorded up to Week 8 (Day 56), so that course staff can prioritize early supportive interventions for learners most likely to need help.

Classification:<br>

1. **Low** risk for *Pass* or *Distinction*
2. **Medium** risk for *Fail*  
3. **High** risk for *Withdrawn*

For more details, please reference `/reports/01_problem_definition.md` in this repository and the `README.md` documents.


### Step 1 — Define the classification problem, unit, horizon, and prediction time
- **Cell 1 (Markdown):** Notebook title + problem framing  
  - Unit: **student–course enrolment**  
  - Prediction time: **Week 8 (Day 56)**  
  - Horizon: end-of-course outcome (`final_result`)

### 2. Setup

Initial setup for the notebook:

In [9]:
# Install requirements into the current kernel
import sys
import subprocess
print("Installing requirements from requirements.txt (this may take a minute)...")
subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", "pip"])
subprocess.check_call([sys.executable, "-m", "pip", "install", "-r", "../requirements.txt"])
print("Done. If imports still fail, restart the kernel (Kernel -> Restart).")

Installing requirements from requirements.txt (this may take a minute)...
Done. If imports still fail, restart the kernel (Kernel -> Restart).


Imports and display options: 

In [None]:
# Imports
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import hashlib
import joblib

# scikit-learn imports
from sklearn.model_selection import StratifiedGroupKFold, cross_validate
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.dummy import DummyClassifier

RANDOM_STATE = 42 # For reproducibility
CUTOFF_DAY = 56  # this is the cutoff day set to 8 weeks of the course (Day 56)

# Directories (relative to notebook location in notebooks/)
RAW_DIR = Path("../inputs/raw")    # raw data directory
OUT_DIR = Path("../outputs")       # output directory  
FIG_DIR = OUT_DIR / "figures"   # figure output directory
TAB_DIR = OUT_DIR / "tables"    # table output directory
DATA_DIR = OUT_DIR / "data"     # processed data output directory

# Ensure output directories exist
for d in [OUT_DIR, FIG_DIR, TAB_DIR, DATA_DIR]:
    d.mkdir(parents=True, exist_ok=True)

# Set pandas display options
pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 140)

### 3. Identify Dataset and Integrity

Data validation: 
1. verify all expected files exist, and
2. compute checksums (MD5) so results are traceable and reproducible.

In [13]:
# Verify expected raw files exist
EXPECTED_FILES = [
    "courses.csv",
    "assessments.csv",
    "vle.csv",
    "studentInfo.csv",
    "studentRegistration.csv",
    "studentAssessment.csv",
    "studentVle.csv",
]

# Function to compute MD5 checksum of a file
def md5_file(path: Path, chunk_size: int = 2**20) -> str:
    h = hashlib.md5()
    with path.open("rb") as f:
        while True:
            b = f.read(chunk_size)
            if not b:
                break
            h.update(b)
    return h.hexdigest()

missing = [f for f in EXPECTED_FILES if not (RAW_DIR / f).exists()]
if missing:
    raise FileNotFoundError(
        f"Missing expected CSV(s) in {RAW_DIR}:\n- " + "\n- ".join(missing)
    )

# Create inventory of raw files with sizes and MD5 checksums
inventory = []
for f in EXPECTED_FILES:
    p = RAW_DIR / f
    inventory.append({
        "file": f,
        "bytes": p.stat().st_size,
        "md5": md5_file(p)
    })

inv_df = pd.DataFrame(inventory).sort_values("file")
inv_df.to_csv(TAB_DIR / "raw_file_inventory_md5.csv", index=False)
inv_df

FileNotFoundError: Missing expected CSV(s) in inputs/raw:
- courses.csv
- assessments.csv
- vle.csv
- studentInfo.csv
- studentRegistration.csv
- studentAssessment.csv
- studentVle.csv