# Overall EDA Summary

## Dataset overview 

The training data contains **630,000 rows** and **13 columns**, with the target variable **`exam_score`** (continuous, 0-100 scale). An **`id`** column is a unique identifier (sequential from 0 to 629,999) and is not inherently predictive. The `id` column would need to be removed during feature engineering.

- **Data quality:**
  - **No missing values** across any features or the target.
  - Data types are clean and appropriate after casting:
    - **Numeric:** `age`, `study_hours`, `class_attendance`, `sleep_hours`, `exam_score`
    - **Categorical:** `gender`, `course`, `internet_access`, `sleep_quality`, `study_method`, `facility_rating`, `exam_difficulty`
- **Typical student profile (central tendency):**
  - **Age:** ~**20.55** years (median **21**, range **17-24**)
  - **Study hours:** mean ~**4.00** (median **4.00**, range **0.08-7.91**)
  - **Class attendance (%):** mean ~**71.99** (median **72.6**, range **40.6-99.4**)
  - **Sleep hours:** mean ~**7.07** (median **7.1**, range **4.1-9.9**)
  - **Exam score:** mean ~**62.51** (median **62.6**, range **19.6-100**)
- **Spread / variability:**
  - `exam_score` has **substantial variability** (std ~**18.92**), suggesting meaningful separation between low- and high-performing students.
  - `class_attendance` also varies widely (std ~**17.43**), while `age` is comparatively tight (std ~**2.26**).
- **Notable extremes & sanity checks:**
  - Very low study time values exist (down to **0.08 hours**), and attendance can be as low as **~40%**, which may represent legitimately low-engagement students rather than data issues (since missingness is zero and ranges look plausible).
  - Scores span almost the entire possible scale (**~20 to 100**), indicating no obvious clipping problems.
- **Model-readiness implications:**
  - This is a **mixed-type regression problem** with several categorical predictors that will require **encoding** (one-hot or target/ordinal encoding depending on the feature).

## Feature Engineering TODO

### 1) Split + leakage control

- [ ] Create train/validation split **before** fitting any preprocessing steps
- [ ] Use a single preprocessing + model **pipeline** (fit on train only, apply to val/test)

### 2) Basic cleaning

- [ ] Drop `id`
- [ ] Verify numeric ranges (attendance 0–100, score 0–100, sleep hours plausible)
- [ ] Outlier handling (only if needed): clip numeric features to sensible bounds or train-quantiles

### 3) Encoding (keep it simple)

- [ ] One-hot encode nominal categoricals: `gender`, `course`, `internet_access`, `study_method`
- [ ] Ordinal encode only if truly ordered: `sleep_quality`, `facility_rating`, `exam_difficulty`

### 4) Scaling (for stability)

- [ ] Standardize numeric features: `age`, `study_hours`, `class_attendance`, `sleep_hours`

### 5) Minimal “linear-friendly” feature creation (optional, small set)

- [ ] Add 1–2 interaction terms. Starting points:
  - `study_hours * class_attendance`
  - `study_hours * exam_difficulty` (after encoding)
- [ ] Add 1 curvature term if residuals suggest nonlinearity:
  - `study_hours^2` (or `sleep_hours^2`)

### 6) Multicollinearity + regularization

- [ ] Check multicollinearity (VIF or condition number) after encoding
- [ ] Prefer **Ridge** as first regularized baseline (stable with many one-hot features)
- [ ] Optionally try **Lasso/ElasticNet** for sparsity

### 7) Diagnostics + iteration loop

- [ ] Check residual plots (nonlinearity, heteroscedasticity)
- [ ] Evaluate metrics (MAE/RMSE/R²) and compare against a naive baseline (predict mean)
- [ ] Iterate: only add interactions/polynomials if diagnostics show systematic error