In [1]:
# import sys
# sys.path.append("/kaggle/input/kaggle-utils")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

%run ../../utils/stat_funcs.py
# %run /kaggle/input/kaggle-utils/stat_funcs.py
print("custom functions are now available in the notebook namespace!")
print("Libraries loaded successfully!")

train_df = pd.read_csv("../../data-raw/train.csv")
test_df = pd.read_csv("../../data-raw/test.csv")

# train_df = pd.read_csv("/kaggle/input/playground-series-s6e1/train.csv")
# test_df = pd.read_csv("/kaggle/input/playground-series-s6e1/test.csv")

custom functions are now available in the notebook namespace!
Libraries loaded successfully!


# Overall EDA Summary

## Dataset overview 

The training data contains **630,000 rows** and **13 columns**, with the target variable **`exam_score`** (continuous, 0-100 scale). An **`id`** column is a unique identifier (sequential from 0 to 629,999) and is not inherently predictive. The `id` column would need to be removed during feature engineering.

- **Data quality:**
  - **No missing values** across any features or the target.
  - Data types are clean and appropriate after casting:
    - **Numeric:** `age`, `study_hours`, `class_attendance`, `sleep_hours`, `exam_score`
    - **Categorical:** `gender`, `course`, `internet_access`, `sleep_quality`, `study_method`, `facility_rating`, `exam_difficulty`
- **Typical student profile (central tendency):**
  - **Age:** ~**20.55** years (median **21**, range **17-24**)
  - **Study hours:** mean ~**4.00** (median **4.00**, range **0.08-7.91**)
  - **Class attendance (%):** mean ~**71.99** (median **72.6**, range **40.6-99.4**)
  - **Sleep hours:** mean ~**7.07** (median **7.1**, range **4.1-9.9**)
  - **Exam score:** mean ~**62.51** (median **62.6**, range **19.6-100**)
- **Spread / variability:**
  - `exam_score` has **substantial variability** (std ~**18.92**), suggesting meaningful separation between low- and high-performing students.
  - `class_attendance` also varies widely (std ~**17.43**), while `age` is comparatively tight (std ~**2.26**).
- **Notable extremes & sanity checks:**
  - Very low study time values exist (down to **0.08 hours**), and attendance can be as low as **~40%**, which may represent legitimately low-engagement students rather than data issues (since missingness is zero and ranges look plausible).
  - Scores span almost the entire possible scale (**~20 to 100**), indicating no obvious clipping problems.
- **Model-readiness implications:**
  - This is a **mixed-type regression problem** with several categorical predictors that will require **encoding** (one-hot or target/ordinal encoding depending on the feature).

## Feature Engineering

### 1) Split + leakage control

- [x] Create train/validation split before fitting any preprocessing steps
- [x] Drop `id` column

In [2]:
from sklearn.model_selection import train_test_split

TARGET = "exam_score"

In [3]:
X = train_df.drop(columns=[TARGET, "id"])
y = train_df[TARGET]

In [4]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

### 2) Basic cleaning

- [x] Verify numeric ranges (attendance 0–100, score 0–100, sleep hours plausible)
- [x] Outlier handling (only if needed): clip numeric features to sensible bounds or train-quantiles

In [5]:
X.describe()

Unnamed: 0,age,study_hours,class_attendance,sleep_hours
count,630000.0,630000.0,630000.0,630000.0
mean,20.545821,4.002337,71.987261,7.072758
std,2.260238,2.35988,17.430098,1.744811
min,17.0,0.08,40.6,4.1
25%,19.0,1.97,57.0,5.6
50%,21.0,4.0,72.6,7.1
75%,23.0,6.05,87.2,8.6
max,24.0,7.91,99.4,9.9


### 3) Encoding (keep it simple)

- [x] One-hot encode nominal categoricals: `gender`, `course`, `internet_access`, `study_method`
- [x] Ordinal encode only if truly ordered: `sleep_quality`, `facility_rating`, `exam_difficulty`

In [18]:
X.head()

Unnamed: 0,age,gender,course,study_hours,class_attendance,internet_access,sleep_hours,sleep_quality,study_method,facility_rating,exam_difficulty
0,21,female,b.sc,7.91,98.8,no,4.9,average,online videos,low,easy
1,18,other,diploma,4.95,94.8,yes,4.7,poor,self-study,medium,moderate
2,20,female,b.sc,4.68,92.6,yes,5.8,poor,coaching,high,moderate
3,19,male,b.sc,2.0,49.5,yes,8.3,average,group study,high,moderate
4,23,male,bca,7.65,86.9,yes,9.6,good,self-study,high,easy


In [29]:
encode_cols = ["gender", "course", "internet_access", "study_method", "sleep_quality", "facility_rating", "exam_difficulty"]

In [30]:
X_enc = pd.get_dummies(X, columns=encode_cols, dtype=int)

In [31]:
X_enc.head()

Unnamed: 0,age,study_hours,class_attendance,sleep_hours,gender_female,gender_male,gender_other,course_b.com,course_b.sc,course_b.tech,...,study_method_self-study,sleep_quality_average,sleep_quality_good,sleep_quality_poor,facility_rating_high,facility_rating_low,facility_rating_medium,exam_difficulty_easy,exam_difficulty_hard,exam_difficulty_moderate
0,21,7.91,98.8,4.9,1,0,0,0,1,0,...,0,1,0,0,0,1,0,1,0,0
1,18,4.95,94.8,4.7,0,0,1,0,0,0,...,1,0,0,1,0,0,1,0,0,1
2,20,4.68,92.6,5.8,1,0,0,0,1,0,...,0,0,0,1,1,0,0,0,0,1
3,19,2.0,49.5,8.3,0,1,0,0,1,0,...,0,1,0,0,1,0,0,0,0,1
4,23,7.65,86.9,9.6,0,1,0,0,0,0,...,1,0,1,0,1,0,0,1,0,0


In [33]:
X_enc.columns

Index(['age', 'study_hours', 'class_attendance', 'sleep_hours',
       'gender_female', 'gender_male', 'gender_other', 'course_b.com',
       'course_b.sc', 'course_b.tech', 'course_ba', 'course_bba', 'course_bca',
       'course_diploma', 'internet_access_no', 'internet_access_yes',
       'study_method_coaching', 'study_method_group study',
       'study_method_mixed', 'study_method_online videos',
       'study_method_self-study', 'sleep_quality_average',
       'sleep_quality_good', 'sleep_quality_poor', 'facility_rating_high',
       'facility_rating_low', 'facility_rating_medium', 'exam_difficulty_easy',
       'exam_difficulty_hard', 'exam_difficulty_moderate'],
      dtype='object')

In [35]:
X_enc.dtypes

age                             int64
study_hours                   float64
class_attendance              float64
sleep_hours                   float64
gender_female                   int64
gender_male                     int64
gender_other                    int64
course_b.com                    int64
course_b.sc                     int64
course_b.tech                   int64
course_ba                       int64
course_bba                      int64
course_bca                      int64
course_diploma                  int64
internet_access_no              int64
internet_access_yes             int64
study_method_coaching           int64
study_method_group study        int64
study_method_mixed              int64
study_method_online videos      int64
study_method_self-study         int64
sleep_quality_average           int64
sleep_quality_good              int64
sleep_quality_poor              int64
facility_rating_high            int64
facility_rating_low             int64
facility_rat

In [37]:
X_enc.drop("internet_access_no", axis=1, inplace=True)

In [38]:
X_enc.dtypes

age                             int64
study_hours                   float64
class_attendance              float64
sleep_hours                   float64
gender_female                   int64
gender_male                     int64
gender_other                    int64
course_b.com                    int64
course_b.sc                     int64
course_b.tech                   int64
course_ba                       int64
course_bba                      int64
course_bca                      int64
course_diploma                  int64
internet_access_yes             int64
study_method_coaching           int64
study_method_group study        int64
study_method_mixed              int64
study_method_online videos      int64
study_method_self-study         int64
sleep_quality_average           int64
sleep_quality_good              int64
sleep_quality_poor              int64
facility_rating_high            int64
facility_rating_low             int64
facility_rating_medium          int64
exam_difficu

### 5) Scaling (for stability)

- [x] Standardize numeric features: `age`, `study_hours`, `class_attendance`, `sleep_hours`

In [39]:
from sklearn.preprocessing import StandardScaler

In [40]:
scaler = StandardScaler()

In [41]:
X_enc[["age", "study_hours", "class_attendance", "sleep_hours"]] = scaler.fit_transform(X_enc[["age", "study_hours", "class_attendance", "sleep_hours"]])

In [42]:
X_enc.head()

Unnamed: 0,age,study_hours,class_attendance,sleep_hours,gender_female,gender_male,gender_other,course_b.com,course_b.sc,course_b.tech,...,study_method_self-study,sleep_quality_average,sleep_quality_good,sleep_quality_poor,facility_rating_high,facility_rating_low,facility_rating_medium,exam_difficulty_easy,exam_difficulty_hard,exam_difficulty_moderate
0,0.200943,1.655875,1.538302,-1.245269,1,0,0,0,1,0,...,0,1,0,0,0,1,0,1,0,0
1,-1.126352,0.401573,1.308814,-1.359895,0,0,1,0,0,0,...,1,0,0,1,0,0,1,0,0,1
2,-0.241488,0.28716,1.182595,-0.729454,1,0,0,0,1,0,...,0,0,0,1,1,0,0,0,0,1
3,-0.68392,-0.848492,-1.290141,0.703367,0,1,0,0,1,0,...,0,1,0,0,1,0,0,0,0,1
4,1.085807,1.545699,0.855575,1.448434,0,1,0,0,0,0,...,1,0,1,0,1,0,0,1,0,0


### 6) Minimal “linear-friendly” feature creation (optional, small set)

- [ ] Add 1–2 interaction terms. Starting points:
  - `study_hours * class_attendance`
  - `study_hours * exam_difficulty` (after encoding)
- [ ] Add 1 curvature term if residuals suggest nonlinearity:
  - `study_hours^2` (or `sleep_hours^2`)

### 7) Multicollinearity + regularization

- [ ] Check multicollinearity (VIF or condition number) after encoding
- [ ] Prefer **Ridge** as first regularized baseline (stable with many one-hot features)
- [ ] Optionally try **Lasso/ElasticNet** for sparsity

### 8) Diagnostics + iteration loop

- [ ] Check residual plots (nonlinearity, heteroscedasticity)
- [ ] Evaluate metrics (MAE/RMSE/R²) and compare against a naive baseline (predict mean)
- [ ] Iterate: only add interactions/polynomials if diagnostics show systematic error