# ðŸŒŸ Zero to Hero: Mastering Student Score Prediction
We will adopt an iterative development philosophy: first, securing a reliable baseline, and then progressively optimizing our architecture. This approach ensures that every layer of complexity added serves a measurable purpose in driving performance.

<img src="https://raw.githubusercontent.com/cviejom/kaggle_images/refs/heads/main/Gemini_Generated_Image_ooik58ooik58ooik.png" width="800px" height="400px">

### Index
1. Level 1 Basic Model, Linear Regression
    * Loading Datasets
2. Level 2 Random Forest Model
3. Level 3 Extreme Gradient Boosted Trees
    * Baseline Model
    * Advance Feature Engineering, Mean Encoding and KNN Features (Work in Progress)
4. Level 4 Neuronal Network (Work in Progress)
5. Level 5 Meta Model (Work in Progress)

---

In [1]:
%%time
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/exam-score-prediction-dataset/Exam_Score_Prediction.csv
/kaggle/input/playground-series-s6e1/sample_submission.csv
/kaggle/input/playground-series-s6e1/train.csv
/kaggle/input/playground-series-s6e1/test.csv
CPU times: user 245 ms, sys: 84.8 ms, total: 330 ms
Wall time: 1.41 s


In [2]:
import random  # Pythonâ€™s built-in module for generating pseudo-random numbers and selecting random elements
import warnings  # Provides a way to control the display of warning messages (e.g., filter out deprecation warnings)
def configure_notebook(seed=548, float_precision=3, max_columns=15, max_rows=25):
    """
    Configure notebook settings:
      - Disables warnings for cleaner output.
      - Sets pandas display options for better table formatting.
      - Returns a seed value for reproducibility.
    
    Parameters:
      seed (int): Random seed (default 548).
      float_precision (int): Number of decimal places for floats (default 3).
      max_columns (int): Maximum number of columns to display (default 15).
      max_rows (int): Maximum number of rows to display (default 25).

    Returns:
      int: The provided seed.
    """
    # Disable all warnings
    warnings.filterwarnings("ignore")
    
    # Set pandas display options for nicer output
    pd.options.display.float_format = f"{{:,.{float_precision}f}}".format
    pd.set_option("display.max_columns", max_columns)
    pd.set_option("display.max_rows", max_rows)

    # Set seeds for reproducibility in numpy and the standard random module
    np.random.seed(seed)
    random.seed(seed)
    
    return seed

# Apply configuration and set random seeds for reproducibility
seed = configure_notebook()

# ðŸŒŸ Level 1: Basic Model, Linear Regressor

## 1.1 Loading Datasets Using Pandas
In this step we will load all the information provided into a pandas dataframe

In [3]:
%%time
# Load the information into a pandas datafra
trn_df = pd.read_csv('/kaggle/input/playground-series-s6e1/train.csv', )
tst_df = pd.read_csv('/kaggle/input/playground-series-s6e1/test.csv')
sub_df = pd.read_csv('/kaggle/input/playground-series-s6e1/sample_submission.csv')
org_df = pd.read_csv('/kaggle/input/exam-score-prediction-dataset/Exam_Score_Prediction.csv')


CPU times: user 986 ms, sys: 228 ms, total: 1.21 s
Wall time: 1.43 s


## 1.2 Exploring the Dataset
In this step we will explore the information loaded into the dataframe to asses what's needed for the model development.

In [4]:
%%time
# Reviewing the fields available.
trn_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 630000 entries, 0 to 629999
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                630000 non-null  int64  
 1   age               630000 non-null  int64  
 2   gender            630000 non-null  object 
 3   course            630000 non-null  object 
 4   study_hours       630000 non-null  float64
 5   class_attendance  630000 non-null  float64
 6   internet_access   630000 non-null  object 
 7   sleep_hours       630000 non-null  float64
 8   sleep_quality     630000 non-null  object 
 9   study_method      630000 non-null  object 
 10  facility_rating   630000 non-null  object 
 11  exam_difficulty   630000 non-null  object 
 12  exam_score        630000 non-null  float64
dtypes: float64(4), int64(2), object(7)
memory usage: 62.5+ MB
CPU times: user 197 ms, sys: 3.62 ms, total: 201 ms
Wall time: 218 ms


In [5]:
org_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   student_id        20000 non-null  int64  
 1   age               20000 non-null  int64  
 2   gender            20000 non-null  object 
 3   course            20000 non-null  object 
 4   study_hours       20000 non-null  float64
 5   class_attendance  20000 non-null  float64
 6   internet_access   20000 non-null  object 
 7   sleep_hours       20000 non-null  float64
 8   sleep_quality     20000 non-null  object 
 9   study_method      20000 non-null  object 
 10  facility_rating   20000 non-null  object 
 11  exam_difficulty   20000 non-null  object 
 12  exam_score        20000 non-null  float64
dtypes: float64(4), int64(2), object(7)
memory usage: 2.0+ MB


In [6]:
%%time
# Checking a few rows of information to get a sense of the data provided.
trn_df.head()

CPU times: user 63 Âµs, sys: 11 Âµs, total: 74 Âµs
Wall time: 77.5 Âµs


Unnamed: 0,id,age,gender,course,study_hours,class_attendance,internet_access,sleep_hours,sleep_quality,study_method,facility_rating,exam_difficulty,exam_score
0,0,21,female,b.sc,7.91,98.8,no,4.9,average,online videos,low,easy,78.3
1,1,18,other,diploma,4.95,94.8,yes,4.7,poor,self-study,medium,moderate,46.7
2,2,20,female,b.sc,4.68,92.6,yes,5.8,poor,coaching,high,moderate,99.0
3,3,19,male,b.sc,2.0,49.5,yes,8.3,average,group study,high,moderate,63.9
4,4,23,male,bca,7.65,86.9,yes,9.6,good,self-study,high,easy,100.0


In [7]:
%%time
# Checking some statistical information on Numerical fields. Trying to pinpoint some outliers.
trn_df.describe()

CPU times: user 108 ms, sys: 1.18 ms, total: 109 ms
Wall time: 114 ms


Unnamed: 0,id,age,study_hours,class_attendance,sleep_hours,exam_score
count,630000.0,630000.0,630000.0,630000.0,630000.0,630000.0
mean,314999.5,20.546,4.002,71.987,7.073,62.507
std,181865.479,2.26,2.36,17.43,1.745,18.917
min,0.0,17.0,0.08,40.6,4.1,19.599
25%,157499.75,19.0,1.97,57.0,5.6,48.8
50%,314999.5,21.0,4.0,72.6,7.1,62.6
75%,472499.25,23.0,6.05,87.2,8.6,76.3
max,629999.0,24.0,7.91,99.4,9.9,100.0


In [8]:
%%time
# Checking for any missing information.
trn_df.isnull().sum()

CPU times: user 193 ms, sys: 118 Âµs, total: 193 ms
Wall time: 192 ms


id                  0
age                 0
gender              0
course              0
study_hours         0
class_attendance    0
internet_access     0
sleep_hours         0
sleep_quality       0
study_method        0
facility_rating     0
exam_difficulty     0
exam_score          0
dtype: int64

In [9]:
%%time
# Removing unneserary fields
fields_to_remove = ['id']
trn_df = trn_df.drop(columns = fields_to_remove)
tst_df = tst_df.drop(columns = fields_to_remove)
org_df = org_df.drop(columns = 'student_id')

trn_df = pd.concat([trn_df, org_df[trn_df.columns]], axis=0, ignore_index=True)

CPU times: user 98.7 ms, sys: 37.1 ms, total: 136 ms
Wall time: 139 ms


## 1.3 Feature Engineering
In this step we will develop simple features to demonstrate the value of creating new variables to provide more signal

In [10]:
%%time
from sklearn.preprocessing import PolynomialFeatures
import pandas as pd

def create_poly_features(df, variable_list, degree=2):
    # Initialize the transformer
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    
    # Transform specific columns
    poly_data = poly.fit_transform(df[variable_list])
    
    # Get proper feature names (e.g., 'age^2', 'income^2', 'age income')
    feature_names = poly.get_feature_names_out(variable_list)
    
    # Create DF and concat
    poly_df = pd.DataFrame(poly_data, columns=feature_names, index=df.index)
    
    # Drop original columns from poly_df to avoid duplicates if needed
    # (PolynomialFeatures includes the degree 1 terms by default)
    poly_df = poly_df.drop(columns=variable_list)
    
    return pd.concat([df, poly_df], axis=1)

CPU times: user 912 ms, sys: 215 ms, total: 1.13 s
Wall time: 2.23 s


In [11]:
%%time
variable_list = ['age', 'study_hours', 'class_attendance', 'sleep_hours']
trn_df = create_poly_features(trn_df, variable_list, 2)
tst_df = create_poly_features(tst_df, variable_list, 2)

#trn_df = create_poly_features(trn_df, variable_list, 3)
#tst_df = create_poly_features(tst_df, variable_list, 3)

#trn_df = create_poly_features(trn_df, variable_list, 4)
#tst_df = create_poly_features(tst_df, variable_list, 4)

# Remove duplicate columns, keeping the first occurrence
trn_df = trn_df.loc[:, ~trn_df.columns.duplicated()]
tst_df = tst_df.loc[:, ~tst_df.columns.duplicated()]

CPU times: user 398 ms, sys: 147 ms, total: 546 ms
Wall time: 545 ms


In [12]:
%%time
# Utilizing the Formula
# https://www.kaggle.com/competitions/playground-series-s6e1/discussion/665915

def formula_feat(df):
    LUT = {
        'sleep_quality': {'good': 5, 'average': 0, 'poor': -5},
        'facility_rating': {'high': 4, 'medium': 0, 'low': -4},
        'study_method': {
            'coaching': 10,
            'mixed': 5,
            'group study': 2,
            'online videos': 1,
            'self-study': 0
                        }
        }
    
    df['formula'] = 6.0 * df['study_hours'] + 0.35 * df['class_attendance'] \
            + 1.5 * df['sleep_hours'] + \
            + df['sleep_quality'].map(LUT['sleep_quality']) \
            + df['study_method'].map(LUT['study_method']) \
            + df['facility_rating'].map(LUT['facility_rating'])
    
    return df

trn_df = formula_feat(trn_df)
tst_df = formula_feat(tst_df)

CPU times: user 149 ms, sys: 0 ns, total: 149 ms
Wall time: 151 ms


## 1.4 Data Pre-Processing
In this step we will remove unnesesary fields, outliers and transform some categorical field into numerical

In [13]:
%%time
# Label encoding categorical variables
def encode_categorical_features(train_df, test_df, cat_cols):
    """
    One-Hot Encodes categorical features for both Train and Test sets.
    
    Args:
        train_df: Training DataFrame
        test_df: Test DataFrame
        cat_cols: List of column names to encode
        
    Returns:
        train_encoded, test_encoded
    """
    
    # 1. Mark the split point so we can separate them later
    train_len = len(train_df)
    
    # 2. Concatenate temporarily to ensure column alignment
    # (This ensures 'Gender_Male' exists in both, even if one set has no males)
    combined = pd.concat([train_df, test_df], axis=0, ignore_index=True)
    
    # 3. Apply One-Hot Encoding
    # drop_first=True avoids the "Dummy Variable Trap" (multicollinearity), 
    # which is important for linear models like Logistic Regression.
    combined_encoded = pd.get_dummies(combined, columns=cat_cols, drop_first=True)
    
    # 4. Split back into Train and Test
    train_encoded = combined_encoded.iloc[:train_len].copy()
    test_encoded = combined_encoded.iloc[train_len:].reset_index(drop=True).copy()
    test_encoded = test_encoded.drop(columns = ['exam_score'])
    
    return train_encoded, test_encoded

# --- Example Usage ---
# cat_cols = ['Gender', 'Location', 'Smoking_Status']
# train_enc, test_enc = encode_categorical_features(train, test, cat_cols)

CPU times: user 3 Âµs, sys: 1e+03 ns, total: 4 Âµs
Wall time: 7.63 Âµs


In [14]:
%%time
cat_cols = ['gender', 'course', 'internet_access', 'sleep_quality', 'study_method', 'facility_rating', 'exam_difficulty']
trn_enc, tst_enc = encode_categorical_features(trn_df, tst_df, cat_cols)

CPU times: user 689 ms, sys: 174 ms, total: 863 ms
Wall time: 865 ms


In [15]:
%%time
# Feature Standarization
import pandas as pd
from sklearn.preprocessing import StandardScaler

def standardize_dataframe(df, exclude_cols=None):
    """
    Standardizes the numerical features of a dataframe (Z-score normalization).
    
    Args:
        df (pd.DataFrame): The input dataframe.
        exclude_cols (list or str): Column(s) to exclude from scaling 
                                    (e.g., target variable, IDs).
    
    Returns:
        pd.DataFrame: A new dataframe with standardized features.
        scaler: The fitted StandardScaler object (useful for inverse_transform later).
    """
    # 1. Create a copy to avoid modifying the original dataframe in place
    df_scaled = df.copy()
    
    # 2. Handle exclude_cols (ensure it's a list)
    if exclude_cols:
        if isinstance(exclude_cols, str):
            exclude_cols = [exclude_cols]
    else:
        exclude_cols = []
        
    # 3. Select only numerical columns to scale
    # We filter for 'number' types and remove any excluded columns
    numeric_cols = df_scaled.select_dtypes(include=['number']).columns
    cols_to_scale = [c for c in numeric_cols if c not in exclude_cols]
    
    # 4. Initialize and fit the scaler
    scaler = StandardScaler()
    
    # 5. Apply the transformation
    if cols_to_scale:
        df_scaled[cols_to_scale] = scaler.fit_transform(df_scaled[cols_to_scale])
        print(f"Standardized {len(cols_to_scale)} columns.")
    else:
        print("No numerical columns found to standardize.")
        
    return df_scaled, scaler

# --- Example Usage ---
# df_normalized, scaler = standardize_dataframe(df_train, exclude_cols=['target', 'id'])

CPU times: user 11 Âµs, sys: 0 ns, total: 11 Âµs
Wall time: 14.3 Âµs


In [16]:
%%time
trn_std, _ = standardize_dataframe(trn_enc, exclude_cols=['exam_score'])
tst_std, _ = standardize_dataframe(tst_enc, exclude_cols=['exam_score'])

Standardized 15 columns.
Standardized 15 columns.
CPU times: user 275 ms, sys: 118 ms, total: 393 ms
Wall time: 392 ms


In [17]:
trn_std.head().T

Unnamed: 0,0,1,2,3,4
age,0.202,-1.125,-0.240,-0.683,1.086
study_hours,1.657,0.402,0.287,-0.849,1.547
class_attendance,1.542,1.312,1.186,-1.287,0.859
sleep_hours,-1.244,-1.359,-0.728,0.705,1.450
exam_score,78.300,46.700,99.000,63.900,100.000
...,...,...,...,...,...
study_method_self-study,False,True,False,False,True
facility_rating_low,True,False,False,False,False
facility_rating_medium,False,True,False,False,False
exam_difficulty_hard,False,False,False,False,False


## 1.5 Model Development, Linear Regressor

In [18]:
%%time
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

def train_and_predict_linear(train_df, test_df, target_col, test_size=0.2, random_state=42):
    """
    Trains a linear regressor, validates it, and prints a performance report.
    """
    # 1. Prepare Data
    X = train_df.drop(columns=[target_col])
    y = train_df[target_col]
    X_test_inference = test_df[X.columns]
    
    # 2. Split Data
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )

    # 3. Train Model
    model = LinearRegression()
    model.fit(X_train, y_train)

    # 4. Validate and Calculate Metrics
    val_predictions = model.predict(X_val)
    val_mse = mean_squared_error(y_val, val_predictions)
    val_rmse = np.sqrt(val_mse)  # Calculate RMSE
    val_r2 = r2_score(y_val, val_predictions)

    # --- REPORTING BLOCK ---
    print("--- Validation Metrics ---")
    print(f"RMSE: {val_rmse:.4f}")
    print(f"R^2 Score: {val_r2:.4f}")
    print("") # Empty line for spacing

    # 5. Inference
    test_predictions = model.predict(X_test_inference)

    print("--- Inference Complete ---")
    print(f"Generated {len(test_predictions)} predictions.")
    
    # Return objects for further use if needed
    return {
        "test_predictions": test_predictions,
        "rmse": val_rmse,
        "r2": val_r2,
        "model": model
    }

CPU times: user 47.6 ms, sys: 18.3 ms, total: 65.9 ms
Wall time: 264 ms


In [19]:
%%time
target_col = 'exam_score'
predictions = train_and_predict_linear(trn_std, tst_std, target_col)

--- Validation Metrics ---
RMSE: 8.9082
R^2 Score: 0.7769

--- Inference Complete ---
Generated 270000 predictions.
CPU times: user 2.08 s, sys: 256 ms, total: 2.33 s
Wall time: 1.15 s


In [20]:
%%time
sub_df['exam_score'] = predictions['test_predictions']
sub_df.to_csv('submission.csv', index = False)

CPU times: user 732 ms, sys: 10.3 ms, total: 743 ms
Wall time: 514 ms


In [21]:
# --- Validation Metrics ---
# RMSE: 8.8865
# R^2 Score: 0.7780

# --- Inference Complete ---
# Generated 270000 predictions.

In [22]:
# --- Validation Metrics ---
# RMSE: 8.8723
# R^2 Score: 0.7787

# --- Inference Complete ---
# Generated 270000 predictions.

# ðŸŒŸ Level 2: Random Forest Model.

In [23]:
%%time
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

def train_and_predict_tree(train_df, test_df, target_col):
    """
    Trains a Decision Tree Regressor using a train/val split and 
    predicts on a test dataset.
    
    Args:
        train_df (pd.DataFrame): The training dataset containing features and target.
        test_df (pd.DataFrame): The test dataset containing only features.
        target_col (str): The name of the target column in train_df.
        
    Returns:
        np.array: Predictions for the test dataset.
        model: The trained DecisionTreeRegressor object.
    """
    
    # 1. Separate Features and Target
    # We drop the target from X, and ensure we only use columns present in both datasets 
    # (intersecting columns) to avoid shape mismatch errors.
    feature_cols = [c for c in train_df.columns if c != target_col and c in test_df.columns]
    
    X = train_df[feature_cols]
    y = train_df[target_col]
    X_test = test_df[feature_cols]

    # 2. Split Train into Train and Validation sets (80/20 split)
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # 3. Initialize and Train the Model
    # You can adjust max_depth to prevent overfitting
    model = RandomForestRegressor(n_estimators=100, random_state=42) 
    model.fit(X_train, y_train)
    
    # 4. Evaluate on Validation Set
    val_preds = model.predict(X_val)
    rmse = np.sqrt(mean_squared_error(y_val, val_preds))
    r2 = r2_score(y_val, val_preds)
    
    print("--- Validation Metrics ---")
    print(f"RMSE: {rmse:.4f}")
    print(f"R^2 Score: {r2:.4f}")
    
    # 5. Inference on Test Dataset
    test_predictions = model.predict(X_test)
    
    print("\n--- Inference Complete ---")
    print(f"Generated {len(test_predictions)} predictions.")
    
    return test_predictions, model

# --- Example Usage ---
# Assuming you have pandas DataFrames 'df_train' and 'df_test'
# predictions, trained_model = train_and_predict_tree(df_train, df_test, 'target_variable_name')

CPU times: user 166 ms, sys: 20.3 ms, total: 186 ms
Wall time: 368 ms


In [24]:
%%time
# target_col = 'exam_score'
# test_predictions = train_and_predict_tree(trn_std, tst_std, target_col)

CPU times: user 3 Âµs, sys: 0 ns, total: 3 Âµs
Wall time: 5.96 Âµs


In [25]:
# Square Features
#--- Validation Metrics ---
#RMSE: 12.8053
#R^2 Score: 0.5389

#--- Inference Complete ---
#Generated 270000 predictions.

In [26]:
# Square and Cubic Features
#--- Validation Metrics ---
#RMSE: 12.7606
#R^2 Score: 0.5422

#--- Inference Complete ---
#Generated 270000 predictions.

In [27]:
# Square, Cubic and Fourth Power Features
#--- Validation Metrics ---
#RMSE: 12.7490
#R^2 Score: 0.5430

#--- Inference Complete ---
#Generated 270000 predictions.

In [28]:
# No Additional Features
#--- Validation Metrics ---
#RMSE: 12.8884
#R^2 Score: 0.5329

#--- Inference Complete ---
#Generated 270000 predictions.

# ðŸŒŸ Level 3: Extreme Gradient Boosted Trees, XGBoost, Catboost and LightGBM

## 3.1 XG Boost

In [29]:
%%time
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor  # 1. Changed Import
from sklearn.metrics import mean_squared_error, r2_score

def train_and_predict_xgb(train_df, test_df, target_col, test_size=0.2, random_state=42):
    """
    Trains an XGBoost Regressor, validates it, and prints a performance report.
    """
    
    # 1. Prepare Data
    X = train_df.drop(columns=[target_col])
    y = train_df[target_col]
    X_test_inference = test_df[X.columns]

    # 2. Split Data
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )

    # 3. Train Model
    # XGBoost handles missing values (NaNs) internally, unlike Linear Regression
    model = XGBRegressor(
        objective='reg:squarederror', 
        n_estimators=2048, 
        learning_rate=0.01, 
        random_state=random_state
    )
    model.fit(X_train, y_train)

    # 4. Validate and Calculate Metrics
    val_predictions = model.predict(X_val)
    val_mse = mean_squared_error(y_val, val_predictions)
    val_rmse = np.sqrt(val_mse) 
    val_r2 = r2_score(y_val, val_predictions)

    # --- REPORTING BLOCK ---
    print("--- Validation Metrics ---")
    print(f"RMSE: {val_rmse:.4f}")
    print(f"R^2 Score: {val_r2:.4f}")
    print("") 

    # 5. Inference
    test_predictions = model.predict(X_test_inference)

    print("--- Inference Complete ---")
    print(f"Generated {len(test_predictions)} predictions.")
    
    return {
        "test_predictions": test_predictions,
        "rmse": val_rmse,
        "r2": val_r2,
        "model": model
    }

CPU times: user 76 ms, sys: 13.6 ms, total: 89.6 ms
Wall time: 362 ms


In [30]:
%%time
#predictions = train_and_predict_xgb(trn_std, tst_std, 'exam_score', test_size=0.2, random_state=42)

CPU times: user 2 Âµs, sys: 0 ns, total: 2 Âµs
Wall time: 5.96 Âµs


In [31]:
# --- Validation Metrics ---
# RMSE: 8.7950
# R^2 Score: 0.7825

# --- Inference Complete ---
# Generated 270000 predictions.

In [32]:
%%time
sub_df['exam_score'] = predictions['test_predictions']
sub_df.to_csv('submission_xgb.csv', index = False)

CPU times: user 481 ms, sys: 14.1 ms, total: 495 ms
Wall time: 494 ms


## 3.2 CatBoost

## 3.3 Light GBM

# ðŸŒŸ Level 4: Advance Feature Engineering

## 4.1 Mean Encoded Features

In [33]:
import pandas as pd
import numpy as np

def mean_encode_features(train_df, test_df, target_col, cat_features, alpha=10):
    """
    Performs mean encoding on specified categorical features with smoothing.

    Args:
        train_df (pd.DataFrame): Training data containing features and target.
        test_df (pd.DataFrame): Test data containing features.
        target_col (str): The name of the target variable column.
        cat_features (list): List of categorical feature names to encode.
        alpha (float): Smoothing parameter (regularization) to prevent overfitting 
                       on rare categories. Higher alpha = more smoothing towards global mean.

    Returns:
        tuple: (train_df_encoded, test_df_encoded)
               New DataFrames with mean encoded features appended.
    """
    # Create copies to avoid modifying original dataframes in place
    train_encoded = train_df.copy()
    test_encoded = test_df.copy()
    
    # Calculate Global Mean of the target (used for smoothing and filling missing values)
    global_mean = train_df[target_col].mean()

    for feature in cat_features:
        # Define new column name
        new_col_name = f"{feature}_mean_enc"
        
        # Calculate count and sum of the target for each category in the training set
        agg = train_df.groupby(feature)[target_col].agg(['count', 'sum'])
        counts = agg['count']
        sums = agg['sum']
        
        # Apply Smoothing Formula
        # Smoothed Mean = (Sum of Target + Global Mean * alpha) / (Count + alpha)
        smoothed_means = (sums + global_mean * alpha) / (counts + alpha)
        
        # Map the smoothed means to the training and test datasets
        train_encoded[new_col_name] = train_encoded[feature].map(smoothed_means)
        test_encoded[new_col_name] = test_encoded[feature].map(smoothed_means)
        
        # Handle unseen categories (NaNs) in train/test by filling with global mean
        train_encoded[new_col_name] = train_encoded[new_col_name].fillna(global_mean)
        test_encoded[new_col_name] = test_encoded[new_col_name].fillna(global_mean)
        
    return train_encoded, test_encoded

In [34]:
target_col = 'exam_score'
cat_features = ['gender', 'course', 'internet_access', 'sleep_quality', 'study_method', 'facility_rating', 'exam_difficulty']
trn_mean_enc, tst_mean_enc = mean_encode_features(trn_df, tst_df, target_col, cat_features, alpha=2)

mean_enc_feat  = ['gender_mean_enc', 'course_mean_enc', 'internet_access_mean_enc', 'sleep_quality_mean_enc', 'study_method_mean_enc', 'facility_rating_mean_enc','exam_difficulty_mean_enc']

trn_std = pd.merge(trn_std, trn_mean_enc[mean_enc_feat], left_index = True, right_index=True)
tst_std = pd.merge(tst_std, tst_mean_enc[mean_enc_feat], left_index = True, right_index=True)

In [35]:
trn_std.columns

Index(['age', 'study_hours', 'class_attendance', 'sleep_hours', 'exam_score',
       'age^2', 'age study_hours', 'age class_attendance', 'age sleep_hours',
       'study_hours^2', 'study_hours class_attendance',
       'study_hours sleep_hours', 'class_attendance^2',
       'class_attendance sleep_hours', 'sleep_hours^2', 'formula',
       'gender_male', 'gender_other', 'course_b.sc', 'course_b.tech',
       'course_ba', 'course_bba', 'course_bca', 'course_diploma',
       'internet_access_yes', 'sleep_quality_good', 'sleep_quality_poor',
       'study_method_group study', 'study_method_mixed',
       'study_method_online videos', 'study_method_self-study',
       'facility_rating_low', 'facility_rating_medium', 'exam_difficulty_hard',
       'exam_difficulty_moderate', 'gender_mean_enc', 'course_mean_enc',
       'internet_access_mean_enc', 'sleep_quality_mean_enc',
       'study_method_mean_enc', 'facility_rating_mean_enc',
       'exam_difficulty_mean_enc'],
      dtype='object')

In [36]:
import pandas as pd

def frequency_encode_features(train_df, test_df, cat_features, target=None):
    """
    Performs frequency encoding: replaces categories with their percentage 
    representation in the training set.

    Args:
        train_df (pd.DataFrame): Training data.
        test_df (pd.DataFrame): Test data.
        cat_features (list): List of categorical feature names.
        target (str, optional): Not used for Frequency encoding, included for API consistency.

    Returns:
        tuple: (train_df_encoded, test_df_encoded)
               New DataFrames with frequency encoded features appended.
    """
    train_encoded = train_df.copy()
    test_encoded = test_df.copy()

    for feature in cat_features:
        new_col_name = f"{feature}_freq"

        # 1. Calculate frequency (proportion) based on TRAINING data only
        # normalize=True gives percentage (0.5), normalize=False gives raw count (50)
        freq_map = train_df[feature].value_counts(normalize=True)

        # 2. Map the frequencies to both Train and Test
        train_encoded[new_col_name] = train_encoded[feature].map(freq_map)
        test_encoded[new_col_name] = test_encoded[feature].map(freq_map)

        # 3. Handle unseen categories in Test
        # If a category is in Test but not Train, its frequency is technically 0
        test_encoded[new_col_name] = test_encoded[new_col_name].fillna(0)

    return train_encoded, test_encoded

In [37]:
target_col = 'exam_score'
cat_features = ['gender', 'course', 'internet_access', 'sleep_quality', 'study_method', 'facility_rating', 'exam_difficulty']
trn_freq_enc, tst_freq_enc = frequency_encode_features(trn_df, tst_df, cat_features, target_col)

freq_enc_feat  = ['gender_freq', 'course_freq', 'internet_access_freq', 'sleep_quality_freq', 'study_method_freq', 'facility_rating_freq','exam_difficulty_freq']

trn_std = pd.merge(trn_std, trn_freq_enc[freq_enc_feat], left_index = True, right_index=True)
tst_std = pd.merge(tst_std, tst_freq_enc[freq_enc_feat], left_index = True, right_index=True)

In [38]:
%%time
#predictions = train_and_predict_xgb(trn_std, tst_std, 'exam_score', test_size=0.2, random_state=42)

CPU times: user 3 Âµs, sys: 1 Âµs, total: 4 Âµs
Wall time: 6.68 Âµs


In [39]:
# --- Validation Metrics ---
# RMSE: 8.7915
# R^2 Score: 0.7827

# --- Inference Complete ---
# Generated 270000 predictions.
# CPU times: user 9min 53s, sys: 809 ms, total: 9min 54s
# Wall time: 2min 30s

In [40]:
%%time
sub_df['exam_score'] = predictions['test_predictions']
sub_df.to_csv('submission_xgb_adv.csv', index = False)

CPU times: user 482 ms, sys: 5.18 ms, total: 487 ms
Wall time: 486 ms


# ðŸŒŸ Level 5: Advance Training Cross Validations Loop

In [41]:
def train_regressor(train_df, test_df, target_column, feature_columns=None, model_type="xgboost", param_file=None, 
                n_splits=5, categorical_features=None):
    """
    Train a machine learning REGRESSOR using the provided training and test datasets with K-Fold cross-validation,
    utilizing GPU support where applicable, and return predicted values.

    Parameters:
        train_df (pandas.DataFrame): Training DataFrame.
        test_df (pandas.DataFrame): Testing DataFrame.
        target_column (str): Name of the target column.
        feature_columns (list): List of feature column names to use. If None, all columns except target_column are used.
        model_type (str): "xgboost", "catboost", "lgbm", or "hgb".
        param_file (dict): Dictionary of hyperparameters for the model.
        n_splits (int): Number of folds for cross-validation.
        categorical_features (list): List of categorical column names. If None, they are inferred from the features.

    Returns:
        tuple: (test_predictions, oof_predictions, model)
            - test_predictions: Final predicted values for the test set (averaged across folds).
            - oof_predictions: Out-of-fold predicted values for the training set.
            - model: The final fitted model from the last fold.
    """
    from sklearn.model_selection import KFold
    from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
    import numpy as np

    # Determine feature set: use specified feature_columns if provided, otherwise use all except target_column
    if feature_columns is not None:
        # Ensure target_column is not included in feature_columns
        feature_columns = [col for col in feature_columns if col != target_column]
        X = train_df[feature_columns].copy()
        X_test = test_df[feature_columns].copy()
    else:
        X = train_df.drop(columns=[target_column])
        X_test = test_df.copy()

    # Extract target values
    y = train_df[target_column]

    # Infer categorical features from the selected features if not provided
    if categorical_features is None:
        categorical_features = X.select_dtypes(include=['object', 'category']).columns.tolist()

    # Import model-specific REGRESSOR classes
    if model_type == "xgboost":
        from xgboost import XGBRegressor as Model
    elif model_type == "catboost":
        from catboost import CatBoostRegressor as Model
    elif model_type == "lgbm":
        from lightgbm import LGBMRegressor as Model
    elif model_type == "hgb":
        from sklearn.ensemble import HistGradientBoostingRegressor as Model
    else:
        raise ValueError("Unsupported model_type. Choose from 'xgboost', 'catboost', 'lgbm', or 'hgb'.")

    # Set default parameters if none are provided (Updated for Regression)
    if param_file is None:
        if model_type == "xgboost":
            param_file = {
                'n_estimators': 8192,
                'learning_rate': 0.01,
                'max_depth': 8,
                'subsample': 0.80,
                'colsample_bytree': 0.80,
                'min_child_weight': 10,
                'objective': 'reg:squarederror', # Key change for regression
                'eval_metric': 'rmse',           # Key change for regression
                'tree_method': 'hist',
                'device': 'cuda',
                'random_state': 42
            }
        elif model_type == "catboost":
            param_file = {
                'iterations': 2048,
                'learning_rate': 0.01,
                'depth': 6,
                'loss_function': 'RMSE', # Key change for regression
                'task_type': 'GPU',
                'random_seed': 42,
                'verbose': False
            }
        elif model_type == "lgbm":
            param_file = {
                'n_estimators': 4096,
                'learning_rate': 0.01,
                'max_depth': -1,
                'subsample': 0.75,
                'colsample_bytree': 0.80,
                'objective': 'regression', # Key change for regression
                'metric': 'rmse',          # Key change for regression
                'device': 'gpu',
                'random_state': 42
            }
        elif model_type == "hgb":
            param_file = {
                'max_iter': 200,
                'learning_rate': 0.05,
                'max_leaf_nodes': 31,
                'early_stopping': True,
                'random_state': 42
            }

    # Initialize the model with the specified parameters
    model = Model(**param_file)

    # Set up K-Fold cross-validation
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

    # Lists to store metrics and test-set predictions
    rmse_scores = []
    mae_scores = []
    r2_scores = []
    test_preds = []  # To collect test-set predictions from each fold
    
    # Initialize oof_predictions as a 1D array for regression
    oof_predictions = np.zeros(len(train_df))

    for train_index, val_index in kf.split(X):
        # Create fold-specific training and validation data
        X_train, X_val = X.iloc[train_index].copy(), X.iloc[val_index].copy()
        y_train, y_val = y.iloc[train_index], y.iloc[val_index]
        X_test_fold = X_test.copy()

        # --- PROCESS CATEGORICAL FEATURES ---
        if model_type == "catboost":
            # CatBoost handles categorical features internally if passed as strings.
            X_train[categorical_features] = X_train[categorical_features].fillna('Missing').astype(str)
            X_val[categorical_features] = X_val[categorical_features].fillna('Missing').astype(str)
            X_test_fold[categorical_features] = X_test_fold[categorical_features].fillna('Missing').astype(str)
            model.fit(X_train, y_train, cat_features=categorical_features)
        else:
            # For other models, perform label encoding for categorical features.
            from sklearn.preprocessing import OrdinalEncoder
            encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
            X_train[categorical_features] = encoder.fit_transform(
                X_train[categorical_features].fillna('Missing').astype(str)
            )
            X_val[categorical_features] = encoder.transform(
                X_val[categorical_features].fillna('Missing').astype(str)
            )
            X_test_fold[categorical_features] = encoder.transform(
                X_test_fold[categorical_features].fillna('Missing').astype(str)
            )
            model.fit(X_train, y_train)

        # --- PREDICTION ON VALIDATION SET ---
        # Regression models use .predict(), not .predict_proba()
        y_val_pred = model.predict(X_val)
        
        # Store OOF predictions
        oof_predictions[val_index] = y_val_pred

        # Calculate Metrics
        rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
        mae = mean_absolute_error(y_val, y_val_pred)
        r2 = r2_score(y_val, y_val_pred)
        
        rmse_scores.append(rmse)
        mae_scores.append(mae)
        r2_scores.append(r2)

        # --- PREDICTION ON TEST SET ---
        if target_column in X_test_fold.columns:
            X_test_fold = X_test_fold.drop(columns=[target_column], errors='ignore')
        
        test_preds.append(model.predict(X_test_fold))

    # Calculate and print average metrics
    avg_rmse = np.mean(rmse_scores)
    avg_mae = np.mean(mae_scores)
    avg_r2 = np.mean(r2_scores)
    
    print("Model Performance Metrics (Cross-Validation):")
    print("..................")
    print(f"Average RMSE: {avg_rmse:.4f}")
    print(f"Average MAE:  {avg_mae:.4f}")
    print(f"Average R2:   {avg_r2:.4f}")

    # Average the test-set predictions from each fold
    test_preds = np.array(test_preds)
    mean_preds = np.mean(test_preds, axis=0)

    # Optionally, print the feature names used in the last training fold
    print("Features used in the last fold:")
    print(X_train.columns)

    return mean_preds, oof_predictions, model

In [42]:
mean_preds, oof_predictions, model = train_regressor(trn_std, tst_std, 'exam_score', feature_columns=None, model_type="xgboost", param_file=None, n_splits=5, categorical_features=None)

Model Performance Metrics (Cross-Validation):
..................
Average RMSE: 8.7870
Average MAE:  6.9903
Average R2:   0.7842
Features used in the last fold:
Index(['age', 'study_hours', 'class_attendance', 'sleep_hours', 'age^2',
       'age study_hours', 'age class_attendance', 'age sleep_hours',
       'study_hours^2', 'study_hours class_attendance',
       'study_hours sleep_hours', 'class_attendance^2',
       'class_attendance sleep_hours', 'sleep_hours^2', 'formula',
       'gender_male', 'gender_other', 'course_b.sc', 'course_b.tech',
       'course_ba', 'course_bba', 'course_bca', 'course_diploma',
       'internet_access_yes', 'sleep_quality_good', 'sleep_quality_poor',
       'study_method_group study', 'study_method_mixed',
       'study_method_online videos', 'study_method_self-study',
       'facility_rating_low', 'facility_rating_medium', 'exam_difficulty_hard',
       'exam_difficulty_moderate', 'gender_mean_enc', 'course_mean_enc',
       'internet_access_mean_enc', 

In [43]:
# Model Performance Metrics (Cross-Validation):
# ..................
# Average RMSE: 8.8102
# Average MAE:  7.0132
# Average R2:   0.7831
# Features used in the last fold:
# Index(['age', 'study_hours', 'class_attendance', 'sleep_hours', 'gender_male',
#        'gender_other', 'course_b.sc', 'course_b.tech', 'course_ba',
#        'course_bba', 'course_bca', 'course_diploma', 'internet_access_yes',
#        'sleep_quality_good', 'sleep_quality_poor', 'study_method_group study',
#        'study_method_mixed', 'study_method_online videos',
#        'study_method_self-study', 'facility_rating_low',
#        'facility_rating_medium', 'exam_difficulty_hard',
#        'exam_difficulty_moderate', 'gender_mean_enc', 'course_mean_enc',
#        'internet_access_mean_enc', 'sleep_quality_mean_enc',
#        'study_method_mean_enc', 'facility_rating_mean_enc',
#        'exam_difficulty_mean_enc', 'gender_freq', 'course_freq',
#        'internet_access_freq', 'sleep_quality_freq', 'study_method_freq',
#        'facility_rating_freq', 'exam_difficulty_freq'],
#       dtype='object')

In [44]:
%%time
sub_df['exam_score'] = mean_preds
sub_df.to_csv('submission_xgb_cv.csv', index = False)

CPU times: user 330 ms, sys: 4.01 ms, total: 334 ms
Wall time: 333 ms


In [45]:
#mean_preds, oof_predictions, model = train_regressor(trn_std, tst_std, 'exam_score', feature_columns=None, model_type="catboost", param_file=None, n_splits=5, categorical_features=None)

In [46]:
# Model Performance Metrics (Cross-Validation):
# ..................
# Average RMSE: 8.8461
# Average MAE:  7.0636
# Average R2:   0.7813
# Features used in the last fold:
# Index(['age', 'study_hours', 'class_attendance', 'sleep_hours', 'gender_male',
#        'gender_other', 'course_b.sc', 'course_b.tech', 'course_ba',
#        'course_bba', 'course_bca', 'course_diploma', 'internet_access_yes',
#        'sleep_quality_good', 'sleep_quality_poor', 'study_method_group study',
#        'study_method_mixed', 'study_method_online videos',
#        'study_method_self-study', 'facility_rating_low',
#        'facility_rating_medium', 'exam_difficulty_hard',
#        'exam_difficulty_moderate', 'gender_mean_enc', 'course_mean_enc',
#        'internet_access_mean_enc', 'sleep_quality_mean_enc',
#        'study_method_mean_enc', 'facility_rating_mean_enc',
#        'exam_difficulty_mean_enc', 'gender_freq', 'course_freq',
#        'internet_access_freq', 'sleep_quality_freq', 'study_method_freq',
#        'facility_rating_freq', 'exam_difficulty_freq'],
#       dtype='object')

In [47]:
# %%time
# sub_df['exam_score'] = mean_preds
# sub_df.to_csv('submission_xgb_cv.csv', index = False)

# ðŸŒŸ Level 6: Neuronal Networks + XGB Pseudo Labels

In [48]:
%%capture
!pip install pytabkit

In [49]:
trn_std['xgboost_pred'] = oof_predictions
tst_std['xgboost_pred'] = mean_preds

In [50]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer

# Import the specific model classes you want to support
try:
    from pytabkit import TabM_D_Regressor, RealMLP_TD_Regressor
except ImportError:
    raise ImportError("Please install pytabkit first: pip install pytabkit")

def train_pytabkit_regressor(train_df, test_df, target_col, 
                             model_type='TabM', n_splits=5, 
                             batch_size=256, max_epochs=200, 
                             device='cuda', seed=42):
    """
    Trains a PyTabKit Regressor (TabM or RealMLP) using K-Fold CV.
    PyTabKit handles categorical encoding and scaling internally.
    
    Args:
        train_df (pd.DataFrame): Training data.
        test_df (pd.DataFrame): Test data.
        target_col (str): Name of target column.
        model_type (str): 'TabM' or 'RealMLP'.
        n_splits (int): Number of CV folds.
        batch_size (int): Batch size for training.
        max_epochs (int): Maximum epochs (early stopping is used automatically).
        device (str): 'cuda' or 'cpu'.
        seed (int): Random seed.

    Returns:
        tuple: (test_predictions, oof_predictions)
    """
    
    # 1. Preprocessing (PyTabKit handles encoding, but we must impute NaNs)
    X = train_df.drop(columns=[target_col]).copy()
    y = train_df[target_col].copy().values
    X_test = test_df.copy()
    
    if target_col in X_test.columns:
        X_test = X_test.drop(columns=[target_col])

    # Identify features
    cat_features = X.select_dtypes(include=['object', 'category']).columns.tolist()
    num_features = X.select_dtypes(exclude=['object', 'category']).columns.tolist()

    print(f"Features: {len(num_features)} Numerical, {len(cat_features)} Categorical")

    # Impute Missing Values (PyTabKit currently requires no NaNs in Numericals)
    # Categoricals are handled if they are 'object'/'category' type
    if num_features:
        imp = SimpleImputer(strategy='mean')
        X[num_features] = imp.fit_transform(X[num_features])
        X_test[num_features] = imp.transform(X_test[num_features])
        
    if cat_features:
        # Fill missing categoricals with a string placeholder
        X[cat_features] = X[cat_features].fillna('MISSING').astype(str)
        X_test[cat_features] = X_test[cat_features].fillna('MISSING').astype(str)

    
    # 2. TabM Parameters
    param_grid_TabM = {
    'device': 'cuda',
    'random_state': 42,
    'verbosity': 0,

    # Architecture
    'arch_type': 'tabm-mini-normal',
    'tabm_k': 32,              # More feature interactions
    'n_blocks': 6,             # Depth
    'd_block': 320,            # Capacity

    # Embeddings
    'num_emb_type': 'pwl',
    'd_embedding': 24,         # Richer numeric embeddings (was 16)

    # Optimization
    'batch_size': 512,         # Stabler gradients
    'lr': 7e-4,                # Slower learning 
    'weight_decay': 5e-3,      # Less aggressive regularization

    # Regularization
    'dropout': 0.07,           # You already regularize via Ridge
    'patience': 6,             # Allow convergence
    'n_epochs': 128,           # Allow longer training
}

    # 2. K-Fold Cross Validation
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=seed)
    oof_preds = np.zeros(len(train_df))
    test_fold_preds = []

    for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
        print(f"\n--- Fold {fold+1}/{n_splits} ---")
        
        # Split Data
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]
        
        # Select Model Class
        # TD = "Tuned Defaults" (Recommended by authors)
        if model_type == 'TabM':
            model = TabM_D_Regressor(**param_grid_TabM)
            
        elif model_type == 'RealMLP':
            model = RealMLP_TD_Regressor(
                device=device,
                random_state=seed,
                batch_size=batch_size,
                n_epochs=max_epochs,
                verbosity=0
            )
        else:
            raise ValueError("model_type must be 'TabM' or 'RealMLP'")

        # Train (PyTabKit accepts validation sets for Early Stopping)
        # Note: PyTabKit automatically handles the cat/num split based on dtypes
        model.fit(X_train, y_train, X_val=X_val, y_val=y_val)
        
        # Prediction
        val_preds = model.predict(X_val)
        oof_preds[val_idx] = val_preds
        
        test_preds = model.predict(X_test)
        test_fold_preds.append(test_preds)
        
        rmse = np.sqrt(mean_squared_error(y_val, val_preds))
        print(f"Fold {fold+1} RMSE: {rmse:.4f}")

    # Average Test Predictions
    final_test_preds = np.mean(test_fold_preds, axis=0)
    overall_rmse = np.sqrt(mean_squared_error(y, oof_preds))
    
    print(f"\nTraining Complete. Overall OOF RMSE: {overall_rmse:.4f}")
    
    return final_test_preds, oof_preds

In [51]:
%%time
final_test_preds, oof_preds = train_pytabkit_regressor(trn_std, tst_std, 'exam_score', model_type='TabM', n_splits=5, batch_size=256, max_epochs=128, device='cuda', seed=42)

Features: 49 Numerical, 0 Categorical

--- Fold 1/5 ---
Fold 1 RMSE: 8.7680

--- Fold 2/5 ---
Fold 2 RMSE: 8.7929

--- Fold 3/5 ---
Fold 3 RMSE: 8.7713

--- Fold 4/5 ---
Fold 4 RMSE: 8.7872

--- Fold 5/5 ---
Fold 5 RMSE: 8.7841

Training Complete. Overall OOF RMSE: 8.7807
CPU times: user 20min 6s, sys: 12min 48s, total: 32min 55s
Wall time: 32min 59s


In [52]:
# Training Complete. Overall OOF RMSE: 8.8666
# CPU times: user 27min 54s, sys: 11min 44s, total: 39min 38s
# Wall time: 39min 59s

In [53]:
%%time
sub_df['exam_score'] = final_test_preds
sub_df.to_csv('submission_tabm_cv.csv', index = False)

CPU times: user 517 ms, sys: 16.9 ms, total: 534 ms
Wall time: 533 ms


# ðŸŒŸ Level 7: Hero Mode, Meta Model

In [54]:
final_test_preds

array([70.13445129, 69.64476166, 87.54426575, ..., 91.37497559,
       55.23769836, 67.42386017])