## 🔧 Version 2: Feature Engineering + Label Encoding

This version prepares the dataset for machine learning models.  
Steps included:
- Importing necessary libraries
- Adding engineered features like BMI, Intensity, Age × BMI, etc.
- Encoding categorical variables (`Sex`)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

In [2]:
# Load training and test datasets
train = pd.read_csv("/kaggle/input/playground-series-s5e5/train.csv")
test = pd.read_csv("/kaggle/input/playground-series-s5e5/test.csv")
sample_submission = pd.read_csv("/kaggle/input/playground-series-s5e5/sample_submission.csv")

### 📐 Feature Engineering

New features created:
- **BMI** = Weight / (Height in meters)^2
- **Intensity** = Duration × Heart Rate
- **Age_BMI** = Age × BMI
- **Cardio_Effort** = Heart Rate / Duration (plus 1 to avoid division by zero)

In [3]:
def add_features(df):
    df['BMI'] = df['Weight'] / ((df['Height'] / 100) ** 2)
    df['Intensity'] = df['Duration'] * df['Heart_Rate']
    df['Age_BMI'] = df['Age'] * df['BMI']
    df['Cardio_Effort'] = df['Heart_Rate'] / (df['Duration'] + 1)
    return df

# Apply to both datasets
train = add_features(train)
test = add_features(test)

### 🔢 Label Encoding for 'Sex'

The 'Sex' column contains categorical values: "male" and "female".  
We convert them into numerical format using LabelEncoder:
- female → 0
- male → 1


In [4]:
le = LabelEncoder()
train['Sex'] = le.fit_transform(train['Sex'])
test['Sex'] = le.transform(test['Sex'])  # Important: use the same encoder


### 📊 Preview the Updated Dataset


In [5]:
train.head()

Unnamed: 0,id,Sex,Age,Height,Weight,Duration,Heart_Rate,Body_Temp,Calories,BMI,Intensity,Age_BMI,Cardio_Effort
0,0,1,36,189.0,82.0,26.0,101.0,41.0,150.0,22.955684,2626.0,826.404636,3.740741
1,1,0,64,163.0,60.0,8.0,85.0,39.7,34.0,22.582709,680.0,1445.293387,9.444444
2,2,0,51,161.0,64.0,7.0,84.0,39.8,29.0,24.690405,588.0,1259.210679,10.5
3,3,1,20,192.0,90.0,25.0,105.0,40.7,140.0,24.414062,2625.0,488.28125,4.038462
4,4,0,38,166.0,61.0,25.0,102.0,40.6,146.0,22.13674,2550.0,841.19611,3.923077


## 🚀 Version 3: XGBoost Model with Feature Engineering

This version includes:
- Train/Validation split
- Log transformation on target (`Calories`)
- XGBoost training and validation
- Prediction on test set
- Saving the submission file

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor

In [7]:
# Drop ID and target from input features
X = train.drop(columns=['id', 'Calories'])
y = train['Calories']

### 📐 Feature Engineering

New features created:
- **BMI** = Weight / (Height in meters)^2
- **Intensity** = Duration × Heart Rate
- **Age_BMI** = Age × BMI
- **Cardio_Effort** = Heart Rate / Duration (plus 1 to avoid division by zero)


In [8]:
# Split data for validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply log transformation to target (log1p to avoid log(0))
y_train_log = np.log1p(y_train)
y_val_log = np.log1p(y_val)

### 🔢 Label Encoding for 'Sex'

The 'Sex' column contains categorical values: "male" and "female".  
We convert them into numerical format using LabelEncoder:
- female → 0
- male → 1


In [9]:
le = LabelEncoder()
train['Sex'] = le.fit_transform(train['Sex'])
test['Sex'] = le.transform(test['Sex'])  # Important: use the same encoder

In [10]:
### 📊 Preview the Updated Dataset

In [11]:
train.head()

Unnamed: 0,id,Sex,Age,Height,Weight,Duration,Heart_Rate,Body_Temp,Calories,BMI,Intensity,Age_BMI,Cardio_Effort
0,0,1,36,189.0,82.0,26.0,101.0,41.0,150.0,22.955684,2626.0,826.404636,3.740741
1,1,0,64,163.0,60.0,8.0,85.0,39.7,34.0,22.582709,680.0,1445.293387,9.444444
2,2,0,51,161.0,64.0,7.0,84.0,39.8,29.0,24.690405,588.0,1259.210679,10.5
3,3,1,20,192.0,90.0,25.0,105.0,40.7,140.0,24.414062,2625.0,488.28125,4.038462
4,4,0,38,166.0,61.0,25.0,102.0,40.6,146.0,22.13674,2550.0,841.19611,3.923077


## 🚀 Version 4: XGBoost Model Training (with log1p target)

This version continues from Version 3 where feature engineering was completed.

Steps:
- Define input features `X` and target `y`
- Split the training data for validation
- Apply log transformation to the target
- Train XGBoost regressor
- Evaluate performance using RMSE (log)
- Predict on the test set
- Save the predictions in `submission_xgb_fe.csv`


In [12]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor

In [13]:
# Drop ID and Calories from training features
X = train.drop(columns=['id', 'Calories'])
y = train['Calories']

In [14]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [15]:
y_train_log = np.log1p(y_train)
y_val_log = np.log1p(y_val)

### 🧠 Train XGBoost Regressor

We use:
- 500 estimators
- Learning rate = 0.05
- max_depth = 6
- 80% subsample

In [16]:
model = XGBRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

model.fit(X_train, y_train_log)


In [17]:
y_val_pred_log = model.predict(X_val)
rmse_log = np.sqrt(mean_squared_error(y_val_log, y_val_pred_log))
print(f"Validation RMSE (log): {rmse_log:.4f}")

Validation RMSE (log): 0.0602


### 📤 Make Predictions on Test Set and Save Submission

In [18]:
X_test = test.drop(columns=['id'])
test_preds_log = model.predict(X_test)
test_preds = np.expm1(test_preds_log)
test_preds = np.maximum(0, test_preds)  # MSLE compatibility

# Save predictions
sample_submission['Calories'] = test_preds
sample_submission.to_csv("submission_xgb_fe.csv", index=False)

## 🐱 Version 5: CatBoost Model Training
In this version:
- We train a CatBoostRegressor on the same features
- Log1p transformation is applied to the target
- Predictions are saved in `submission_cb_fe.csv`

In [19]:
from catboost import CatBoostRegressor

# Prepare training and validation sets
X = train.drop(columns=['id', 'Calories'])
y = train['Calories']

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Log transform the target
y_train_log = np.log1p(y_train)
y_val_log = np.log1p(y_val)


In [20]:
cat_model = CatBoostRegressor(
    iterations=500,
    learning_rate=0.05,
    depth=6,
    verbose=0,
    random_state=42
)
cat_model.fit(X_train, y_train_log)

<catboost.core.CatBoostRegressor at 0x7f42b1a12290>

In [21]:
val_preds_log = cat_model.predict(X_val)
rmse_log = np.sqrt(mean_squared_error(y_val_log, val_preds_log))
print(f"CatBoost Validation RMSE (log): {rmse_log:.4f}")

CatBoost Validation RMSE (log): 0.0603


In [22]:
X_test = test.drop(columns=['id'])
test_preds_log = cat_model.predict(X_test)
test_preds = np.expm1(test_preds_log)
test_preds = np.maximum(0, test_preds)

sample_submission['Calories'] = test_preds
sample_submission.to_csv("submission_cb_fe.csv", index=False)

## 🤖 Version 5: MLP Model (Neural Network with Keras)

In this version:
- Standardize features with StandardScaler
- Use TensorFlow/Keras to train a simple MLP
- Apply log1p transformation to the target
- Save predictions as `submission_nn_fe.csv`


In [23]:
from sklearn.preprocessing import StandardScaler
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split

# Define features and target
X = train.drop(columns=['id', 'Calories'])
y = train['Calories']

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Prepare test set as well
X_test = test.drop(columns=['id'])
X_test_scaled = scaler.transform(X_test)

# Train-validation split
X_train, X_val, y_train, y_val = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Log transform the target
y_train_log = np.log1p(y_train)
y_val_log = np.log1p(y_val)

2025-05-15 11:08:25.778935: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747307306.058182      13 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747307306.140092      13 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [24]:
model = keras.Sequential([
    keras.layers.Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(1)
])

model.compile(optimizer='adam', loss='mse')

# Train the model
model.fit(X_train, y_train_log, epochs=20, batch_size=512, validation_data=(X_val, y_val_log), verbose=1)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
2025-05-15 11:08:42.745104: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


Epoch 1/20
[1m1172/1172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 5ms/step - loss: 1.5163 - val_loss: 0.0108
Epoch 2/20
[1m1172/1172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 6ms/step - loss: 0.2368 - val_loss: 0.0101
Epoch 3/20
[1m1172/1172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 6ms/step - loss: 0.1727 - val_loss: 0.0048
Epoch 4/20
[1m1172/1172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 5ms/step - loss: 0.1319 - val_loss: 0.0074
Epoch 5/20
[1m1172/1172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 5ms/step - loss: 0.0928 - val_loss: 0.0093
Epoch 6/20
[1m1172/1172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 5ms/step - loss: 0.0608 - val_loss: 0.0168
Epoch 7/20
[1m1172/1172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 5ms/step - loss: 0.0391 - val_loss: 0.0168
Epoch 8/20
[1m1172/1172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 5ms/step - loss: 0.0276 - val_loss: 0.0214
Epoch 9/20
[1m1172/1172

<keras.src.callbacks.history.History at 0x7f41fbe20510>

In [25]:
val_preds_log = model.predict(X_val).flatten()
rmse_log = np.sqrt(mean_squared_error(y_val_log, val_preds_log))
print(f"MLP Validation RMSE (log): {rmse_log:.4f}")

[1m4688/4688[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 1ms/step
MLP Validation RMSE (log): 0.1869


In [26]:
test_preds_log = model.predict(X_test_scaled).flatten()
test_preds = np.expm1(test_preds_log)
test_preds = np.maximum(0, test_preds)

sample_submission['Calories'] = test_preds
sample_submission.to_csv("submission_nn_fe.csv", index=False)

[1m7813/7813[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 1ms/step


## 🧮 Version 6: Model Ensemble (XGBoost + CatBoost + MLP)

This version combines predictions from:
- XGBoost (`submission_xgb_fe.csv`)
- CatBoost (`submission_cb_fe.csv`)
- MLP (`submission_nn_fe.csv`)

Weights:
- 0.45 XGB
- 0.45 CatBoost
- 0.10 MLP

In [27]:
xgb = pd.read_csv("/kaggle/working/submission_xgb_fe.csv")
cat = pd.read_csv("/kaggle/working/submission_cb_fe.csv")
mlp = pd.read_csv("/kaggle/working/submission_nn_fe.csv")

In [28]:
ensemble = pd.DataFrame()
ensemble['id'] = xgb['id']
ensemble['Calories'] = (
    0.45 * xgb['Calories'] +
    0.45 * cat['Calories'] +
    0.10 * mlp['Calories']
)

# Clip predictions to valid range [1, 314] for MSLE
ensemble['Calories'] = np.clip(ensemble['Calories'], 1, 314)

In [29]:
ensemble.to_csv("submission_ensemble.csv", index=False)