## 🔧 Version 2: Feature Engineering + Label Encoding

This version prepares the dataset for machine learning models.  
Steps included:
- Importing necessary libraries
- Adding engineered features like BMI, Intensity, Age × BMI, etc.
- Encoding categorical variables (`Sex`)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

In [2]:
# Load training and test datasets
train = pd.read_csv("/kaggle/input/playground-series-s5e5/train.csv")
test = pd.read_csv("/kaggle/input/playground-series-s5e5/test.csv")
sample_submission = pd.read_csv("/kaggle/input/playground-series-s5e5/sample_submission.csv")

### 📐 Feature Engineering

New features created:
- **BMI** = Weight / (Height in meters)^2
- **Intensity** = Duration × Heart Rate
- **Age_BMI** = Age × BMI
- **Cardio_Effort** = Heart Rate / Duration (plus 1 to avoid division by zero)

In [3]:
def add_features(df):
    df['BMI'] = df['Weight'] / ((df['Height'] / 100) ** 2)
    df['Intensity'] = df['Duration'] * df['Heart_Rate']
    df['Age_BMI'] = df['Age'] * df['BMI']
    df['Cardio_Effort'] = df['Heart_Rate'] / (df['Duration'] + 1)
    return df

# Apply to both datasets
train = add_features(train)
test = add_features(test)

### 🔢 Label Encoding for 'Sex'

The 'Sex' column contains categorical values: "male" and "female".  
We convert them into numerical format using LabelEncoder:
- female → 0
- male → 1


In [4]:
le = LabelEncoder()
train['Sex'] = le.fit_transform(train['Sex'])
test['Sex'] = le.transform(test['Sex'])  # Important: use the same encoder


### 📊 Preview the Updated Dataset


In [5]:
train.head()

Unnamed: 0,id,Sex,Age,Height,Weight,Duration,Heart_Rate,Body_Temp,Calories,BMI,Intensity,Age_BMI,Cardio_Effort
0,0,1,36,189.0,82.0,26.0,101.0,41.0,150.0,22.955684,2626.0,826.404636,3.740741
1,1,0,64,163.0,60.0,8.0,85.0,39.7,34.0,22.582709,680.0,1445.293387,9.444444
2,2,0,51,161.0,64.0,7.0,84.0,39.8,29.0,24.690405,588.0,1259.210679,10.5
3,3,1,20,192.0,90.0,25.0,105.0,40.7,140.0,24.414062,2625.0,488.28125,4.038462
4,4,0,38,166.0,61.0,25.0,102.0,40.6,146.0,22.13674,2550.0,841.19611,3.923077


## 🚀 Version 3: XGBoost Model with Feature Engineering

This version includes:
- Train/Validation split
- Log transformation on target (`Calories`)
- XGBoost training and validation
- Prediction on test set
- Saving the submission file

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor

In [7]:
# Drop ID and target from input features
X = train.drop(columns=['id', 'Calories'])
y = train['Calories']

### 📐 Feature Engineering

New features created:
- **BMI** = Weight / (Height in meters)^2
- **Intensity** = Duration × Heart Rate
- **Age_BMI** = Age × BMI
- **Cardio_Effort** = Heart Rate / Duration (plus 1 to avoid division by zero)


In [8]:
# Split data for validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply log transformation to target (log1p to avoid log(0))
y_train_log = np.log1p(y_train)
y_val_log = np.log1p(y_val)

### 🔢 Label Encoding for 'Sex'

The 'Sex' column contains categorical values: "male" and "female".  
We convert them into numerical format using LabelEncoder:
- female → 0
- male → 1


In [9]:
le = LabelEncoder()
train['Sex'] = le.fit_transform(train['Sex'])
test['Sex'] = le.transform(test['Sex'])  # Important: use the same encoder


In [10]:
### 📊 Preview the Updated Dataset

In [11]:
train.head()

Unnamed: 0,id,Sex,Age,Height,Weight,Duration,Heart_Rate,Body_Temp,Calories,BMI,Intensity,Age_BMI,Cardio_Effort
0,0,1,36,189.0,82.0,26.0,101.0,41.0,150.0,22.955684,2626.0,826.404636,3.740741
1,1,0,64,163.0,60.0,8.0,85.0,39.7,34.0,22.582709,680.0,1445.293387,9.444444
2,2,0,51,161.0,64.0,7.0,84.0,39.8,29.0,24.690405,588.0,1259.210679,10.5
3,3,1,20,192.0,90.0,25.0,105.0,40.7,140.0,24.414062,2625.0,488.28125,4.038462
4,4,0,38,166.0,61.0,25.0,102.0,40.6,146.0,22.13674,2550.0,841.19611,3.923077


## 🚀 Version 4: XGBoost Model Training (with log1p target)

This version continues from Version 3 where feature engineering was completed.

Steps:
- Define input features `X` and target `y`
- Split the training data for validation
- Apply log transformation to the target
- Train XGBoost regressor
- Evaluate performance using RMSE (log)
- Predict on the test set
- Save the predictions in `submission_xgb_fe.csv`


In [12]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor

In [13]:
# Drop ID and Calories from training features
X = train.drop(columns=['id', 'Calories'])
y = train['Calories']

In [14]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [15]:
y_train_log = np.log1p(y_train)
y_val_log = np.log1p(y_val)

### 🧠 Train XGBoost Regressor

We use:
- 500 estimators
- Learning rate = 0.05
- max_depth = 6
- 80% subsample

In [16]:
model = XGBRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

model.fit(X_train, y_train_log)


In [17]:
y_val_pred_log = model.predict(X_val)
rmse_log = np.sqrt(mean_squared_error(y_val_log, y_val_pred_log))
print(f"Validation RMSE (log): {rmse_log:.4f}")

Validation RMSE (log): 0.0602


### 📤 Make Predictions on Test Set and Save Submission

In [18]:
X_test = test.drop(columns=['id'])
test_preds_log = model.predict(X_test)
test_preds = np.expm1(test_preds_log)
test_preds = np.maximum(0, test_preds)  # MSLE compatibility

# Save predictions
sample_submission['Calories'] = test_preds
sample_submission.to_csv("submission_xgb_fe.csv", index=False)