# Feature Engineering: Predict Calorie Expenditure

This section explains the most important part of my notebook: **feature engineering**.

## Why is this the most important part?

Feature engineering is the process of creating new features from the original data to help the model understand the patterns better. In this competition, the original dataset had useful but limited information. By applying domain knowledge and careful transformations, I extracted deeper insights from the data, which helped the model make better predictions.

The performance improved significantly after this step. Most of the model’s power came not from the algorithm choice, but from the features provided.

## About the process

I applied the feature engineering process separately for males and females, because calorie expenditure formulas are different by gender. This also helps the model focus on more consistent patterns within each group.

### What kinds of features were created?

- Physiological features like:
  - Body Mass Index (BMI)
  - Body Surface Area (BSA)
  - Ponderal Index
- Heart-related features like:
  - Heart Rate × Duration
  - Heart Rate as a percentage of Max Heart Rate
- Temperature effects like:
  - Deviation from normal body temperature
  - Heart Rate × Temperature
- Interaction terms like:
  - Weight × Duration
  - Age × Heart Rate
- Logarithmic, square root, and squared versions of selected variables
- Binned variables for age, duration, and heart zones
- Group-based averages and differences:
  - Compare individual heart rate with group average (by age or duration)
- Scientific formulas:
  - Estimated calories burned (from medical literature)
  - BMR (Basal Metabolic Rate)

## Important note: No target leakage

All features are created only from input variables available at prediction time.  
Even the `Calories_per_min` used in training is derived after separating the training labels.  
There is no use of test labels or any future data when creating these features.

## Summary

This section transforms the dataset into a much richer form, allowing the model to focus on real physical and physiological patterns. By handling males and females separately and using both domain-based and statistical features, this step forms the core of my approach.

In [1]:
def process_sex_group(df, is_train):
    if is_train:
        df["Calories_per_min"] = df["Calories"] / df["Duration"]
    df["Sex_encoded"] = df["Sex"].map({"male": 0, "female": 1})
    df["BMI"] = df["Weight"] / ((df["Height"] / 100) ** 2)
    df["BSA"] = 0.007184 * (df["Height"] ** 0.725) * (df["Weight"] ** 0.425)
    df["Ponderal_Index"] = df["Weight"] / ((df["Height"] / 100) ** 3)
    df["HR_Duration"] = df["Heart_Rate"] * df["Duration"]
    df["HR_per_min"] = df["Heart_Rate"] / df["Duration"]
    df["Max_HR"] = 220 - df["Age"]
    df["HR_pct_max"] = df["Heart_Rate"] / df["Max_HR"]
    df["HR_pct_max_x_Duration_x_BMI"] = df["HR_pct_max"] * df["Duration"] * df["BMI"]
    df["Temp_Elevation"] = df["Body_Temp"] - 37.0
    df["HR_Temp"] = df["Heart_Rate"] * df["Body_Temp"]
    df["HR_Temp_Duration"] = df["Heart_Rate"] * df["Body_Temp"] * df["Duration"]
    df["Weight_Duration"] = df["Weight"] * df["Duration"]
    df["Age_HR"] = df["Age"] * df["Heart_Rate"]
    df["Age_Duration"] = df["Age"] * df["Duration"]
    df["Age_Temp"] = df["Age"] * df["Body_Temp"]
    df["Weight_Temp"] = df["Weight"] * df["Body_Temp"]
    df["Height_Temp"] = df["Height"] * df["Body_Temp"]
    df["log_Duration"] = np.log1p(df["Duration"])
    df["Duration_squared"] = df["Duration"] ** 2
    df["sqrt_HR"] = np.sqrt(df["Heart_Rate"])
    df["Age_bin"] = pd.cut(df["Age"], bins=[0, 25, 35, 45, 55, 65, 100], labels=False)
    df["Duration_bin"] = pd.cut(df["Duration"], bins=[0, 10, 20, 30, 45, 60, 90], labels=False)
    df["heart_zone"] = pd.cut(df["Heart_Rate"], bins=[0, 90, 110, 130, 300], labels=["rest", "light", "moderate", "intense"])
    df["is_overheated"] = (df["Body_Temp"] > 37.5).astype(int)
    df["Age_group"] = pd.cut(df["Age"], bins=[0, 30, 50, 100], labels=["young", "middle", "senior"])
    df = pd.get_dummies(df, columns=["Age_group"], drop_first=True)
    df["dur_x_middle_age"] = df["Duration"] * df.get("Age_group_middle", 0)
    df["dur_x_senior"] = df["Duration"] * df.get("Age_group_senior", 0)
    df["group_avg_HR_Age"] = df.groupby("Age")["Heart_Rate"].transform("mean")
    df["group_avg_HR_Duration"] = df.groupby("Duration")["Heart_Rate"].transform("mean")
    df["delta_HR_age"] = df["Heart_Rate"] - df["group_avg_HR_Age"]
    df["delta_HR_dur"] = df["Heart_Rate"] - df["group_avg_HR_Duration"]
    df["exertion_index"] = (df["Heart_Rate"] - df["group_avg_HR_Age"]) * df["Duration"]
    df["physio_load"] = df["Body_Temp"] * df["Heart_Rate"] * df["Duration"] * df["BMI"]

    df["Calories_Burned"] = np.where(
        df["Sex"] == "male",
        (-55.0969 + 0.6309 * df["Heart_Rate"] + 0.1988 * df["Weight"] + 0.2017 * df["Age"]) / 4.184 * df["Duration"],
        (-20.4022 + 0.4472 * df["Heart_Rate"] - 0.1263 * df["Weight"] + 0.074 * df["Age"]) / 4.184 * df["Duration"]
    )

    df["BMR"] = np.where(
        df["Sex"] == "male",
        10 * df["Weight"] + 6.25 * df["Height"] - 5 * df["Age"] + 5,
        10 * df["Weight"] + 6.25 * df["Height"] - 5 * df["Age"] - 161
    )

    drop_cols = ["Sex", "Calories"] if is_train else ["Sex"]
    df = df.drop(columns=drop_cols, errors="ignore")
    df = pd.get_dummies(df, columns=["Age_bin", "Duration_bin", "heart_zone"], drop_first=True)
    return df

def feature_engineering(df):
    is_train = "Calories" in df.columns
    if is_train:
        cols = [col for col in df.columns if col != "Calories"]
        df = df.groupby(cols, as_index=False)["Calories"].mean()
    df_male = df[df["Sex"] == "male"].copy()
    df_female = df[df["Sex"] == "female"].copy()
    df_male = process_sex_group(df_male, is_train)
    df_female = process_sex_group(df_female, is_train)
    if is_train:
        X_male = df_male.drop(columns=["Calories_per_min"], errors="ignore")
        y_male = df_male["Calories_per_min"]
        X_female = df_female.drop(columns=["Calories_per_min"], errors="ignore")
        y_female = df_female["Calories_per_min"]
        return X_male, y_male, X_female, y_female
    else:
        return df_male, df_female