In [1]:
pip install pandas xgboost catboost lightgbm tabPFN

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import RidgeCV, LassoCV
from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from sklearn.ensemble import RandomForestRegressor

declaring the dataframe and split the data to matrix X and vector Y

A dataset for classification task the output is the income column.
5000 rows, 12 cols 7 numerical features.
 you can find the dataset here: https://www.kaggle.com/datasets/ajinilpatel/energy-consumption-prediction

In [3]:
df_reg = pd.read_csv('Energy_consumption_dataset.csv')
df_reg

Unnamed: 0,Month,Hour,DayOfWeek,Holiday,Temperature,Humidity,SquareFootage,Occupancy,HVACUsage,LightingUsage,RenewableEnergy,EnergyConsumption
0,1,0,Saturday,No,25.139433,43.431581,1565.693999,5,On,Off,2.774699,75.364373
1,1,1,Saturday,No,27.731651,54.225919,1411.064918,1,On,On,21.831384,83.401855
2,1,2,Saturday,No,28.704277,58.907658,1755.715009,2,Off,Off,6.764672,78.270888
3,1,3,Saturday,No,20.080469,50.371637,1452.316318,1,Off,On,8.623447,56.519850
4,1,4,Saturday,No,23.097359,51.401421,1094.130359,9,On,Off,3.071969,70.811732
...,...,...,...,...,...,...,...,...,...,...,...,...
4995,12,6,Sunday,Yes,26.338718,52.580000,1563.567259,7,On,On,20.591717,70.270344
4996,12,17,Monday,No,20.007565,42.765607,1999.982252,5,Off,On,7.536319,73.943071
4997,12,13,Thursday,Yes,26.226253,30.015975,1999.982252,5,Off,On,28.162193,85.784613
4998,12,8,Saturday,Yes,24.673206,50.223939,1240.811298,2,On,On,20.918483,63.784001


In [4]:
df_reg.dtypes

Month                  int64
Hour                   int64
DayOfWeek             object
Holiday               object
Temperature          float64
Humidity             float64
SquareFootage        float64
Occupancy              int64
HVACUsage             object
LightingUsage         object
RenewableEnergy      float64
EnergyConsumption    float64
dtype: object

### 8:
Creating a preprocessing function:
* I decided to create sub functions first and call them from a 'main' function.

A:  CHECKING FOR MISSING VALUES (THERE ISN'T MISSING VALUES) AND CATEGORICAL COLS

In [5]:
df_reg.isnull().sum()

Month                0
Hour                 0
DayOfWeek            0
Holiday              0
Temperature          0
Humidity             0
SquareFootage        0
Occupancy            0
HVACUsage            0
LightingUsage        0
RenewableEnergy      0
EnergyConsumption    0
dtype: int64

In [6]:
def handle_missing_values(df):
    print("Checking for missing values:")
    print(df.isnull().sum())
    # df = df.fillna()  THERE IS NO NULL VALUES ANYWAY SO WE DON'T NEED THAT (:
    return df

def remove_high_cardinality_columns(df, max_unique=4):
    cat_cols = df.select_dtypes(include=['object', 'category','string']).columns
    removed_cols = []

    for col in cat_cols:
        if df[col].nunique() > max_unique:
            df.drop(columns=[col], inplace=True)
            removed_cols.append(col)

    print(f"Removed high-cardinality columns (>{max_unique} unique values): {removed_cols}")
    return df

def encode_categorical_variables(df):
    cat_cols = df.select_dtypes(include=['object', 'category']).columns
    df_encoded = pd.get_dummies(df, columns=cat_cols, drop_first=True)
    print(f"Encoded categorical columns: {list(cat_cols)}")
    return df_encoded


B: SPLITTING FOR TRAIN/TEST:

In [7]:
def split_data_reg(X, y, test_size=0.2, random_state=42):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )
    print(f"Training set: {X_train.shape}, Test set: {X_test.shape}")
    return X_train, X_test, y_train, y_test


C: STANDARDIZE THE DATA:

In [8]:
def standardize_features_reg(X_train, X_test):
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    print("Standardization complete.")
    return X_train_scaled, X_test_scaled


In [9]:
def preprocess_regression_data(df, target_column):
    X_reg = df.drop(columns=[target_column])
    y_reg = df[target_column]
    
    X_reg = handle_missing_values(X_reg)
    X_reg = remove_high_cardinality_columns(X_reg)
    X_reg = encode_categorical_variables(X_reg)

    X_train_reg, X_test_reg, y_train_reg, y_test_reg = split_data_reg(X_reg, y_reg)

    X_train_scaled_reg, X_test_scaled_reg = standardize_features_reg(X_train_reg, X_test_reg)

    return X_train_scaled_reg, X_test_scaled_reg, y_train_reg, y_test_reg


In [10]:
X_train_scaled_reg, X_test_scaled_reg, y_train_reg, y_test_reg = preprocess_regression_data(df_reg, 'EnergyConsumption')

Checking for missing values:
Month              0
Hour               0
DayOfWeek          0
Holiday            0
Temperature        0
Humidity           0
SquareFootage      0
Occupancy          0
HVACUsage          0
LightingUsage      0
RenewableEnergy    0
dtype: int64
Removed high-cardinality columns (>4 unique values): ['DayOfWeek']
Encoded categorical columns: ['Holiday', 'HVACUsage', 'LightingUsage']
Training set: (4000, 10), Test set: (1000, 10)
Standardization complete.


### 9:
Train a linear regression model:

In [11]:
lr_model = LinearRegression()

lr_model.fit(X_train_scaled_reg, y_train_reg)

### 10: 
Making Predictions and evaluating the model:

In [12]:
# Custom implementation of MAPE 
def mean_absolute_percentage_error(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100


In [13]:
# Predictions on test set and train set
y_train_pred_lr = lr_model.predict(X_train_scaled_reg)
y_test_pred_lr = lr_model.predict(X_test_scaled_reg)

# Metrics:
r2_train = r2_score(y_train_reg, y_train_pred_lr)
rmse_train = mean_squared_error(y_train_reg, y_train_pred_lr)
mae_train = mean_absolute_error(y_train_reg, y_train_pred_lr)
mape_train = mean_absolute_percentage_error(y_train_reg, y_train_pred_lr)

r2_test = r2_score(y_test_reg, y_test_pred_lr)
rmse_test = mean_squared_error(y_test_reg, y_test_pred_lr)
mae_test = mean_absolute_error(y_test_reg, y_test_pred_lr)
mape_test = mean_absolute_percentage_error(y_test_reg, y_test_pred_lr)

print("Linear Regression Evaluation Metrics:\n")

print("Training Set:")
print(f"R^2: {r2_train:.4f}")
print(f"RMSE: {rmse_train:.4f}")
print(f"MAE: {mae_train:.4f}")
print(f"MAPE: {mape_train:.2f}%\n")

print("Test Set:")
print(f"R^2: {r2_test:.4f}")
print(f"RMSE: {rmse_test:.4f}")
print(f"MAE: {mae_test:.4f}")
print(f"MAPE: {mape_test:.2f}%")


Linear Regression Evaluation Metrics:

Training Set:
R^2: 0.3261
RMSE: 57.7122
MAE: 6.0637
MAPE: 8.10%

Test Set:
R^2: 0.2682
RMSE: 61.0832
MAE: 6.1669
MAPE: 8.24%


### 11:
Polynomial Regression

Transforms features into polynomial features of the given degree.
Returns a new DataFrame X_poly

In [14]:
def create_polynomial_features(X, degree):
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_poly = poly.fit_transform(X)
    print(f"Polynomial features created for degree {degree}. Shape: {X_poly.shape}")
    return X_poly


Splitting for train/test and train models by Ridge and Lasso

In [15]:
def evaluate_polynomial_models(X, y, degrees=[2, 3, 4]):
    results = []

    for d in degrees:
        print(f"\nEvaluating Polynomial Degree: {d}")

        # Create polynomial features
        X_poly = create_polynomial_features(X, d)

        # Split and scale
        X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)
        X_train_scaled, X_test_scaled = standardize_features_reg(X_train, X_test)

        # Ridge Regression with cross-validation
        ridge_model = RidgeCV(alphas=[0.01, 0.1, 1, 10, 100], cv=5)
        ridge_model.fit(X_train_scaled, y_train)
        y_pred_ridge = ridge_model.predict(X_test_scaled)
        r2_ridge = r2_score(y_test, y_pred_ridge)

        # Lasso Regression with cross-validation
        lasso_model = LassoCV(alphas=[0.01, 0.1, 1, 10, 100], cv=5, max_iter=20000)
        lasso_model.fit(X_train_scaled, y_train)
        y_pred_lasso = lasso_model.predict(X_test_scaled)
        r2_lasso = r2_score(y_test, y_pred_lasso)

        results.append({
            'degree': d,
            'ridge_alpha': ridge_model.alpha_,
            'lasso_alpha': lasso_model.alpha_,
            'ridge_r2': r2_ridge,
            'lasso_r2': r2_lasso,
            'best_model': 'Ridge' if r2_ridge > r2_lasso else 'Lasso',
            'best_r2': max(r2_ridge, r2_lasso)
        })

    return results


Run the models and retrieve the best model:

In [16]:
results = evaluate_polynomial_models(X_train_scaled_reg, y_train_reg)

print("\nSummary of Polynomial Models:")
for res in results:
    print(f"Degree {res['degree']}:")
    print(f"  Ridge - Alpha: {res['ridge_alpha']} | R²: {res['ridge_r2']:.4f}")
    print(f"  Lasso - Alpha: {res['lasso_alpha']} | R²: {res['lasso_r2']:.4f}")
    print(f"  Best Model: {res['best_model']} with R²: {res['best_r2']:.4f}\n")

best = max(results, key=lambda x: x['best_r2'])
print("🏆 Best Overall Model:")
print(f"Degree: {best['degree']}")
print(f"Model: {best['best_model']}")
print(f"R²: {best['best_r2']:.4f}")
print(f"Alpha: {best['ridge_alpha'] if best['best_model']=='Ridge' else best['lasso_alpha']}")



Evaluating Polynomial Degree: 2
Polynomial features created for degree 2. Shape: (4000, 65)
Standardization complete.

Evaluating Polynomial Degree: 3
Polynomial features created for degree 3. Shape: (4000, 285)
Standardization complete.

Evaluating Polynomial Degree: 4
Polynomial features created for degree 4. Shape: (4000, 1000)
Standardization complete.

Summary of Polynomial Models:
Degree 2:
  Ridge - Alpha: 100.0 | R²: 0.3238
  Lasso - Alpha: 0.1 | R²: 0.3340
  Best Model: Lasso with R²: 0.3340

Degree 3:
  Ridge - Alpha: 100.0 | R²: 0.2945
  Lasso - Alpha: 0.1 | R²: 0.3333
  Best Model: Lasso with R²: 0.3333

Degree 4:
  Ridge - Alpha: 100.0 | R²: 0.1521
  Lasso - Alpha: 0.1 | R²: 0.3105
  Best Model: Lasso with R²: 0.3105

🏆 Best Overall Model:
Degree: 2
Model: Lasso
R²: 0.3340
Alpha: 0.1


🏆 Best Overall Model:
Degree: 2
Model: Lasso
R²: 0.3340
Alpha: 0.1

### 12:
What can be the problem with creating variables in degree 2+ and how can you solve it?
* the main problem is that the features grows significantly which can lead to overfitting that reduce the model's ability to generalize patterns.
* We can solve it we use regularization methods such as Lasso and Ridge which we'll explain about them in the next question.

What are Lasso and Ridge, when to use and what's the main difference between them?
* Lasso and ridge are penalize large coefficients:
* Use Lasso when you believe that some features are irrelevant, it shrinks those coefficients to zero and acting as a form of automatic **feature selection**
* Use Ridge when you believe that all features contribute to the outcome, but you want to prevent any single feature from dominating. Ridge shrinks all coefficients without removing them.
* Main Differences: 
* Lasso can eliminate features by setting their coefficients to zero (L1 regularization)
* Ridge reduces the magnitude of all coefficients (L2 regularization)

### 13:
Making prediction on train set and test set for the best model from 11 and display the metrics:

🏆 Best Overall Model:
Degree: 2
Model: Lasso
R²: 0.3340
Alpha: 0.1

In [17]:
# Recreate polynomial features for degree 2
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_train_scaled_reg)

# Split and scale again
X_train_poly, X_test_poly, y_train_poly, y_test_poly = train_test_split(
    X_poly, y_train_reg, test_size=0.2, random_state=42
)

X_train_poly_scaled, X_test_poly_scaled = standardize_features_reg(X_train_poly, X_test_poly)


Standardization complete.


In [18]:
best_lasso = LassoCV(alphas=[0.1], cv=5, max_iter=20000)
best_lasso.fit(X_train_poly_scaled, y_train_poly)

y_train_pred_best_lasso = best_lasso.predict(X_train_poly_scaled)
y_test_pred_best_lasso = best_lasso.predict(X_test_poly_scaled)


In [19]:
def evaluate_model(y_true, y_pred, label="Set"):
    r2 = r2_score(y_true, y_pred)
    rmse = mean_squared_error(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    mape = mean_absolute_percentage_error(y_true, y_pred)
    print(f"{label} Evaluation:")
    print(f"R^2: {r2:.4f}")
    print(f"RMSE: {rmse:.4f}")
    print(f"MAE: {mae:.4f}")
    print(f"MAPE: {mape:.2f}%\n")


In [20]:
# Evaluate on both sets
evaluate_model(y_train_poly, y_train_pred_best_lasso, "Training Set")
evaluate_model(y_test_poly, y_test_pred_best_lasso, "Test Set")


Training Set Evaluation:
R^2: 0.3376
RMSE: 57.3648
MAE: 6.0261
MAPE: 8.06%

Test Set Evaluation:
R^2: 0.3340
RMSE: 54.4697
MAE: 5.9139
MAPE: 7.83%



### 14: 
Train models with gridsearchcv on train set from 8

In [21]:
def train_model_with_gridsearch(model, param_grid, X_train, y_train):
    grid = GridSearchCV(estimator=model, param_grid=param_grid, 
                        cv=5, scoring='r2', n_jobs=-1)
    grid.fit(X_train, y_train)
    return grid.best_estimator_, grid.best_params_

In [22]:
XGB_model = XGBRegressor()
param_grid_XGB = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5]
}

best_XGBRegressor, best_XGB_params = train_model_with_gridsearch(
    XGB_model, param_grid_XGB, X_train_scaled_reg, y_train_reg
)
print("Best XGBRegressor params:", best_XGB_params)


Best XGBRegressor params: {'max_depth': 3, 'n_estimators': 100}


In [23]:
CatBoost_model = CatBoostRegressor(verbose=0)
param_grid_CatBoost = {
    'depth': [4, 6],
    'learning_rate': [0.01, 0.1]
}

best_CatBoostRegressor, best_CatBoost_params = train_model_with_gridsearch(
    CatBoost_model, param_grid_CatBoost, X_train_scaled_reg, y_train_reg
)
print("Best CatBoostRegressor params:", best_CatBoost_params)

Best CatBoostRegressor params: {'depth': 4, 'learning_rate': 0.01}


In [24]:
LGBM_model = LGBMRegressor()
param_grid_LGBM = {
    'num_leaves': [31, 64],
    'learning_rate': [0.01, 0.1]
}

best_LGBMRegressor, best_LGBM_params = train_model_with_gridsearch(
    LGBM_model, param_grid_LGBM, X_train_scaled_reg, y_train_reg
)
print("Best LGBMRegressor params:", best_LGBM_params)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000719 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1078
[LightGBM] [Info] Number of data points in the train set: 4000, number of used features: 10
[LightGBM] [Info] Start training from score 76.828375
Best LGBMRegressor params: {'learning_rate': 0.1, 'num_leaves': 31}


[WinError 2] The system cannot find the file specified
  File "c:\Users\aviad\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "c:\Users\aviad\AppData\Local\Programs\Python\Python312\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\aviad\AppData\Local\Programs\Python\Python312\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "c:\Users\aviad\AppData\Local\Programs\Python\Python312\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


In [25]:
RF_model = RandomForestRegressor()
param_grid_RF = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10]
}

best_RFRegressor, best_RF_params = train_model_with_gridsearch(
    RF_model, param_grid_RF, X_train_scaled_reg, y_train_reg
)
print(best_RF_params)

{'max_depth': 10, 'n_estimators': 100}


# ADD TO NB

In [26]:
from tabpfn import TabPFNRegressor
import torch


# Initialize the model
tabpfn_model = TabPFNRegressor(device='cuda' if torch.cuda.is_available() else 'cpu', ignore_pretraining_limits=True)

# Fit on the training set (already scaled from step 8)
tabpfn_model.fit(X_train_scaled_reg, y_train_reg)

  config = ModelConfig.from_dict(ModelConfig.upgrade_config(checkpoint["config"]))


### 15:
Making predictions on train set and test set for each of the models and displaying the metrics:

In [27]:
def print_metrics(y_train, y_train_pred, y_test, y_test_pred, model_name):
    print(f"\n{model_name} Evaluation")

    print("Training Set:")
    print(f"R^2:   {r2_score(y_train, y_train_pred):.4f}")
    print(f"RMSE: {mean_squared_error(y_train, y_train_pred):.4f}")
    print(f"MAE:  {mean_absolute_error(y_train, y_train_pred):.4f}")
    print(f"MAPE: {mean_absolute_percentage_error(y_train, y_train_pred):.2f}%")

    print("\nTest Set:")
    print(f"R^2:   {r2_score(y_test, y_test_pred):.4f}")
    print(f"RMSE: {mean_squared_error(y_test, y_test_pred):.4f}")
    print(f"MAE:  {mean_absolute_error(y_test, y_test_pred):.4f}")
    print(f"MAPE: {mean_absolute_percentage_error(y_test, y_test_pred):.2f}%")


In [28]:
y_train_pred_xgb = best_XGBRegressor.predict(X_train_scaled_reg)
y_test_pred_xgb = best_XGBRegressor.predict(X_test_scaled_reg)

print_metrics(y_train_reg, y_train_pred_xgb, y_test_reg, y_test_pred_xgb, "XGBoost")


XGBoost Evaluation
Training Set:
R^2:   0.5234
RMSE: 40.8120
MAE:  5.0793
MAPE: 6.77%

Test Set:
R^2:   0.2228
RMSE: 64.8693
MAE:  6.2978
MAPE: 8.41%


In [29]:
y_train_pred_cat = best_CatBoostRegressor.predict(X_train_scaled_reg)
y_test_pred_cat = best_CatBoostRegressor.predict(X_test_scaled_reg)

print_metrics(y_train_reg, y_train_pred_cat, y_test_reg, y_test_pred_cat, "CatBoost")


CatBoost Evaluation
Training Set:
R^2:   0.3908
RMSE: 52.1670
MAE:  5.7540
MAPE: 7.68%

Test Set:
R^2:   0.2809
RMSE: 60.0235
MAE:  6.1010
MAPE: 8.15%


In [30]:
y_train_pred_lgbm = best_LGBMRegressor.predict(X_train_scaled_reg)
y_test_pred_lgbm = best_LGBMRegressor.predict(X_test_scaled_reg)

print_metrics(y_train_reg, y_train_pred_lgbm, y_test_reg, y_test_pred_lgbm, "LGBM")


LGBM Evaluation
Training Set:
R^2:   0.6599
RMSE: 29.1277
MAE:  4.2708
MAPE: 5.70%

Test Set:
R^2:   0.2424
RMSE: 63.2355
MAE:  6.2699
MAPE: 8.37%




In [31]:
y_train_pred_rf = best_RFRegressor.predict(X_train_scaled_reg)
y_test_pred_rf = best_RFRegressor.predict(X_test_scaled_reg)

print_metrics(y_train_reg, y_train_pred_rf, y_test_reg, y_test_pred_rf, "Random Forest")


Random Forest Evaluation
Training Set:
R^2:   0.6968
RMSE: 25.9632
MAE:  4.1046
MAPE: 5.48%

Test Set:
R^2:   0.2439
RMSE: 63.1055
MAE:  6.2868
MAPE: 8.40%


Due to computational limitations, TabPFN predicted on a 1000-sample subset. While this doesn't allow a fully fair comparison, it offers insight into TabPFN's performance characteristics on limited data.

In [32]:
X_train_small_reg = X_train_scaled_reg[:1000]
y_train_small_reg = y_train_reg[:1000]

X_test_small_reg = X_test_scaled_reg[:1000]
y_test_small_reg = y_test_reg[:1000]

y_train_tabpfn = tabpfn_model.predict(X_train_small_reg)
y_test_tabpfn = tabpfn_model.predict(X_test_small_reg)
print_metrics(y_train_small_reg, y_train_tabpfn, y_test_small_reg, y_test_tabpfn, "tabPFN")


tabPFN Evaluation
Training Set:
R^2:   0.3651
RMSE: 59.0879
MAE:  6.1431
MAPE: 8.20%

Test Set:
R^2:   0.2828
RMSE: 59.8595
MAE:  6.0756
MAPE: 8.12%


But I ran the predictions in VS code on my local GPU on the original data print_metrics(y_train_reg, y_train_pred_tabpfn, y_test_reg, y_test_pred_tabpfn, "tabPFN") and I markdown the results to be consistent with answer with my answer in 17.a

In [33]:
# RUN IT WITH GPU OTHERWISE IT STUCK
'''
y_train_pred_tabpfn = tabpfn_model.predict(X_train_scaled_reg)
y_test_pred_tabpfn = tabpfn_model.predict(X_test_scaled_reg)
print_metrics(y_train_reg, y_train_pred_tabpfn, y_test_reg, y_test_pred_tabpfn, "tabPFN")
'''

'\ny_train_pred_tabpfn = tabpfn_model.predict(X_train_scaled_reg)\ny_test_pred_tabpfn = tabpfn_model.predict(X_test_scaled_reg)\nprint_metrics(y_train_reg, y_train_pred_tabpfn, y_test_reg, y_test_pred_tabpfn, "tabPFN")\n'

### 16:
Creating summary table with the 5 models:

In [35]:
summary = {
    'Model': ['XGBoost', 'CatBoost', 'LGBM', 'Random Forest', 'tabPFN','LinearRegression', 'Best Lasso'],
    'RMSE Train': [
        mean_squared_error(y_train_reg, y_train_pred_xgb),
        mean_squared_error(y_train_reg, y_train_pred_cat),
        mean_squared_error(y_train_reg, y_train_pred_lgbm),
        mean_squared_error(y_train_reg, y_train_pred_rf),
        mean_squared_error(y_train_small_reg, y_train_tabpfn),
        mean_squared_error(y_train_reg, y_train_pred_lr),
        mean_squared_error(y_train_poly, y_train_pred_best_lasso)
    ],
    
    'RMSE Test': [
        mean_squared_error(y_test_reg, y_test_pred_xgb),
        mean_squared_error(y_test_reg, y_test_pred_cat),
        mean_squared_error(y_test_reg, y_test_pred_lgbm),
        mean_squared_error(y_test_reg, y_test_pred_rf),
        mean_squared_error(y_test_reg, y_test_tabpfn),
        mean_squared_error(y_test_reg, y_test_pred_lr),
        mean_squared_error(y_test_poly, y_test_pred_best_lasso)
    ]
}

df_summary = pd.DataFrame(summary).round(4)
print(df_summary)


              Model  RMSE Train  RMSE Test
0           XGBoost     40.8120    64.8693
1          CatBoost     52.1670    60.0235
2              LGBM     29.1277    63.2355
3     Random Forest     25.9632    63.1055
4            tabPFN     59.0879    59.8595
5  LinearRegression     57.7122    61.0832
6        Best Lasso     57.3648    54.4697


Marking down the table because i wanted to be consistent with my answer in 17.a


| Model             | RMSE Train | RMSE Test |
|-------------------|------------|-----------|
| XGBOOST           | 40.8120    | 64.8693   |
| CatBoost          | 52.1670    | 60.0235   |
| LGBM              | 29.1277    | 63.2355   |
| Random Forest     | 25.6229    | 62.6501   |
| tabPFN            | 54.0252    | 59.8595   |
| Linear Regression | 57.7122    | 61.0832   |

### 17:

17.a: 


Who is the best model?
* Based on the RMSE values in the summary table, the best-performing model on the test set is tabPFN, with the lowest RMSE of 59.86.

Explain shortly the models' results
* XGBoostRegressor: The model performs well on training but significantly worse on test data, indicating overfitting - it learned the training set patterns too well, but doesn't generalize effectively.
 
* CatBoostRegressor: A balanced result - the test RMSE is close to the training RMSE. This suggests good generalization and a solid trade-off between bias and variance. One of the more stable models I got.

* LGBMRegressor: Very low training error but much higher error on the test set - a classic case of overfitting. The model likely memorized the training data and doesn't generalize well.

* RFRegressor: Extremely low training error and a large gap with test error - strong overfitting. While it excels on the training data, it performs poorly on unseen data.

* tabPFNRegressor: Balanced performance. The small gap between train and test RMSE shows that tabPFN generalizes well. Among all models, it has the lowest test RMSE, making it the best performer overall in terms of predictive accuracy.

* This model is very stable, with a small difference between training and test errors. It slightly overfit, but its test RMSE is higher than tabPFN and CatBoost, meaning it is reliable but less accurate.

Explain each of the 4 evaluation metrics.
* R^2 measures how well the model explains the variance in the target variable, it ranges from 0 to 1 where higher values indicate a better fit.
In the context of energy consumption prediction, a high R² means the model successfully captures the patterns in features like temperature and occupancy to explain fluctuations in energy usage. 
* RMSE is measured in the same units as the target variable - in this case, kilowatt-hours (kWh) — making it highly interpretable. It calculates the square root of the average squared difference between predicted and actual values, which means it penalizes larger errors more heavily than smaller ones. This makes RMSE especially useful in my case: when predicting energy usage, a few large mistakes (e.g., underestimating consumption during peak hours) can have operational or financial consequences. Therefore, RMSE is a key metric in this task, as it reflects both average error and sensitivity to outliers.
* MAE is also measured in the same units as the target variable and represents the average absolute difference between the predicted and actual values. Unlike RMSE, MAE treats all errors equally, regardless of their size, making it more robust to outliers. In the context of this project, MAE tells, on average, how many kWh (EnergyConsumption) your predictions deviate from the true values. This is useful when you want a straightforward, reliable estimate of how far off the model tends to be, especially if both small and large errors are equally important to your stakeholders.

* MAPE expresses error as a percentage of the actual values, making it unitless and easy to interpret across different datasets or use cases. For example, a MAPE of 8% means your model is off by 8% on average, or conversely, 92% accurate. This is particularly helpful for communicating model performance to non-technical audiences. In this project, MAPE allows you to quantify how reliable the energy predictions are in relative terms - a key consideration if different buildings or systems have widely varying levels of energy consumption.

17.b:
Related to this context, which is the best suited metric?

* RMSE is the most appropriate and informative metric. Here's why:

RMSE is measured in the same units as the target variable - kilowatt-hours (kWh) - and it penalizes large errors more heavily than smaller ones. This is particularly important in energy forecasting, where a few significant underestimations or overestimations can lead to real-world problems, such as power shortages, overproduction, or inefficient energy distribution. Unlike MAE, which treats all errors equally, RMSE gives more weight to large deviations - and in this context, such deviations can have operational and financial consequences.

Additionally, RMSE is commonly used in energy analytics and engineering applications for its sensitivity to peak error scenarios, making it a natural fit when modeling and optimizing for resource consumption.


