# **DESCRIPTION**

This project consist in a forecasting of cost for different industrial supplies from a company

# **Project Develop**

## *Install Libraries*

In [1]:
!pip install pandas
!pip install numpy
!pip install openpyxl
!pip install scikit-learn
!pip install lightgbm

## *Import Libraries*

In [2]:
import pandas as pd
import numpy as np
import openpyxl
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
import lightgbm as lgb
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import TimeSeriesSplit

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



## *Import data*

Import data from an Excel file provided by the company

In [3]:
df_purchases = pd.read_excel("COMPRAS.xlsx", index_col=0)

In [4]:
df_purchases.head(10)

Unnamed: 0_level_0,supplier_order_id,order_date,supplier_name,position_supply,supply_id,supply_reference,unit_value,discount,delivery_date,quantity,pending,deliv_date_1,deliv_quant_1,deliv_note_1,deliv_date_2,deliv_quant_2,deliv_note_2,deliv_date_3,deliv_quant_3,deliv_note_3
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
5252,64,2013-05-16,OFFICINE SANTAFEDE,1.0,16556,"BWN 6""900RTJ ID142,88mm F44",1020.0,0.0,2013-11-18,50.0,0.0,,28.0,,,22.0,,,0.0,
5253,64,2013-05-16,OFFICINE SANTAFEDE,2.0,16557,FORGING ROUND F44,2560.0,0.0,2013-11-18,25.0,0.0,,25.0,,,0.0,,,0.0,
5589,1,2013-09-02,Aceros y Equipos S.L.,1.0,16548,"BRE 76,2MM HAST C276",44.5,0.0,2013-09-04,1.6,0.0,,1.6,,,0.0,,,0.0,
5590,2,2013-09-03,UTILES Y MAQUINAS INDUSTRIALES,1.0,15728,"TP1R 0,5mm",420.6,0.0,2013-09-10,3.0,0.0,,3.0,,,0.0,,,0.0,
5591,2,2013-09-03,UTILES Y MAQUINAS INDUSTRIALES,2.0,16383,VAR 4.5X2X1.2X1000 A710,19.6,0.0,2013-09-10,5.0,0.0,,5.0,,,0.0,,,0.0,
5592,3,2013-09-18,OFFICINE SANTAFEDE,1.0,15774,"TPI 4-1/16""X4""2500",5200.0,0.0,2013-10-21,1.0,0.0,,1.0,,,0.0,,,0.0,
5502,4,2013-09-20,ThyssenKrupp Materials Ibérica,3.0,12029,BRE 165 AISI-304L,2.55,0.0,2013-10-08,112.0,0.0,,112.0,,,0.0,,,0.0,
5503,5,2013-09-20,Aceros y Equipos S.L.,1.0,11819,BRE 55 AISI-321,4.7,0.0,2013-09-24,112.0,0.0,,112.0,,,0.0,,,0.0,
5504,6,2013-09-20,"Empresa Santa Lucía, S.A.",1.0,11715,BRE 35 A5,14.6,0.0,2013-09-27,132.0,0.0,,132.0,,,0.0,,,0.0,
5505,6,2013-09-20,"Empresa Santa Lucía, S.A.",2.0,11742,BRE 35 ALLOY400,27.5,0.0,2013-09-27,91.0,0.0,,91.0,,,0.0,,,0.0,


## *Data Resume*

In [5]:
df_purchases.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10775 entries, 5252 to 10818
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   supplier_order_id  10775 non-null  int64         
 1   order_date         10775 non-null  datetime64[ns]
 2   supplier_name      10775 non-null  object        
 3   position_supply    10763 non-null  float64       
 4   supply_id          10775 non-null  int64         
 5   supply_reference   10775 non-null  object        
 6   unit_value         10775 non-null  float64       
 7   discount           10775 non-null  float64       
 8   delivery_date      10420 non-null  datetime64[ns]
 9   quantity           10775 non-null  float64       
 10  pending            10775 non-null  float64       
 11  deliv_date_1       574 non-null    float64       
 12  deliv_quant_1      10692 non-null  float64       
 13  deliv_note_1       574 non-null    object        
 14  deliv_da

In [6]:
df_purchases.describe()

Unnamed: 0,supplier_order_id,order_date,position_supply,supply_id,unit_value,discount,delivery_date,quantity,pending,deliv_date_1,deliv_quant_1,deliv_date_2,deliv_quant_2,deliv_date_3,deliv_quant_3
count,10775.0,10775,10763.0,10775.0,10775.0,10775.0,10420,10775.0,10775.0,574.0,10692.0,248.0,10692.0,31.0,10692.0
mean,1466.358515,2019-04-04 08:59:46.969837568,8.807582,15884.09884,19344.75,0.103745,2019-02-21 04:31:00.115163136,17076570000.0,0.663651,45467.355401,17209130000.0,45354.991935,7.826178,45334.16129,1.83605
min,1.0,2013-05-16 00:00:00,1.0,11200.0,0.0,0.0,2012-11-30 00:00:00,0.0,0.0,45295.0,0.0,45295.0,0.0,45294.0,0.0
25%,818.5,2016-10-03 12:00:00,1.0,13938.5,4.26,0.0,2016-09-05 00:00:00,2.0,0.0,45404.0,1.0,45310.0,0.0,45294.0,0.0
50%,1485.0,2019-01-29 00:00:00,3.0,15834.0,19.5,0.0,2018-11-12 00:00:00,6.0,0.0,45455.0,4.0,45334.0,0.0,45348.0,0.0
75%,2151.5,2022-01-07 00:00:00,9.0,17876.5,100.0,0.0,2021-09-07 00:00:00,30.0,0.0,45544.0,24.0,45370.0,0.0,45369.0,0.0
max,2903.0,2024-12-16 00:00:00,105.0,20060.0,10000000.0,40.0,2024-12-02 00:00:00,184000000000000.0,1500.0,45628.0,184000000000000.0,45635.0,5000.0,45373.0,11700.0
std,824.794745,,13.565559,2375.611046,290580.7,1.417428,,1772594000000.0,16.886231,84.396032,1779460000000.0,64.945009,87.111424,33.843854,116.367555


Show number of rows for each product

In [7]:
product_counts = df_purchases['supply_reference'].value_counts()
product_counts.head(50)

Unnamed: 0_level_0,count
supply_reference,Unnamed: 1_level_1
TRANSPORTE,413
CORTE,256
TERMO BIME,155
PORTES,122
CERTIFICADO 3.1,79
2016.JL.2L.S.XX,76
PACKING,50
EXTRA COST,49
MECANIZADO VARIOS,45
CALIBRACION BIMETA,44


## *Data Engineering*

Columns that do not provide data of interest are eliminated.

In [8]:
df_purchases = df_purchases.drop(columns=["supplier_order_id","position_supply","supply_id","discount","pending",
                    "deliv_date_1","deliv_quant_1","deliv_note_1",
                    "deliv_date_2","deliv_quant_2","deliv_note_2",
                    "deliv_date_3","deliv_quant_3","deliv_note_3"])

Fill data for items not delivered with the last day of working before christmas holidays

In [9]:
df_purchases['delivery_date'] = df_purchases['delivery_date'].fillna(pd.Timestamp('2024-12-20'))

Change the order of columns in dataframe

In [10]:
new_column_order = ["order_date", "delivery_date", "supplier_name", "supply_reference","unit_value","quantity"]
df_purchases = df_purchases[new_column_order]


Calculation of the relative change in the unit price of a product compared to previous purchases

In [11]:
df_purchases = df_purchases.sort_values(by=['supply_reference', 'order_date'])

# Calculate the previous unit price for each product
df_purchases['previous_unit_value'] = df_purchases.groupby('supply_reference')['unit_value'].shift(1)

# Calculate the rate of change in the unit price
df_purchases['price_change_rate'] = ((df_purchases['unit_value'] - df_purchases['previous_unit_value']) / df_purchases['previous_unit_value']) * 100

# Fill the NaN values (which appear for the first purchase of each product) with 0 or an appropriate value
df_purchases['price_change_rate'] = df_purchases['price_change_rate'].fillna(0)

Verify if infinite or NaN values in new colum

In [12]:
num_infinite_values = np.isinf(df_purchases['price_change_rate']).sum()
num_nan_values = df_purchases['price_change_rate'].isnull().sum()

print(f"Infinites values: {num_infinite_values}; NaN values: {num_nan_values}")

Infinites values: 46; NaN values: 0


Replacing infinite values

In [13]:
df_purchases['price_change_rate'].replace([np.inf, -np.inf], np.nan, inplace=True)
mean_value = df_purchases['price_change_rate'].mean()
df_purchases['price_change_rate'].fillna(mean_value, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_purchases['price_change_rate'].replace([np.inf, -np.inf], np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_purchases['price_change_rate'].fillna(mean_value, inplace=True)


In [14]:
num_infinite_values = np.isinf(df_purchases['price_change_rate']).sum()
num_nan_values = df_purchases['price_change_rate'].isnull().sum()

print(f"Infinites values: {num_infinite_values}; NaN values: {num_nan_values}")

Infinites values: 0; NaN values: 0


Coding cathegorical variables. Using Target Encoding to establish to each category the mean of target variable

In [15]:
supplier_avg_cost = df_purchases.groupby("supplier_name")["unit_value"].mean()
supply_ref_avg_cost = df_purchases.groupby("supply_reference")["unit_value"].mean()

df_purchases["supplier_encoded"] = df_purchases["supplier_name"].map(supplier_avg_cost)
df_purchases["supply_ref_encoded"] = df_purchases["supply_reference"].map(supply_ref_avg_cost)

Creation of new categories for time series

In [16]:
df_purchases["lead_time"] = (df_purchases["delivery_date"] - df_purchases["order_date"]).dt.days  # Delivery time in days
df_purchases["month"] = df_purchases["order_date"].dt.month  # Month of order
df_purchases["year"] = df_purchases["order_date"].dt.year # Year of order

Standarization of numeric columns

In [17]:
# Negative values can be obtained of this transformation
scaler = StandardScaler()

df_purchases['quantity'] = scaler.fit_transform(df_purchases[['quantity']])
df_purchases['unit_value'] = scaler.fit_transform(df_purchases[['unit_value']])
df_purchases['lead_time'] = scaler.fit_transform(df_purchases[['lead_time']])

df_purchases.head(5)

Unnamed: 0_level_0,order_date,delivery_date,supplier_name,supply_reference,unit_value,quantity,previous_unit_value,price_change_rate,supplier_encoded,supply_ref_encoded,lead_time,month,year
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
9362,2018-01-25,2018-02-28,CODESOL,"1/2 CLAMP ORBIWELD 76S ø25,40MM",-0.066097,-0.009634,,0.0,3204.848571,139.09,0.1269,1,2018
4771,2024-02-29,2024-06-18,"Officine Orobiche, S.r.l.",2006.PF.PF.V.XX,14.800898,-0.009634,,0.0,544962.573576,4320000.0,1.721152,2,2024
3476,2023-09-21,2024-12-20,"Officine Orobiche, S.r.l.",2016.825.TI.S.XX,13.753085,-0.009634,,0.0,544962.573576,1333652.0,8.979195,9,2023
3477,2023-09-21,2024-12-20,"Officine Orobiche, S.r.l.",2016.825.TI.S.XX,14.447141,-0.009634,4015540.0,5.022239,544962.573576,1333652.0,8.979195,9,2023
4313,2023-09-21,2024-12-20,"Officine Orobiche, S.r.l.",2016.825.TI.S.XX,15.765872,-0.009634,4217210.0,9.086102,544962.573576,1333652.0,8.979195,9,2023


Get X and Y variables droping those columns without interesting data

In [18]:
X = df_purchases[['quantity', 'price_change_rate', 'supplier_encoded', 'supply_ref_encoded', 'lead_time', 'month', 'year']]
y = df_purchases['unit_value']

TimeSeriesSplit configuration for time series

In [19]:
n_splits = 5  # Number of divisions (folds)
tscv = TimeSeriesSplit(n_splits=n_splits)

Divide data in train and test. Train and prediction each model looking for the best

In [20]:
# Models to evaluate
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    "LightGBM": lgb.LGBMRegressor(random_state=42),
}

# Initialise a dictionary to store the average metrics for each model.
metrics = {name: {"MAE": [], "RMSE": [], "R2": []} for name in models.keys()}

# Cross-Validation
for fold, (train_index, test_index) in enumerate(tscv.split(X)):
    print(f"Fold {fold + 1}/{n_splits}")

    # Divide data
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # Train and evaluate each model
    for name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        # Calculate metrics
        mae = mean_absolute_error(y_test, y_pred)
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        r2 = r2_score(y_test, y_pred)

        # Store metrics
        metrics[name]["MAE"].append(mae)
        metrics[name]["RMSE"].append(rmse)
        metrics[name]["R2"].append(r2)

Fold 1/5
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000855 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 899
[LightGBM] [Info] Number of data points in the train set: 1800, number of used features: 7
[LightGBM] [Info] Start training from score 0.217356
Fold 2/5
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000400 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 964
[LightGBM] [Info] Number of data points in the train set: 3595, number of used features: 7
[LightGBM] [Info] Start training from score 0.112555
Fold 3/5
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000378 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1001
[LightGBM] [Info] Number of data po

In [21]:
# Average metrics by model
print("\nResultados Promedio:")
for name, model_metrics in metrics.items():
    avg_mae = np.mean(model_metrics["MAE"])
    avg_rmse = np.mean(model_metrics["RMSE"])
    avg_r2 = np.mean(model_metrics["R2"])
    print(f"{name}: MAE = {avg_mae:.2f}, RMSE = {avg_rmse:.2f}, R² = {avg_r2:.2f}")


Resultados Promedio:
Linear Regression: MAE = 0.28, RMSE = 0.53, R² = -14.46
Decision Tree: MAE = 0.02, RMSE = 0.27, R² = 0.32
Random Forest: MAE = 0.02, RMSE = 0.17, R² = 0.69
Gradient Boosting: MAE = 0.02, RMSE = 0.21, R² = 0.27
LightGBM: MAE = 0.04, RMSE = 0.31, R² = -1.08


Select Random Forest as best model and use GridSearch to hyperparameter tuning

In [22]:
from sklearn.model_selection import GridSearchCV

param_grids = {
    "Random Forest": {
        'n_estimators': [100, 200, 300],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 3],
    },
    "Decision Tree": {
        'max_depth': [None, 5, 10, 15],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
        'max_features': ['auto', 'sqrt', 'log2'],
    },
    "Gradient Boosting": {
        'n_estimators': [100, 200],
        'learning_rate': [0.01, 0.05, 0.1],
        'max_depth': [3, 5, 7],
        'min_samples_split': [2, 5],
        'min_samples_leaf': [1, 2, 4],
        'subsample': [0.8, 1.0],
    }
}

# Initialise a dictionary to store the average metrics for each model.
best_models = {}

for fold, (train_index, test_index) in enumerate(tscv.split(X)):
    print(f"Fold {fold + 1}/{n_splits}")

    # Divide data
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # Train and evaluate each model
    for name, model in models.items():
        if name in param_grids:
          print(name)  # If model has defined hyperparameters
          grid_search = GridSearchCV(estimator=model, param_grid=param_grids[name],
                                      cv=3, n_jobs=-1, verbose=1, scoring='neg_mean_absolute_error')
          grid_search.fit(X_train, y_train)
          best_model = grid_search.best_estimator_
          best_models[name] = best_model
        # else:
        #   model.fit(X_train, y_train)
        #   best_models[name] = model  # Save original model

          y_pred = best_models[name].predict(X_test)

          # Calculate metrics
          mae = mean_absolute_error(y_test, y_pred)
          rmse = np.sqrt(mean_squared_error(y_test, y_pred))
          r2 = r2_score(y_test, y_pred)

          # Store metrics
          metrics[name]["MAE"].append(mae)
          metrics[name]["RMSE"].append(rmse)
          metrics[name]["R2"].append(r2)

# Calculate mean metrics
average_metrics = {name: {metric: np.mean(values) for metric, values in metrics[name].items()} for name in metrics}

for name in metrics:
  for metric, values in metrics[name].items():
    print(f"{name}: {metric}: {np.mean(values)}")

Fold 1/5
Decision Tree
Fitting 3 folds for each of 108 candidates, totalling 324 fits


108 fits failed out of a total of 324.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
38 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 98, in validate_parameter_constraints
    raise InvalidParameterError(
skl

Random Forest
Fitting 3 folds for each of 108 candidates, totalling 324 fits
Gradient Boosting
Fitting 3 folds for each of 216 candidates, totalling 648 fits
Fold 2/5
Decision Tree
Fitting 3 folds for each of 108 candidates, totalling 324 fits


108 fits failed out of a total of 324.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
68 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 98, in validate_parameter_constraints
    raise InvalidParameterError(
skl

Random Forest
Fitting 3 folds for each of 108 candidates, totalling 324 fits
Gradient Boosting
Fitting 3 folds for each of 216 candidates, totalling 648 fits
Fold 3/5
Decision Tree
Fitting 3 folds for each of 108 candidates, totalling 324 fits


108 fits failed out of a total of 324.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
67 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 98, in validate_parameter_constraints
    raise InvalidParameterError(
skl

Random Forest
Fitting 3 folds for each of 108 candidates, totalling 324 fits
Gradient Boosting
Fitting 3 folds for each of 216 candidates, totalling 648 fits
Fold 4/5
Decision Tree
Fitting 3 folds for each of 108 candidates, totalling 324 fits


108 fits failed out of a total of 324.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
83 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 98, in validate_parameter_constraints
    raise InvalidParameterError(
skl

Random Forest
Fitting 3 folds for each of 108 candidates, totalling 324 fits
Gradient Boosting
Fitting 3 folds for each of 216 candidates, totalling 648 fits
Fold 5/5
Decision Tree
Fitting 3 folds for each of 108 candidates, totalling 324 fits


108 fits failed out of a total of 324.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
41 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 98, in validate_parameter_constraints
    raise InvalidParameterError(
skl

Random Forest
Fitting 3 folds for each of 108 candidates, totalling 324 fits
Gradient Boosting
Fitting 3 folds for each of 216 candidates, totalling 648 fits
Linear Regression: MAE: 0.2835461224978958
Linear Regression: RMSE: 0.529064356792243
Linear Regression: R2: -14.45721793920022
Decision Tree: MAE: 0.019739779936352987
Decision Tree: RMSE: 0.28204163752188977
Decision Tree: R2: 0.19641430163353196
Random Forest: MAE: 0.0157492470230347
Random Forest: RMSE: 0.17453433043175418
Random Forest: R2: 0.6875774318207586
Gradient Boosting: MAE: 0.02000674188993392
Gradient Boosting: RMSE: 0.1987487415889055
Gradient Boosting: R2: 0.39837711233157885
LightGBM: MAE: 0.03802677250892415
LightGBM: RMSE: 0.30655102957715263
LightGBM: R2: -1.0751590232378025


**Finally the best model is Random Forest with the following metrics**

*   MAE: 0.0157
*   RMSE: 0.174
*   R2: 0.687

