# Fuel Pairs Combinations Train-Test Split
In this experiment each possible fuel combination has been included as test set and excluded from training set. Each trained model was tried on selected models. Model performances within each trial has been saved as csv file at the end.

File loc: "C:\Users\demir\OneDrive\Desktop\MSc Thesis\Data\!Exp_data\experiment2_results.csv"

In [1]:
import itertools
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.dummy import DummyRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
import xgboost as xgb
import lightgbm as lgb
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from itertools import combinations

# Load data
clean_data = pd.read_csv(r"C:\Users\demir\OneDrive\Desktop\MSc Thesis\Data\!Exp_data\experiment_batch_data.csv")

# Get unique fuel types
fuel_types = clean_data['fuel_type'].unique()

# Store results
results = []

# Iterate over all combinations of 2 fuel types
for test_comb in combinations(fuel_types, 3):

    # Split data
    train_data = clean_data[~clean_data['fuel_type'].isin(test_comb)].drop(columns=['fuel_type']).reset_index(drop=True)
    test_data = clean_data[clean_data['fuel_type'].isin(test_comb)].drop(columns=['fuel_type']).reset_index(drop=True)

    # Extract features and target
    X_train = train_data.drop(columns=['sample', 'devol_yield'])
    y_train = train_data['devol_yield']
    X_test = test_data.drop(columns=['sample', 'devol_yield'])
    y_test = test_data['devol_yield']

    # Train-test split ratio
    train_ratio = len(X_train) / (len(X_train) + len(X_test))
    test_ratio = 1 - train_ratio

    # Imputation and scaling
    knn_imputer = KNNImputer(n_neighbors=3)
    X_train_imputed = knn_imputer.fit_transform(X_train)
    X_test_imputed = knn_imputer.transform(X_test)

    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train_imputed)
    X_test_scaled = scaler.transform(X_test_imputed)

    X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
    X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

    # Models
    models = {
        "Dummy Mean": DummyRegressor(strategy="mean"),
        "Dummy Median": DummyRegressor(strategy="median"),
        "KNN": KNeighborsRegressor(n_neighbors=5),
        "Linear": LinearRegression(),
        "Ridge": Ridge(alpha=1.0),
        "Lasso": Lasso(alpha=0.1),
        "ElasticNet": ElasticNet(alpha=0.1, l1_ratio=0.5),
        "Decision Tree": DecisionTreeRegressor(max_depth=5),
        "Random Forest": RandomForestRegressor(n_estimators=100, max_depth=5),
        "Gradient Boosting": GradientBoostingRegressor(n_estimators=100, learning_rate=0.1),
        "XGBoost": xgb.XGBRegressor(n_estimators=100, learning_rate=0.1),
        "LightGBM": lgb.LGBMRegressor(n_estimators=100, learning_rate=0.1),
        "Gaussian Process": GaussianProcessRegressor(),
        "SVR": SVR(kernel='rbf', C=1.0, epsilon=0.1),
        "MLP": MLPRegressor(hidden_layer_sizes=(100,), activation='relu', max_iter=2000)
    }

    # Train and evaluate models
    scores = {}
    for model_name, model in models.items():
        model.fit(X_train_scaled, y_train)
        scores[model_name] = model.score(X_test_scaled, y_test)

    # Store results
    results.append({
        "Test Fuel Types": test_comb,
        "Train-Test Ratio": (train_ratio, test_ratio),
        **scores
    })

# Convert results to DataFrame
results_df = pd.DataFrame(results)

# Save results to CSV
results_df.to_csv(r"C:\Users\demir\OneDrive\Desktop\MSc Thesis\Data\!Exp_data\experiment3.0_results.csv", index=False)

print("Experiment completed. Results saved.")

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000164 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1570
[LightGBM] [Info] Number of data points in the train set: 1749, number of used features: 21
[LightGBM] [Info] Start training from score 52.099957
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000195 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1479
[LightGBM] [Info] Number of data points in the train set: 1746, number of used features: 21
[LightGBM] [Info] Start training from score 52.767294
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000256 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1449
[LightGBM] [Info] Number of data points in the train set

  model = cd_fast.enet_coordinate_descent(


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000299 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1478
[LightGBM] [Info] Number of data points in the train set: 1608, number of used features: 21
[LightGBM] [Info] Start training from score 52.448348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000163 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 343
[LightGBM] [Info] Number of data points in the train set: 1584, number of used features: 21
[LightGBM] [Info] Start training from score 52.891371
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000195 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1322
[LightGBM] [Info] Number of data points in the train set:

  model = cd_fast.enet_coordinate_descent(


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000204 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1456
[LightGBM] [Info] Number of data points in the train set: 1711, number of used features: 21
[LightGBM] [Info] Start training from score 52.475571
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000209 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1434
[LightGBM] [Info] Number of data points in the train set: 1699, number of used features: 21
[LightGBM] [Info] Start training from score 51.876212
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000139 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 340
[LightGBM] [Info] Number of data points in the train set:



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000205 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1328
[LightGBM] [Info] Number of data points in the train set: 1541, number of used features: 20
[LightGBM] [Info] Start training from score 54.228458
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000198 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1317
[LightGBM] [Info] Number of data points in the train set: 1545, number of used features: 20
[LightGBM] [Info] Start training from score 54.753802
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000299 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1471
[LightGBM] [Info] Number of data points in the train set: 1426, number of used features: 21
[LightGBM] [Info] Start trai