# Baseline Model

This notebook establishes a simple baseline to compare the performance of more advanced models. The baseline model predicts the majority class (i.e., the most frequent outcome in the training data) and does not use any input features. This sets a minimum benchmark for classification accuracy — any meaningful model should outperform this baseline.

In [1]:
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
import numpy as np
import pandas as pd

In [2]:
df = pd.read_excel('combined_data_binary.xlsx', index_col=0)

In [3]:
# Separate features and target variable
X = df.drop('booked_energy_consultation', axis=1)
y = df['booked_energy_consultation']

# Identify numerical and categorical columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X.select_dtypes(include=['object', 'category']).columns

# One-hot encode the categorical variables
encoder = OneHotEncoder(sparse_output=False)
categorical_encoded = encoder.fit_transform(X[categorical_cols])
categorical_encoded_df = pd.DataFrame(categorical_encoded, columns=encoder.get_feature_names_out(categorical_cols))

X = pd.concat([X[numerical_cols].reset_index(drop=True), categorical_encoded_df.reset_index(drop=True)], axis=1)

In [4]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
# Feature Scaling
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [6]:
constant_value = 1

baseline_model = DummyClassifier(strategy='uniform', random_state=42)
baseline_model.fit(X_train_scaled, y_train)
baseline_predictions = baseline_model.predict(X_test_scaled)

In [7]:
print(classification_report(y_test, baseline_predictions, zero_division=1))

              precision    recall  f1-score   support

       False       0.65      0.51      0.57       968
        True       0.36      0.50      0.42       532

    accuracy                           0.50      1500
   macro avg       0.50      0.50      0.49      1500
weighted avg       0.55      0.50      0.52      1500



As expected, the baseline model performs at chance level, with an accuracy of 50% and low precision and recall for the positive class. This confirms the need for more advanced models to make good predictions.