# Logistic Regression

This notebook applies logistic regression as the first predictive model beyond the baseline. Logistic regression is a simple yet powerful linear classification algorithm that estimates the probability of a binary outcome. It serves as a strong starting point for evaluating the predictive power of the features and provides interpretable coefficients to understand feature influence.

In [3]:
import pandas as pd
import numpy as np
import random as python_random
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.metrics import classification_report, accuracy_score, f1_score, precision_score

In [4]:
df = pd.read_excel('combined_data_binary.xlsx', index_col=0)

In [5]:
# Separate features and target variable
X = df.drop('booked_energy_consultation', axis=1)
y = df['booked_energy_consultation']

# Identify numerical and categorical columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X.select_dtypes(include=['object', 'category']).columns

# One-hot encode the categorical variables
encoder = OneHotEncoder(sparse_output=False, drop='first')
categorical_encoded = encoder.fit_transform(X[categorical_cols])
categorical_encoded_df = pd.DataFrame(categorical_encoded, columns=encoder.get_feature_names_out(categorical_cols))

X = pd.concat([X[numerical_cols].reset_index(drop=True), categorical_encoded_df.reset_index(drop=True)], axis=1)

In [6]:
#X = X.drop('Unnamed: 0', axis=1)

### Splitting the data into test and training set, training set 70%, test set 30%

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Scaling the features

In [9]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Initializing the Logistic Regression Model

In [10]:
log_reg = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)

# Train the model
log_reg.fit(X_train_scaled, y_train)

In [11]:
y_pred = log_reg.predict(X_test_scaled)

# Evaluate the model
print("Classification Report Logistic Regression:")
print(classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))

Classification Report Logistic Regression:
              precision    recall  f1-score   support

       False       0.92      0.94      0.93      1474
        True       0.88      0.84      0.86       776

    accuracy                           0.90      2250
   macro avg       0.90      0.89      0.89      2250
weighted avg       0.90      0.90      0.90      2250

Accuracy Score: 0.9048888888888889


Logistic regression achieved strong performance, with an overall accuracy of 90%. Both classes are predicted well, with a precision of 92% and recall of 94% for the negative class, and 88% precision and 84% recall for the positive class. The balanced F1-scores (0.93 and 0.86) indicate that the model is capable of distinguishing between homeowners who are likely and unlikely to book an energy consultation. This suggests that the features used in the model provide meaningful predictive value.