# k-NN

In this notebook, a k-nearest neighbors classifier is applied. k-NN is a non-parametric, instance-based algorithm that classifies new data points based on the majority label of their closest neighbors in the feature space. It is particularly sensitive to feature scaling and works best when similar instances are close together in the input space.

In [1]:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.metrics import classification_report, accuracy_score, f1_score, precision_score

In [2]:
df = pd.read_excel('combined_data_binary.xlsx', index_col=0)

In [3]:
# Separate features and target variable
X = df.drop('booked_energy_consultation', axis=1)
y = df['booked_energy_consultation']

# Identify numerical and categorical columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X.select_dtypes(include=['object', 'category']).columns

# One-hot encode the categorical variables
encoder = OneHotEncoder(sparse_output=False)
categorical_encoded = encoder.fit_transform(X[categorical_cols])
categorical_encoded_df = pd.DataFrame(categorical_encoded, columns=encoder.get_feature_names_out(categorical_cols))

X = pd.concat([X[numerical_cols].reset_index(drop=True), categorical_encoded_df.reset_index(drop=True)], axis=1)

#### Splitting the data into test and training set, training set 80%, test set 20%


In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Standardizing the Data

In [5]:
# Feature Scaling
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [6]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

In [7]:
class_labels = knn.classes_

# Making predictions
y_pred = knn.predict(X_test_scaled)
f1 = f1_score(y_test, y_pred, average=None)
precision = precision_score(y_test, y_pred, average=None)
accuracy = accuracy_score(y_test, y_pred)


# Evaluate the model
print("Classification Report k-NN: \n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy)

Classification Report k-NN: 
               precision    recall  f1-score   support

       False       0.92      0.95      0.93      1474
        True       0.89      0.83      0.86       776

    accuracy                           0.91      2250
   macro avg       0.90      0.89      0.90      2250
weighted avg       0.91      0.91      0.91      2250

Accuracy Score: 0.9066666666666666


The k-NN model achieved an overall accuracy of 91%, making it one of the best-performing models in this project. It shows strong results for precision, with a score of 89% for the positive class. This indicates that the model is effective in identifying interested homeowners while maintaining a low rate of false positives. The slightly lower recall for the positive class suggests that a few relevant leads may still be missed, but overall, the model demonstrates a strong balance between precision and recall.