# KNN Regression

There was a lot of discussion in this competitions forms concluding that the dataset was purely random -- without a signal. A user claimed that 5% of the original data was duplicated. Therefore we an use a KNN regressor to predict rows in the original dataset.

### Load Data

In [65]:
import pandas as pd, numpy as np
df = pd.read_csv('./../../data/processed/train_dropped.csv')
test_df = pd.read_csv('./../../data/raw/test.csv')
print("Original Data shape",df.shape)
df.head()

Original Data shape (246686, 11)


Unnamed: 0,id,Brand,Material,Size,Compartments,Laptop Compartment,Waterproof,Style,Color,Weight Capacity (kg),Price
0,0,Jansport,Leather,Medium,7.0,Yes,No,Tote,Black,11.611723,112.15875
1,1,Jansport,Canvas,Small,10.0,Yes,Yes,Messenger,Green,27.078537,68.88056
2,2,Under Armour,Leather,Small,2.0,Yes,No,Messenger,Red,16.64376,39.1732
3,3,Nike,Nylon,Small,8.0,Yes,No,Messenger,Green,12.93722,80.60793
4,4,Adidas,Canvas,Medium,1.0,Yes,Yes,Messenger,Green,17.749338,86.02312


### Encode

In [66]:
# ----------------------------------------------------------------------------
# LABEL ENCODE (OR FACTORIZE) CATEGORICAL COLUMNS AND PRESERVE MISSING DATA
# ----------------------------------------------------------------------------

from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer


COLS = ['Brand', 'Material', 'Size', 'Laptop Compartment', 'Waterproof', 'Style', 'Color']

for c in COLS:
    # We'll create a LabelEncoder for each column
    le = LabelEncoder()
    
    # Replace NaNs with a sentinel label (e.g., "missing") before encoding
    # so that all rows have some string representation
    df[c] = df[c].fillna('missing').astype(str)
    test_df[c] = test_df[c].fillna('missing').astype(str)
    
    # Fit the encoder and transform
    df[c] = le.fit_transform(df[c])
    test_df[c] = le.fit_transform(test_df[c])
    
    # Convert to float32 (optional, saves space)
    df[c] = df[c].astype('float32')
    test_df[c] = test_df[c].astype('float32')

# List numerical columns that need imputation (excluding the target if you prefer)
num_cols = ['Compartments', 'Weight Capacity (kg)']

# Create an imputer instance, here using the mean
imputer = SimpleImputer(strategy='mean')

# Impute missing values in the training set
df[num_cols] = imputer.fit_transform(df[num_cols])
test_df[num_cols] = imputer.transform(test_df[num_cols])

print("Data after label encoding...")
df.head()


Data after label encoding...


Unnamed: 0,id,Brand,Material,Size,Compartments,Laptop Compartment,Waterproof,Style,Color,Weight Capacity (kg),Price
0,0,1.0,1.0,1.0,7.0,1.0,0.0,2.0,0.0,11.611723,112.15875
1,1,1.0,0.0,2.0,10.0,1.0,1.0,1.0,3.0,27.078537,68.88056
2,2,4.0,1.0,2.0,2.0,1.0,0.0,1.0,5.0,16.64376,39.1732
3,3,2.0,2.0,2.0,8.0,1.0,0.0,1.0,3.0,12.93722,80.60793
4,4,0.0,0.0,1.0,1.0,1.0,1.0,1.0,3.0,17.749338,86.02312


### Split

In [67]:
# Decide which column is your target; for example, say it's named "Price"
target_column = "Price"

# 1) Choose the split index as before
train_size = int(0.9 * len(df))

# 2) Slice your DataFrame into train_df and valid_df
train_df = df.iloc[:train_size].copy()
valid_df = df.iloc[train_size:].copy()

print("Original subset train shape", train_df.shape)
print("Original subset valid shape", valid_df.shape)

# 3) Separate features (X) and target (y) for training and validation
X_train = train_df.drop(columns=[target_column])
y_train = train_df[target_column]

X_valid = valid_df.drop(columns=[target_column])
y_valid = valid_df[target_column]

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_valid shape:", X_valid.shape)
print("y_valid shape:", y_valid.shape)


Original subset train shape (222017, 11)
Original subset valid shape (24669, 11)
X_train shape: (222017, 10)
y_train shape: (222017,)
X_valid shape: (24669, 10)
y_valid shape: (24669,)


### Predict with Constant Train Mean
On a truly random signal, predicting the mean for every instance should give you your best score. So we will calculate our score using mean prediction to compare to KNN regression.

In [68]:
# Baseline: Predict the mean of the training target for every validation sample
train_mean = y_train.mean()
print("Train mean:", train_mean)

# Get the true target values from the validation set
true = y_valid.values

# Create a prediction array that's all the same value (the train mean)
pred = np.ones(len(y_valid)) * train_mean
print("First 10 predictions:", pred[:10])

# Compute the error metric (e.g., RMSE)
m = np.sqrt(np.nanmean((true - pred) ** 2))
print("Using Constant Prediction - Validation Score (RMSE) =", m)

# Create submission file for constant mean prediction
submission = pd.DataFrame({
    "id": test_df["id"],
    "Price": train_mean
})
submission.to_csv("./../../submissions/constain_mean.csv", index=False)


Train mean: 81.56795501704825
First 10 predictions: [81.56795502 81.56795502 81.56795502 81.56795502 81.56795502 81.56795502
 81.56795502 81.56795502 81.56795502 81.56795502]
Using Constant Prediction - Validation Score (RMSE) = 39.14995083675154


### Predict With KNN Regressor

In [75]:
# CODE TO CHANGE: Using scikit-learn's KNeighborsRegressor instead of manual distance computations

from sklearn.neighbors import KNeighborsRegressor
import numpy as np

# Remove noise from columns 'Compartments' and 'Weight Capacity (kg)' by dividing by 2.0
X_train['Compartments'] = X_train['Compartments'] / 2.0
X_valid['Compartments'] = X_valid['Compartments'] / 2.0
X_train['Weight Capacity (kg)'] = X_train['Weight Capacity (kg)'] / 2.0
X_valid['Weight Capacity (kg)'] = X_valid['Weight Capacity (kg)'] / 2.0

# Optionally convert to numpy arrays if not already (scikit-learn accepts DataFrames too)
X_train_np = X_train.values if hasattr(X_train, 'values') else X_train
X_valid_np = X_valid.values if hasattr(X_valid, 'values') else X_valid

# Create a 1-nearest neighbor regressor to mimic your manual approach
knn = KNeighborsRegressor(n_neighbors=1, algorithm='auto')
knn.fit(X_train_np, y_train)

# Predict on the validation set
pred = knn.predict(X_valid_np)

# Obtain the distance and index of the nearest neighbor for each validation sample
distances, indices = knn.kneighbors(X_valid_np, n_neighbors=1)

# For each validation sample, if the nearest neighbor is far (distance >= 1), replace the prediction
global_mean = np.nanmean(y_train)
pred[distances.flatten() >= 1] = global_mean

print("First 10 predictions:", pred[:10])

# Compute the error metric (e.g., RMSE)
m = np.sqrt(np.nanmean((y_valid - pred) ** 2))
print("Using KNN Regressor - Validation score =", m)


First 10 predictions: [81.56795502 81.56795502 81.56795502 81.56795502 81.56795502 81.56795502
 81.56795502 81.56795502 81.56795502 81.56795502]
Using KNN Regressor - Validation score = 39.14995083675154


### Generate Submission

In [74]:
# Convert test DataFrame to numpy array
test_np = test_df.values if hasattr(test_df, 'values') else test_df

# Predict on the test set
pred = knn.predict(test_np)

# Obtain the distance and index of the nearest neighbor for each validation sample
distances, indices = knn.kneighbors(test_np, n_neighbors=1)

# For each validation sample, if the nearest neighbor is far (distance >= 1), replace the prediction
global_mean = np.nanmean(y_train)
pred[distances.flatten() >= 1] = global_mean

# Create submission DataFrame with two columns: "ID" and "predicted_price"
submission = pd.DataFrame({
    "id": test_df["id"],
    "predicted_price": pred
})

# Save the submission to the specified CSV file
submission.to_csv("./../../submissions/knn_regressor.csv", index=False)
