In [15]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.impute import KNNImputer,SimpleImputer
from sklearn.linear_model import LogisticRegression

In [2]:
df=pd.read_csv('train.csv',usecols=['Age','Pclass','Fare','Survived'])

In [3]:
df.head()

Unnamed: 0,Survived,Pclass,Age,Fare
0,0,3,22.0,7.25
1,1,1,38.0,71.2833
2,1,3,26.0,7.925
3,1,1,35.0,53.1
4,0,3,35.0,8.05


In [6]:
df.isnull().mean()*100

Survived     0.00000
Pclass       0.00000
Age         19.86532
Fare         0.00000
dtype: float64

In [7]:
X=df.drop(columns=['Survived'])
y=df['Survived']

In [9]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=2)

In [10]:
X_train.head()

Unnamed: 0,Pclass,Age,Fare
30,1,40.0,27.7208
10,3,4.0,16.7
873,3,47.0,9.0
182,3,9.0,31.3875
876,3,20.0,9.8458


In [11]:
knn=KNNImputer()

X_train_trf=knn.fit_transform(X_train)
X_test_trf=knn.transform(X_test)

In [None]:
In KNN Imputer, the distance parameter controls how similarity between rows is measured.
Since KNN is distance-based, this choice directly affects which neighbors are selected and how accurate the imputation is.

1. Distance parameter in sklearn.KNNImputer
KNNImputer(
    n_neighbors=5,
    weights="uniform",
    metric="nan_euclidean"
)


üîπ The distance parameter is metric
üîπ Default value: "nan_euclidean"

2. Default distance: nan_euclidean ‚≠ê
What it does

Computes Euclidean distance

Ignores NaN values

Normalizes distance based on number of valid features

Formula
ùëë
(
ùë•
,
ùë¶
)
=
ùëõ
ùëö
‚àë
ùëñ
‚àà
valid
(
ùë•
ùëñ
‚àí
ùë¶
ùëñ
)
2
d(x,y)=
m
n
	‚Äã

i‚ààvalid
‚àë
	‚Äã

(x
i
	‚Äã

‚àíy
i
	‚Äã

)
2
	‚Äã


Where:

ùëõ
n = total features

ùëö
m = features without NaN in both rows

Why it is the default

‚úî Works with missing values
‚úî Fair comparison even when features differ
‚úî Prevents rows with many NaNs from appearing closer

3. Why normal Euclidean distance is NOT used
Standard Euclidean distance
ùëë
=
‚àë
(
ùë•
ùëñ
‚àí
ùë¶
ùëñ
)
2
d=
‚àë(x
i
	‚Äã

‚àíy
i
	‚Äã

)
2
	‚Äã


‚ùå Fails if any feature is NaN
‚ùå No normalization ‚Üí biased distances

üëâ Hence, cannot be default

4. Can we change the distance metric?
‚ùå In KNNImputer

Only nan_euclidean is officially supported

You cannot directly use:

Manhattan

Minkowski

Cosine

‚úÖ Alternative

Use sklearn.impute.IterativeImputer or

Custom KNN implementation

5. Distance vs Weights (important distinction)
Parameter	Role
metric	How distance is computed
weights	How neighbors influence imputation
weights options:

"uniform" (default) ‚Üí all neighbors equal

"distance" ‚Üí closer neighbors have more influence

‚ö†Ô∏è Distance metric ‚â† weights

6. Comparison: Default vs Hypothetical Alternatives
Metric	Works with NaN	Normalized	Used in KNNImputer
Euclidean	‚ùå No	‚ùå No	‚ùå
Manhattan	‚ùå No	‚ùå No	‚ùå
Cosine	‚ùå No	‚ùå No	‚ùå
nan_euclidean	‚úÖ Yes	‚úÖ Yes	‚úÖ Default
7. Practical Example
Data
A	B	C
2	NaN	4
3	6	NaN

Valid feature: A only

Total features = 3

Normalized distance:

ùëë
=
3
1
(
2
‚àí
3
)
2
=
3
d=
1
3
	‚Äã

(2‚àí3)
2
	‚Äã

=
3
	‚Äã


Without normalization ‚Üí distance = 1 (misleading)

8. When default distance is BEST

‚úî Mixed missing patterns
‚úî Numerical features
‚úî Real-world datasets
‚úî Fair neighbor selection needed

9. Exam-ready one-liners

Distance parameter in KNN Imputer: metric

Default metric: nan_euclidean

Why default: Handles missing values and normalizes distance

Key advantage: Prevents bias due to unequal feature availability

Final takeaway

KNN Imputer uses nan_euclidean distance by default because it ignores missing values and rescales distances for fair neighbor selection‚Äîsomething standard Euclidean distance cannot do.

In [17]:
X_train_trf

array([[  1.    ,  40.    ,  27.7208],
       [  3.    ,   4.    ,  16.7   ],
       [  3.    ,  47.    ,   9.    ],
       ...,
       [  1.    ,  71.    ,  49.5042],
       [  1.    ,  28.    , 221.7792],
       [  1.    ,  42.8   ,  25.925 ]], shape=(712, 3))

In [18]:
pd.DataFrame(X_train_trf,columns=X_train.columns)

Unnamed: 0,Pclass,Age,Fare
0,1.0,40.0,27.7208
1,3.0,4.0,16.7000
2,3.0,47.0,9.0000
3,3.0,9.0,31.3875
4,3.0,20.0,9.8458
...,...,...,...
707,3.0,30.0,8.6625
708,3.0,24.2,8.7125
709,1.0,71.0,49.5042
710,1.0,28.0,221.7792


In [19]:
lr=LogisticRegression()
lr.fit(X_train_trf,y_train)
y_pred=lr.predict(X_test_trf)
accuracy_score(y_test,y_pred)

0.7039106145251397

In [30]:
# comparisson with Simple Imputer -->mean
si=SimpleImputer()
X_train_trf2=si.fit_transform(X_train)
X_test_trf2=si.transform(X_test)

In [31]:
lr=LogisticRegression()
lr.fit(X_train_trf2,y_train)
y_pred=lr.predict(X_test_trf2)
accuracy_score(y_test,y_pred)

0.6927374301675978