Question 1: What is KNN and how does it work?

Answer:
K-Nearest Neighbors (KNN) is a simple, non-parametric machine learning algorithm used for classification and regression.

How it works in Classification:

A new data point is given.

Find its K nearest neighbors using a distance metric (Euclidean, Manhattan etc.)

Check which class is most common among these neighbors.

Assign that class to the new point.

How it works in Regression:

Find K nearest neighbors.

Take the average (or weighted average) of their numerical target values.

That becomes the prediction.

Question 2: What is the Curse of Dimensionality and how does it affect KNN?

Answer:
The Curse of Dimensionality refers to problems that occur when data has a very large number of features.

How it affects KNN:

Distance between points becomes less meaningful in high dimensions.

All points appear almost equally far, making nearest neighbors unreliable.

KNN becomes slow because it must compute distances in many dimensions.

Model accuracy decreases due to sparse data.

Question 3: What is PCA? How is it different from Feature Selection?

Answer:
PCA (Principal Component Analysis) is a dimensionality reduction technique that transforms original features into new uncorrelated features called principal components.

PCA vs Feature Selection
Feature Selection	PCA
Removes some features	Creates new transformed features
Keeps original meaning	Components lose original interpretation
Selects best subset	Creates linear combinations

Question 4: What are Eigenvalues and Eigenvectors in PCA? Why important?

Answer:

Eigenvectors → directions of maximum variance in data (principal components).

Eigenvalues → amount of variance captured by each eigenvector.

Importance:

Larger eigenvalue → more important component

Helps decide how many components to keep

Determines how PCA compresses the data

Question 5: How do KNN and PCA complement each other?

Answer:

KNN performs poorly in high dimensions.

PCA reduces dimensions, removes noise, and improves KNN performance.

PCA speeds up KNN and avoids the curse of dimensionality.

Together they create a stable machine learning pipeline.

In [1]:
#Question 6: KNN with & without Scaling on Wine Dataset

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

data = load_wine()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Without Scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
pred1 = knn.predict(X_test)
acc_without = accuracy_score(y_test, pred1)

# With Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn2 = KNeighborsClassifier(n_neighbors=5)
knn2.fit(X_train_scaled, y_train)
pred2 = knn2.predict(X_test_scaled)
acc_with = accuracy_score(y_test, pred2)

print("Accuracy without scaling:", acc_without)
print("Accuracy with scaling:", acc_with)


Accuracy without scaling: 0.7222222222222222
Accuracy with scaling: 0.9444444444444444


In [2]:
#Question 7: PCA Explained Variance Ratio
from sklearn.decomposition import PCA
import numpy as np

pca = PCA()
pca.fit(X)

print(pca.explained_variance_ratio_)


[9.98091230e-01 1.73591562e-03 9.49589576e-05 5.02173562e-05
 1.23636847e-05 8.46213034e-06 2.80681456e-06 1.52308053e-06
 1.12783044e-06 7.21415811e-07 3.78060267e-07 2.12013755e-07
 8.25392788e-08]


In [5]:
#Question 8: KNN on PCA (2 Components)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

X_train_p, X_test_p, y_train_p, y_test_p = train_test_split(X_pca, y, test_size=0.2, random_state=42)

knn_p = KNeighborsClassifier(n_neighbors=5)
knn_p.fit(X_train_p, y_train_p)
pred_p = knn_p.predict(X_test_p)

acc_pca = accuracy_score(y_test_p, pred_p)

print("Accuracy with PCA (2 components):", acc_pca)


Accuracy with PCA (2 components): 0.7222222222222222


In [4]:
#Question 9: KNN with Euclidean vs Manhattan Distance
# Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Euclidean
knn_e = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_e.fit(X_train_s, y_train_s)
acc_e = accuracy_score(y_test_s, knn_e.predict(X_test_s))

# Manhattan
knn_m = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_m.fit(X_train_s, y_train_s)
acc_m = accuracy_score(y_test_s, knn_m.predict(X_test_s))

print("Euclidean Accuracy:", acc_e)
print("Manhattan Accuracy:", acc_m)


Euclidean Accuracy: 0.9444444444444444
Manhattan Accuracy: 0.9444444444444444


Question 10: Cancer Gene Expression Dataset – PCA + KNN Pipeline

Answer:

1. Using PCA for Dimensionality Reduction

Gene datasets have thousands of features.

PCA reduces noise and correlation.

Retains maximum biological information.

2. Selecting Number of Components

Plot cumulative explained variance.

Choose components covering 90–95% variance.

3. KNN After PCA

Use reduced PCA output as input to KNN.

KNN performs better due to fewer noisy dimensions.

4. Model Evaluation

Use train-test split / cross-validation

Metrics: accuracy, F1-score, confusion matrix

5. Why This Pipeline Is Robust

Reduces overfitting (common in biomedical datasets)

Improves model speed and stability

Works well even with small sample sizes

Easy to interpret and explain to doctors/stakeholders