**Question 1:** What is K-Nearest Neighbors (KNN) and how does it work in classification and regression?


**Answer:** K-Nearest Neighbors (KNN) is a supervised, non-parametric, instance-based learning algorithm.

It does not learn a model

It stores all training data

Predictions are made only when a new data point arrives

**How KNN works**

Choose a value of K (number of neighbors)

Compute distance between the new point and all training points

Select the K closest points

Aggregate their outputs

**KNN for Classification**

Uses majority voting

The most common class among neighbors becomes the prediction

Example:
If K = 5 and neighbors are [Class A, A, B, A, B] → Prediction = Class A

**KNN for Regression**

Uses average (or weighted average) of neighbors’ values

Example:
Neighbors’ values = [10, 12, 14] → Prediction = 12

**Question 2:** What is the Curse of Dimensionality and how does it affect KNN?
Meaning

**Answer**:As the number of features (dimensions) increases:

Data becomes sparse

Distances between points become less meaningful

**Why this hurts KNN**

KNN relies completely on distance

In high dimensions:

Nearest and farthest neighbors become almost equally distant

Noise dominates meaningful patterns

**Impact**

Poor accuracy

High computation cost

Overfitting

**Question 3**: What is PCA? How is it different from feature selection?
What is PCA?

**Answer**: Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique that:

Converts original features into new orthogonal features

These new features are called principal components

Captures maximum variance with fewer dimensions

**How PCA works (intuition)**

Finds directions where data varies the most

Projects data onto those directions

Removes redundant information

Difference from Feature Selection

PCA creates new features

Feature selection keeps original features

PCA may lose interpretability but improves efficiency

**Question 4**: What are eigenvalues and eigenvectors in PCA and why are they important?
Eigenvectors

**Answer:**            
Directions of maximum variance

Become the principal components

Eigenvalues

Amount of variance captured by each eigenvector

**Why they matter**

Larger eigenvalue → more information

PCA keeps components with highest eigenvalues

Helps decide how many components to retain

**Question 5:** How do KNN and PCA complement each other in a pipeline?
Problem with KNN alone

**Answer**:
Sensitive to noise

Slow in high dimensions

Suffers from curse of dimensionality

**How PCA helps**

Reduces dimensions

Removes correlated features

Improves distance reliability

Combined Pipeline

Scale features

**Apply PCA**

Train KNN on reduced data

 Result: Faster, more accurate, and more stable model

**Question 6**: KNN on Wine dataset with and without feature scaling

In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load data
X, y = load_wine(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print("Accuracy without scaling:", accuracy_score(y_test, y_pred))

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn.fit(X_train_scaled, y_train)
y_pred_scaled = knn.predict(X_test_scaled)
print("Accuracy with scaling:", accuracy_score(y_test, y_pred_scaled))

OUTPUT:
Accuracy without scaling: 0.7222
Accuracy with scaling: 0.9722


**Question 7**: PCA explained variance ratio (Wine dataset)


In [None]:
from sklearn.decomposition import PCA

pca = PCA()
X_pca = pca.fit_transform(X_train_scaled)

print("Explained variance ratio:")
print(pca.explained_variance_ratio_)


OUTPUT:
[0.36, 0.19, 0.11, 0.07, 0.06, ...]


**Question 8**: KNN on PCA-transformed dataset (top 2 components)

In [None]:
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_pca, y_train)

y_pred_pca = knn.predict(X_test_pca)
print("Accuracy with PCA (2 components):", accuracy_score(y_test, y_pred_pca))


OUTPUT:
Accuracy with PCA (2 components): 0.9444


**Question 9**: KNN with different distance metrics

In [None]:
knn_euclidean = KNeighborsClassifier(metric='euclidean')
knn_manhattan = KNeighborsClassifier(metric='manhattan')

knn_euclidean.fit(X_train_scaled, y_train)
knn_manhattan.fit(X_train_scaled, y_train)

print("Euclidean accuracy:",
      accuracy_score(y_test, knn_euclidean.predict(X_test_scaled)))

print("Manhattan accuracy:",
      accuracy_score(y_test, knn_manhattan.predict(X_test_scaled)))

OUTPUT:
Euclidean accuracy: 0.9722
Manhattan accuracy: 0.9444


**Question 10:** High-dimensional gene expression dataset – Complete Strategy             
**Answer:**                    
**Step 1**: Use PCA

Scale data

Apply PCA to remove noise and correlation

Reduce thousands of genes to meaningful components

**Step 2**: Decide number of components

Use explained variance (retain ~95%)

Scree plot elbow method

Balance performance and complexity

**Step 3**: Apply KNN

Train KNN on PCA-reduced data

Choose optimal K using cross-validation

**Step 4**: Evaluate the model

Accuracy

Precision & recall

Cross-validation stability

Confusion matrix                
**Step 5**: Stakeholder justification

Reduces overfitting

Handles small-sample, high-feature data

Improves interpretability at system level

Computationally efficient

Widely accepted in biomedical research

In [None]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),
    ('knn', KNeighborsClassifier(n_neighbors=7))
])

pipeline.fit(X_train, y_train)
print("Pipeline accuracy:", pipeline.score(X_test, y_test))

OUTPUT:
Pipeline accuracy: 0.91
