Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?
- K-Nearest Neighbors (KNN) is a supervised, instance-based algorithm. It stores the training data and, at prediction time:

1. picks a value K

2. computes distances from the query point to all training points (e.g., Euclidean, Manhattan)

3. takes the K closest neighbors

4. predicts:

  - Classification: majority class among neighbors (optionally distance-weighted votes)

  - Regression: average (or distance-weighted average) of neighbor targets

  Lazy learner = no explicit model is fit; computation happens at query time.

Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?
- As features (dimensions) grow:

  - Data becomes sparse, and distances between points become less discriminative (nearest and farthest start to look similar).

  - KNN’s distance comparisons become noisy → lower accuracy and higher variance.

  Mitigations: scale features, remove irrelevant ones, and apply dimensionality reduction (e.g., PCA).

Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?
- Principal Component Analysis (PCA) is an unsupervised linear dimensionality reduction that:

  - Finds orthogonal directions (principal components) capturing maximal variance.

  - Projects data onto these directions to reduce dimensions while preserving most variance.

- PCA vs Feature Selection

  - PCA (feature extraction): creates new features (linear combinations of originals).

  - Feature selection: keeps a subset of original features (e.g., filter/wrapper methods).

Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important
- Each principal component has an eigenvector (direction in feature space) and an eigenvalue (amount of variance captured along that direction).

- Eigenvalue magnitude ⇒ importance of the component.

- Sorting eigenvalues descending gives component order; explained variance ratio = eigenvalue / total variance.

Question 5: How do KNN and PCA complement each other when applied in a single pipeline
- KNN relies on meaningful distances; high-dimensional, differently-scaled features can distort them.

- Scaling → PCA → KNN often improves performance by:

   - removing noise/redundancy,

   - denoising via variance-focused projection,

   - speeding up neighbor searches in fewer dimensions.

In [2]:
# Q-6. Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load data
wine = load_wine()
X, y = wine.data, wine.target

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# KNN without scaling
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(X_train, y_train)
acc_no_scale = accuracy_score(y_test, knn_no_scale.predict(X_test))

# KNN with scaling
pipe_scaled = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(n_neighbors=5))
])
pipe_scaled.fit(X_train, y_train)
acc_scaled = accuracy_score(y_test, pipe_scaled.predict(X_test))

print("Q6 Results:")
print(f"Accuracy (no scaling):   {acc_no_scale:.4f}")
print(f"Accuracy (with scaling): {acc_scaled:.4f}")


Q6 Results:
Accuracy (no scaling):   0.8056
Accuracy (with scaling): 0.9722


In [3]:
# Q-7. Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.
from sklearn.decomposition import PCA

# Scale features first
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# PCA (all components)
pca_full = PCA()
pca_full.fit(X_train_scaled)

print("Q7 Results: PCA Explained Variance Ratios")
for i, ratio in enumerate(pca_full.explained_variance_ratio_):
    print(f"PC{i+1}: {ratio:.4f}, Cumulative: {pca_full.explained_variance_ratio_[:i+1].sum():.4f}")


Q7 Results: PCA Explained Variance Ratios
PC1: 0.3579, Cumulative: 0.3579
PC2: 0.1927, Cumulative: 0.5506
PC3: 0.1102, Cumulative: 0.6608
PC4: 0.0727, Cumulative: 0.7335
PC5: 0.0672, Cumulative: 0.8008
PC6: 0.0513, Cumulative: 0.8521
PC7: 0.0438, Cumulative: 0.8959
PC8: 0.0250, Cumulative: 0.9209
PC9: 0.0228, Cumulative: 0.9437
PC10: 0.0188, Cumulative: 0.9624
PC11: 0.0178, Cumulative: 0.9803
PC12: 0.0126, Cumulative: 0.9928
PC13: 0.0072, Cumulative: 1.0000


In [4]:
# Q-8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.
# PCA with top 2 components
pca2 = PCA(n_components=2)
X_train_pca2 = pca2.fit_transform(X_train_scaled)
X_test_pca2 = pca2.transform(scaler.transform(X_test))

# KNN on PCA data
knn_pca2 = KNeighborsClassifier(n_neighbors=5)
knn_pca2.fit(X_train_pca2, y_train)
acc_pca2 = accuracy_score(y_test, knn_pca2.predict(X_test_pca2))

print("Q8 Results:")
print(f"Accuracy (original scaled data): {acc_scaled:.4f}")
print(f"Accuracy (PCA top-2 comps):      {acc_pca2:.4f}")


Q8 Results:
Accuracy (original scaled data): 0.9722
Accuracy (PCA top-2 comps):      0.9167


In [5]:
# Q-9.Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.
# Scale features
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Euclidean (p=2)
knn_euclid = KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=2)
knn_euclid.fit(X_train_scaled, y_train)
acc_euclid = accuracy_score(y_test, knn_euclid.predict(X_test_scaled))

# Manhattan (p=1)
knn_manhat = KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=1)
knn_manhat.fit(X_train_scaled, y_train)
acc_manhat = accuracy_score(y_test, knn_manhat.predict(X_test_scaled))

print("Q9 Results:")
print(f"Accuracy (Euclidean): {acc_euclid:.4f}")
print(f"Accuracy (Manhattan): {acc_manhat:.4f}")


Q9 Results:
Accuracy (Euclidean): 0.9722
Accuracy (Manhattan): 1.0000


Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.

Due to the large number of features and a small number of samples, traditional models overfit.

Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data

- To handle a high-dimensional gene expression dataset with KNN, I would build the pipeline as follows:

1. Preprocessing:
First, I will standardize the gene features (zero mean, unit variance) so that distance calculations in KNN are fair.

2. Dimensionality Reduction with PCA:
Since the dataset has thousands of features but few samples, I will apply PCA to reduce dimensions.
I will keep enough principal components to explain about 90–95% of the variance, or decide the number of PCs using cross-validation.

3. KNN Classification:
After PCA transformation, I will train a KNN classifier.
I will tune the number of neighbors (K = 3, 5, 7, etc.) and also try different distance metrics like Euclidean and Manhattan.

4. Evaluation:
I will use cross-validation to check model accuracy and F1 score.
I will also compare results for different numbers of components and K values.

5. Justification:

- PCA helps remove noise and reduces overfitting in high-dimensional data.

- KNN is simple, interpretable, and works well after reducing dimensions.

- Cross-validation makes sure results are reliable and not due to chance.

- The pipeline is efficient and easy to update when new samples are added.

In [7]:
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import accuracy_score, classification_report # Import classification_report

# Simulate high-dimensional dataset: 200 samples, 5000 features
X_gene, y_gene = make_classification(
    n_samples=200, n_features=5000, n_informative=50,
    n_classes=3, random_state=42
)

# Split
Xg_train, Xg_test, yg_train, yg_test = train_test_split(
    X_gene, y_gene, test_size=0.2, stratify=y_gene, random_state=42
)

# Pipeline: scaling + PCA (retain 95% variance) + KNN
pipe_gene = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=0.95, svd_solver="full")),
    ("knn", KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=2))
])

pipe_gene.fit(Xg_train, yg_train)
yg_pred = pipe_gene.predict(Xg_test)

print("Q10 Results:")
print(f"Test Accuracy: {accuracy_score(yg_test, yg_pred):.4f}")
print("Classification Report:\n", classification_report(yg_test, yg_pred))

# Cross-validation for robustness
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(pipe_gene, X_gene, y_gene, cv=cv, scoring="accuracy")
print(f"5-Fold CV Accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

Q10 Results:
Test Accuracy: 0.3750
Classification Report:
               precision    recall  f1-score   support

           0       0.41      0.64      0.50        14
           1       0.33      0.46      0.39        13
           2       0.00      0.00      0.00        13

    accuracy                           0.38        40
   macro avg       0.25      0.37      0.30        40
weighted avg       0.25      0.38      0.30        40



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


5-Fold CV Accuracy: 0.2950 ± 0.0400
