## Question 1
**What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?**

**Answer:**

K-Nearest Neighbors (KNN) is a non-parametric, instance-based supervised learning algorithm used for both classification and regression. It makes predictions by finding the `k` training samples closest to a query point (according to a distance metric such as Euclidean or Manhattan) and using their labels to decide the prediction.

- **Classification:** The predicted class is typically the majority class among the k nearest neighbors (voting). Optionally, neighbors can be weighted by inverse distance so nearer neighbors contribute more.
- **Regression:** The predicted value is usually the mean (or weighted mean) of the target values of the k nearest neighbors.

**Key properties:**
- Simple, intuitive, and requires no explicit training (training is just storing the dataset).
- Sensitive to feature scales — feature scaling (standardization or normalization) is important.
- Choice of `k` influences bias-variance: small `k` → low bias, high variance; large `k` → high bias, low variance.
- Computationally expensive on large datasets because nearest neighbor search scales with number of samples (though KD-trees, Ball-trees, or approximate methods mitigate this).


## Question 2
**What is the Curse of Dimensionality and how does it affect KNN performance?**

**Answer:**

The "Curse of Dimensionality" refers to phenomena that arise when working with high-dimensional data. As dimensionality (number of features) increases:

- Data points become sparse; the volume of the space increases exponentially and points are far apart on average.
- Distance metrics become less informative: distances between nearest and farthest neighbors tend to concentrate, reducing contrast.
- Models that rely on locality (like KNN) degrade because the concept of "neighborhood" becomes less meaningful.
- Overfitting risk grows when sample size is small relative to dimensionality.

**Impact on KNN:**
- KNN relies on distance to define similarity; with many dimensions, distances lose discriminative power and nearest neighbors may not be truly similar.
- Performance often worsens; KNN can become noisy and unstable.

**Mitigations:**
- Dimensionality reduction (PCA, t-SNE, UMAP) or feature selection.
- Increase sample size if possible.
- Use distance metrics or learned embeddings that better reflect similarity for the problem domain.


## Question 3
**What is Principal Component Analysis (PCA)? How is it different from feature selection?**

**Answer:**

Principal Component Analysis (PCA) is a linear dimensionality reduction technique that projects data onto a lower-dimensional orthogonal basis (principal components) chosen to maximize explained variance. Steps:
1. Center the data (subtract mean).
2. Compute covariance matrix.
3. Compute eigenvalues and eigenvectors.
4. Sort eigenvectors by eigenvalue (variance explained) and project data onto the top components.

**Differences from feature selection:**
- **PCA (feature extraction):** Constructs new features (linear combinations of original features). The new features (principal components) are orthogonal and ordered by variance explained. Original features are not preserved directly.
- **Feature selection:** Chooses a subset of original features without creating new ones (e.g., filter methods, wrapper methods, embedded methods). Interpretability is usually better for feature selection because original measurements remain.

PCA reduces dimensionality while retaining maximum variance (for linear projections), whereas feature selection keeps or discards original features based on some criterion.


## Question 4
**What are eigenvalues and eigenvectors in PCA, and why are they important?**

**Answer:**

In PCA, eigenvectors (principal components) of the covariance matrix define directions in feature space along which the data varies. Eigenvalues correspond to the amount of variance captured along each eigenvector.

- **Eigenvector:** A direction (unit vector) in the original feature space. Projecting data onto this vector gives coordinates along that principal axis.
- **Eigenvalue:** Scalar that quantifies the variance of the data along its eigenvector.

Importance:
- Sorting by eigenvalue gives an ordering of principal components by importance (explained variance).
- Choosing the top eigenvectors (highest eigenvalues) yields a low-dimensional subspace that preserves most of the variance.
- Eigenvalues let us compute the *explained variance ratio* to decide how many components to keep.


## Question 5
**How do KNN and PCA complement each other when applied in a single pipeline?**

**Answer:**

PCA and KNN are often combined:
- **PCA reduces dimensionality**, alleviating the curse of dimensionality, removing noisy or redundant features, and improving distance-based algorithms' reliability.
- **KNN benefits from lower-dimensional, decorrelated features** because distances become more meaningful, computation is faster, and overfitting risk lowers.

Typical pipeline: scale → PCA (retain components capturing most variance) → KNN (tune `k`, distance metric). This often yields better generalization and faster inference.


---

# Practical section — Wine dataset experiments

We use `sklearn.datasets.load_wine()` and run the following experiments:

- Q6: KNN with and without feature scaling
- Q7: PCA and explained variance ratios
- Q8: KNN on PCA-transformed data (top 2 components)
- Q9: KNN with different distance metrics
- Q10: Discussion + example pipeline for high-dimensional gene expression data (with demo code)

Run the code cells sequentially to reproduce results.


In [2]:
# Q6-Q9 — experiments with the Wine dataset
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score, classification_report

# Load dataset
data = load_wine()
X = data.data
y = data.target
feature_names = data.feature_names

print('Wine dataset shape:', X.shape)
print('Classes:', np.unique(y))

# create a DataFrame for easy viewing
df = pd.DataFrame(X, columns=feature_names)
df['target'] = y

# train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)


Wine dataset shape: (178, 13)
Classes: [0 1 2]


In [3]:
# Q6: Train KNN classifier WITH and WITHOUT feature scaling and compare accuracy
from sklearn.pipeline import make_pipeline

# Without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred)

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_with_scaling = accuracy_score(y_test, y_pred_scaled)

print('Accuracy without scaling: {:.4f}'.format(acc_no_scaling))
print('Accuracy with StandardScaler: {:.4f}'.format(acc_with_scaling))

# Show brief classification report for scaled model
print('\nClassification report (scaled model):')
print(classification_report(y_test, y_pred_scaled))


Accuracy without scaling: 0.7778
Accuracy with StandardScaler: 0.9333

Classification report (scaled model):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        15
           1       0.94      0.89      0.91        18
           2       0.85      0.92      0.88        12

    accuracy                           0.93        45
   macro avg       0.93      0.94      0.93        45
weighted avg       0.94      0.93      0.93        45



In [4]:
# Q7: Train PCA and print explained variance ratio of each component
pca = PCA()
X_scaled_full = StandardScaler().fit_transform(X)  # PCA on scaled features
pca.fit(X_scaled_full)
explained_ratios = pca.explained_variance_ratio_

print('Number of components:', len(explained_ratios))
for i, ratio in enumerate(explained_ratios, start=1):
    cum = explained_ratios[:i].cumsum()[-1]
    print(f'PC{i}: {ratio:.4f} (cumulative {cum:.4f})')

# Also show cumulative explained variance and components to retain 90% variance
cum_var = np.cumsum(explained_ratios)
components_90 = np.searchsorted(cum_var, 0.90) + 1
print('\nComponents needed to retain >=90% variance:', components_90)


Number of components: 13
PC1: 0.3620 (cumulative 0.3620)
PC2: 0.1921 (cumulative 0.5541)
PC3: 0.1112 (cumulative 0.6653)
PC4: 0.0707 (cumulative 0.7360)
PC5: 0.0656 (cumulative 0.8016)
PC6: 0.0494 (cumulative 0.8510)
PC7: 0.0424 (cumulative 0.8934)
PC8: 0.0268 (cumulative 0.9202)
PC9: 0.0222 (cumulative 0.9424)
PC10: 0.0193 (cumulative 0.9617)
PC11: 0.0174 (cumulative 0.9791)
PC12: 0.0130 (cumulative 0.9920)
PC13: 0.0080 (cumulative 1.0000)

Components needed to retain >=90% variance: 8


In [5]:
# Q8: KNN on PCA-transformed dataset (top 2 components)
from sklearn.decomposition import PCA

# Fit scaler on full data split used earlier
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test)

pca2 = PCA(n_components=2)
X_train_pca2 = pca2.fit_transform(X_train_s)
X_test_pca2 = pca2.transform(X_test_s)

knn_pca2 = KNeighborsClassifier(n_neighbors=5)
knn_pca2.fit(X_train_pca2, y_train)
y_pred_pca2 = knn_pca2.predict(X_test_pca2)
acc_pca2 = accuracy_score(y_test, y_pred_pca2)

print('Accuracy on original scaled data (k=5): {:.4f}'.format(acc_with_scaling))
print('Accuracy on PCA (2 components) data (k=5): {:.4f}'.format(acc_pca2))

print('\nClassification report (PCA 2 components):')
print(classification_report(y_test, y_pred_pca2))


Accuracy on original scaled data (k=5): 0.9333
Accuracy on PCA (2 components) data (k=5): 0.9333

Classification report (PCA 2 components):
              precision    recall  f1-score   support

           0       0.93      0.93      0.93        15
           1       0.89      0.94      0.92        18
           2       1.00      0.92      0.96        12

    accuracy                           0.93        45
   macro avg       0.94      0.93      0.94        45
weighted avg       0.94      0.93      0.93        45



In [6]:
# Q9: KNN with different distance metrics on the scaled Wine dataset
from sklearn.metrics import accuracy_score

# Correct scaler usage: fit on train, transform on test
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test)

for metric in ['euclidean', 'manhattan']:
    knn_m = KNeighborsClassifier(n_neighbors=5, metric=metric)
    knn_m.fit(X_train_s, y_train)
    y_pred_m = knn_m.predict(X_test_s)
    acc = accuracy_score(y_test, y_pred_m)
    print(f'Metric: {metric:9s} -> Accuracy: {acc:.4f}')


Metric: euclidean -> Accuracy: 0.9333
Metric: manhattan -> Accuracy: 0.9778


## Question 10

**Scenario:** High-dimensional gene expression dataset (many features, few samples) causing overfitting.

**Explain how you would:**

- **Use PCA to reduce dimensionality**
  - Standardize features (zero mean, unit variance).
  - Fit PCA on training data (not on entire dataset including test) to avoid data leakage.
  - Project data onto top principal components that capture most variance.

- **Decide how many components to keep**
  - Use explained variance ratio and cumulative explained variance (e.g., keep components that capture 90–95% of variance).
  - Alternatively, use domain knowledge, scree plot (elbow), cross-validation performance with different component counts, or downstream classification performance.

- **Use KNN for classification post-dimensionality reduction**
  - Use the PCA-transformed features as input to KNN.
  - Tune `k` and distance metric via cross-validation.
  - Use distance-weighted voting if useful.

- **Evaluate the model**
  - Use stratified train-test split or nested cross-validation if sample size permits.
  - Evaluate with accuracy, precision, recall, F1-score, ROC-AUC (for binary or one-vs-rest multi-class), and confusion matrices.
  - Use permutation tests or repeated CV for robust estimates given small sample size.

- **Justify to stakeholders**
  - PCA reduces noise and stabilizes distance-based classifiers, lowering overfitting risk.
  - The pipeline (scaling → PCA → cross-validated KNN) is transparent and reproducible.
  - Components can be inspected (loadings) to relate back to genes if interpretability is needed (rotate or examine feature contributions).
  - Provide robust validation (e.g., nested CV) and confidence intervals for metrics.


In [7]:
# Demo: synthetic high-dimensional gene expression-like data
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline

# Create synthetic dataset: 100 samples, 2000 features, 10 informative
X_hd, y_hd = make_classification(n_samples=100, n_features=2000, n_informative=10, n_redundant=50, n_classes=3, random_state=42)
print('Synthetic high-dim data shape:', X_hd.shape)

# Pipeline: scaler -> PCA -> KNN with cross-validation to choose n_components and k
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA()),
    ('knn', KNeighborsClassifier())
])

param_grid = {
    'pca__n_components': [10, 20, 50, 100],
    'knn__n_neighbors': [3,5,7],
    'knn__metric': ['euclidean', 'manhattan']
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
search = GridSearchCV(pipe, param_grid, cv=cv, scoring='accuracy', n_jobs=-1)
search.fit(X_hd, y_hd)

print('Best params:', search.best_params_)
print('Best CV accuracy: {:.4f}'.format(search.best_score_))

# Show how many components were chosen and why
best_pca_n = search.best_params_['pca__n_components']
print('\nSelected number of components:', best_pca_n)

# Fit PCA separately to show explained variance for that number
scaler = StandardScaler().fit(X_hd)
X_hd_s = scaler.transform(X_hd)
pca_best = PCA(n_components=best_pca_n).fit(X_hd_s)
print('Explained variance ratio sum for selected components: {:.4f}'.format(pca_best.explained_variance_ratio_.sum()))


Synthetic high-dim data shape: (100, 2000)


30 fits failed out of a total of 120.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\amitk\AppData\Roaming\Python\Python313\site-packages\sklearn\model_selection\_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\amitk\AppData\Roaming\Python\Python313\site-packages\sklearn\base.py", line 1365, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "C:\Users\amitk\AppData\Roaming\Python\Python313\site-packages\sklearn\pipeline.py", line 655, in fit
    Xt = self._fit(X, y, routed_params, raw_params=params)
  File "C:\Users\a

Best params: {'knn__metric': 'manhattan', 'knn__n_neighbors': 7, 'pca__n_components': 10}
Best CV accuracy: 0.4800

Selected number of components: 10
Explained variance ratio sum for selected components: 0.1410
