Q1. What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?

K-Nearest Neighbors (KNN) is a supervised machine learning algorithm that is used for both classification and regression problems. It is called a lazy learning algorithm because it does not build a model during training. Instead, it stores the training data and makes predictions when new data is given.

KNN works based on the idea that similar data points stay close to each other.

Working of KNN:
1. Choose the value of K (number of neighbors).
2. Calculate the distance between the new data point and all training data points.
3. Select the K nearest data points.
4. Make prediction based on the nearest neighbors.

In classification problems, KNN predicts the class label by using majority voting. The class that appears most frequently among the K nearest neighbors is assigned to the new data point.

In regression problems, KNN predicts a continuous value by taking the average of the values of the K nearest neighbors.

Thus, KNN works by comparing a new data point with the closest data points and making predictions for both classification and regression tasks.


Q2. What is the Curse of Dimensionality and how does it affect KNN performance?

The Curse of Dimensionality is a problem that occurs when the number of features (dimensions) in a dataset becomes very large. As the dimensions increase, the data points become more spread out in the feature space. This makes it difficult for machine learning algorithms to find meaningful patterns.

In K-Nearest Neighbors (KNN), the algorithm works by calculating the distance between data points. When the number of dimensions increases, the distance between data points becomes less meaningful because most points appear to be at similar distances from each other.

Effects of Curse of Dimensionality on KNN:
1. KNN becomes slower because it has to calculate distances in many dimensions.
2. The accuracy of KNN decreases because the nearest neighbors may not be truly similar.
3. More data is required to maintain good performance as dimensions increase.

Because of this problem, KNN performs poorly on high-dimensional data and usually works better when the number of features is small.


Q3. What is Principal Component Analysis (PCA)? How is it different from feature selection?

Principal Component Analysis (PCA) is an unsupervised machine learning technique used for dimensionality reduction. It transforms the original features into a new set of features called principal components. These principal components are created in such a way that they capture the maximum possible variance (information) from the data.

PCA reduces the number of features while keeping most of the important information in the dataset. It does this by creating new features that are combinations of the old features.

Difference between PCA and Feature Selection:

PCA:
1. Creates new features (principal components) from the original features.
2. Features are transformed, not selected.
3. It is a feature extraction method.

Feature Selection:
1. Selects a subset of the original features.
2. Does not create new features.
3. It is a feature selection method.

In short, PCA creates new transformed features while feature selection simply chooses the best existing features from the dataset.


Q4. What are eigenvalues and eigenvectors in PCA, and why are they important?

In Principal Component Analysis (PCA), eigenvalues and eigenvectors are mathematical concepts that help in finding the most important directions in the data.

Eigenvectors are directions (or axes) along which the data varies the most. They show the new directions in which the data should be projected.

Eigenvalues are numbers that tell how much variance (information) is captured by each eigenvector. A larger eigenvalue means that the corresponding eigenvector carries more important information.

Importance in PCA:

1. Eigenvectors decide the direction of the new principal components.
2. Eigenvalues decide the importance of each principal component.
3. PCA selects the eigenvectors with the largest eigenvalues to reduce the dimensionality of the dataset while keeping maximum information.

Thus, eigenvectors and eigenvalues help PCA identify the most useful features and reduce data size effectively.


Q5. How do KNN and PCA complement each other when applied in a single pipeline?

K-Nearest Neighbors (KNN) and Principal Component Analysis (PCA) complement each other very well when used together in a single machine learning pipeline.

PCA is used first to reduce the number of features (dimensions) in the dataset. It removes noise and keeps only the most important information. This makes the dataset smaller and easier to handle.

After PCA, KNN is applied to the reduced dataset. Since KNN works by calculating distances between data points, having fewer features makes the distance calculations faster and more meaningful.

How they work together:
1. PCA reduces dimensionality and removes irrelevant features.
2. The transformed data is then given to KNN.
3. KNN performs better because distances are more accurate and computation is faster.

Benefits of combining PCA and KNN:
- Faster execution of the KNN algorithm.
- Improved accuracy by reducing noise.
- Better performance on high-dimensional data.

Thus, PCA improves the efficiency and performance of KNN when both are used together in a single pipeline.


In [1]:
# Q5: PCA + KNN Pipeline using Wine Dataset

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load dataset
data = load_wine()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=2)),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])

# Train model
pipeline.fit(X_train, y_train)

# Predict
y_pred = pipeline.predict(X_test)

# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 1.0


Q6. Train a KNN Classifier on the Wine dataset with and without feature scaling.
Compare model accuracy in both cases.

Feature scaling is very important for KNN because KNN depends on distance calculation.
If features are not scaled, features with large values dominate the distance.

In this experiment, we train two KNN models on the Wine dataset:
1. Without feature scaling
2. With feature scaling using StandardScaler

Then we compare the accuracy of both models to see the effect of feature scaling.


In [2]:
# Q6: KNN with and without Feature Scaling on Wine Dataset

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
data = load_wine()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ----------- Model 1: Without Feature Scaling -----------

knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = knn_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

print("Accuracy without scaling:", accuracy_no_scaling)

# ----------- Model 2: With Feature Scaling -----------

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

print("Accuracy with scaling:", accuracy_scaled)


Accuracy without scaling: 0.7222222222222222
Accuracy with scaling: 0.9444444444444444


Q7. Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.

PCA (Principal Component Analysis) is used to reduce the dimensionality of a dataset while keeping most of the important information.
The explained variance ratio shows how much variance (information) each principal component captures from the original dataset.

In this task, PCA is trained on the Wine dataset and the explained variance ratio for each principal component is printed.


In [3]:
# Q7: PCA Explained Variance Ratio on Wine Dataset

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load dataset
data = load_wine()
X = data.data

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
pca.fit(X_scaled)

# Print explained variance ratio
print("Explained Variance Ratio of each Principal Component:")
print(pca.explained_variance_ratio_)


Explained Variance Ratio of each Principal Component:
[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]


Q8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components).
Compare the accuracy with the original dataset.

In this task, we compare the performance of KNN on:
1. The original Wine dataset
2. The PCA-transformed dataset using only the top 2 principal components

PCA reduces the number of features, which makes the model simpler and faster.
By comparing the accuracies, we can see how dimensionality reduction affects model performance.


In [4]:
# Q8: KNN on Original Data vs PCA-Reduced Data

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_wine()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ----------- KNN on Original Data -----------

scaler_original = StandardScaler()
X_train_scaled = scaler_original.fit_transform(X_train)
X_test_scaled = scaler_original.transform(X_test)

knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)

accuracy_original = accuracy_score(y_test, y_pred_original)
print("Accuracy of KNN on Original Dataset:", accuracy_original)

# ----------- PCA Transformation -----------

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# ----------- KNN on PCA Data -----------

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)

accuracy_pca = accuracy_score(y_test, y_pred_pca)
print("Accuracy of KNN on PCA-Reduced Dataset:", accuracy_pca)


Accuracy of KNN on Original Dataset: 0.9444444444444444
Accuracy of KNN on PCA-Reduced Dataset: 1.0


Q9. Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.

KNN uses distance metrics to calculate how close data points are to each other.
Different distance metrics can affect the performance of the model.

In this task, we train two KNN models on the scaled Wine dataset:
1. Using Euclidean distance
2. Using Manhattan distance

Then we compare their accuracies to understand the effect of different distance metrics.


In [5]:
# Q9: KNN with Euclidean and Manhattan Distance on Scaled Wine Dataset

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_wine()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ----------- KNN with Euclidean Distance -----------

knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

print("Accuracy with Euclidean distance:", accuracy_euclidean)

# ----------- KNN with Manhattan Distance -----------

knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

print("Accuracy with Manhattan distance:", accuracy_manhattan)


Accuracy with Euclidean distance: 0.9444444444444444
Accuracy with Manhattan distance: 0.9444444444444444


Q10. You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data

Answer:

Background:
High-dimensional gene-expression datasets commonly have thousands of features (genes) but only a small number of samples (patients). This imbalance causes models to overfit, because they can memorize noise. Dimensionality reduction using PCA plus a simple classifier like KNN is a common, interpretable, and effective pipeline when used with careful validation.

1) Use PCA to reduce dimensionality
- Preprocessing: always standardize the features (e.g., StandardScaler) because PCA and KNN are sensitive to feature scales.
- Fit PCA on the training set only (never fit on test data). PCA finds orthogonal directions (principal components) that capture maximal variance.
- Transform both training and test data using the PCA fitted on training set.

2) Decide how many components to keep
- Use explained variance ratio: compute cumulative explained variance and choose the smallest number of components that capture a target percentage of variance (commonly 90–99%). Example: choose the number of components needed to reach 95% cumulative variance.
- Practical constraint: retaining 95% variance may still result in many components (and risk overfitting). Also consider:
  - Use a small fixed number (e.g., 10–50) chosen by cross-validation performance to balance bias-variance.
  - Use a scree plot (look for "elbow") to find diminishing returns.
  - Use nested cross-validation to select the number of components robustly.
- For genomics, prefer a conservative approach: keep enough components to retain biological signal but small enough to avoid overfitting; verify via cross-validation.

3) Use KNN for classification post-dimensionality reduction
- After PCA transform, train a KNN classifier on the reduced features.
- Typical steps:
  - Use Stratified train/test split or nested CV (outer loop for final evaluation, inner loop for hyperparameter tuning).
  - Tune K (n_neighbors) and distance metric (euclidean, manhattan) using cross-validation on training folds.
  - Use StandardScaler before PCA or inside a pipeline.
- Rationale: PCA reduces noise and correlation among original gene features; KNN then uses distances in the lower-dimensional, more meaningful subspace.

4) Evaluate the model
- Use a held-out test set for final evaluation (never used for training or hyperparameter tuning).
- Use stratified k-fold cross-validation (or nested CV) on the training set for hyperparameter selection and to estimate generalization performance.
- Report multiple metrics: accuracy, precision, recall, F1-score, and confusion matrix (class imbalance matters in biomedical data).
- For robust claims, use repeated CV or nested CV and report mean ± std of metrics.
- If possible, validate on an independent external cohort (best practice in biomedical studies).

5) Justify this pipeline to stakeholders
- Simplicity & interpretability: PCA + KNN is easy to explain — PCA reduces features to principal axes, KNN classifies based on similar patients.
- Overfitting control: dimensionality reduction reduces parameter space, decreasing chance of overfitting on small samples.
- Computational efficiency: PCA reduces storage and compute for downstream models.
- Reproducibility & validation: emphasize that we use strict train/test splits and cross-validation, with hyperparameter tuning only inside training folds.
- Biological validation: show that components/loaded genes correlate with known biology (if loadings point to gene sets/pathways), and propose follow-up wet-lab validation.
- Risk mitigation: communicate limitations (PCA is linear, may miss nonlinear structure) and propose alternatives (nonlinear methods, feature selection by biological priors, regularized models) as needed.

Summary:
- Pipeline: StandardScaler → PCA (fit on training) → select n_components (via explained variance + CV) → KNN (tune k & metric via CV) → final evaluation on hold-out test set and (if available) independent cohort.
- Validate thoroughly and report multiple metrics and confidence intervals; provide biological interpretation of top loadings where possible.


In [6]:
# Q10: PCA + KNN pipeline for high-dimensional gene-expression-like data
# This code simulates a gene-expression dataset, demonstrates PCA-based dimensionality reduction,
# shows ways to decide number of components, trains KNN after PCA, and evaluates the model.
# Paste into Google Colab and run. Outputs will be printed.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# -------------------------
# 1) Simulate dataset (replace with real data loading in practice)
RANDOM_STATE = 42
n_samples = 100         # small sample size (typical in genomics)
n_features = 500        # high dimensionality (genes)
n_informative = 50      # a small set of informative features
n_classes = 3

X, y = make_classification(n_samples=n_samples,
                           n_features=n_features,
                           n_informative=n_informative,
                           n_redundant=0,
                           n_repeated=0,
                           n_classes=n_classes,
                           class_sep=1.5,
                           random_state=RANDOM_STATE)

print("Dataset shape:", X.shape)
print("Class counts:", {i: int(sum(y==i)) for i in np.unique(y)})

# -------------------------
# 2) Train/test split (final hold-out to evaluate generalization)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=RANDOM_STATE)
print("\nTrain/test shapes:", X_train.shape, X_test.shape)

# -------------------------
# 3) Preprocess: scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # fit only on training
X_test_scaled  = scaler.transform(X_test)

# -------------------------
# 4) PCA: fit on training and inspect explained variance
# Use randomized SVD for speed on high-dim data
pca_full = PCA(svd_solver='randomized', random_state=RANDOM_STATE)
pca_full.fit(X_train_scaled)

explained = pca_full.explained_variance_ratio_
cum_explained = np.cumsum(explained)

print("\nExplained variance ratio (first 10):", np.round(explained[:10], 4))
print("Cumulative explained variance (first 10):", np.round(cum_explained[:10], 4))

# Decide components: example by threshold (95%) and also a practical small fixed number
threshold = 0.95
n_components_95 = int(np.searchsorted(cum_explained, threshold) + 1)
print(f"\nComponents to reach {int(threshold*100)}% variance: {n_components_95}")

# Often we choose a smaller practical number and validate by CV:
n_small = 20
n_medium = 50
n_small = min(n_small, X_train.shape[0]-1)
n_medium = min(n_medium, X_train.shape[0]-1)

# -------------------------
# 5) Transform with PCA (trained on training set)
pca_small = PCA(n_components=n_small, svd_solver='randomized', random_state=RANDOM_STATE)
pca_med   = PCA(n_components=n_medium, svd_solver='randomized', random_state=RANDOM_STATE)

X_train_pca_small = pca_small.fit_transform(X_train_scaled)
X_test_pca_small  = pca_small.transform(X_test_scaled)

X_train_pca_med = pca_med.fit_transform(X_train_scaled)
X_test_pca_med  = pca_med.transform(X_test_scaled)

print("\nShapes after PCA (small):", X_train_pca_small.shape, X_test_pca_small.shape)
print("Shapes after PCA (medium):", X_train_pca_med.shape, X_test_pca_med.shape)

# -------------------------
# 6) KNN training and quick CV-based comparison of k and metric
cv = StratifiedKFold(n_splits=4, shuffle=True, random_state=RANDOM_STATE)
ks = [3, 5, 7]
metrics = ['euclidean', 'manhattan']

def evaluate_options(X_tr, y_tr, X_te, y_te, label):
    print(f"\nEvaluating K options on: {label}")
    results = []
    for k in ks:
        for m in metrics:
            knn = KNeighborsClassifier(n_neighbors=k, metric=m)
            cv_scores = cross_val_score(knn, X_tr, y_tr, cv=cv, scoring='accuracy', n_jobs=1)
            mean_cv = cv_scores.mean()
            # Fit on full training and evaluate on hold-out test
            knn.fit(X_tr, y_tr)
            y_pred = knn.predict(X_te)
            test_acc = accuracy_score(y_te, y_pred)
            print(f"k={k}, metric={m} -> CV acc: {mean_cv:.4f}, Test acc: {test_acc:.4f}")
            results.append((mean_cv, test_acc, k, m))
    results.sort(reverse=True, key=lambda x: x[0])  # sort by CV acc
    best = results[0]
    print("Best (by CV): CV_acc={:.4f}, Test_acc={:.4f}, k={}, metric={}".format(best[0], best[1], best[2], best[3]))
    return best

best_small = evaluate_options(X_train_pca_small, y_train, X_test_pca_small, y_test, "PCA small ({} comps)".format(n_small))
best_med   = evaluate_options(X_train_pca_med, y_train, X_test_pca_med, y_test, "PCA medium ({} comps)".format(n_medium))

# Also evaluate on original scaled data as baseline (no PCA)
print("\nBaseline: KNN on original scaled data")
baseline_best = evaluate_options(X_train_scaled, y_train, X_test_scaled, y_test, "Original scaled data")

# -------------------------
# 7) Final evaluation: pick whichever configuration has best CV performance (as example)
# Here we demonstrate final test evaluation using 'best_small' config on PCA-small
best_k, best_metric = int(best_small[2]), best_small[3]
final_knn = KNeighborsClassifier(n_neighbors=best_k, metric=best_metric)
final_knn.fit(X_train_pca_small, y_train)
y_final = final_knn.predict(X_test_pca_small)

print("\nFinal evaluation on hold-out test (PCA small):")
print("Accuracy:", accuracy_score(y_test, y_final))
print("Classification report:\n", classification_report(y_test, y_final, digits=4))
print("Confusion matrix:\n", confusion_matrix(y_test, y_final))

# -------------------------
# Notes for real data:
# - Replace the simulated dataset with real gene-expression matrix and labels.
# - Consider nested CV for hyperparameter tuning and an external validation cohort if possible.
# - For interpretability, examine PCA components' loadings and check if top genes map to known pathways.


Dataset shape: (100, 500)
Class counts: {np.int64(0): 34, np.int64(1): 33, np.int64(2): 33}

Train/test shapes: (80, 500) (20, 500)

Explained variance ratio (first 10): [0.0254 0.023  0.0221 0.0214 0.0212 0.0209 0.0208 0.0203 0.0198 0.0194]
Cumulative explained variance (first 10): [0.0254 0.0484 0.0705 0.0918 0.113  0.1339 0.1547 0.175  0.1948 0.2142]

Components to reach 95% variance: 71

Shapes after PCA (small): (80, 20) (20, 20)
Shapes after PCA (medium): (80, 50) (20, 50)

Evaluating K options on: PCA small (20 comps)
k=3, metric=euclidean -> CV acc: 0.3875, Test acc: 0.6000
k=3, metric=manhattan -> CV acc: 0.4625, Test acc: 0.3500
k=5, metric=euclidean -> CV acc: 0.5375, Test acc: 0.5000
k=5, metric=manhattan -> CV acc: 0.4750, Test acc: 0.6500
k=7, metric=euclidean -> CV acc: 0.4500, Test acc: 0.4500
k=7, metric=manhattan -> CV acc: 0.4500, Test acc: 0.5000
Best (by CV): CV_acc=0.5375, Test_acc=0.5000, k=5, metric=euclidean

Evaluating K options on: PCA medium (50 comps)
k=3, 