#Assignment Code: DA-AG-016
#KNN & PCA | Assignment

Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?

Answer:

K-Nearest Neighbors (KNN) is a supervised learning algorithm used for classification and regression tasks. It is a non-parametric and instance-based learning algorithm.

How it works:

Given a query point, the algorithm finds the K nearest neighbors in the training dataset based on a distance metric (e.g., Euclidean).

In classification, the majority label among the K neighbors is assigned.

In regression, the output is the average (or weighted average) of the neighbors' values.

Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?

Answer:

The Curse of Dimensionality refers to various problems that arise when analyzing and organizing data in high-dimensional spaces.

In the context of KNN:

As dimensions increase, data points become sparse, making the concept of "nearness" less meaningful.

Distance metrics lose effectiveness, causing KNN to perform poorly.

Computational cost increases with dimensionality.

Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?

Answer:

PCA is a dimensionality reduction technique that transforms original features into new uncorrelated features (principal components) that capture the most variance.

Difference from Feature Selection:

PCA creates new features (linear combinations), while feature selection selects existing features.

PCA is unsupervised, whereas feature selection can be supervised or unsupervised.

Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?

Answer:

Eigenvectors represent the directions (principal components) of the new feature space.

Eigenvalues represent the magnitude of variance captured by each eigenvector.

They are crucial because:

The eigenvector with the highest eigenvalue captures the most variance.

Selecting top components based on eigenvalues helps in reducing dimensionality.


Question 5: How do KNN and PCA complement each other when applied in a single pipeline?

Answer:

PCA helps by reducing noise and dimensionality, addressing the curse of dimensionality.

This makes KNN more effective and efficient.

Together, PCA + KNN improve model generalization, especially on high-dimensional datasets.  

Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.  
(Include your Python code and output in the code box below.)  
Dataset:  
Use the Wine Dataset from sklearn.datasets.load_wine().

Answer:

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load data
data = load_wine()
X, y = data.data, data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# KNN without scaling
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
acc_no_scaling = knn.score(X_test, y_test)

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier()
knn_scaled.fit(X_train_scaled, y_train)
acc_scaled = knn_scaled.score(X_test_scaled, y_test)

print(f"Accuracy without scaling: {acc_no_scaling:.2f}")
print(f"Accuracy with scaling: {acc_scaled:.2f}")


Accuracy without scaling: 0.71
Accuracy with scaling: 0.96


Question 7: Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.  
(Include your Python code and output in the code box below.)  
Answer:

In [2]:
from sklearn.decomposition import PCA

# Scale data
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
pca.fit(X_scaled)
explained_variance = pca.explained_variance_ratio_

print("Explained variance ratio:")
print(explained_variance)


Explained variance ratio:
[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]


Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.  
(Include your Python code and output in the code box below.)  
Answer:

In [3]:
# PCA with 2 components
pca_2 = PCA(n_components=2)
X_pca_2 = pca_2.fit_transform(X_scaled)

# Train/test split
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_pca_2, y, random_state=42)

# Train KNN on PCA data
knn_pca = KNeighborsClassifier()
knn_pca.fit(X_train_pca, y_train_pca)
acc_pca = knn_pca.score(X_test_pca, y_test_pca)

print(f"Accuracy with top 2 PCA components: {acc_pca:.2f}")


Accuracy with top 2 PCA components: 0.98


Question 9: Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.  
(Include your Python code and output in the code box below.)  
Answer:  

In [4]:
# Euclidean (default)
knn_euc = KNeighborsClassifier(metric='euclidean')
knn_euc.fit(X_train_scaled, y_train)
acc_euc = knn_euc.score(X_test_scaled, y_test)

# Manhattan
knn_man = KNeighborsClassifier(metric='manhattan')
knn_man.fit(X_train_scaled, y_train)
acc_man = knn_man.score(X_test_scaled, y_test)

print(f"Euclidean accuracy: {acc_euc:.2f}")
print(f"Manhattan accuracy: {acc_man:.2f}")


Euclidean accuracy: 0.96
Manhattan accuracy: 0.96


Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.  
Due to the large number of features and a small number of samples, traditional models
overfit.  
Explain how you would:  
● Use PCA to reduce dimensionality  
● Decide how many components to keep  
● Use KNN for classification post-dimensionality reduction  
● Evaluate the model  
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data  
(Include your Python code and output in the code box below.)  
Answer:  
When working with high-dimensional gene expression data, where the number of features (genes) is much larger than the number of patient samples, models like KNN can easily overfit due to noise and sparsity. Here's how we can build a robust and interpretable machine learning pipeline using PCA + KNN, especially suited for biomedical use cases.

 Strategy Overview:  
🔹 1. Use PCA to Reduce Dimensionality

Why? Gene expression datasets often contain thousands of features, many of which are correlated or irrelevant.

What? PCA helps by transforming the data into a smaller set of uncorrelated variables (principal components) that retain most of the variance.

🔹 2. Decide How Many Components to Keep

Use the explained variance ratio to choose the minimum number of components that capture 95% of the variance.

This ensures important biological signals are preserved.

🔹 3. Use KNN for Classification Post-Dimensionality Reduction

KNN is suitable when features are well-scaled and dimensionality is reduced.

PCA reduces overfitting and noise, making KNN effective even with limited samples.

🔹 4. Evaluate the Model

Use stratified k-fold cross-validation to get reliable performance metrics.

Measure accuracy, and optionally other metrics like precision/recall if classes are imbalanced.

🔹 5. Justify to Stakeholders

The pipeline is:

Scientifically sound: PCA reduces noise and retains biological signal.

Interpretable: Easy to understand how data is transformed and classified.

Efficient: Reduces training time and improves generalization.

Proven: Widely used in genomic and biomedical research.  
Python Code:

In [5]:
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# 1. Simulate high-dimensional gene expression dataset
# 100 samples, 1000 features (like gene expression), 2 cancer types
X, y = make_classification(n_samples=100, n_features=1000, n_informative=50,
                           n_classes=2, random_state=42)

# 2. Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Apply PCA to retain 95% of variance
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

print(f"Original feature count: {X.shape[1]}")
print(f"Reduced feature count after PCA: {X_pca.shape[1]}")
print(f"Total explained variance by selected components: {np.sum(pca.explained_variance_ratio_):.2f}")

# 4. Train and evaluate KNN using 5-fold cross-validation
knn = KNeighborsClassifier(n_neighbors=5)
cv_scores = cross_val_score(knn, X_pca, y, cv=5)

print(f"\nCross-validated accuracy scores: {cv_scores}")
print(f"Mean accuracy: {cv_scores.mean():.2f}")
print(f"Standard deviation: {cv_scores.std():.2f}")


Original feature count: 1000
Reduced feature count after PCA: 90
Total explained variance by selected components: 0.95

Cross-validated accuracy scores: [0.7  0.65 0.7  0.6  0.6 ]
Mean accuracy: 0.65
Standard deviation: 0.04
