1.What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

- K-Nearest Neighbors (KNN) is a simple, non-parametric algorithm that predicts outcomes based on the closest data points. In classification, it assigns the majority class among neighbors, while in regression, it averages their values.
- KNN in Classification
  - Compute distances between the new point and all training points.
  - Select the k nearest neighbors.
  - Assign the class label by majority vote among neighbors.
- KNN in Regression
  - Compute distances to all training points.
  - Select the k nearest neighbors.
  - Predict the target as the average (or weighted average) of neighbors’ values.



2.What is the Curse of Dimensionality and how does it affect KNN
performance?

- It’s a term used in machine learning to describe problems that occur when your dataset has too many features (dimensions).
- As the number of dimensions increases:
- Data points become sparse (spread out).
- Distances lose meaning — in high dimensions, all points start to look equally far from each other.
- Algorithms that rely on distance or density (like KNN) struggle.
- KNN works by finding the nearest neighbors to a point using a distance measure (like Euclidean distance).
- But in high dimensions:
  - Nearest vs farthest neighbor gap shrinks
  - In 2D, the closest neighbor is clearly nearer than the farthest.
  - In 100D, the difference between “closest” and “farthest” becomes very small.
  - Result: KNN can’t reliably tell who is truly close.
  - Noise dominates
  - Extra irrelevant features add random variation.
  - KNN treats all features equally, so important signals get drowned out.




3.What is Principal Component Analysis (PCA)? How is it different from
feature selection?

-  PCA is a dimensionality reduction technique.
- It transforms your original features into a new set of variables called principal components.
- These components are:
- Linear combinations of the original features.
- Ordered so that the first component captures the most variance in the data, the second captures the next most, and so on.
- The goal: reduce the number of dimensions while keeping as much information (variance) as possible.
- Feature Selection
  - Chooses a subset of the original features and discards the rest.
  - Goal: Keep only the most relevant predictors for the target variable (or reduce redundancy).
  - Nature: Can be supervised (using target labels) or unsupervised.
  - Interpretability: Easier, because you’re still working with the original features



4.What are eigenvalues and eigenvectors in PCA, and why are they
important?
- Eigenvectors
  - They represent the directions of maximum variance in the data.
  - In PCA, each eigenvector corresponds to a principal component.
- Eigenvalues
  - They represent the amount of variance captured by each eigenvector.
  - Larger eigenvalue = more information (variance) explained by that component.
-  Identify principal components
   - Eigenvectors define the new axes (principal components) onto which data is projected.
   - Rank importance of components
   Eigenvalues tell us how much variance each component explains.
   - This helps decide how many components to keep.
   - Dimensionality reduction
By keeping only the top k eigenvectors, we reduce dimensions while preserving most of the information.
   - Noise filtering
Components with very small eigenvalues capture little variance . Dropping them improves model efficiency and generalization.




5.How do KNN and PCA complement each other when applied in a single
pipeline?

- KNN relies on distance (Euclidean, Manhattan, etc.) to find nearest neighbors.
- In high-dimensional data, distances become unreliable due to the Curse of Dimensionality (all points start to look equally far apart).
- This hurts KNN’s performance, especially when many features are noisy or redundant.
- PCA reduces dimensionality by projecting data onto fewer principal components that capture most of the variance.
- This has several benefits for KNN:
- Removes noise & redundancy - KNN focuses on meaningful features.
- Improves distance reliability - Distances in lower dimensions are more discriminative.
- Speeds up computation - Fewer dimensions mean faster distance calculations.
- Mitigates overfitting - KNN generalizes better when irrelevant features are removed



In [1]:
'''
6.Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.
'''

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

wine = load_wine()
X, y = wine.data, wine.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = knn_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaling = KNeighborsClassifier(n_neighbors=5)
knn_scaling.fit(X_train_scaled, y_train)
y_pred_scaling = knn_scaling.predict(X_test_scaled)
accuracy_scaling = accuracy_score(y_test, y_pred_scaling)


print("Accuracy WITHOUT scaling:", accuracy_no_scaling)
print("Accuracy WITH scaling   :", accuracy_scaling)

Accuracy WITHOUT scaling: 0.7222222222222222
Accuracy WITH scaling   : 0.9444444444444444


In [2]:
'''
7.Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.
'''

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA


wine = load_wine()
X = wine.data


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


pca = PCA()
pca.fit(X_scaled)


print("Explained variance ratio of each principal component:")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {ratio:.4f}")

Explained variance ratio of each principal component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080


In [3]:
'''
8.Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset
'''

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

wine = load_wine()
X, y = wine.data, wine.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)
accuracy_original = accuracy_score(y_test, y_pred_original)


pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)


print("Accuracy on Original Dataset:", accuracy_original)
print("Accuracy on PCA (2 components):", accuracy_pca)

Accuracy on Original Dataset: 0.9444444444444444
Accuracy on PCA (2 components): 0.9444444444444444


In [4]:
'''
9.Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.
'''

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

wine = load_wine()
X, y = wine.data, wine.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)


knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)


print("Accuracy with Euclidean distance:", accuracy_euclidean)
print("Accuracy with Manhattan distance:", accuracy_manhattan)

Accuracy with Euclidean distance: 0.9444444444444444
Accuracy with Manhattan distance: 0.9814814814814815


In [6]:
'''
10.You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical dat
'''
import numpy as np
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.datasets import make_classification


X, y = make_classification(n_samples=100, n_features=5000,
                           n_informative=50, n_classes=3, random_state=42)


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


pca = PCA()
X_pca = pca.fit_transform(X_scaled)


cum_var = np.cumsum(pca.explained_variance_ratio_)
n_components = np.argmax(cum_var >= 0.95) + 1
print(f"Number of components to retain (95% variance): {n_components}")


pca = PCA(n_components=n_components)
X_reduced = pca.fit_transform(X_scaled)


knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')


cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(knn, X_reduced, y, cv=cv, scoring='accuracy')

print("Cross-validation accuracies:", scores)
print("Mean accuracy:", scores.mean())

Number of components to retain (95% variance): 93
Cross-validation accuracies: [0.4  0.4  0.2  0.3  0.35]
Mean accuracy: 0.32999999999999996
