## KNN & PCA | Assignment

## QUESTION & ANSWERS:

1. Q. What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

>> A. K-Nearest Neighbors(KNN) is simple, non-parametric, supervised machine learning algorithm used for both classification and regression tasks. It is also known as an instance-based or lazy learning algorithm because it does not build an explicit model during training; instead, it stores the entire training dataset and makes predictions only at the time of querying by comparing the new data point to the stored instances.

**How KNN works (general steps):**

1. Choose a value for K- the number of nearest neighbors to consider (e.g., K=5).

2. Calculate distance- For a new (test) data point, compute its distance to all points in the training dataset. The most common distance metric is Euclidean distance, but others like Manhattan, Minkowski, or weighted distances can also be used.

3. Identify K nearest neighbors- Select the K training points that are closet to the test point (smallest distances).

4. Make prediction- Aggregate the information from these K neighbors to predict the output for the test point.

**In Classification**:

> KNN predicts the class label of the test point.

> It uses majority voting (or plurality voting when these are more than two classes): the class that appears most frequantly among the K nearest neighbors is assigned to the test point.

> Ties can be broken randomly or by giviing more weight to closer neighbors (weighted voting, where nearer neighbors contribute more).

> Example :IF K= 5 and among the 5 nearest points, 3 belong to class "Positive " and 2 to "Negative", the prediction is "Positive".

**In Regression**:

> KNN predicts a continuous numerical value.

> It takes the average (mean) ofthe targett values of the K nearest neighbors.

> Optionally, it can use a weighted average, where closer neighbors have higher weights (e.g., inverse of distance).

> Ex- if K = 3 and the target values of the 3 nearest neighbors are 4.2, 4.8, and 5.0, the predicted value might be (4.2 + 4.8 + 5.0) /3 = 4.67.

KNN is very intuitive and works well when the data has clear local structure, (smaller points tend to have similar outputs), but it can be computationally expensive for large datasets because predictions require calculating distances to all training points.



2. Q. What is the Curse of Dimensionality and how does it affect KNN
performance?


>> A. The Curse of Dimensionality refers to various phenomena that arise when working with data in high-dimensions increases, the volume of the space grows exponentially, causing data points to become increasingly sparse and behave counterintuitively.

key aspects of the curse include:

> Most of the volume of a high-dimensional space is concentrated near the boundaries( empty in the center).

> The distance between points becomes less meaningful-points tend to the almost equidstant from the each other.

> The number of data points needed to maintain density grows exponentially with dimensions.

**HOw it affects KNN performance:**

> Distance concentration/loss of discriminative power
In high dimensions, the difference between the nearest neighbor and the farthest neighbor becomes neighbor very small (distances become similar). This violates KNN's core assumption that nearby points are more similar/relvant than distant ones. All points start looking roughly equally distant, so"nearest" neighbors are no longer meaningful close.

> Sparsity of data
to cover the space adequately, exponentially more training examples are required as dimensions increase. With fixed data size, neighborhoods become empty or contain very few points, leading to unreliable neighbours and poor predictions.

> Increased computational cost
Calculatng distances becomes slower(more features to process), and the algorithm scales poorly with both data size and dimensionality.

> Degraded accuracy
KNN perforamnce often drops significantly in high dimesnions unless the dataset is extemely large or dimensionality reduction (e.g., PCA, feature selection) is applied first. Irrelevant/noisy dimensions further worsen  the problem by adding randomness to distances.

**Mitigation techniques for KNN**:

> Dimensionality reduction (PCA, t-SNE, feature selection).

> Feature scaling/normalization (very important for distance-based models).

> Choosing a good distance metric or using adaptive metrics.

> Using approximate nearest neighbor search for large/high-dimensional data.

3. Q.What is Principal Component Analysis (PCA)? How is it different from
feature selection?

>> A. Principal Component Analysis (PCA) is an unsupervised statistical technique used for dimensionality reduction. It transforms a high-dimensional dataset into a lower-dimensional space by creating a new set of uncorrelated variables called principal components (PCS). These components are linear combinations of the original features and are ordered such that:

> The First principal component captures the maximum possible variance in the data.

> Each subsequent component captures the maximum remaining variance while being orthogonal (uncorrelated) to the previous ones.

PCA works by:

1. Standardizing the data(mean= 0, variance = 1 per feature- important because PCA is variance-based).

2. Computing the covariance (or correlation) matrix.

3. Finding its eigenvectors (directions of maximum variance) and eigenvalues (amount of variance explained by each direction).

4. Sorting components by decreasing eigenvalue and projecting the original data onto the top k eigenvectors ot obtain the reduced dataset.

PCA helps remove redundancy(correlated features), mitigate the curse of dimensionality, reduce noise, improve model performance/computation speed, and enable visualization (e.g., projecting to 2D/3D).

**Difference from Feature Selection"**

> PCA (Feature Extraction/ Dimensionality Reduction).

> Creates new synthetic features (principal components) as linear combinations of all original features.

> Does not select or discard original features; it transform them.

> Unsupervised-does not use the target/label information.

> Aims to maximize explained variance (preseves overall data structure).

> The new features are usually not interpretable (they mix originally variables).

> Good when features are highly correlated or when preserving global variance is important.

**Feature Selection**

> Selects a subsets of the origianl features and discards the rest.

> Keeps features unchanged and interpretable.

> Can be supervised (uses target variable , e.g., mutual information, recursive feature elimination, LASSO) or unsupervised (e.g., variance threshold).

> Aims to remove irrelevant/noisy features or those with low predictive power for the specific task.

> Better for model interpretability and when domin knowledge suggests certain features are meaningless.



4. Q.  What are eigenvalues and eigenvectors in PCA, and why are they
important?

>> A. In PCA, eigenvectors and eigenvalues come from the eigendecomposition of the data's covariance (or correlation) matrix.

> Eigenvector : Represents the direction (axis) in the feature space along with the data varies themost (or next most, etc).
Each eigenvector defines one principal component-it gives the wieghts (loadings) that tell how much each originally feature contributes that component.

> Eignevalue: Represent tha amount of variance explained by the corresponding eigenvector (principal component). Larger eigenvalue- more important direction-that component more the data's variability.

**Why they are important**:

> Sorting and selection : Principal components are ranked by decreasing eigenvalue. We keeps only the top k components with the largest eigenvalues (those explaining most variance) and discard the rest - this achieves dimensionality reduction with minimal information loss.

> Variance maximization: The first eigevector (with largest eigenvalues) deifnes the direction of maximum variance. The second is orthogonal and captures the next highest variance, and so on.

> Explained variance ratio: Eigenvalues allow computing howmuch total variance each component explains (eigenvalue/sum of all eigenvalues). This helps decide how many components to retrain (e.g., keep enough to explain 95% of variance).

> Orthogonality: Eigenvectors of a synmetric matrix (like convariance) are orthogonal- principal components are uncorrelected, removing multiconllinearity issues.



5. Q. How do KNN and PCA complement each other when applied in a single
pipeline?

>> A. KNN (K-nearest neighbors) and PCA complement each  other over well in a machine learning pipeline, especially for classification or regression tasks on high-dimensional data. Their combination addresses key weakness of each methods.

**How they complement each other:**

> PCA mitigates the curse of Dimensionality for KNN

- KNN relies heavily on distance metrics (e.g., Euclidean). In high dimensions, distances become  less meaningful (curse of dimensionality)- all points appear roughly equidistant , degrading KNN performance.

- PCA reduces dimensionality by projecting data  onto the top principal components (directions of highes variance), removing noise/ redunancy and making distances more reliable and discriminative.- Better neighborhood structure -more accurate KNN predictions.

> Faster and more scalable KNN

- High-dimensional data makes KNN slow (distance computation to every training point).

- After PCA, the dataset has far fewar features- distance calculations become much faster,especially on large datasets.

> Noise reduction Improves KNN robustness

- Irrelevent/noisy dimensions hurt KNN  (they add randomess to distances).

- PCA discards low- variance (often noisy)  components - clearner data- KNN focuses on meaninful patterns.

> Typical pipeline order

- Standardize features (mean = 0 std= 1) - crucial for both  PCA and distance -based KNN.

- Apply PCA (fit on training data only, transform train + test).

- Train KNN on the reduced-dimensional data.

- Predict using the same PCA transoformation on new data.


**Benefits of the combination:**

- Improved accuracy ( especially when original features are correlated /high -dimensional).

- Reduced overfitting risk in KNN (fewer dimensions- less complexity).

- Faster training and prediction.

- Often better generalization than raw KNN or KNN on selected features alone.

**Limitations to note:**

- PCA is linear - may miss non-linear patterns (consider alternatives like t-SUE/UMAP it needed, though they are not for direct modeling).

- PCA is unsupervised- may discard some task specific information (supervised alternatives like LDA exist for classification).

-  Requires choosing number of components (via explained variance or cross-validation).

6. Q. Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.
(Include your Python code and output in the code box below.)

Dataset:
Use the Wine Dataset from sklearn.datasets.load_wine().

In [1]:
# Import required libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split the data (stratified split for balanced classes)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Case 1: KNN without feature scaling
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(X_train, y_train)
y_pred_no = knn_no_scale.predict(X_test)
acc_no_scale = accuracy_score(y_test, y_pred_no)

# Case 2: KNN with feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

# Print results
print("Accuracy without feature scaling : {:.4f} ({:.1f}%)".format(acc_no_scale, acc_no_scale*100))
print("Accuracy with feature scaling    : {:.4f} ({:.1f}%)".format(acc_scaled, acc_scaled*100))
print(f"Improvement due to scaling      : {acc_scaled - acc_no_scale:.4f} ({(acc_scaled - acc_no_scale)*100:.1f}%)")

Accuracy without feature scaling : 0.7222 (72.2%)
Accuracy with feature scaling    : 0.9444 (94.4%)
Improvement due to scaling      : 0.2222 (22.2%)


7. Q.Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.
(Include your Python code and output in the code box below.)

In [2]:
# Import required libraries
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
import numpy as np

# Load the wine dataset
wine = load_wine()
X = wine.data

# Apply PCA (without target, unsupervised)
pca = PCA()
pca.fit(X)

# Get explained variance ratios
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance_ratio)

# Print results nicely
print("Explained variance ratio per principal component:")
print("-" * 60)
for i, ratio in enumerate(explained_variance_ratio, 1):
    print(f"PC{i:2d}: {ratio:.6f}    (cumulative: {cumulative_variance[i-1]:.6f})")
print("-" * 60)
print(f"Total variance explained by all 13 components: {cumulative_variance[-1]:.6f}")

Explained variance ratio per principal component:
------------------------------------------------------------
PC 1: 0.998091    (cumulative: 0.998091)
PC 2: 0.001736    (cumulative: 0.999827)
PC 3: 0.000095    (cumulative: 0.999922)
PC 4: 0.000050    (cumulative: 0.999972)
PC 5: 0.000012    (cumulative: 0.999985)
PC 6: 0.000008    (cumulative: 0.999993)
PC 7: 0.000003    (cumulative: 0.999996)
PC 8: 0.000002    (cumulative: 0.999997)
PC 9: 0.000001    (cumulative: 0.999999)
PC10: 0.000001    (cumulative: 0.999999)
PC11: 0.000000    (cumulative: 1.000000)
PC12: 0.000000    (cumulative: 1.000000)
PC13: 0.000000    (cumulative: 1.000000)
------------------------------------------------------------
Total variance explained by all 13 components: 1.000000


8. Q. Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.
(Include your Python code and output in the code box below.)

In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np

# Load and scale data
wine = load_wine()
X = wine.data
y = wine.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# KNN on original scaled (13 features)
k = 5
knn = KNeighborsClassifier(n_neighbors=k)
orig_scores = cross_val_score(knn, X_scaled, y, cv=5, scoring='accuracy')
orig_mean = np.mean(orig_scores)
orig_std = np.std(orig_scores)

# PCA to top 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
knn_pca = KNeighborsClassifier(n_neighbors=k)
pca_scores = cross_val_score(knn_pca, X_pca, y, cv=5, scoring='accuracy')
pca_mean = np.mean(pca_scores)
pca_std = np.std(pca_scores)

# Results
print('Original scaled (13 features):', f'{orig_mean:.4f} (+/- {2*orig_std:.4f})')
print('PCA top 2 components:', f'{pca_mean:.4f} (+/- {2*pca_std:.4f})')

Original scaled (13 features): 0.9551 (+/- 0.0580)
PCA top 2 components: 0.9663 (+/- 0.0219)


9. Q.  Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.
(Include your Python code and output in the code box below.)

In [4]:
from sklearn.datasets import load_wine
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
import numpy as np

# Load and scale data
wine = load_wine()
X = wine.data
y = wine.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train KNN with different metrics
k = 5
metrics = ['euclidean', 'manhattan']
results = {}
for metric in metrics:
    knn = KNeighborsClassifier(n_neighbors=k, metric=metric)
    scores = cross_val_score(knn, X_scaled, y, cv=5, scoring='accuracy')
    results[metric] = {'mean': np.mean(scores), 'std': np.std(scores)}

# Print results
print('KNN Results with k=5 on scaled Wine dataset:')
for metric, res in results.items():
    print(f'{metric.capitalize()}: {res["mean"]:.4f} (+/- {2*res["std"]:.4f})')

KNN Results with k=5 on scaled Wine dataset:
Euclidean: 0.9551 (+/- 0.0580)
Manhattan: 0.9552 (+/- 0.0569)


10. Q. You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data
(Include your Python code and output in the code box below.)

>> A. **PCA-KNN Pipeline for High-Dimensional Gene Data**
Breast cancer dataset simulates gene expression with 30 features (genes) and 569 samples (patients), binary classification for malignant/benign. PCA-KNN pipeline retains 95% variance (10 components) to combat curse of dimensionality.

Approach
•	Standardize features to handle scale differences.

•	Apply PCA to reduce dimensions while preserving 95% variance, determining components via cumulative explained variance.

•	Train KNN (K=5) on reduced data, evaluate with 5-fold CV accuracy.

In [5]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np

# Load data
data = load_breast_cancer()
X = data.data
y = data.target
print(f'Dataset shape: {X.shape}')
print(f'Classes: {len(np.unique(y))}')

# Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Optimal components for 95% variance
pca = PCA()
pca.fit(X_scaled)
cumsum_var = np.cumsum(pca.explained_variance_ratio_)
n_comp_95 = np.argmax(cumsum_var >= 0.95) + 1
print(f'Components for 95% variance: {n_comp_95}')

# KNN original
k = 5
knn = KNeighborsClassifier(n_neighbors=k)
orig_scores = cross_val_score(knn, X_scaled, y, cv=5)
print(f'Original accuracy: {np.mean(orig_scores):.4f} (+/- {np.std(orig_scores)*2:.4f})')

# PCA-KNN
pca_opt = PCA(n_components=n_comp_95)
X_pca = pca_opt.fit_transform(X_scaled)
knn_pca = KNeighborsClassifier(n_neighbors=k)
pca_scores = cross_val_score(knn_pca, X_pca, y, cv=5)
print(f'PCA accuracy: {np.mean(pca_scores):.4f} (+/- {np.std(pca_scores)*2:.4f})')

Dataset shape: (569, 30)
Classes: 2
Components for 95% variance: 10
Original accuracy: 0.9649 (+/- 0.0192)
PCA accuracy: 0.9613 (+/- 0.0265)
