#**KNN AND PCA ASSIGNMENT**

Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

Ans1. K-Nearest Neighbors (KNN) is a simple supervised machine learning algorithm that makes predictions based on the closest data points in the training set.

It stores all training examples and, for a new input, finds the K nearest neighbors using a distance metric like Euclidean distance.

In classification, it predicts the class that most of the K neighbors belong to (majority vote).

In regression, it predicts the average of the neighbors’ values.

Question 2: What is the Curse of Dimensionality and how does it affect KNN
performance?

Ans2. The Curse of Dimensionality refers to problems that arise when your data has a very large number of features (dimensions). As dimensions increase, the space grows so fast that data points become extremely sparse and distances between them become less meaningful. This makes it harder for models to find real patterns in the data.

How it affects KNN performance:

Distance loses meaning: In high dimensions, distances between points tend to become similar, so KNN cannot reliably tell which points are truly “nearest.”

Degraded accuracy: Since KNN depends on distances to find neighbors, its ability to correctly classify or predict drops.

More computation: Calculating distances in many dimensions requires more time and resources.

Question 3: What is Principal Component Analysis (PCA)? How is it different from
feature selection?

Ans3. Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms high-dimensional data into a new set of principal components — these are linear combinations of the original features that capture the maximum variance in the data. It projects the data onto these new axes so that most of the important information is retained with fewer dimensions. PCA is a feature transformation/extraction method, whereas feature selection actually selects which original variables to keep.

Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?

Ans4. Eigenvectors are the directions (new axes) along which the data varies the most. Each eigenvector defines a principal component — a direction in the transformed feature space where the data has meaningful structure.
 Eigenvalues are the scalars that tell you how much variance in the data is captured along each eigenvector direction. A larger eigenvalue means that its corresponding eigenvector explains more of the data’s variability.
 You rank principal components by their eigenvalues — highest to lowest — because those with larger eigenvalues capture the most significant patterns in the data.

By selecting top eigenvectors (with largest eigenvalues), PCA reduces dimensionality while retaining most of the original variance.

This process lets you simplify data and remove noise or redundant dimensions with minimal information loss.

Question 5: How do KNN and PCA complement each other when applied in a single
pipeline?

Ans5. PCA first reduces dimensionality by transforming the data into a smaller set of principal components, keeping most of the important information while removing noise and redundant features. This helps mitigate the curse of dimensionality that can weaken KNN’s distance-based predictions.
 KNN then runs on the PCA-transformed data, where distances are more meaningful and computations are faster because there are fewer features to compare. This can improve KNN’s accuracy and efficiency compared to using all original high-dimensional features.



In [6]:
#Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.
#ans6.
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# load data
X, y = load_wine(return_X_y=True)

# without scaling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
print("Accuracy without scaling:", accuracy_score(y_test, knn.predict(X_test)))

# with scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train_s, X_test_s, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)
knn = KNeighborsClassifier()
knn.fit(X_train_s, y_train)
print("Accuracy with scaling:", accuracy_score(y_test, knn.predict(X_test_s)))


Accuracy without scaling: 0.7407407407407407
Accuracy with scaling: 0.9629629629629629


In [7]:
#Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.
#Ans7.
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# 1. load wine data
wine = load_wine()
X = wine.data  # 178 samples, 13 features :contentReference[oaicite:0]{index=0}

# 2. standardize features (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. fit PCA with all components
pca = PCA()
pca.fit(X_scaled)

# 4. print explained variance ratio
print("Explained variance ratio for each principal component:")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {ratio:.4f}")


Explained variance ratio for each principal component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080


In [8]:
#Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.
#Ans8.
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# load wine data
X, y = load_wine(return_X_y=True)

# split raw data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# scale raw features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# train KNN on original scaled data
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
orig_acc = accuracy_score(y_test, knn.predict(X_test_scaled))

# PCA reduce to top 2 components
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# train KNN on PCA data
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
pca_acc = accuracy_score(y_test, knn_pca.predict(X_test_pca))

print("Accuracy (original):", orig_acc)
print("Accuracy (PCA 2 components):", pca_acc)


Accuracy (original): 0.9629629629629629
Accuracy (PCA 2 components): 0.9814814814814815


In [9]:
#Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.
#Ans9.
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# load data
X, y = load_wine(return_X_y=True)

# scale features (important for fair comparison)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# KNN with Euclidean distance
knn_euc = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euc.fit(X_train, y_train)
acc_euc = accuracy_score(y_test, knn_euc.predict(X_test))

# KNN with Manhattan distance
knn_man = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_man.fit(X_train, y_train)
acc_man = accuracy_score(y_test, knn_man.predict(X_test))

print("Euclidean Distance Accuracy:", acc_euc)
print("Manhattan Distance Accuracy:", acc_man)


Euclidean Distance Accuracy: 0.9629629629629629
Manhattan Distance Accuracy: 0.9629629629629629


In [10]:
#Question 10: You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer. Due to the large number of features and a small number of samples, traditional models overfit.
#Explain how you would:
#● Use PCA to reduce dimensionality
#● Decide how many components to keep
#● Use KNN for classification post-dimensionality reduction
#● Evaluate the model
#● Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data
#Ans10.
import numpy as np
from sklearn.datasets import load_wine  # use wine dataset as stand-in for gene data
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# --- 1) Load and scale the dataset ---
X, y = load_wine(return_X_y=True)  # substitute with your gene expression data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- 2) Train/test split ---
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

# --- 3) Baseline KNN (no PCA) ---
knn_baseline = KNeighborsClassifier(n_neighbors=5)
knn_baseline.fit(X_train, y_train)
y_pred_base = knn_baseline.predict(X_test)

print("=== Baseline KNN (No PCA) ===")
print("Test Accuracy:", accuracy_score(y_test, y_pred_base))
print(classification_report(y_test, y_pred_base))

# --- 4) PCA dimensionality reduction ---
pca = PCA()
pca.fit(X_train)

# show explained variance ratio
print("\nExplained Variance Ratio (all components):")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {ratio:.4f}")

# choose number of components to retain e.g., 95% variance
cum_var = np.cumsum(pca.explained_variance_ratio_)
n_components_95 = np.argmax(cum_var >= 0.95) + 1
print(f"\n# components for ~95% variance: {n_components_95}")

# apply PCA with top components
pca = PCA(n_components=n_components_95)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# --- 5) KNN on PCA features ---
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)

print("\n=== KNN after PCA ===")
print("Test Accuracy:", accuracy_score(y_test, y_pred_pca))
print(classification_report(y_test, y_pred_pca))

# --- 6) Optional: cross-validation comparison ---
cv_scores_base = cross_val_score(knn_baseline, X_scaled, y, cv=5, scoring='accuracy')
cv_scores_pca = cross_val_score(knn_pca, pca.transform(X_scaled), y, cv=5, scoring='accuracy')

print("\nCross-val Accuracies (no PCA):", cv_scores_base.mean())
print("Cross-val Accuracies (with PCA):", cv_scores_pca.mean())


=== Baseline KNN (No PCA) ===
Test Accuracy: 0.9444444444444444
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        18
           1       1.00      0.86      0.92        21
           2       0.83      1.00      0.91        15

    accuracy                           0.94        54
   macro avg       0.94      0.95      0.94        54
weighted avg       0.95      0.94      0.94        54


Explained Variance Ratio (all components):
PC1: 0.3498
PC2: 0.1972
PC3: 0.1108
PC4: 0.0777
PC5: 0.0691
PC6: 0.0525
PC7: 0.0438
PC8: 0.0247
PC9: 0.0203
PC10: 0.0193
PC11: 0.0168
PC12: 0.0110
PC13: 0.0069

# components for ~95% variance: 10

=== KNN after PCA ===
Test Accuracy: 0.9444444444444444
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        18
           1       1.00      0.86      0.92        21
           2       0.83      1.00      0.91        15

    accuracy                           0