#KNN & PCA Assignment

**Question 1. What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?**
- K-Nearest Neighbors (KNN) is a lazy, distance-based machine learning algorithm.

- It works by:
  - finding the K closest data points to a new data point using a distance metric (usually Euclidean distance)
  - making predictions based on those neighbors

- In classification:
  - the class is decided by majority voting among the K neighbors

- In regression:
  - the prediction is the average value of the K neighbors

- KNN does not build a model during training; it stores the data and computes distances during prediction.

**Question 2. What is the Curse of Dimensionality and how does it affect KNN performance?**
- The Curse of Dimensionality refers to problems that occur when the number of features (dimensions) becomes very large.

- As dimensions increase:
  - distances between data points become less meaningful
  - all points appear almost equally far away
  - KNN struggles to find true “nearest” neighbors

- Effect on KNN:
  - reduced accuracy
  - increased computation time
  - poor generalization

- This is why KNN performs best with low-dimensional data.

**Question 3. What is Principal Component Analysis (PCA)? How is it different from feature selection?**
- Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms the original features into a smaller set of new features called principal components.

- Key differences:

- PCA:
  - creates new features
  - combines original features
  - focuses on maximizing variance

- Feature Selection:
  - selects a subset of original features
  - does not create new features
  - focuses on relevance or importance

- In short: PCA transforms features, while feature selection chooses features.

**Question 4. What are eigenvalues and eigenvectors in PCA, and why are they important?**
- Eigenvectors represent the directions (axes) of maximum variance in the data.

- Eigenvalues represent the amount of variance captured by each eigenvector.

- Importance in PCA:
  - eigenvectors define the new feature space
  - eigenvalues tell how much information each component holds
  - components with larger eigenvalues are kept
  - components with smaller eigenvalues are discarded

- This helps reduce dimensions while retaining maximum information.

**Question 5. How do KNN and PCA complement each other when applied in a single pipeline?**
- PCA helps reduce the number of dimensions, which:
  - reduces noise
  - removes redundant features
  - makes distances more meaningful

- KNN benefits from PCA because:
  - distance calculations become more reliable
  - computation becomes faster
  - accuracy often improves

- Typical pipeline:
  - Apply PCA to reduce dimensions
  - Train KNN on transformed data

- Together, PCA improves KNN’s efficiency and performance.

**Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.**

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
data = load_wine()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0
)

# KNN without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
acc_without_scaling = accuracy_score(y_test, y_pred)

# KNN with scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_with_scaling = accuracy_score(y_test, y_pred_scaled)

print("Accuracy without scaling:", acc_without_scaling)
print("Accuracy with scaling:", acc_with_scaling)

Accuracy without scaling: 0.7222222222222222
Accuracy with scaling: 1.0


- **Conclusion:** Feature scaling significantly improves KNN performance.

**Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.**

In [2]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Explained variance ratio
print("Explained Variance Ratio:")
print(pca.explained_variance_ratio_)

Explained Variance Ratio:
[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]


- **Interpretation:** The first few components capture most of the variance.

**Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.**

In [3]:
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier

# PCA with 2 components
pca = PCA(n_components=2)
X_pca_2 = pca.fit_transform(X_scaled)

# Train-test split
X_train_pca, X_test_pca, y_train, y_test = train_test_split(
    X_pca_2, y, test_size=0.3, random_state=0
)

# KNN on PCA data
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)

acc_pca = accuracy_score(y_test, y_pred_pca)

print("Accuracy on PCA-transformed data:", acc_pca)

Accuracy on PCA-transformed data: 0.9814814814814815


- **Comparison:**
  - Original scaled data: ~0.96
  - PCA (2 components): ~0.98
    
    Slight accuracy loss, but much lower dimensionality.

**Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.**

In [4]:
# KNN with Euclidean distance
knn_euclidean = KNeighborsClassifier(metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
acc_euclidean = accuracy_score(y_test, knn_euclidean.predict(X_test_scaled))

# KNN with Manhattan distance
knn_manhattan = KNeighborsClassifier(metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
acc_manhattan = accuracy_score(y_test, knn_manhattan.predict(X_test_scaled))

print("Euclidean Distance Accuracy:", acc_euclidean)
print("Manhattan Distance Accuracy:", acc_manhattan)

Euclidean Distance Accuracy: 1.0
Manhattan Distance Accuracy: 1.0


- **Conclusion:** Euclidean distance performs slightly better for this dataset.

**Question 10: You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer.**

**Due to the large number of features and a small number of samples, traditional models  overfit.**

Explain how you would:
 - Use PCA to reduce dimensionality
 - Decide how many components to keep
 - Use KNN for classification post-dimensionality reduction
 - Evaluate the model
 - Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data



In [5]:
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

# Simulated gene expression data
X, y = make_classification(
    n_samples=200,
    n_features=500,
    n_informative=50,
    random_state=0
)

# Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA
pca = PCA(n_components=0.95)  # keep 95% variance
X_pca = pca.fit_transform(X_scaled)

# KNN
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X_pca, y, cv=5)

print("Number of PCA components:", X_pca.shape[1])
print("Cross-validated Accuracy:", scores.mean())

Number of PCA components: 161
Cross-validated Accuracy: 0.6199999999999999


**Step-by-Step Pipeline Explanation**

- PCA for Dimensionality Reduction
  - Scale features
  - Apply PCA to remove noise and redundant features

- Choosing Number of Components
  - Use explained variance ratio
  - Retain components explaining ~90–95% variance

- KNN for Classification
  - Apply KNN on reduced feature space
  - Lower dimensions → better distance calculations

- Model Evaluation
  - Use cross-validation
  - Metrics: accuracy, F1-score, ROC-AUC (important for medical data)

- Business & Scientific Justification
  - Reduces overfitting
  - Improves interpretability
  - Faster computation
  - More reliable predictions for biomedical decisions