 **KNN & PCA**

In [None]:
# 📘 PW Skills - KNN & PCA Assignment (Colab Ready)

# ---
# ✅ Q1. What is K-Nearest Neighbors (KNN) and how does it work?
# KNN is a supervised algorithm used for classification & regression.
# - Classification: Predicts the class label by majority voting among k-nearest neighbors.
# - Regression: Predicts the value by averaging the k-nearest neighbors.
# Distance metrics (e.g., Euclidean, Manhattan) are used to identify neighbors.

# ---
# ✅ Q2. What is the Curse of Dimensionality and how does it affect KNN?
# Curse of Dimensionality = Problems when data has too many features.
# - Distances lose meaning → harder to find nearest neighbors.
# - KNN becomes less effective, risk of overfitting.
# Solution: Dimensionality reduction (e.g., PCA).

# ---
# ✅ Q3. What is PCA and how is it different from feature selection?
# PCA = Principal Component Analysis, a dimensionality reduction technique.
# - Creates new features (principal components) that maximize variance.
# - Unlike feature selection, PCA transforms features rather than just selecting a subset.

# ---
# ✅ Q4. Eigenvalues & Eigenvectors in PCA
# - Eigenvectors: Directions of maximum variance (principal components).
# - Eigenvalues: Amount of variance explained along those directions.
# They determine importance of each component in PCA.

# ---
# ✅ Q5. How do KNN & PCA complement each other?
# - PCA reduces dimensions → combats curse of dimensionality.
# - KNN works better on reduced feature space → faster & more accurate.
# - Together: Robust pipeline for high-dimensional data.

# ---
# ✅ Q6. KNN on Wine Dataset (with & without scaling)
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

wine = load_wine()
X, y = wine.data, wine.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred)

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_with_scaling = accuracy_score(y_test, y_pred_scaled)

print("Accuracy without scaling:", acc_no_scaling)
print("Accuracy with scaling:", acc_with_scaling)

# ---
# ✅ Q7. PCA on Wine Dataset (Explained Variance Ratio)
from sklearn.decomposition import PCA

pca = PCA()
pca.fit(X)
print("Explained Variance Ratio:", pca.explained_variance_ratio_)

# ---
# ✅ Q8. KNN on PCA-transformed dataset (top 2 components)
pca2 = PCA(n_components=2)
X_pca = pca2.fit_transform(X)

X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_pca, y, test_size=0.3, random_state=42)
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train_pca)
y_pred_pca = knn_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test_pca, y_pred_pca)

print("Accuracy on original (scaled) dataset:", acc_with_scaling)
print("Accuracy on PCA (2 components):", acc_pca)

# ---
# ✅ Q9. KNN with different distance metrics (Euclidean vs Manhattan)
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')

knn_euclidean.fit(X_train_scaled, y_train)
knn_manhattan.fit(X_train_scaled, y_train)

acc_euclidean = accuracy_score(y_test, knn_euclidean.predict(X_test_scaled))
acc_manhattan = accuracy_score(y_test, knn_manhattan.predict(X_test_scaled))

print("Euclidean Accuracy:", acc_euclidean)
print("Manhattan Accuracy:", acc_manhattan)

# ---
# ✅ Q10. High-dimensional gene expression dataset use-case
# Step 1: Use PCA to reduce dimensionality (retain components explaining ~95% variance).
# Step 2: Decide #components using cumulative explained variance plot.
# Step 3: Train KNN on reduced data.
# Step 4: Evaluate using accuracy, precision, recall, F1, cross-validation.
# Step 5: Business Justification:
# - PCA reduces noise, prevents overfitting.
# - KNN is simple & interpretable.
# - Pipeline balances accuracy & generalization for biomedical data.

# Example Code (Synthetic high-dim simulation)
X_highdim = np.random.rand(100, 500)  # 100 samples, 500 features
y_highdim = np.random.randint(0, 2, 100)

pca_hd = PCA(n_components=50)
X_reduced = pca_hd.fit_transform(X_highdim)

X_train_hd, X_test_hd, y_train_hd, y_test_hd = train_test_split(X_reduced, y_highdim, test_size=0.3, random_state=42)
knn_hd = KNeighborsClassifier(n_neighbors=5)
knn_hd.fit(X_train_hd, y_train_hd)

acc_hd = accuracy_score(y_test_hd, knn_hd.predict(X_test_hd))
print("Accuracy on high-dimensional synthetic dataset (PCA+KNN):", acc_hd)
