**Q1. What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?**

Answer:
KNN is a supervised learning algorithm that makes predictions based on the similarity between data points.

In classification, it assigns a class label to a new point by looking at the majority class among its k nearest neighbors.

In regression, it predicts a continuous value by taking the average (or weighted average) of the target values of its k nearest neighbors.
Distance metrics like Euclidean, Manhattan, or Minkowski are commonly used to find neighbors.

**Q2. What is the Curse of Dimensionality and how does it affect KNN performance?**

Answer:
The Curse of Dimensionality refers to the problems that arise when data has too many features (high dimensions).

In high dimensions, distances between points become less meaningful — all points tend to appear equally far apart.

For KNN, this means neighbors are not truly “close,” reducing classification/regression accuracy.

It also increases computation cost.

Q3. **What is Principal Component Analysis (PCA)? How is it different from feature selection?**

Answer:

PCA is a dimensionality reduction technique that transforms data into a new coordinate system, where the new features (principal components) capture the maximum variance in the data.

It is different from feature selection because PCA creates new features (linear combinations of original ones), whereas feature selection simply chooses a subset of existing features.

**Q4. What are eigenvalues and eigenvectors in PCA, and why are they important?**
Answer:

Eigenvectors define the direction of the new feature axes (principal components).

Eigenvalues represent the amount of variance captured by each eigenvector.

Importance: By ranking eigenvalues, we know which components capture the most information, allowing us to reduce dimensionality effectively.

**Q5. How do KNN and PCA complement each other when applied in a single pipeline?**

Answer:

PCA reduces the dimensionality, removing noise and redundant features.

KNN then operates on this reduced dataset, improving efficiency and accuracy.

Together, they handle high-dimensional data better and avoid overfitting

**Q6. Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare accuracy.**

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
acc_without = accuracy_score(y_test, y_pred)

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn.fit(X_train_scaled, y_train)
y_pred_scaled = knn.predict(X_test_scaled)
acc_with = accuracy_score(y_test, y_pred_scaled)

print("Accuracy without scaling:", acc_without)
print("Accuracy with scaling:", acc_with)


Accuracy without scaling: 0.7222222222222222
Accuracy with scaling: 0.9444444444444444


**Q8. Train KNN on PCA-transformed dataset (top 2 components). Compare accuracy.**

In [None]:
pca_2 = PCA(n_components=2)
X_train_pca = pca_2.fit_transform(X_train_scaled)
X_test_pca = pca_2.transform(X_test_scaled)

knn.fit(X_train_pca, y_train)
y_pred_pca = knn.predict(X_test_pca)
acc_pca = accuracy_score(y_test, y_pred_pca)

print("Accuracy with original data:", acc_with)
print("Accuracy with PCA (2 components):", acc_pca)


**Q9. Train KNN with different distance metrics (euclidean, manhattan). Compare results.**

In [3]:
for metric in ['euclidean', 'manhattan']:
    knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
    knn.fit(X_train_scaled, y_train)
    y_pred = knn.predict(X_test_scaled)
    print(f"Accuracy with {metric} distance:", accuracy_score(y_test, y_pred))


Accuracy with euclidean distance: 0.9444444444444444
Accuracy with manhattan distance: 0.9444444444444444


**Q10. High-dimensional gene dataset pipeline explanation**

Answer:

Use PCA to reduce dimensionality → Keeps only the components that explain most of the variance, removing noise.

Decide components to keep → Use explained variance ratio or cumulative variance plot (e.g., retain 95% variance).

Use KNN for classification → After PCA transformation, apply KNN with an appropriate distance metric.

Evaluate the model → Use cross-validation and metrics like accuracy, precision, recall, and F1-score.

Justification:

PCA reduces risk of overfitting in high dimensions.

KNN is simple and interpretable for stakeholders.

Pipeline is computationally efficient and robust for biomedical datasets with few samples and many features.