# **KNN & PCA**
---

## Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?

**Answer:**

- **Definition:**  
  K-Nearest Neighbors (KNN) is a supervised machine learning algorithm used for classification and regression. It is a non-parametric, instance-based learning method, meaning it does not assume any underlying data distribution and makes predictions based on the entire training dataset.

- **Working Principle:**  
  The core idea is that similar data points tend to be close to each other in the feature space. For a new input sample, KNN finds the \( K \) closest training samples (neighbors) based on a distance metric (commonly Euclidean distance).

- **In Classification:**  
  - The algorithm identifies the \( K \) nearest neighbors of the new data point.  
  - It then assigns the class label that is most frequent among these neighbors (majority voting).  
  - For example, if among 5 neighbors, 3 belong to class A and 2 to class B, the new point is classified as class A.

- **In Regression:**  
  - Instead of voting, the algorithm takes the average (or weighted average) of the target values of the \( K \) nearest neighbors.  
  - This average is used as the predicted continuous value for the new data point.

- **Advantages:**  
  - Simple to understand and implement.  
  - No training phase, making it fast to set up.  
  - Naturally handles multi-class problems.

- **Disadvantages:**  
  - Computationally expensive during prediction for large datasets.  
  - Sensitive to irrelevant or redundant features and feature scaling.  
  - Performance degrades with high-dimensional data due to the curse of dimensionality.

---

## Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?

**Answer:**

- **Definition:**  
  The Curse of Dimensionality refers to the various challenges that arise when working with data in high-dimensional spaces. As the number of features (dimensions) increases, the volume of the space increases exponentially, causing data points to become sparse.

- **Implications for KNN:**  
  - **Distance Concentration:** In high dimensions, the difference between the nearest and farthest neighbor distances becomes negligible. This makes it difficult for KNN to distinguish between close and distant points, reducing its effectiveness.  
  - **Sparsity:** With many dimensions, data points are spread thinly, so neighbors may not be truly "close," leading to noisy or unreliable predictions.  
  - **Increased Computation:** More dimensions mean more calculations for distance, increasing computational cost.

- **Effect on Model Performance:**  
  - KNN’s reliance on distance metrics means that as dimensionality grows, the notion of "closeness" loses meaning, causing poor classification or regression accuracy.  
  - Overfitting risk increases because the model may fit noise rather than meaningful patterns.

- **Mitigation Strategies:**  
  - Dimensionality reduction techniques like PCA to reduce feature space.  
  - Feature selection to remove irrelevant features.  
  - Using distance metrics or algorithms designed for high-dimensional data.

---

## Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?

**Answer:**

- **Principal Component Analysis (PCA):**  
  PCA is a statistical technique used for dimensionality reduction by transforming the original correlated features into a smaller set of uncorrelated variables called principal components. These components capture the maximum variance in the data.

- **How PCA Works:**  
  - Computes the covariance matrix of the data to understand feature relationships.  
  - Calculates eigenvalues and eigenvectors of the covariance matrix.  
  - Eigenvectors define directions (principal components) in the feature space, and eigenvalues quantify the variance along these directions.  
  - Projects the original data onto the top principal components, reducing dimensionality while preserving most of the variance.

- **Difference from Feature Selection:**  
  - **PCA (Feature Extraction):** Creates new features by combining original features linearly. The new features (principal components) are orthogonal and capture variance. Original features are transformed and not directly interpretable.  
  - **Feature Selection:** Selects a subset of the original features based on criteria like correlation, importance, or statistical tests. The original features remain unchanged and interpretable.

- **Summary:**  
  PCA reduces dimensionality by creating new features, while feature selection reduces dimensionality by choosing existing features.

---

## Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?

**Answer:**

- **Eigenvectors:**  
  In PCA, eigenvectors represent the directions in the feature space along which the data varies the most. Each eigenvector corresponds to a principal component, which is a new axis in the transformed space.

- **Eigenvalues:**  
  Each eigenvector has an associated eigenvalue that measures the amount of variance in the data along that eigenvector’s direction. Larger eigenvalues indicate that the corresponding eigenvector captures more variance.

- **Importance in PCA:**  
  - Eigenvectors define the new coordinate system for the data after transformation.  
  - Eigenvalues help rank the principal components by importance. Components with higher eigenvalues are retained because they explain more variance.  
  - By selecting components with the largest eigenvalues, PCA reduces dimensionality while preserving the most significant information.

---

## Question 5: How do KNN and PCA complement each other when applied in a single pipeline?

**Answer:**

- **Complementarity:**  
  - PCA reduces the dimensionality of the dataset by extracting the most informative features, which helps alleviate the curse of dimensionality.  
  - KNN relies on distance calculations, which become unreliable in high-dimensional spaces. PCA transforms the data into a lower-dimensional space where distances are more meaningful.  
  - PCA also removes noise and redundant features, improving KNN’s accuracy and efficiency.

- **Pipeline Workflow:**  
  1. **Data Preprocessing:** Scale features to have zero mean and unit variance.  
  2. **Dimensionality Reduction:** Apply PCA to reduce the number of features while retaining most variance.  
  3. **Model Training:** Train KNN on the PCA-transformed data.  
  4. **Prediction:** Use the trained KNN model to classify or regress new data points in the reduced space.

- **Benefits:**  
  - Improved computational efficiency due to fewer features.  
  - Enhanced model generalization and reduced overfitting.  
  - Better interpretability of model performance.

---



In [1]:
#Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
# Load dataset
data = load_wine()
X, y = data.data, data.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# KNN without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred)
# KNN with scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
knn.fit(X_train_scaled, y_train)
y_pred_scaled = knn.predict(X_test_scaled)
acc_scaling = accuracy_score(y_test, y_pred_scaled)
print(f"Accuracy without scaling: {acc_no_scaling:.4f}")
print(f"Accuracy with scaling: {acc_scaling:.4f}")

'''
Explanation:
Feature scaling significantly improves KNN performance because KNN relies on distance metrics sensitive to feature scales.
'''

Accuracy without scaling: 0.7407
Accuracy with scaling: 0.9630


In [2]:
#Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.

from sklearn.decomposition import PCA
# Scale data before PCA
X_scaled = StandardScaler().fit_transform(X)
# Apply PCA
pca = PCA()
pca.fit(X_scaled)
# Explained variance ratio
print("Explained variance ratio of each component:")
print(pca.explained_variance_ratio_)

Explained variance ratio of each component:
[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]


In [3]:
#Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.

# Retain top 2 components
pca_2 = PCA(n_components=2)
X_pca = pca_2.fit_transform(X_scaled)
# Split PCA data
X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.3, random_state=42)
# Train KNN on PCA data
knn.fit(X_train_pca, y_train)
y_pred_pca = knn.predict(X_test_pca)
acc_pca = accuracy_score(y_test, y_pred_pca)
print(f"Accuracy with top 2 PCA components: {acc_pca:.4f}")

'''
Explanation:
Reducing to 2 components reduces dimensionality but loses some information, causing a slight drop in accuracy.
'''

Accuracy with top 2 PCA components: 0.9815


In [4]:
#Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.

# KNN with Euclidean distance
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)
# KNN with Manhattan distance
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)
print(f"Accuracy with Euclidean distance: {acc_euclidean:.4f}")
print(f"Accuracy with Manhattan distance: {acc_manhattan:.4f}")

'''
Explanation:
Both distance metrics perform well, with Euclidean slightly better on this dataset.
'''

Accuracy with Euclidean distance: 0.9630
Accuracy with Manhattan distance: 0.9630


---

# Question 10: You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer. Due to the large number of features and a small number of samples, traditional models overfit. Explain how you would:

- Use PCA to reduce dimensionality  
- Decide how many components to keep  
- Use KNN for classification post-dimensionality reduction  
- Evaluate the model  
- Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data

**Answer:**

- **Use PCA:**  
  Apply PCA to reduce thousands of gene expression features to a smaller set of principal components that capture most variance, reducing noise and redundancy.

- **Decide number of components:**  
  Select components explaining 90-95% cumulative variance using explained variance ratio or scree plot.

- **Use KNN:**  
  Train KNN on PCA-transformed data. Reduced dimensions improve distance metric reliability and reduce overfitting.

- **Evaluate model:**  
  Use cross-validation and metrics like accuracy, precision, recall, F1-score, ROC-AUC. Validate on independent datasets if possible.

- **Justification:**  
  - PCA mitigates curse of dimensionality and noise.  
  - KNN is simple, interpretable, and effective in reduced space.  
  - Pipeline is computationally efficient and robust for small sample sizes.  
  - Widely accepted in biomedical research for reproducibility and transparency.

**Example code snippet:**

```python
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Assume X_gene, y_gene are gene expression data and labels
X_scaled = StandardScaler().fit_transform(X_gene)
pca = PCA(n_components=0.95)  # Retain 95% variance
X_pca = pca.fit_transform(X_scaled)

X_train, X_test, y_train, y_test = train_test_split(X_pca, y_gene, test_size=0.3, random_state=42)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
```

---

same code below

In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Assume X_gene, y_gene are gene expression data and labels
X_scaled = StandardScaler().fit_transform(X)
pca = PCA(n_components=0.95)  # Retain 95% variance
X_pca = pca.fit_transform(X_scaled)

X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.3, random_state=42)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.9629629629629629
