

### **Q1. What is the main difference between the Euclidean distance metric and the Manhattan distance metric in KNN? How might this difference affect the performance of a KNN classifier or regressor?**

**Answer:**  
- **Euclidean Distance** measures the straight-line distance between two points:
  $$
  d = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}
  $$

- **Manhattan Distance** sums the absolute differences:
  $$
  d = \sum_{i=1}^{n} |x_i - y_i|
  $$

**Impact on performance:**
- **Euclidean** is sensitive to large differences in individual dimensions and works better when data is dense and continuous.
- **Manhattan** is better for high-dimensional or sparse data (like text) and grid-based structures (e.g., city blocks).
- The choice of metric can significantly impact neighbor selection and, therefore, model accuracy.

---

### **Q2. How do you choose the optimal value of k for a KNN classifier or regressor? What techniques can be used to determine the optimal k value?**

**Answer:**  
- Use **Cross-Validation** (e.g., k-fold CV) to test different values of **k** and select the one that gives the best performance on the validation set.
- Common techniques:
  - **Grid Search** for hyperparameter tuning.
  - **Elbow Method** to plot performance metrics vs. different k values and choose the "elbow" point.
  - **Leave-one-out cross-validation (LOOCV)** for small datasets.

---

### **Q3. How does the choice of distance metric affect the performance of a KNN classifier or regressor? In what situations might you choose one distance metric over the other?**

**Answer:**  
- The **distance metric** determines how neighbors are selected.
- **Euclidean** is sensitive to feature scales and outliers.
- **Manhattan** is better when features are not normally distributed or when dealing with high-dimensional data.
- Choose **Euclidean** when features are continuous and well-scaled.  
- Choose **Manhattan** for high-dimensional, sparse, or grid-like data.

Other distance metrics:

- **Minkowski Distance** (generalized form):

  $$
  d = \left( \sum_{i=1}^{n} |x_i - y_i|^p \right)^{1/p}
  $$

  Special cases:

  $$
  p = 1 \Rightarrow \text{Manhattan Distance}
  $$

  $$
  p = 2 \Rightarrow \text{Euclidean Distance}
  $$

---

### **Q4. What are some common hyperparameters in KNN classifiers and regressors, and how do they affect the performance of the model? How might you go about tuning these hyperparameters to improve model performance?**

**Answer:**  
**Common hyperparameters:**
- **k (number of neighbors)** – Controls model complexity.
- **Distance metric** – Affects how similarity is measured.
- **Weight function** – `'uniform'` (equal weight) or `'distance'` (closer neighbors have more influence).

**Tuning methods:**
- **Grid Search / Random Search** with cross-validation.
- Use **scikit-learn’s GridSearchCV** or **RandomizedSearchCV**.
- Perform **feature scaling** before tuning to ensure fair distance measurement.

---

### **Q5. How does the size of the training set affect the performance of a KNN classifier or regressor? What techniques can be used to optimize the size of the training set?**

**Answer:**  
- Larger training sets improve generalization and performance, but increase **computational cost** at prediction time.
- KNN is a **lazy learner**: no training phase, but slow during inference.
- Too small a dataset → **underfitting**; too large → **slow prediction**.

**Optimization techniques:**
- Use **approximate nearest neighbor algorithms** (e.g., KD-Tree, Ball Tree, Locality Sensitive Hashing).
- **Data sampling**: Use representative subsets.
- Apply **dimensionality reduction** (e.g., PCA) to reduce feature space and training time.

---

### **Q6. What are some potential drawbacks of using KNN as a classifier or regressor? How might you overcome these drawbacks to improve the performance of the model?**

**Answer:**  
**Drawbacks:**
- **Computational inefficiency** during prediction.
- **Sensitive to irrelevant features and outliers**.
- **Struggles with high-dimensional data** (curse of dimensionality).
- **Requires feature scaling** to work correctly.

**Solutions:**
- **Normalize or standardize** features.
- Use **feature selection** or **dimensionality reduction** (PCA, t-SNE).
- Use **distance-weighted KNN**.
- Apply **efficient search techniques** (KD-Trees, Ball Trees).
- Use **hybrid models** or **ensemble methods** if needed.

---
