

## Assignment

### Q1. What is the main difference between the Euclidean distance metric and the Manhattan distance metric in KNN? How might this difference affect the performance of a KNN classifier or regressor?

The **Euclidean distance** is the straight-line distance between two points, calculated as:

\[
d(p, q) = \sqrt{\sum_{i=1}^{n}(p_i - q_i)^2}
\]

The **Manhattan distance** is the sum of absolute differences between the points, calculated as:

\[
d(p, q) = \sum_{i=1}^{n} |p_i - q_i|
\]

**Main differences**:
- **Euclidean distance** measures straight-line distance, which works well when features are on a similar scale.
- **Manhattan distance** is more suited for cases where the feature differences follow grid-like paths (e.g., city blocks).

**Effect on KNN**:
- Euclidean distance is sensitive to large feature differences due to squaring. It works well when features are continuous and scaled properly.
- Manhattan distance may perform better when dealing with high-dimensional data or when features differ on a grid-like pattern.

---

### Q2. How do you choose the optimal value of k for a KNN classifier or regressor? What techniques can be used to determine the optimal k value?

The optimal value of **k** can be determined using:

1. **Cross-validation**: Split the dataset into training and validation sets. Train the model with different values of k and evaluate the performance using metrics such as accuracy (for classification) or mean squared error (for regression).
   
2. **Grid Search**: Perform an exhaustive search over a range of k values and select the one that yields the best performance.

3. **Elbow method**: Plot the error rate as a function of k and select the k value where the error rate stabilizes (i.e., the "elbow" point).

---

### Q3. How does the choice of distance metric affect the performance of a KNN classifier or regressor? In what situations might you choose one distance metric over the other?

The choice of **distance metric** directly impacts how neighbors are determined:

- **Euclidean distance** is more appropriate when features are continuous and of similar scale.
- **Manhattan distance** might be better when the data has high dimensions, or when features are measured in grid-like patterns.

You might choose one over the other based on:
- **Scale of data**: Euclidean is sensitive to feature scaling.
- **Dimensionality**: Manhattan may handle high-dimensional data better.

---

### Q4. What are some common hyperparameters in KNN classifiers and regressors, and how do they affect the performance of the model? How might you go about tuning these hyperparameters to improve model performance?

Common hyperparameters in KNN:
- **k (number of neighbors)**: Controls the trade-off between bias and variance. A smaller k can lead to overfitting (low bias, high variance), while a larger k can underfit (high bias, low variance).
- **Distance metric**: Affects how neighbors are selected. Popular choices are Euclidean and Manhattan distances.
- **Weighting of neighbors**: Neighbors can be weighted by distance, with closer neighbors having more influence.

**Tuning techniques**:
- **Grid Search/Random Search**: Explore combinations of k, distance metrics, and weighting methods.
- **Cross-validation**: Evaluate model performance across multiple folds to ensure generalization.

---

### Q5. How does the size of the training set affect the performance of a KNN classifier or regressor? What techniques can be used to optimize the size of the training set?

The size of the **training set** affects KNN as it directly influences the number of available neighbors. A larger training set:
- **Improves accuracy**: More data points provide more accurate neighbors.
- **Increases computational cost**: Larger datasets require more computation for distance calculations.

**Optimization techniques**:
- **Data sampling**: Use techniques like stratified sampling to reduce dataset size while preserving class distribution.
- **Dimensionality reduction**: Apply PCA (Principal Component Analysis) or feature selection to reduce the number of features and simplify distance calculations.

---

### Q6. What are some potential drawbacks of using KNN as a classifier or regressor? How might you overcome these drawbacks to improve the performance of the model?

**Potential drawbacks**:
- **Computational inefficiency**: KNN requires calculating distances for all data points, which can be slow for large datasets.
- **Curse of dimensionality**: High-dimensional data can make distance metrics less informative, as all points tend to become equidistant.
- **Sensitivity to noise**: KNN can be overly affected by noisy data points or outliers.

**Solutions**:
- **Dimensionality reduction**: Use PCA or feature selection to reduce the dimensionality of the data.
- **Efficient algorithms**: Implement algorithms like KD-trees or Ball-trees to optimize neighbor search.
- **Preprocessing**: Apply data normalization, outlier detection, and noise reduction techniques before applying KNN.

---
