## Q1. What is the main difference between the Euclidean distance metric and the Manhattan distance metric in KNN? How might this difference affect the performance of a KNN classifier or regressor?

|Aspect|Euclidean Distance|Manhattan Distance|
|---|---|---|
|**Definition**|Straight-line distance between two points.|Sum of the absolute differences of coordinates.|
|**Formula**|$\sqrt{\sum_{i=1}^{n}(x_{i} -y_{i})^2}$|$ \sum_{i=1}^{n}(x_{i} - y_{i})$|
|**Distance Metric Type**|L2 norm (also known as Euclidean norm)|L1 norm (also known as Taxicab or City Block distance)|
|**Sensitivity to Outliers**|More sensitive to outliers because squared differences amplify the impact.|Less sensitive to outliers because it considers absolute differences.|
|**Geometric Interpretation**|Represents the shortest distance over a straight line in Euclidean space.|Represents the distance if only vertical and horizontal movements are allowed.|


##### Impact of Euclidean Distance on KNN:

1. **Sensitivity to Magnitude:** Euclidean distance takes into account the magnitude of the differences between feature values. This means that features with larger ranges can dominate the distance calculation, potentially overshadowing smaller features unless proper normalization is applied.


2. **Sensitivity to Outliers:** Because it squares the differences between coordinates, Euclidean distance is more sensitive to outliers. Large differences in any single dimension will disproportionately affect the overall distance.


3. **Correlation with Actual Distance:** For features that are continuous and naturally aligned in Euclidean space, Euclidean distance can accurately reflect the actual proximity between data points.


##### Impact of Manhattan Distance on KNN:

1. **Sensitivity to Magnitude:** While Manhattan distance also depends on the magnitude of differences, it treats each dimension independently and linearly. Thus, it is less affected by the scale of the data compared to Euclidean distance.


2. **Robustness to Outliers:** Manhattan distance is less sensitive to outliers than Euclidean distance because it does not square the differences. This can be beneficial in datasets with noisy or extreme values.


3. **Alignment with Grid-like Data:** For data where the important relationships are along the axes, such as grid-like city layouts or high-dimensional spaces with sparse data, Manhattan distance might better capture the true nearest neighbors.

## Q2. How do you choose the optimal value of k for a KNN classifier or regressor? What techniques can be used to determine the optimal k value?

Choosing the optimal value of $k$ in K-Nearest Neighbors (KNN) is crucial for the performance of the algorithm. The right $k$ balances the trade-off between overfitting and underfitting. Here are some methods and considerations for selecting $k$:

##### Methods for Choosing $k$ :

1. **Cross-Validation:** K-Fold Cross-Validation: Split the training dataset into  $k$ folds. For each fold, train the model on $k−1$ folds and validate it on the remaining fold. This process is repeated  $k$ times, with a different fold used as the validation set each time. Calculate the average performance across all folds. The $k$ value that results in the best average performance is selected.

2. **Grid Search:** Perform an exhaustive search over a range of $k$ values, evaluating the performance of the model for each $k$ . The $k$ with the best performance is chosen. This is often combined with cross-validation to ensure robustness.

## Q3. How does the choice of distance metric affect the performance of a KNN classifier or regressor? In what situations might you choose one distance metric over the other?

The choice of distance metric in a K-Nearest Neighbors (KNN) classifier or regressor can profoundly influence the performance of the model by determining how similarity between data points is measured. This in turn affects which neighbors are considered closest, and thus which points influence the classification or regression result.

##### Situations to Choose One Over the Other

1. **Data Characteristics:**

    - **Homogeneous Features:** When features are on similar scales and have a natural Euclidean relationship, Euclidean distance is often preferred.
    - **Heterogeneous Features:** When features have different scales or types, Manhattan distance can be more appropriate due to its lessened sensitivity to magnitude differences.
    

2. **Nature of the Problem:**

    - **Spatial Data:** For problems involving physical space or geographic locations, Euclidean distance might better capture the true distance (e.g., distance between points on a plane).
    - **Grid-like Structures:** For grid-like data or cases where movement is constrained to orthogonal paths (like in a city grid), Manhattan distance is more suitable.
    
    
3. **Presence of Outliers:**

    - **Noisy Data:** If the data contains significant outliers, Manhattan distance can provide more robust performance.
    - **Clean Data:** In datasets with fewer outliers, Euclidean distance can be more effective, especially if relationships are naturally Euclidean.
    
    
4. **Dimensionality of Data:**

    - **Low to Moderate Dimensionality:** Euclidean distance can be effective in lower-dimensional spaces where distances are more meaningful.
    - **High Dimensionality:** Manhattan distance often performs better in high-dimensional spaces due to maintaining greater differentiation between points.
    
    
##### Practical Examples

- **Image Recognition:** For image data, where pixel intensity differences are important, Euclidean distance is typically used. Normalized Euclidean distance can improve performance by reducing the impact of varying pixel intensities.


- **Text Data:** For text data represented as high-dimensional vectors (e.g., TF-IDF vectors), Manhattan distance might be more effective due to its robustness in high-dimensional, sparse data scenarios.


- **Urban Planning:** In applications like route planning within a city grid, Manhattan distance accurately reflects the real-world constraints and distances.

## Q4. What are some common hyperparameters in KNN classifiers and regressors, and how do they affect the performance of the model? How might you go about tuning these hyperparameters to improve model performance?

In K-Nearest Neighbors (KNN) classifiers and regressors, several hyperparameters can significantly influence the performance of the model. The most common hyperparameters include:

1. **Number of Neighbors (k):**

    - **Effect on Performance:** The number of neighbors (k) determines how many nearby data points are considered when making a prediction.
        - **Small k:** A small k (e.g., 1 or 3) can lead to a model that is sensitive to noise and outliers, potentially resulting in high variance and overfitting.
        - **Large k:** A large k can smooth out predictions, reducing the influence of noise but potentially leading to high bias and underfitting.
    - **Tuning Approach:** Typically, k is tuned by evaluating model performance across a range of k values (e.g., 1 to 30) using cross-validation to find the optimal balance between bias and variance.
    
    
2. **Distance Metric:

    - **Common Metrics:** Euclidean, Manhattan, Minkowski, Chebyshev, and others.
    - **Effect on Performance:** The choice of distance metric affects how distances between points are calculated and thus which neighbors are considered closest.
        - **Euclidean:** Works well with continuous and normalized data.
        - **Manhattan:** Better for high-dimensional or grid-like data.
    - **Tuning Approach:** Experiment with different distance metrics and use cross-validation to determine which metric provides the best performance for the specific dataset.
    
    
3. **Weights:**

    - **Options:** Uniform weights (all neighbors have equal weight) and distance weights (closer neighbors have more weight).
    - **Effect on Performance:** Weighting by distance can improve performance when closer neighbors are more relevant to the prediction.
    - **Tuning Approach:** Compare the performance of uniform and distance-weighted approaches using cross-validation.


4. **Algorithm:**

    - **Options:** ‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’.
    - **Effect on Performance:** The choice of algorithm affects the speed and efficiency of finding the nearest neighbors.
        - **kd_tree:** Efficient for low-dimensional data.
        - **ball_tree:** More efficient for high-dimensional data.
    - **Tuning Approach:** Generally, the ‘auto’ option allows the algorithm to choose the best method based on the dataset size and dimensions, but manual tuning might be necessary for very large or complex datasets.
    
    
##### For tuning the hyperparameters, we can follow some of the below algorithms:

1. Grid Search
2. Random Search
3. Cross Validation
4. Automated Hyperparameter Optimization

## Q5. How does the size of the training set affect the performance of a KNN classifier or regressor? What techniques can be used to optimize the size of the training set?

The size of the training set can significantly impact the performance of a K-Nearest Neighbors (KNN) classifier or regressor. Here’s how it affects performance and some techniques to optimize the training set size:

##### Impact of Training Set Size on KNN Performance

1. **Model Accuracy:**

    - **Small Training Set:** A small training set may not capture the underlying data distribution effectively, leading to poor generalization and higher variance. This can result in overfitting, where the model performs well on the training data but poorly on unseen data.
    - **Large Training Set:** A larger training set provides more examples for the model to learn from, which can improve generalization and reduce variance. However, it also increases computational cost and memory usage.
    
    
2. Computational Complexity:

    - **Training Phase:** KNN has minimal training time complexity, as it essentially involves storing the training data.
    - **Prediction Phase:** The computational cost during prediction is proportional to the size of the training set, as the model needs to compute distances between the query point and all training points. Larger training sets result in longer prediction times.
    
    
3. **Memory Usage:**

    - Larger training sets require more memory to store the data, which can be a limiting factor, especially for very large datasets.
    
    
    
##### Techniques to Optimize the Training Set Size

1. **Dimensionality Reduction:**

    - **Principal Component Analysis (PCA):** Reduce the number of features while retaining most of the variance in the data. This can make distance calculations more efficient and potentially improve performance.
    - **t-SNE or UMAP:** For visualization or reducing dimensions in a way that preserves local structures.
    

2. **Instance Selection:**

    - **Condensed Nearest Neighbor (CNN):** Aims to reduce the training set by retaining a subset of instances that are necessary to maintain the decision boundary.
    - **Edited Nearest Neighbor (ENN):** Removes instances that are misclassified by their k-nearest neighbors, which can help clean the training set and improve generalization.
    - **Reduced Nearest Neighbor (RNN):** A combination of CNN and ENN techniques to optimize the training set size.
    
    
3. **Sampling Techniques:**

    - **Random Sampling:** Randomly select a subset of the data, but this might not be representative of the entire dataset.
    - **Stratified Sampling:** Ensures that the sampled subset preserves the distribution of the classes, which is important for maintaining the balance in classification tasks.
    - **Bootstrapping:** Creates multiple smaller training sets by sampling with replacement and aggregates the results, which can provide robustness against overfitting.

## Q6. What are some potential drawbacks of using KNN as a classifier or regressor? How might you overcome these drawbacks to improve the performance of the model?

K-Nearest Neighbors (KNN) is a simple and intuitive algorithm for classification and regression tasks, but it comes with several potential drawbacks. Here are some common issues and strategies to overcome them to improve the model's performance:

##### Potential Drawbacks of KNN

1. **Computational Complexity:**

    - **Training:** KNN has negligible training time since it simply stores the training data.
    - **Prediction:** The prediction phase can be computationally expensive because it involves calculating the distance between the query point and all training points. This becomes problematic with large datasets.
    
    
2. **Memory Usage:**

    - KNN requires storing the entire training dataset in memory, which can be impractical for very large datasets.
    
    
3. **Curse of Dimensionality:**

    - As the number of dimensions increases, the distance between points becomes less meaningful due to the sparse distribution of data in high-dimensional spaces. This can lead to poor performance.
    
    
4. **Sensitivity to Irrelevant Features:**

    - KNN considers all features equally, which can be problematic if some features are irrelevant or noisy, leading to incorrect distance calculations and poor predictions.



##### Strategies to Overcome Drawbacks


1. Dimensionality Reduction(PCA, t-SNE)
2. Cross validating for the Distance metrics
3. Scaling and Normalization of the data
4. Using efficient data structures(KD-tree, Ball tree)