# KNN

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Here's a shortened version of the steps for using the K-Nearest Neighbors (KNN) algorithm:

```
1. Preprocess data (handle missing values, normalize/standardize features, split into train/test)
2. Choose the value of K and distance metric
3. For each training point:
    - Calculate distance to query point
    - Sort distances and select K nearest neighbors
4. For classification:
    - Assign query point to the most common class among K nearest neighbors
5. For regression:
    - Predict target value as mean/median of K nearest neighbors
6. Evaluate model performance on test set
7. Optionally, tune hyperparameters (K, distance metric, feature scaling)
8. Make predictions on new data using trained model
```

This condensed version captures the essential steps of the KNN algorithm, including data preprocessing, model training, evaluation, and prediction. The specific implementations and additional details can be expanded based on the requirements.

**Hyperparameters in KNN**

The K-Nearest Neighbors (KNN) algorithm has two main hyperparameters that can significantly impact its performance:

1. **Number of Neighbors (K)**: This is the most critical hyperparameter in KNN. A larger value of K can lead to smoother decision boundaries and a more stable model, but it may also result in underfitting and poor capture of local patterns. A smaller value of K can lead to more complex decision boundaries and better capture of local patterns, but it may also result in overfitting and sensitivity to noise.

2. **Distance Metric**: The choice of distance metric (e.g., Euclidean, Manhattan, Minkowski) can affect the algorithm's performance, especially when dealing with different types of features or data distributions.

These hyperparameters are typically tuned using techniques like cross-validation, grid search, or random search to find the optimal values that maximize the model's performance on the validation set.

**Advantages of KNN**

1. **Simple and Intuitive**: KNN is a straightforward algorithm that is easy to understand and implement.

2. **Non-parametric**: KNN makes no assumptions about the underlying data distribution, making it versatile and applicable to various types of data.

3. **Effective for Non-linear Problems**: KNN can handle non-linear decision boundaries and complex relationships between features and target variables.

4. **Lazy Learning**: KNN is an instance-based learner, meaning it doesn't require extensive training or model building. Instead, it defers the computational work until a new query point needs to be classified or predicted.

5. **Versatile**: KNN can be used for both classification and regression tasks.

**Disadvantages of KNN**

1. **Computationally Expensive**: KNN requires computing the distances between the query point and all training points, which can be computationally expensive, especially for large datasets.

2. **Curse of Dimensionality**: KNN can suffer from the "curse of dimensionality," where the distances between points become less meaningful in high-dimensional spaces, leading to poor performance.

3. **Sensitive to Feature Scaling**: KNN is sensitive to the scale of features, as features with larger ranges can dominate the distance calculations. Feature scaling or normalization is often necessary.

4. **Noisy Data and Outliers**: KNN is susceptible to the influence of noisy data and outliers, as they can distort the distance calculations and affect the selection of nearest neighbors.

5. **Memory Limitations**: KNN requires storing the entire training dataset, which can be memory-intensive for large datasets.

6. **Imbalanced Classes**: KNN can struggle with imbalanced classification problems, as the majority class can dominate the nearest neighbors.

While KNN has some limitations, it remains a popular and useful algorithm, particularly for smaller datasets, exploratory data analysis, and applications where interpretability and simplicity are important.

# Kmeans

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Here's a shortened version of how the K-Means Clustering algorithm works:

1. Randomly initialize K centroids (cluster centers).
2. Assign each data point to the nearest centroid.
3. Update centroids by calculating the mean of all data points in each cluster.
4. Repeat steps 2 and 3 until convergence (centroids no longer move significantly).
5. Final centroids represent the centers of the K clusters.

The objective is to minimize the sum of squared distances between data points and their assigned centroids.

Key points:
- Initialization of centroids impacts the final result.
- Works well for spherical clusters but struggles with non-spherical shapes, outliers, and high dimensions.
- Alternatives like DBSCAN or Hierarchical Clustering may be better in certain cases.

![image.png](attachment:image.png)

Here's a shortened version of the elbow method for determining the optimal number of clusters (K) in K-Means Clustering:

1. Run K-Means for different values of K and calculate the within-cluster sum of squares (WCSS) for each value.
2. Plot WCSS vs. K.
3. Identify the "elbow point" in the plot where the curve starts to flatten out.
4. The value of K at the elbow point is considered the optimal number of clusters.

The elbow method aims to find the value of K where adding more clusters does not significantly reduce the WCSS, forming an "elbow" in the plot.

Key points:
- Heuristic approach, choice of K can be subjective.
- Assumes WCSS decreases rapidly up to a point, then flattens out.
- Should be complemented with other techniques like silhouette analysis or domain knowledge.

# DB scan

![image.png](attachment:image.png)

![image.png](attachment:image.png)