## K-Nearest Neighbors

K-Nearest Neighbors (KNN) is a simple, intuitive supervised learning algorithm that classifies new data points by finding the *k* most similar (nearest) examples from the training data and using majority voting among their labels.

#### How KNN Works (Step by Step)

KNN follows a straightforward "lazy learning" process—no complex model training, just storing the data and computing on the fly during prediction.

##### 1. Training Phase (Store Data)
- Simply save all labeled training examples (features + class labels).
- No fitting or parameter learning happens here—KNN is "lazy."

##### 2. Prediction Phase (For a New Point)
- **Calculate distances**: Measure distance from the new point to *every* training point (Euclidean distance is common: straight-line distance in feature space).
- **Find k nearest**: Sort distances and pick the *k* smallest (your k neighbors).
- **Vote or average**:
  - **Classification**: Majority class among the k neighbors wins (e.g., if 3/5 say "spam," predict "spam").
  - **Regression**: Average the values of the k neighbors.

**Analogy**: Imagine classifying a fruit by size and color. Compare it to known fruits in a basket, grab the 3 closest ones, and go with whatever type most of them are (e.g., 2 apples → predict apple).

#### Key Parameter: Choosing k

k controls the "neighborhood size" and directly impacts model behavior:

- **Small k** (e.g., k=1): Fits training data too closely → overfitting (sensitive to noise).
- **Large k**: Smooths too much → underfitting (ignores local patterns, high computation).
- **Best practice**: Test odd values (avoids ties in binary classification), use cross-validation or elbow plots to find optimal k (often 3-10).

No universal "best" k—it depends on your data. Start small, tune via validation.

#### Strengths and Limitations

| Aspect | Details |
|--------|---------|
| **Pros** | Simple to understand/implement; no assumptions about data distribution; works for classification *and* regression; handles multi-class naturally. |
| **Cons** | Slow on large datasets (computes all distances); sensitive to feature scaling (scale features first!); struggles in high dimensions ("curse of dimensionality"); stores all data (memory-heavy). |

#### Practical Tips (Beginner-Friendly)
- **Scale features**: Use StandardScaler—distances matter, unscaled features dominate.
- **Distance metric**: Euclidean for most cases; try Manhattan or Minkowski for others.
- **In scikit-learn** (Python):
  ```python
  from sklearn.neighbors import KNeighborsClassifier
  knn = KNeighborsClassifier(n_neighbors=5)  # k=5
  knn.fit(X_train, y_train)
  y_pred = knn.predict(X_test)
  ```
- Real-world: Great for small/medium datasets, prototyping, or when interpretability matters (e.g., "these 5 similar patients had this outcome").

This matches your description: KNN identifies similarities via proximity, votes for the dominant class, and tunes k to balance accuracy vs complexity.

Sources:

[1](https://www.pinecone.io/learn/k-nearest-neighbor/)
[2](https://www.geeksforgeeks.org/machine-learning/k-nearest-neighbours/)
[3](https://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm)
[4](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)
[5](https://www.youtube.com/watch?v=v5CcxPiYSlA)
[6](https://www.youtube.com/watch?v=zeFt_JCA3b4)
[7](https://www.youtube.com/watch?v=b6uHw7QW_n4)
[8](https://www.elastic.co/what-is/knn)
[9](https://www.youtube.com/watch?v=HVXime0nQeI)
[10](https://www.ibm.com/think/topics/knn)