<a href="https://colab.research.google.com/github/angekonan715/Data-Science-Project/blob/main/DSAI_ML_Bootcamp_KNN_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# K-Nearest Neighbors Machine Leaning Model 💻 🧠

---



### 🔴 PLEASE COPY THIS NOTEBOOK INTO YOUR OWN GITHUB OR GOOGLE DRIVE DO NOT MODIFY THIS VERSION🔴

## Overview

K-Nearest Neighbors (KNN) is a supervised machine learning algorithm commonly used for classification and regression tasks. It’s one of the simplest and most intuitive models in machine learning.
**bold text**

 <br>

---

<br>

**Definition**
KNN works by finding the 'k' closest data points (neighbors) to a given input based on some distance metric (e.g., Euclidean distance). The predicted value or class is determined by these neighbors:

* For classification, the input is assigned the class most common among its neighbors (majority vote).
* For regression, the predicted value is the average (or sometimes weighted average) of the neighbors' values.

<br>

---

<br>

**Key Concepts**
1. K (Number of Neighbors): The algorithm uses 'k' neighbors to make predictions. Choosing the right 'k' is crucial:
* Small 'k' (e.g., 1 or 3) makes the model sensitive to noise.
* Large 'k' smooths out predictions but may overlook local patterns.  


2. Distance Metrics: Determines how "close" neighbors are. Common metrics include:

* Euclidean Distance: Straight-line distance between points.
* Manhattan Distance: Distance measured along axes at right angles.
* Cosine Similarity: Measures the cosine of the angle between two vectors (useful for text or high-dimensional data).
3. Laziness: KNN is a lazy learner, meaning it doesn't learn a model during training. Instead, it stores the data and makes predictions when queried. This is why it’s called a "memory-based" approach.

<br>

---

<br>

**Pros**
1. Simplicity: Easy to understand and implement.
2. No Training Phase: Since KNN doesn’t explicitly learn a model, the training  phase is fast.
3. Versatility: Can handle classification and regression problems.
4. Adaptability: Works well for multi-class classification.
5. Non-parametric: No assumptions about the underlying data distribution.

<br>

---

<br>

**Cons**
1. Computationally Expensive: Prediction requires calculating distances to all training data, which can be slow for large datasets.
2. Memory Intensive: The model needs to store all the training data.
3. Sensitive to 'k': Choosing the right number of neighbors is critical and often requires experimentation.
4. Sensitive to Noise: Outliers or irrelevant features can negatively affect performance.
5. Scaling Required: Features must be scaled (e.g., using standardization or normalization) for distance metrics like Euclidean distance to work properly.

<br>

---

<br>

**Tips**
* Use Weighted KNN: Assign weights to neighbors so closer neighbors have more influence. Do this wisely, but correct implementation will boost accuracy greatly.
* Use Efficient Searches: KD-Trees or Ball Trees can speed up distance calculations for large datasets.
* Use Feature Selection: Remove irrelevant or noisy features to improve performance.

<br>

---

<br>

**Useful Articles and Videos**
* https://www.w3schools.com/python/python_ml_knn.asp
* https://realpython.com/knn-python/
* https://www.geeksforgeeks.org/k-nearest-neighbor-algorithm-in-python/
* https://www.youtube.com/watch?v=CQveSaMyEwM
* https://www.youtube.com/watch?v=b6uHw7QW_n4
* https://www.youtube.com/watch?v=w6bOBZX-1kY

<br>

## Import Data/Libraries

In [None]:
# needed libraries for KNN models
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# foundation dataset
from sklearn.datasets import load_iris

# stretch dataset
cleveland_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data', header=None)
hungarian_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data', header=None)
switzerland_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.switzerland.data', header=None)

## Explore, Visualize and Understand the Data

## Feature Enginnering and Data Augmentation

### **Data Augmentation**  
**Definition:**
Data augmentation is the process of artificially expanding the size and diversity of a training dataset by applying transformations or modifications to the existing data while preserving the underlying labels or structure. It is commonly used in machine learning, especially in computer vision and natural language processing, to improve model performance and robustness.

### **Feature Engineering**  
**Definition:**
Feature engineering is the process of creating, modifying, or selecting relevant features (input variables) from raw data to improve the performance of a machine learning model. It involves transforming raw data into a format that makes it more suitable for algorithms to learn patterns.


## Machine Learning Model

### Split the data

### Create the model

### Train the model

### Make predictions

## Evaluate the Model