# Supervised Learning Algorithms

In [20]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Linear Regression

**Linear Regression** is one of the simplest and most widely used supervised machine learning algorithms. It is used for predicting a continuous target variable (also called the dependent variable) based on one or more input features (independent variables). The relationship between the input features and the target variable is assumed to be **linear**.

---

### Key Concepts of Linear Regression:
1. **Equation of a Line**:
   - In simple linear regression (with one feature), the relationship is represented by the equation:
     $
     y = mx + b
     $
     - $ y $: Target variable (output).
     - $ x $: Feature (input).
     - $ m $: Slope (weight of the feature).
     - $ b $: Intercept (bias term).

   - In multiple linear regression (with multiple features), the equation becomes:
     $
     y = b_0 + b_1x_1 + b_2x_2 + \dots + b_nx_n
     $
     - $ b_0 $: Intercept.
     - $ b_1, b_2, \dots, b_n $: Coefficients (weights) for each feature.
     - $ x_1, x_2, \dots, x_n $: Input features.

2. **Objective**:
   - The goal of linear regression is to find the best-fitting line (or hyperplane in higher dimensions) that minimizes the difference between the predicted values and the actual values.

3. **Cost Function**:
   - The most common cost function used in linear regression is the **Mean Squared Error (MSE)**:
     $
     \text{MSE} = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2
     $
     - $ y_i $: Actual value.
     - $ \hat{y}_i $: Predicted value.
     - $ N $: Number of data points.

4. **Optimization**:
   - The coefficients ($ b_0, b_1, \dots, b_n $) are optimized using techniques like:
     - **Ordinary Least Squares (OLS)**: A closed-form solution that minimizes the MSE.
     - **Gradient Descent**: An iterative optimization algorithm used for large datasets or complex models.

---

### Assumptions of Linear Regression:
1. **Linearity**: The relationship between the features and the target variable is linear.
2. **Independence**: Observations are independent of each other (no autocorrelation).
3. **Homoscedasticity**: The residuals (errors) have constant variance across all levels of the independent variables.
4. **Normality**: The residuals are normally distributed (important for confidence intervals and hypothesis testing).
5. **No Multicollinearity**: The independent variables are not highly correlated with each other.

---

### Types of Linear Regression:
1. **Simple Linear Regression**:
   - Only one input feature is used to predict the target variable.
   - Example: Predicting house prices based on square footage.

2. **Multiple Linear Regression**:
   - Multiple input features are used to predict the target variable.
   - Example: Predicting house prices based on square footage, number of bedrooms, and location.

3. **Polynomial Regression**:
   - A form of linear regression where the relationship between the independent variable and the dependent variable is modeled as an $ n $-degree polynomial.
   - Example: $ y = b_0 + b_1x + b_2x^2 + \dots + b_nx^n $.

---

### Advantages of Linear Regression:
- Simple to understand and interpret.
- Computationally efficient.
- Works well when the relationship between variables is linear.

---

### Disadvantages of Linear Regression:
- Assumes a linear relationship, which may not hold in real-world scenarios.
- Sensitive to outliers.
- Cannot handle complex relationships between variables.


In [None]:
url = 'MachineLearning/Datasets/CSVs/lr.csv'
data = pd.read_csv(url)

# data

# Drop the missing values
data = data.dropna()

# training dataset and labels
inputData = np.array(data.x[0:500]).reshape(500, 1)
outputData = np.array(data.y[0:500]).reshape(500, 1)

# valid dataset and labels
test_input = np.array(data.x[500:700]).reshape(199, 1)
test_output = np.array(data.y[500:700]).reshape(199, 1)


In [None]:
from matplotlib.animation import FuncAnimation


class LinearRegression: 
    def __init__(self): 
        self.parameters = {} 

    def forwardPropagation(self, inputData): 
        m = self.parameters['m'] 
        c = self.parameters['c'] 
        predictions = np.multiply(m, inputData) + c 
        return predictions 

    def costFunction(self, predictions, outputData): 
        cost = np.mean((outputData - predictions) ** 2) 
        return cost 

    def backwardPropagation(self, inputData, outputData, predictions): 
        derivatives = {} 
        df = (predictions-outputData) 
        # dm= 2/n * mean of (predictions-actual) * input 
        dm = 2 * np.mean(np.multiply(inputData, df)) 
        # dc = 2/n * mean of (predictions-actual) 
        dc = 2 * np.mean(df) 
        derivatives['dm'] = dm 
        derivatives['dc'] = dc 
        return derivatives 

    def updateParameters(self, derivatives, learningRate): 
        self.parameters['m'] = self.parameters['m'] - learningRate * derivatives['dm'] 
        self.parameters['c'] = self.parameters['c'] - learningRate * derivatives['dc'] 

    def train(self, inputData, outputData, learningRate, iters): 
        # Initialize random parameters 
        self.parameters['m'] = np.random.uniform(0, 1) * -1
        self.parameters['c'] = np.random.uniform(0, 1) * -1

        # Initialize loss 
        self.loss = [] 

        # Initialize figure and axis for animation 
        fig, ax = plt.subplots() 
        xValues = np.linspace(min(inputData), max(inputData), 100) 
        line, = ax.plot(xValues, self.parameters['m'] * xValues +
                        self.parameters['c'], color='red', label='Regression Line') 
        ax.scatter(inputData, outputData, marker='o', 
                color='green', label='Training Data') 

        # Set y-axis limits to exclude negative values 
        ax.set_ylim(0, max(outputData) + 1) 

        def update(frame): 
            # Forward propagation 
            predictions = self.forwardPropagation(inputData) 

            # Cost function 
            cost = self.costFunction(predictions, outputData) 

            # Back propagation 
            derivatives = self.backwardPropagation( 
                inputData, outputData, predictions) 

            # Update parameters 
            self.updateParameters(derivatives, learningRate) 

            # Update the regression line 
            line.set_ydata(self.parameters['m'] 
                        * xValues + self.parameters['c']) 

            # Append loss and print 
            self.loss.append(cost) 
            print("Iteration = {}, Loss = {}".format(frame + 1, cost)) 

            return line, 
        # Create animation 
        ani = FuncAnimation(fig, update, frames=iters, interval=200, blit=True) 

        # Save the animation as a video file (e.g., MP4) 
        ani.save('linear_regression_A.gif', writer='ffmpeg') 

        plt.xlabel('Input') 
        plt.ylabel('Output') 
        plt.title('Linear Regression') 
        plt.legend() 
        plt.show() 

        return self.parameters, self.loss 


lg = LinearRegression()
parameters, loss = lg.train(inputData, outputData, 0.0001, 20)

**Polynomial Regression** is a form of regression analysis in which the relationship between the independent variable $ x $ and the dependent variable $ y $ is modeled as an $ n $-th degree polynomial. Unlike linear regression, which assumes a linear relationship between the variables, polynomial regression can capture nonlinear relationships.

---

### Key Concepts of Polynomial Regression:

1. **Polynomial Equation**:
   - The general form of a polynomial regression model is:
     $
     y = b_0 + b_1x + b_2x^2 + \dots + b_nx^n
     $
     - $ y $: Dependent variable (target).
     - $ x $: Independent variable (feature).
     - $ b_0, b_1, \dots, b_n $: Coefficients.
     - $ n $: Degree of the polynomial.

2. **Nonlinear Relationships**:
   - Polynomial regression can model curves, making it more flexible than linear regression.
   - For example, a quadratic polynomial ($ n = 2 $) can model a parabolic relationship.

3. **Higher Degrees**:
   - Higher-degree polynomials can fit the training data more closely, but they may also overfit, capturing noise rather than the underlying pattern.

4. **Applications**:
   - Used in scenarios where the relationship between variables is inherently nonlinear, such as:
     - Predicting growth rates.
     - Modeling economic trends.
     - Analyzing biological data.

---

### How Polynomial Regression Works:
1. **Feature Transformation**:
   - Polynomial regression is essentially a special case of **multiple linear regression**.
   - The original feature $ x $ is transformed into polynomial features ($ x, x^2, x^3, \dots, x^n $).
   - These transformed features are then used as inputs to a linear regression model.

2. **Model Training**:
   - The coefficients ($ b_0, b_1, \dots, b_n $) are learned using the **Ordinary Least Squares (OLS)** method or gradient descent.

3. **Prediction**:
   - Once the model is trained, it can predict $ y $ for new values of $ x $.

---

### Advantages of Polynomial Regression:
- Can model nonlinear relationships.
- Flexible and can fit a wide range of data patterns.
- Easy to implement using libraries like Scikit-learn.

---

### Disadvantages of Polynomial Regression:
- **Overfitting**: Higher-degree polynomials can fit the training data too closely, leading to poor generalization on unseen data.
- **Sensitive to Outliers**: Polynomial models can be heavily influenced by outliers.
- **Computationally Expensive**: Higher-degree polynomials require more computation.

---


In [None]:
data = pd.read_csv("E:/Projects/Notebook/MachineLearning/Datasets/CSVs/data.csv")
X = data.iloc[:, 1:2].values
y = data.iloc[:, 2].values
 
    # degree of polynomial
degree = 2
xPoly = np.column_stack([X ** i for i in range(1, degree + 1)])  # Create polynomial features
xPoly = np.hstack([np.ones((xPoly.shape[0], 1)), xPoly])  # Add a column of ones for the intercept

# Step 3: Implement Polynomial Regression using Ordinary Least Squares (OLS)
def polynomial_regression(X, y):
    coefficients = np.linalg.inv(X.T @ X) @ X.T @ y
    return coefficients

# Train the model
coefficients = polynomial_regression(xPoly, y)

# Step 4: Make predictions
def predict(X, coefficients):
    return X @ coefficients

yPred = predict(xPoly, coefficients)

# Step 5: Evaluate the model
def mean_squared_error(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

mse = mean_squared_error(y, yPred)
print(f"Coefficients: {coefficients}")
print(f"Predictions: {yPred}")
print(f"Mean Squared Error: {mse}")
# Step 6: Visualize the results
plt.scatter(X, y, color='blue', label='Actual Data')  # Plot actual data points
plt.plot(X, yPred, color='red', label='Polynomial Regression')  # Plot regression curve
plt.xlabel('X')
plt.ylabel('y')
plt.title('Polynomial Regression (Degree = 2)')
plt.legend()
plt.show()

**Ridge Regression** and **Lasso Regression** are two popular regularization techniques used in linear regression to prevent overfitting and improve model generalization. Both methods add a penalty term to the loss function, but they differ in the type of penalty applied.

---

### 1. **Ridge Regression (L2 Regularization)**:
Ridge Regression adds a **L2 penalty** (squared magnitude of coefficients) to the loss function. This penalty shrinks the coefficients but does not set them to zero.

#### Key Features:
- **Loss Function**:
  $
  \text{Loss} = \text{Mean Squared Error (MSE)} + \lambda \sum_{i=1}^n b_i^2
  $
  - $ \lambda $: Regularization parameter (controls the strength of the penalty).
  - $ b_i $: Coefficients of the model.

- **Effect**:
  - Shrinks the coefficients toward zero but does not eliminate them entirely.
  - Helps reduce multicollinearity (when features are highly correlated).

- **Use Case**:
  - When you have many features and want to keep all of them in the model.

---

### 2. **Lasso Regression (L1 Regularization)**:
Lasso Regression adds an **L1 penalty** (absolute magnitude of coefficients) to the loss function. This penalty can shrink some coefficients to exactly zero, effectively performing feature selection.

#### Key Features:
- **Loss Function**:
  $
  \text{Loss} = \text{Mean Squared Error (MSE)} + \lambda \sum_{i=1}^n |b_i|
  $
  - $ \lambda $: Regularization parameter.
  - $ b_i $: Coefficients of the model.

- **Effect**:
  - Shrinks some coefficients to zero, effectively removing those features from the model.
  - Performs feature selection, making the model simpler and more interpretable.

- **Use Case**:
  - When you have many features and want to select only the most important ones.

---

### Key Differences Between Ridge and Lasso:

| **Aspect**              | **Ridge Regression**                          | **Lasso Regression**                        |
|-------------------------|-----------------------------------------------|---------------------------------------------|
| **Penalty Term**        | L2 penalty ($ \sum b_i^2 $)                 | L1 penalty ($ \sum |b_i| $)             |
| **Effect on Coefficients** | Shrinks coefficients but does not set to zero | Shrinks coefficients and can set to zero    |
| **Feature Selection**   | No                                            | Yes                                         |
| **Use Case**            | When all features are relevant                | When some features are irrelevant           |

---

---

### Key Parameters:
- **$ \lambda $ (alpha)**: Controls the strength of regularization.
  - Higher $ \lambda $: More regularization (smaller coefficients).
  - Lower $ \lambda $: Less regularization (closer to ordinary least squares).

---

### When to Use Ridge vs. Lasso:
- Use **Ridge Regression** when:
  - You have many features, and all of them are potentially relevant.
  - You want to reduce multicollinearity.
- Use **Lasso Regression** when:
  - You have many features, and you suspect some are irrelevant.
  - You want a simpler, more interpretable model.

---


In [None]:

# Example dataset
data = {
    'X1': [1, 2, 3, 4, 5],
    'X2': [2, 3, 4, 5, 6],
    'y': [2, 4, 5, 4, 5]
}
df = pd.DataFrame(data)

# Extract features (X) and target (y)
X = df[['X1', 'X2']].values
y = df['y'].values

# Add a column of ones for the intercept term
X = np.hstack([np.ones((X.shape[0], 1)), X])

# Ridge Regression implementation
def ridgeRegression(X, y, alpha):
    identityMatrix = np.eye(X.shape[1])  # Identity matrix
    identityMatrix[0, 0] = 0  # Do not penalize the intercept term
    coefficients = np.linalg.inv(X.T @ X + alpha * identityMatrix) @ X.T @ y
    return coefficients

# Train the model
alpha = 1.0  # Regularization parameter
coefficients = ridgeRegression(X, y, alpha)

# Output results
print("Ridge Regression Coefficients:", coefficients)

In [None]:
def lassoRegression(X, y, alpha, max_iter=1000, tol=1e-4):
    nSamples, nFeatures = X.shape
    coefficients = np.zeros(nFeatures)  # Initialize coefficients
    for _ in range(max_iter):
        for j in range(nFeatures):
            # Update each coefficient using coordinate descent
            X_j = X[:, j]
            y_pred = X @ coefficients
            r = y - y_pred + coefficients[j] * X_j
            coefficients[j] = np.sign(r.T @ X_j) * max(np.abs(r.T @ X_j) - alpha, 0) / (X_j.T @ X_j)
        # Check for convergence
        if np.linalg.norm(X @ coefficients - y) < tol:
            break
    return coefficients

# Train the model
alpha = 0.1  # Regularization parameter
coefficients = lassoRegression(X, y, alpha)

# Output results
print("Lasso Regression Coefficients:", coefficients)

### **Logistic Regression in Machine Learning**  

Logistic Regression is a **supervised learning algorithm** used for **binary classification** tasks. It predicts the probability that an instance belongs to a particular class using the **sigmoid function**. Unlike linear regression, which outputs continuous values, logistic regression maps predictions to probabilities between **0 and 1**.

---

## **Mathematical Formulation**  

### **1. Hypothesis Function**
Logistic Regression models the probability of a binary outcome as:

$
h_{\theta}(x) = \frac{1}{1 + e^{-\theta^T x}}
$

Where:  
- $ h_{\theta}(x) $ is the predicted probability that $ y = 1 $.  
- $ \theta $ is the weight vector (parameters).  
- $ x $ is the input feature vector.  
- The function $ \frac{1}{1 + e^{-z}} $ is called the **sigmoid function**.

---

### **2. Cost Function (Log Loss)**
The cost function for logistic regression is derived from the **likelihood function** and is given by:

$
J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_{\theta}(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_{\theta}(x^{(i)})) \right]
$

Where:  
- $ m $ is the number of training examples.  
- $ y^{(i)} $ is the actual label (0 or 1) of the $ i $-th training sample.  
- $ h_{\theta}(x^{(i)}) $ is the predicted probability.

This function is minimized using **Gradient Descent**.

---

### **3. Gradient Descent Update Rule**
To update the parameters $ \theta $, we compute the gradient:

$
\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)}) x_j^{(i)}
$

Where:  
- $ \alpha $ is the **learning rate**.

---

## **Assumptions of Logistic Regression**
1. **Linear Relationship**: The independent variables are assumed to have a linear relationship with the **log odds** of the dependent variable.
2. **Independent Observations**: Observations should be independent of each other.
3. **No Multicollinearity**: Features should not be highly correlated.
4. **Large Sample Size**: Logistic Regression performs better when there is sufficient data.
5. **Absence of Outliers**: Outliers can impact the model’s performance.

---

---

## **Summary**
1. **Logistic Regression** is a classification algorithm that predicts probabilities using the **sigmoid function**.
2. The model is trained using **Gradient Descent**, minimizing the **log loss function**.
3. Assumptions include **linear relationships in log-odds**, **no multicollinearity**, and **independent observations**.
4. We implemented **Logistic Regression from scratch**, covering **sigmoid function, cost function, gradient descent, and prediction**.


In [None]:
# sigmoid Function



def sigmoid(z):
        return 1 / (1 + np.exp(-z))


def costFunction(x, y, theta):
    m = len(y)
    h =sigmoid(x.dot(theta))
    cost = (-1/m) * np.sum(y * np.log(h) + (1 - y) * np.log(1 - h))
    return cost

def gradientDecent(x, y, theta, alpha, iterations=100):
    # alpha is learning rate
    m = len(y)
    costHistory = []
    for _ in range(iterations):
        h = sigmoid(X.dot(theta))
        gradient = (1/m) * X.T.dot(h - y)
        theta -= alpha * gradient
        costHistory.append(costFunction(X, y, theta))
        return theta, costHistory
def predict(X, theta):
    probabilities = sigmoid(X.dot(theta))
    return (probabilities >= 0.5).astype(int)

# yPred = predict(0, 45)

# accuracy = np.mean(yPred == y) * 100
# print(f"Model Accuracy: {accuracy:.2f}%")



### **Support Vector Machines (SVM) in Machine Learning**  

Support Vector Machines (SVM) is a **supervised learning algorithm** used for both **classification** and **regression** tasks. However, it is primarily used for **binary classification** problems.  

SVM finds an **optimal decision boundary** (hyperplane) that maximizes the **margin** between the two classes.

---

## **1. Mathematical Formulation of SVM**  

### **1.1 Hyperplane Equation**  
For an **n-dimensional** feature space, a hyperplane is represented as:  

$
w^T x + b = 0
$

Where:  
- $ w $ is the weight vector (parameters of the model).  
- $ x $ is the input feature vector.  
- $ b $ is the bias term.  

A **decision boundary** in SVM separates the two classes such that:

$
w^T x + b > 0 \quad \text{for class } +1
$

$
w^T x + b < 0 \quad \text{for class } -1
$

---

### **1.2 Margin Maximization**  
SVM tries to **maximize the margin** between the closest data points from both classes. These closest points are called **support vectors**.

The margin is defined as:

$
\frac{2}{\|w\|}
$

The goal is to **maximize** $ \frac{2}{\|w\|} $, which is equivalent to **minimizing** $ \frac{1}{2} \|w\|^2 $.

Thus, the optimization problem becomes:

$
\min \frac{1}{2} \|w\|^2
$

subject to:

$
y_i (w^T x_i + b) \geq 1 \quad \forall i
$

where $ y_i $ is the class label ($+1$ or $-1$).

---

### **1.3 Soft Margin SVM (Handling Misclassifications)**  
For **non-linearly separable data**, we introduce **slack variables** $ \xi_i $ to allow some misclassifications.

The modified optimization problem:

$
\min \frac{1}{2} \|w\|^2 + C \sum_{i=1}^{m} \xi_i
$

subject to:

$
y_i (w^T x_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0
$

Where:  
- $ C $ is a regularization parameter that controls the trade-off between **maximizing the margin** and **minimizing misclassifications**.

---

### **1.4 Kernel Trick for Non-Linearly Separable Data**  
For complex datasets, **SVM uses kernel functions** to map data to a **higher-dimensional space**, where it becomes linearly separable.

Common kernel functions:
- **Linear Kernel**: $ K(x_i, x_j) = x_i^T x_j $
- **Polynomial Kernel**: $ K(x_i, x_j) = (x_i^T x_j + c)^d $
- **Radial Basis Function (RBF) Kernel**: $ K(x_i, x_j) = \exp(-\gamma \|x_i - x_j\|^2) $

---

## **2. Assumptions of SVM**
1. **Binary Classification**: SVM is primarily designed for binary classification tasks.
2. **Independent Features**: Features should be **independent** and **not highly correlated**.
3. **Linearly Separable Data (For Linear SVM)**: Works best when data is **linearly separable**.
4. **Proper Kernel Selection**: For non-linear data, an appropriate **kernel function** must be chosen.

---

## **4. Summary**
1. **SVM** finds the best **hyperplane** to separate data while maximizing the **margin**.
2. **Mathematics**:  
   - Finds the optimal $ w, b $ by solving a **convex optimization problem**.  
   - Uses **Lagrange multipliers** to handle constraints.  
3. **Implementation Steps**:  
   - **Data handling with pandas**  
   - **Feature standardization**  
   - **Gradient-based optimization** for finding the best decision boundary  
4. **Advantages of SVM**:
   - Works well for **small to medium datasets**.
   - Effective in **high-dimensional spaces**.
   - Handles **non-linearly separable data** using **kernels**.


In [None]:
class SVM:
    def __init__(self, learningRate=0.001, lambdaParam=0.01, iterations=1000):
        self.lr = learningRate
        self.lambda_param = lambdaParam  # Regularization
        self.iterations = iterations
        self.w = None
        self.b = None

    def fit(self, X, y):
        m, n = X.shape
        self.w = np.zeros(n)
        self.b = 0
        y = y.flatten()  # Ensure y is a 1D array

        for _ in range(self.iterations):
            for i in range(m):
                condition = y[i] * (np.dot(X[i], self.w) + self.b) >= 1
                if condition:
                    self.w -= self.lr * (2 * self.lambda_param * self.w)
                else:
                    self.w -= self.lr * (2 * self.lambda_param * self.w - np.dot(X[i], y[i]))
                    self.b -= self.lr * y[i]

    def predict(self, X):
        approx = np.dot(X, self.w) + self.b
        return np.sign(approx)


## **k-Nearest Neighbors (k-NN) Algorithm: Instance-Based Learning Using Distance Metrics**  

### **1. Introduction to k-NN**
k-Nearest Neighbors (k-NN) is a **supervised learning algorithm** used for **classification** and **regression**. It is an **instance-based** (or **lazy learning**) algorithm, meaning it **does not build a model during training**. Instead, it **memorizes the training data** and makes predictions based on the **k nearest data points**.

---

### **2. How k-NN Works**
1. **Store Training Data**: k-NN does not learn a model but keeps all training data in memory.
2. **Calculate Distance**: When a new test point needs classification, k-NN computes the distance between this point and all training points.
3. **Find k Nearest Neighbors**: It selects the **k closest training points** based on a chosen distance metric.
4. **Vote (For Classification) or Average (For Regression)**:
   - **Classification**: Assigns the **most common label** among the k neighbors.
   - **Regression**: Takes the **mean (or weighted mean)** of the k neighbors' values.

---

### **3. Distance Metrics Used in k-NN**
To determine the nearest neighbors, k-NN uses various distance metrics:

1. **Euclidean Distance** (Most Common):
   $
   d(x, y) = \sqrt{\sum (x_i - y_i)^2}
   $
   - Works well for **continuous numerical features**.

2. **Manhattan Distance**:
   $
   d(x, y) = \sum |x_i - y_i|
   $
   - Suitable when **data has high dimensions**.

3. **Minkowski Distance** (Generalization of Euclidean and Manhattan):
   $
   d(x, y) = \left(\sum |x_i - y_i|^p \right)^{\frac{1}{p}}
   $
   - When $ p=2 $, it is **Euclidean Distance**.
   - When $ p=1 $, it is **Manhattan Distance**.

4. **Cosine Similarity** (for Text & High-Dimensional Data):
   $
   \cos(\theta) = \frac{x \cdot y}{\|x\| \|y\|}
   $
   - Measures **angle** rather than absolute distance.

---

### **4. Assumptions in k-NN**
1. **Locally Similar Data**: Assumes that similar points are close together.
2. **No Explicit Model**: k-NN is a **lazy learning algorithm** and does not create a mathematical model.
3. **Feature Scaling Matters**: Since k-NN uses distances, **features must be normalized**.
4. **Computational Complexity**: k-NN requires storing the full dataset, making it **slow for large datasets**.


---

## **6. Summary**
### **Key Features of k-NN**
1. **Lazy Learning**: No training phase; the model just stores data.
2. **Distance-Based**: Predictions are based on distance metrics like **Euclidean or Manhattan**.
3. **Highly Interpretable**: Easy to understand but computationally expensive for large datasets.

### **Pros and Cons**
| Pros | Cons |
|------|------|
| Simple to implement | Slow for large datasets |
| Works well for low-dimensional data | Sensitive to irrelevant features |
| No assumptions about data distribution | Requires proper feature scaling |


In [None]:
from collections import Counter


class KNN:
    def __init__(self, k=3, distance_metric="euclidean"):
        self.k = k
        self.distance_metric = distance_metric

    def _compute_distance(self, x1, x2):
        if self.distance_metric == "euclidean":
            return np.sqrt(np.sum((x1 - x2) ** 2))
        elif self.distance_metric == "manhattan":
            return np.sum(np.abs(x1 - x2))
        else:
            raise ValueError("Unsupported distance metric")

    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train

    def predict(self, X_test):
        predictions = []
        for x in X_test:
            # Compute distances from x to all training samples
            distances = [self._compute_distance(x, x_train) for x_train in self.X_train]

            # Get the indices of k nearest neighbors
            k_indices = np.argsort(distances)[:self.k]

            # Get labels of the k nearest neighbors
            k_nearest_labels = [self.y_train[i] for i in k_indices]

            # Majority vote for classification
            most_common = Counter(k_nearest_labels).most_common(1)[0][0]
            predictions.append(most_common)

        return np.array(predictions)


### **Extending k-NN: Weighted k-NN and k-NN Regression**

We will now extend our **k-NN implementation** to include:  
1. **Weighted k-NN**: Assigns **higher weight to closer neighbors**.  
2. **k-NN Regression**: Predicts **continuous values** instead of discrete classes.  

---

## **1. Weighted k-NN (Classification)**
In standard k-NN, all neighbors have **equal influence** on classification. In **weighted k-NN**, closer neighbors have a **higher influence** using an inverse distance weighting:

$
w_i = \frac{1}{d(x, x_i) + \epsilon}
$

Where:  
- $ w_i $ is the weight for neighbor $ i $.  
- $ d(x, x_i) $ is the distance to neighbor $ i $.  
- $ \epsilon $ is a small constant to prevent division by zero.


## **2. k-NN Regression**
Instead of **voting**, k-NN regression **averages** the k nearest values:

$
\hat{y} = \frac{1}{k} \sum_{i=1}^{k} y_i
$

For **weighted k-NN regression**, the formula is:

$
\hat{y} = \frac{\sum w_i y_i}{\sum w_i}
$

where $ w_i $ is the weight (inverse of distance).


---

## **3. Summary**
| Variant | Description | Pros | Cons |
|---------|-------------|------|------|
| **Standard k-NN Classification** | Majority voting among k neighbors | Simple, interpretable | Sensitive to irrelevant features |
| **Weighted k-NN Classification** | Closer neighbors have higher weight | Better performance on noisy data | Sensitive to outliers |
| **k-NN Regression** | Predicts mean of k nearest neighbors | Works well for smooth functions | Struggles with non-uniform data |
| **Weighted k-NN Regression** | Uses inverse distance weighting | More accurate for non-uniform data | Prone to overfitting |


In [None]:
class WeightedKNN(KNN):  # Inherit from our previous KNN class
    def predict(self, X_test):
        predictions = []
        for x in X_test:
            distances = [self._compute_distance(x, x_train) for x_train in self.X_train]
            k_indices = np.argsort(distances)[:self.k]
            k_nearest_labels = [self.y_train[i] for i in k_indices]
            k_nearest_distances = [distances[i] for i in k_indices]

            # Compute weights (inverse distance)
            weights = [1 / (d + 1e-5) for d in k_nearest_distances]

            # Weighted vote for classification
            weighted_votes = {}
            for label, weight in zip(k_nearest_labels, weights):
                if label not in weighted_votes:
                    weighted_votes[label] = 0
                weighted_votes[label] += weight

            # Get label with highest weighted vote
            most_common = max(weighted_votes, key=weighted_votes.get)
            predictions.append(most_common)

        return np.array(predictions)



class KNNRegressor:
    def __init__(self, k=3, distance_metric="euclidean", weighted=False):
        self.k = k
        self.distance_metric = distance_metric
        self.weighted = weighted

    def _compute_distance(self, x1, x2):
        if self.distance_metric == "euclidean":
            return np.sqrt(np.sum((x1 - x2) ** 2))
        elif self.distance_metric == "manhattan":
            return np.sum(np.abs(x1 - x2))
        else:
            raise ValueError("Unsupported distance metric")

    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train

    def predict(self, X_test):
        predictions = []
        for x in X_test:
            distances = [self._compute_distance(x, x_train) for x_train in self.X_train]
            k_indices = np.argsort(distances)[:self.k]
            k_nearest_values = [self.y_train[i] for i in k_indices]

            if self.weighted:
                k_nearest_distances = [distances[i] for i in k_indices]
                weights = [1 / (d + 1e-5) for d in k_nearest_distances]
                prediction = np.sum(np.array(weights) * np.array(k_nearest_values)) / np.sum(weights)
            else:
                prediction = np.mean(k_nearest_values)

            predictions.append(prediction)

        return np.array(predictions)


# **Naïve Bayes: Probabilistic Classifier with Feature Independence Assumption**  

## **1. Introduction to Naïve Bayes**  
Naïve Bayes is a **probabilistic classifier** based on **Bayes' Theorem**. It is called "naïve" because it **assumes that features are independent**, which is often not true in real-world data but simplifies computations.  

Naïve Bayes is widely used in **text classification, spam filtering, sentiment analysis, and medical diagnosis**.

---

## **2. Bayes' Theorem**  
Bayes’ theorem provides a way to update probabilities based on new evidence:

$
P(Y | X) = \frac{P(X | Y) P(Y)}{P(X)}
$

Where:
- $ P(Y | X) $ = **Posterior probability** (probability of class $ Y $ given features $ X $).
- $ P(X | Y) $ = **Likelihood** (probability of features $ X $ given class $ Y $).
- $ P(Y) $ = **Prior probability** (probability of class $ Y $ before seeing data).
- $ P(X) $ = **Evidence** (probability of features $ X $ occurring).

---

## **3. Assumptions of Naïve Bayes**
1. **Feature Independence**: Each feature contributes independently to the probability of a class.
2. **Equal Importance**: All features have the same weight in classification.
3. **Fixed Probabilities**: Probabilities are estimated from training data and do not change dynamically.

---

## **4. Types of Naïve Bayes Classifiers**
1. **Gaussian Naïve Bayes**: Assumes features are **normally distributed** (for continuous data).
2. **Multinomial Naïve Bayes**: Used for **text classification** (word counts in documents).
3. **Bernoulli Naïve Bayes**: Used for **binary features** (e.g., presence/absence of a word).

---


### **Step 4: Implement Naïve Bayes (Gaussian)**
#### **Gaussian Naïve Bayes Assumes Each Feature Follows a Normal Distribution**:
$
P(X | Y) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{(X - \mu)^2}{2 \sigma^2}\right)
$

---

## **6. Summary**
| Variant | Description | Use Cases |
|---------|-------------|------------|
| **Gaussian Naïve Bayes** | Assumes normal distribution of features | Medical diagnosis, image classification |
| **Multinomial Naïve Bayes** | Used for word frequency data | Text classification, spam filtering |
| **Bernoulli Naïve Bayes** | Works with binary features | Sentiment analysis, fraud detection |

### **Pros and Cons**
| Pros | Cons |
|------|------|
| Simple, fast, and scalable | Assumes feature independence |
| Works well with high-dimensional data | Poor performance if features are correlated |
| Requires very little training data | Sensitive to zero probabilities |

---


In [None]:
class GaussianNaiveBayes:
    def __init__(self):
        self.classes = None
        self.class_priors = {}
        self.means = {}
        self.variances = {}

    def fit(self, X_train, y_train):
        self.classes = np.unique(y_train)  # Get unique class labels
        
        for cls in self.classes:
            X_c = X_train[y_train == cls]  # Subset data for each class
            
            # Calculate mean and variance for each feature
            self.means[cls] = np.mean(X_c, axis=0)
            self.variances[cls] = np.var(X_c, axis=0)
            
            # Calculate prior probability P(Y)
            self.class_priors[cls] = X_c.shape[0] / X_train.shape[0]

    def _calculate_likelihood(self, x, mean, var):
        """Compute Gaussian likelihood using Normal Distribution formula."""
        eps = 1e-9  # Avoid division by zero
        coef = 1 / np.sqrt(2 * np.pi * var + eps)
        exponent = np.exp(- (x - mean) ** 2 / (2 * var + eps))
        return coef * exponent

    def _calculate_posterior(self, x):
        """Calculate P(Y | X) for each class."""
        posteriors = {}
        
        for cls in self.classes:
            prior = np.log(self.class_priors[cls])  # Log probability for numerical stability
            likelihoods = np.sum(np.log(self._calculate_likelihood(x, self.means[cls], self.variances[cls])))
            posteriors[cls] = prior + likelihoods  # Apply Bayes' rule
            
        return max(posteriors, key=posteriors.get)  # Return class with max probability

    def predict(self, X_test):
        return np.array([self._calculate_posterior(x) for x in X_test])


### **1. Multinomial Naïve Bayes (For Text Classification)**  
In Multinomial Naïve Bayes, we assume that **features represent counts** (e.g., word frequencies in text classification). The likelihood is calculated as:

$
P(X | Y) = \prod_{i=1}^{n} P(x_i | Y)
$

Where $ P(x_i | Y) $ is estimated as:

$
P(x_i | Y) = \frac{count(x_i, Y) + \alpha}{\sum_{j} count(x_j, Y) + \alpha V}
$

- $ count(x_i, Y) $ = number of times feature $ x_i $ appears in class $ Y $  
- $ V $ = vocabulary size  
- $ \alpha $ = Laplace smoothing parameter (usually $ \alpha = 1 $)  

---

---

### **3. Laplace Smoothing**
Laplace smoothing prevents zero probabilities. The formula for **adjusted probability** is:

$
P(x_i | Y) = \frac{count(x_i, Y) + \alpha}{\sum_{j} count(x_j, Y) + \alpha V}
$

Where $ \alpha > 0 $ ensures all probabilities are non-zero.


In [None]:
from collections import defaultdict

class MultinomialNaiveBayes:
    def __init__(self, alpha=1.0):
        self.alpha = alpha  # Laplace smoothing parameter
        self.class_priors = {}
        self.word_counts = {}
        self.vocab = set()
        self.class_word_totals = {}

    def fit(self, X_train, y_train):
        self.classes = np.unique(y_train)
        self.class_priors = {cls: np.mean(y_train == cls) for cls in self.classes}
        self.word_counts = {cls: defaultdict(int) for cls in self.classes}
        self.class_word_totals = {cls: 0 for cls in self.classes}

        for text, cls in zip(X_train, y_train):
            words = text.split()
            for word in words:
                self.word_counts[cls][word] += 1
                self.class_word_totals[cls] += 1
                self.vocab.add(word)

    def _calculate_posterior(self, text, cls):
        words = text.split()
        log_prob = np.log(self.class_priors[cls])
        V = len(self.vocab)

        for word in words:
            word_freq = self.word_counts[cls].get(word, 0)
            prob = (word_freq + self.alpha) / (self.class_word_totals[cls] + self.alpha * V)
            log_prob += np.log(prob)

        return log_prob

    def predict(self, X_test):
        predictions = []
        for text in X_test:
            posteriors = {cls: self._calculate_posterior(text, cls) for cls in self.classes}
            predictions.append(max(posteriors, key=posteriors.get))
        return np.array(predictions)


## **Decision Trees: Hierarchical Classification Based on Feature Splits**

### **1. Mathematical Foundation**  

A **Decision Tree** is a hierarchical structure where data is recursively split based on feature values to form a tree-like structure. The goal is to **maximize information gain** at each split.

---

### **2. Splitting Criteria**  

At each node, we choose the best feature to split using **Impurity Measures**:

#### **(a) Gini Impurity (for Classification)**
$
Gini(D) = 1 - \sum_{i=1}^{c} p_i^2
$
Where:
- $ p_i $ is the proportion of class $ i $ in dataset $ D $.  
- Lower **Gini** means purer nodes.

#### **(b) Entropy (for Classification)**
$
Entropy(D) = -\sum_{i=1}^{c} p_i \log_2 p_i
$
- Measures **disorder** in the dataset.
- Lower entropy means purer splits.

#### **(c) Mean Squared Error (for Regression)**
$
MSE = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y})^2
$
- Used in **Regression Trees**.

---

### **3. Information Gain (IG)**
The best split is chosen by **maximizing Information Gain**:

$
IG = I(D) - \sum_{j} \frac{|D_j|}{|D|} I(D_j)
$

Where:
- $ I(D) $ = Impurity of the parent node.
- $ D_j $ = Subsets after the split.

---


### **6. Summary**

| **Criterion** | **Formula** | **Use Case** |
|--------------|------------|-------------|
| **Gini Impurity** | $ 1 - \sum p_i^2 $ | Faster computation, common in classification |
| **Entropy** | $ -\sum p_i \log_2 p_i $ | More interpretable, used in information theory |


In [None]:

class DecisionTree:
    def __init__(self, max_depth=5, min_samples_split=2, criterion="gini"):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.criterion = criterion
        self.tree = None

    def _gini(self, y):
        classes, counts = np.unique(y, return_counts=True)
        p = counts / counts.sum()
        return 1 - np.sum(p ** 2)

    def _entropy(self, y):
        classes, counts = np.unique(y, return_counts=True)
        p = counts / counts.sum()
        return -np.sum(p * np.log2(p + 1e-9))

    def _best_split(self, X, y):
        best_gain, best_feature, best_value = 0, None, None
        parent_impurity = self._gini(y) if self.criterion == "gini" else self._entropy(y)
        
        for feature in range(X.shape[1]):
            values = np.unique(X[:, feature])
            for value in values:
                left_mask, right_mask = X[:, feature] <= value, X[:, feature] > value
                if sum(left_mask) < self.min_samples_split or sum(right_mask) < self.min_samples_split:
                    continue
                
                left_impurity = self._gini(y[left_mask]) if self.criterion == "gini" else self._entropy(y[left_mask])
                right_impurity = self._gini(y[right_mask]) if self.criterion == "gini" else self._entropy(y[right_mask])
                
                weighted_impurity = (sum(left_mask) * left_impurity + sum(right_mask) * right_impurity) / len(y)
                info_gain = parent_impurity - weighted_impurity

                if info_gain > best_gain:
                    best_gain, best_feature, best_value = info_gain, feature, value
        
        return best_feature, best_value

    def _build_tree(self, X, y, depth=0):
        if len(np.unique(y)) == 1 or depth >= self.max_depth:
            return np.argmax(np.bincount(y))

        feature, value = self._best_split(X, y)
        if feature is None:
            return np.argmax(np.bincount(y))

        left_mask, right_mask = X[:, feature] <= value, X[:, feature] > value
        left_branch = self._build_tree(X[left_mask], y[left_mask], depth + 1)
        right_branch = self._build_tree(X[right_mask], y[right_mask], depth + 1)

        return {"feature": feature, "value": value, "left": left_branch, "right": right_branch}

    def fit(self, X, y):
        self.tree = self._build_tree(X, y)

    def _predict_sample(self, x, node):
        if isinstance(node, int):
            return node
        return self._predict_sample(x, node["left"] if x[node["feature"]] <= node["value"] else node["right"])

    def predict(self, X):
        return np.array([self._predict_sample(x, self.tree) for x in X])


## **Random Forest: Ensemble of Decision Trees for Improved Accuracy and Robustness**  

### **1. Mathematical Foundation**  

**Random Forest** is an **ensemble learning method** that combines multiple **Decision Trees** to improve accuracy and reduce overfitting. It uses **Bagging (Bootstrap Aggregation)** and **Random Feature Selection** to build diverse trees.

---

### **2. Key Concepts**  

#### **(a) Bagging (Bootstrap Aggregation)**
Each tree is trained on a **random subset** of the training data (sampling with replacement).

$
D_i \sim \text{Bootstrap Sample from } D
$

#### **(b) Random Feature Selection**
Each tree is trained using only a **random subset of features**. If there are $ M $ total features, a subset of size $ \sqrt{M} $ (for classification) or $ M/3 $ (for regression) is used.

#### **(c) Final Prediction (Voting/Averaging)**
For **classification**, the majority vote is taken:

$
\hat{Y} = \arg\max_k \sum_{i=1}^{T} \mathbb{1}(h_i(X) = k)
$

For **regression**, the average prediction is taken:

$
\hat{Y} = \frac{1}{T} \sum_{i=1}^{T} h_i(X)
$

Where $ T $ is the number of trees, and $ h_i(X) $ is the prediction from the $ i $-th tree.


---

### **6. Summary**
| **Concept** | **Explanation** |
|------------|----------------|
| **Bagging** | Trains each tree on a different random sample of data |
| **Random Feature Selection** | Each tree is trained on a subset of features |
| **Voting (Classification)** | Final prediction is based on majority vote |
| **Averaging (Regression)** | Final prediction is the average output of all trees |


In [None]:
class RandomForest:
    def __init__(self, n_trees=10, max_depth=5, min_samples_split=2, criterion="gini"):
        self.n_trees = n_trees
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.criterion = criterion
        self.trees = []

    def _bootstrap_sample(self, X, y):
        indices = np.random.choice(len(X), size=len(X), replace=True)
        return X[indices], y[indices]

    def fit(self, X, y):
        self.trees = []
        for _ in range(self.n_trees):
            X_sample, y_sample = self._bootstrap_sample(X, y)
            # Using DecisionTree
            tree = DecisionTree(max_depth=self.max_depth, min_samples_split=self.min_samples_split, criterion=self.criterion)
            tree.fit(X_sample, y_sample)
            self.trees.append(tree)

    def predict(self, X):
        predictions = np.array([tree.predict(X) for tree in self.trees])
        return np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=0, arr=predictions)
