# <center> <span style="color:#f6f794"> **Supervised ML: Classification Algorithms** </span> </center>

#### <span style="color:#c69005"> What is a "**Classification Algorithms**" ? </span>

Classification is a type of supervised learning where the goal is to <span style="color:#F2C122"> **assign an input** </span> to one of several predefined categories or classes based on its features. Each input is labeled with a class, and the model learns to predict the class for new, unseen data.

* <i> **Classification:** The computer classifies new pictures into categories. </i>
* <i> Example: Given an image of a flower, classifying it as “rose,” “tulip,” or “daisy.” </i>

#### Differences with "similar" techniques: 

| **Techniques**        | **Description**                                               | **Key Difference with Classification**                                        |
|-----------------------|---------------------------------------------------------------|-------------------------------------------------------------------------------|
| **Classification**     | <span style="color:#F2C122"> **Assigning** </span> an input into one of several predefined categories or classes. | <span style="color:#F2C122"> **Each input is assigned to one class** </span> .                                          |
| **Segmentation**       | <span style="color:#F2C122"> **Dividing** </span> data into meaningful segments or regions (often in images or text). | Involves dividing data into parts, but not necessarily assigning each part to a class. |
| **Clustering**         | <span style="color:#F2C122"> **Grouping** </span> similar data points together without predefined labels. | Involves grouping data but without the aim of classifying each individual item. |
| **Regression**         | <span style="color:#F2C122"> **Predicting** </span> a continuous value (numeric) based on input data. | Classification involves discrete classes, while regression <u> **deals with continuous outputs** </u> . |
| **Anomaly Detection**  | <span style="color:#F2C122"> **Identifying** </span> unusual or outlier data points that deviate from normal patterns. | Classification assigns labels, while anomaly detection <u> **identifies data points that don’t fit the pattern** </u> . |

#### <span style="color:#c69005"> **Types** of Supervised Classification Algorithms: </span>

* **Logistic Regression:**
* **Decision Trees:**
* **Random Forests:**
* **Support Vector Machines (SVMs):**
* **K-Nearest Neighbors (KNN):**
* **Naive Bayes:**
* **Neural Networks (Deep Learning):**
 

#### <span style="color:#c69005"> **What** Do These Algorithms Do? </span>

They learn patterns from labeled data and predict classes for new data.

#### <span style="color:#c69005"> **How** Are They Useful: </span>

Classification algorithms are useful when you have **labeled** data, and your goal is to **categorize** new data points into one of several predefined **classes** or labels.

#### <span style="color:#c69005"> **Why** are Classification Algorithms Useful?  </span>

* **Automation of Decision-Making:** Classification helps automate processes like fraud detection, recommendation systems, or medical diagnosis.
* **Improved Accuracy:** It helps in making predictions about unseen data based on learned patterns from the training data.
* **Efficient Categorization:** Classification helps categorize vast amounts of data quickly and accurately, enabling businesses to make data-driven decisions.
* **Risk Management:** Helps in identifying risks, such as detecting fraudulent transactions or predicting equipment failure.

#### <span style="color:#c69005"> **When** to Use Classification Algorithms </span>

- When you:
    - Have labeled data.
    - Want to predict categories.
    - Need to automate classification.
    - Have a clear target variable.


#### <span style="color:#c69005"> Key **Advantages** of Classification Algorithms: </span>

* **Versatility:** Can be applied to a wide range of domains (e.g., finance, healthcare, marketing).
* **Scalability:** Can handle large datasets, especially with algorithms.
* **Interpretability:** Many classification algorithms provide insight into how decisions are made, which is useful for understanding model behavior.

____
# <center> <span style="color:#f6f794">**Types** of supervised classification algorithms: </center> </span>


### <span style="color:#c69005"> **Logistic Regression**  </span>

* What is it?
    * A linear probabilistic classifier that predicts the probability of a binary class. It uses a sigmoid function to transform the output into probabilities.
* What it does?
    * It generates a linear decision boundary that separates the classes.
* How it works?
    * The algorithm finds the coefficients (weights) that minimize the log loss function, adjusting the probability of belonging to each class. It uses the sigmoid function and this converts the output to a probability between 0 and 1.
* When to choose it? 
    - Linearly separable data.
    - No multicollinearity among variables.
    - Binary classification (2 classes).
* Example:
    * Predicting if an email is spam based on features like message length and keyword frequency. 
* <span style="color:#F2C122">  **In simple words**: </span> 
    * It draws a straight line to separate two groups and tells you how likely something belongs to one of them.



### <span style="color:#c69005"> **Decision Trees**  </span>

* What is it?
    * A non-linear classifier that splits data using simple rules into subsets at each decision node.
* What it does?
    * It divides the data according to features until each terminal node (leaf) contains only one class.
* How it works?
    * The tree is built using a greedy algorithm that maximizes information gain (based on measures like entropy or Gini impurity). Each node asks a question about a feature, and based on the answer, it splits the data.
* When to choose it? 
    - Data with complex interaction.
    - Categorical or numerical variables.
    - Need interpretability (understanding how the decision was made).
    - suitable when data contains clear decision rules that lead to different outcomes
* Example:
    * Classifying whether a person will buy a product based on age, income, and location.
* <span style="color:#F2C122">  **In simple words**: </span> 
    * It asks yes/no questions about your data until it decides what class it belongs to—like playing 20 questions.



### <span style="color:#c69005"> **Random Forest**  </span>
* What is it?
    * An ensemble of many decision trees, where each tree is trained on a random subset of data and features.
* What it does?
    * Each tree makes a prediction, and the final result is determined by the majority vote across all trees.
* How it works?
    * Trains multiple trees on bootstrap samples of the data, and for each split, it selects a random subset of features. The predictions are aggregated through voting (for classification).
* When to choose it? 
    - Large and complex relations in large datasets.
    - You want a robust model that generalizes well.
    - Avoid overfitting. 
    - High-dimensional data
* Example:
    * Classifying images of animals, where each tree predicts the animal in the image, and the majority vote decides the type.
* <span style="color:#F2C122">  **In simple words**: </span> 
    * It grows a bunch of decision trees and lets them vote. The majority decides the final class.



### <span style="color:#c69005"> **Support Vector Machines (SVM)**  </span>
* What is it?
    * A classifier that finds the hyperplane that maximizes the margin between classes. It can use kernels to handle non-linear data.
* What it does?
    * It finds the optimal hyperplane that separates the classes, and if data is non-linear, it uses a kernel to transform the space into one where the classes are linearly separable.
* How it works?
    * The algorithm solves an optimization problem to maximize the margin between classes. It uses Lagrange multipliers to find the optimal separating hyperplane. Kernels like RBF (Radial basis function) map data into higher dimensions to make it linearly separable.
* When to choose it? 
    - Well-separated data (clear class separation).
    - High-dimensional feature space.
    - Small to medium-sized datasets.
* Example:
    * Classifying images of cats and dogs, where SVM separates the classes using a large margin between the data points.
* <span style="color:#F2C122">  **In simple words**: </span> 
    * It finds the best line (or curve) that separates your classes with the biggest possible space in between.



### <span style="color:#c69005"> **K-Nearest Neighbors (KNN)**  </span>
* What is it?
    * A lazy learner that classifies a data point based on the k nearest neighbors.
* What it does?
    * It assigns a class to a new data point by finding the k nearest points and selecting the most common class among them.
* How it works?
    * For each new data point, it calculates the Euclidean distance (or another distance metric) between the point and the training data, and assigns the class of the k (k: number of clusters/groups) closest points.
* When to choose it? 
    - Clear distribution of data (clear separation between classes based on proximity).
    - Small datasets.
    - Simplicity and speed (relatively simple data).
* Example:
    * Predicting if a person has diabetes based on the proximity to other people with similar characteristics.
* <span style="color:#F2C122">  **In simple words**: </span> 
    * It looks at the closest neighbors and classifies based on what most of them are.



### <span style="color:#c69005"> **Naive Bayes**  </span>
* What is it?
    * A probabilistic classifier that uses Bayes' Theorem under the assumption of independent features.
* What it does?
    * It calculates the posterior probability of each class and assigns the class with the highest probability.
* How it works?
    * It applies **Bayes' Theorem** to calculate the posterior probability. Each feature’s contribution is assumed to be independent given the class.
* When to choose it? 
    - Independent features.
    - Text classification tasks, like spam detection or sentiment analysis.
* Example:
    * Classifying emails as "spam" or "not spam" based on keyword frequencies.
* <span style="color:#F2C122">  **In simple words**: </span> 
    * It uses probabilities to guess the class, assuming each feature acts on its own.



### <span style="color:#c69005"> **Neural Networks (Deep Learning)**  </span>
* What is it?
    * A model inspired by the human brain, consisting of layers of artificial neurons that learn complex representations of the data.
* What it does?
    * It adjusts the weights of connections between neurons to minimize the prediction error.
* How it works?
    * Each neuron applies an activation function (like ReLU) and adjusts the weights through backpropagation and gradient descent to minimize the loss.
* When to choose it? 
    - Large datasets.
    - Non-linear complex problems.
    - High accuracy is required, and interpretability is less important.
    - Unstructured datasets
* Example:
    * Classifying images of objects into categories like "cat", "dog", "car".
* <span style="color:#F2C122">  **In simple words**: </span> 
    * It’s like a big brain with layers that learns patterns—great for images, sound, and messy data.

<br>

#

</br>



| <span style="color:#F2C122"> **Situation**   </span>                         | <span style="color:#F2C122"> **Data Requirements / Technical Notes** </span> | <span style="color:#F2C122"> **Recommended Algorithm** </span> |
|------------------------------------------|--------------------------------------------------------------------------|-----------------------------|
| **Linearly separable and binary data**   | Requires linearity between features and the log-odds; sensitive to multicollinearity and outliers; features should be scaled. | **Logistic Regression**     |
| **Complex relationships among features** | No need for feature scaling; robust to outliers; can handle both numerical and categorical data; prone to overfitting. | **Decision Trees**          |
| **Large datasets and robustness needed** | Handles missing values and outliers well; does not assume normality or homoscedasticity; no need for feature scaling. | **Random Forest**           |
| **Small sample size, high-dimensional**  | Requires feature scaling; sensitive to outliers; works best when classes are well-separated (margin-based). | **SVM**                     |
| **Clear data distribution and speed**    | Requires feature scaling; sensitive to noise and irrelevant features; assumes local homogeneity. | **KNN**                     |
| **Text or discrete variables**           | Assumes feature independence; works well even with small datasets; robust to irrelevant features but sensitive to zero probabilities. | **Naive Bayes**             |
| **Non-linear, high complexity data**     | Requires large datasets; sensitive to feature scaling; can overfit if not regularized; data should be preprocessed (normalized, encoded). | **Neural Networks**         |


<br>

#

</br>

______

# <center> <span style="color:#f6f794"> Key words: </span> </center>

- **Supervised Learning**: A type of machine learning where the model is trained with labeled data to make predictions.
- **Model**: A mathematical representation or algorithm that learns from data to make predictions or decisions.
- **Labeled Data**: Data where each input is paired with a known output (label).
- **Mapping**: The process of learning a relationship between input data and output labels in supervised learning.
- **Input Features**: The data used by the model to make predictions (e.g., words, numbers, images).
- **Output Labels**: The correct answers or categories assigned to the input features during training.
- **Training**: The process of teaching a model by providing it with labeled data so it can learn.
- **Loss Function**: A function that measures how far the model's predictions are from the actual answers, which the model tries to minimize.
- **Prediction**: The result or output produced by the model based on new input data.
- **Classification**: A type of supervised learning where the model assigns input data into categories or classes.
- **Regression**: A type of supervised learning where the model predicts continuous numerical values.
- **Generalization**: The model's ability to make accurate predictions on new, unseen data.
- **Optimization**: The process of adjusting the model's parameters to minimize errors and improve performance.
- **Gradient Descent**: A popular optimization technique used to adjust model parameters by reducing the loss function.
- **Linear Probabilistic**: A method (like logistic regression) where the prediction is a probability derived from a linear combination of inputs.
- **Linear Decision Boundary**: A straight line or hyperplane that separates different classes in the feature space.
- **Log Loss Function**: Also called binary cross-entropy, it measures the performance of a classification model by penalizing incorrect predictions more strongly when the model is confident but wrong.
- **Entropy**: A measure of disorder or uncertainty. In decision trees, it measures how mixed the class labels are in a dataset.
- **Gini Impurity**: Another measure used in decision trees to evaluate the quality of a split; it represents the probability of incorrectly classifying a randomly chosen element.
- **Hyperplane**: A decision boundary in SVMs that separates classes in a high-dimensional space; in 2D, it’s a line, in 3D, a plane, and beyond that, a hyperplane.
- **Kernel**: A function that transforms data into a higher-dimensional space to make it easier to find a separating hyperplane (used in SVMs).
- **Lagrange Multipliers**: A mathematical technique used in SVM optimization to maximize the margin while satisfying constraints.
- **Euclidean Distance**: A way of measuring straight-line distance between two points in space, often used in KNN to find nearest neighbors.

# <center>  <span style="color:#f6f794"> Key words but lets make it easier: </span> </center>

- **Supervised Learning**: Teaching a computer to make decisions by showing it examples with the right answers.
- **Model**: A tool (like a robot brain) that learns from data to help make decisions or predictions.
- **Labeled Data**: Information where we already know the correct answer or category (like marking emails as "spam" or "not spam").
- **Mapping**: The process of connecting what the computer learns from the data to the right answer.
- **Input Features**: The information the computer uses to make its guess (like pictures, words, or numbers).
- **Output Labels**: The correct answer or category that we want the computer to predict (like "spam" or "not spam").
- **Training**: The process where the computer looks at examples and learns to make predictions.
- **Loss Function**: A way to measure how far off the computer’s guesses are from the right answer, and the goal is to make it as small as possible.
- **Prediction**: The computer's best guess about what the answer should be for new information it has never seen before.
- **Classification**: A type of learning where the computer puts things into different groups or categories.
- **Regression**: A type of learning where the computer predicts a number, like guessing the price of a house.
- **Generalization**: When the computer is good at making correct predictions, even for new things it hasn’t seen before.
- **Optimization**: Adjusting the computer's thinking to improve its decision-making and get better at making predictions.
- **Gradient Descent**: A method the computer uses to get better by learning from its mistakes and slowly improving its guesses.
- **Linear Probabilistic**: The computer uses math to guess a probability, like “there’s a 90% chance this is spam,” based on a straight-line relationship.
- **Linear Decision Boundary**: A line (or flat surface) that the computer draws to separate things into groups—like drawing a line between apples and oranges.
- **Log Loss Function**: A way of scoring how wrong a prediction is, especially when the computer is very sure but ends up being wrong—it’s a harsh teacher!
- **Entropy**: A way to measure uncertainty or “messiness” in the data—the more mixed up the groups are, the higher the entropy.
- **Gini Impurity**: Another way to measure how mixed a group is—used to help decision trees figure out how to split the data.
- **Hyperplane**: A fancy word for a line or surface the computer uses to separate groups, especially in high dimensions (like 3D or more).
- **Kernel**: A trick that helps the computer look at data in a new way so that it can find patterns even when things are not clearly separated.
- **Lagrange Multipliers**: A math tool used behind the scenes to help find the best way to separate groups while following certain rules.
- **Euclidean Distance**: Just the straight-line distance between two points—like using a ruler to measure how close two things are.

<br>

#

</br>



______

# <center>  <span style="color:#f6f794"> Useful links: </span> </center> 

https://datascientest.com/en/classification-algorithms-definition-and-main-models#:~:text=Classification%20algorithms%20are%20part%20of,and%20then%20predictions%20are%20made.

https://www.ibm.com/think/topics/classification-machine-learning

https://www.sciencedirect.com/topics/engineering/classification-algorithm

https://www.simplilearn.com/tutorials/machine-learning-tutorial/classification-in-machine-learning#what_is_classification
