# Whats is classification?

Classification is a supervised machine learning technique used to categorize data into predefined classes or labels. It involves training a model on a labeled dataset, where the input features are associated with specific output labels. Once trained, the model can predict the class of new, unseen data based on the patterns it learned during training.

For example, in email spam detection, a classification model can be trained to distinguish between "spam" and "not spam" emails based on features such as the presence of certain keywords, sender information, and email structure.

There are various algorithms used for classification, including decision trees, support vector machines, logistic regression, and neural networks. The choice of algorithm depends on the nature of the data and the specific requirements of the task at hand.

There are two types of learners in classification as lazy learners and eager learners.

## 1 . Lazy Learners
Lazy learners, also known as instance-based learners, do not build a general model during the training phase. Instead, they store the training data and wait until a query is made to make predictions. When a new instance needs to be classified, lazy learners compare it to the stored instances and use a similarity measure to determine the class.
### Examples of Lazy Learners:
- k-Nearest Neighbors (k-NN)
- Case-Based Reasoning (CBR)


## 2 . Eager Learners
Eager learners, on the other hand, build a general model during the training phase. They analyze the training data and create a model that captures the underlying patterns and relationships. Once the model is built, it can be used to make predictions on new instances without needing to refer back to the original training data.
### Examples of Eager Learners:
- Decision Trees
- Support Vector Machines (SVM)
- Neural Networks
- Logistic Regression


# Classification algorithm

There is a lot of classification algorithms available now but it is not possible to conclude which one is superior to other. It depends on the application and nature of available data set. For example, if the classes are linearly separable, the linear classifiers like Logistic regression, Fisher’s linear discriminant can outperform sophisticated models and vice versa.

## 1. Decision Tree Classifier
A decision tree is a classification or regression model that represents decision rules in a hierarchical tree structure. It relies on a set of mutually exclusive and exhaustive if-then rules, learned sequentially from the training data. After each rule is created, the instances covered by that rule are removed from further consideration. This process continues until a stopping criterion is satisfied.

The tree is built using a top-down, recursive, divide-and-conquer strategy. All attributes should ideally be categorical; if they are continuous, they must be discretized beforehand. Attributes selected near the root have the greatest influence on the classification outcome, and they are chosen according to measures of impurity reduction such as information gain.

However, decision trees are prone to overfitting, often producing overly complex trees that capture noise or outliers in the training data. Such overfitted models perform well on the training set but generalize poorly to unseen data. Overfitting can be controlled either through pre-pruning (stopping the growth of the tree early) or post-pruning (trimming branches after the tree is fully grown).

![Decision Tree](images/decision_tree.png)

## 2. Naives Bayes Classifier

Naive Bayes is a probabilistic classifier based on Bayes' theorem, which assumes that
the features are conditionally independent given the class label. Despite this "naive" assumption, Naive Bayes classifiers often perform well in practice, especially for text classification tasks such as spam detection and sentiment analysis.
The classifier calculates the posterior probability of each class given the input features and assigns the class with the highest probability to the instance. The formula used is:
 $$
\hat{y} = \arg\max_{c \in \mathcal{C}} P(C = c)\prod_{i=1}^{d} P(X_i = x_i \mid C = c)
$$

![Naives Bayes](images/Naives-Bayes.jpg)

## 3. Artificial Neural Networks (ANN)
Artificial Neural Networks (ANNs) are computational models inspired by the structure and function of biological neural networks. They consist of interconnected nodes (neurons) organized in layers: an input layer, one or more hidden layers, and an output layer. Each connection between neurons has an associated weight that is adjusted during the training process to minimize the error in predictions. ANNs are capable of learning complex patterns and relationships in data, making them suitable for a wide range of tasks, including classification, regression, and pattern recognition.

![Artificial Neural Networks](images/ANN.png)

## 4. K-Nearest Neighbors (KNN)
K-Nearest Neighbors (KNN) is a simple, instance-based learning algorithm used for classification and regression tasks. In KNN, the class of a new instance is determined by the majority class of its 'k' nearest neighbors in the feature space. The distance between instances is typically measured using metrics such as Euclidean distance or Manhattan distance. KNN is a lazy learner, meaning it does not build a model during training but instead stores the training data for use during prediction. The choice of 'k' and the distance metric can significantly impact the performance of the algorithm.

![K-Nearest Neighbors](images/KNN-2.png)

Depending on the value of 'k', KNN can be sensitive to noise in the data. A small 'k' may lead to overfitting, while a large 'k' may smooth out important patterns. Therefore, selecting an appropriate value for 'k' is crucial for optimal performance.

Distance Metrics:
- Euclidean Distance
    formula:
$$
d(p, q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}
$$

- Manhattan Distance
    formula:
$$
d(p, q) = \sum_{i=1}^{n} |p_i - q_i|
$$

- Minkowski Distance
    formula:  
$$
d(p, q) = \left( \sum_{i=1}^{n} |p_i - q_i|^m \right)^{1/m}
$$
- Hamming Distance
    formula:
$$
d(p, q) = \sum_{i=1}^{n} \delta(p_i, q_i)
$$




# Evaluation Metrics for Classification

After training a classification model, it is essential to evaluate its performance to understand how well it generalizes to unseen data. Several metrics are commonly used to assess the effectiveness of a classifier:


1. **Accuracy**: The proportion of correctly classified instances out of the total instances. It is calculated as:
   $$
   \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
   $$
   where TP is true positives, TN is true negatives, FP is false positives, and FN is false negatives.

2. **Precision**: The proportion of true positive predictions out of all positive predictions made by the classifier. It is calculated as:
   $$
    \text{Precision} = \frac{TP}{TP + FP}
    $$
3. **Recall (Sensitivity)**: The proportion of true positive predictions out of all actual positive instances. It is calculated as:
   $$
    \text{Recall} = \frac{TP}{TP + FN}
    $$
4. **F1 Score**: The harmonic mean of precision and recall, providing a single metric that balances both. It is calculated as:
   $$
    \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
    $$
5. **Confusion Matrix**: A table that summarizes the performance of a classifier by displaying the counts of true positive, true negative, false positive, and false negative predictions.
    |               | Predicted Positive | Predicted Negative |
    |---------------|--------------------|--------------------|
    | Actual Positive | TP                 | FN                 |
    | Actual Negative | FP                 | TN                 |
These metrics provide insights into different aspects of a classifier's performance, allowing for a comprehensive evaluation and comparison of different models.

1. **Accuracy**: The proportion of correctly classified instances out of the total instances. It is calculated as:
   $$
   \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
   $$
   where TP is true positives, TN is true negatives, FP is false positives, and FN is false negatives.

2. **Precision**: The proportion of true positive predictions out of all positive predictions made by the classifier. It is calculated as:
   $$
    \text{Precision} = \frac{TP}{TP + FP}
    $$

3. **Recall (Sensitivity)**: The proportion of true positive predictions out of all actual positive instances. It is calculated as:
   $$
    \text{Recall} = \frac{TP}{TP + FN}
    $$

4. **F1 Score**: The harmonic mean of precision and recall, providing a single metric that balances both. It is calculated as:
   $$
    \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
    $$

## Holdout Method

One of the most common techniques for evaluating a model is the holdout method. In this approach, the dataset is split into two separate subsets: a training set and a test set, typically in an 80%-20% ratio.

- The training set (80%) is used to train the model.

- The test set (20%) contains unseen data and is used to assess the model’s predictive performance.

This method provides a straightforward way to estimate how well a model generalizes to new data, though its results can vary depending on how the split is done.

## Cross-validation

Cross-validation is a robust technique for evaluating the performance of a machine learning model. It involves partitioning the dataset into 'k' subsets or folds. The model is trained on 'k-1' folds and tested on the remaining fold. This process is repeated 'k' times, with each fold serving as the test set once. The final performance metric is obtained by averaging the results from all 'k' iterations.
Common types of cross-validation include:
- **k-Fold Cross-Validation**: The dataset is divided into 'k' equal-sized folds. Each fold is used as a test set once, while the remaining 'k-1' folds are used for training.

![Cross Validation](images/Cross-validation.png)

Overfitting is a common issue in machine learning, where a model performs very well on training data but poorly on unseen data. k-Fold Cross-Validation is a technique used to detect and prevent overfitting.

In this method, the dataset is randomly divided into k mutually exclusive subsets (folds) of approximately equal size. The procedure is as follows:

One fold is kept aside as the test set, and the remaining k-1 folds are used to train the model.

The model is evaluated on the test fold.

This process is repeated k times, each time with a different fold used as the test set.

The final performance metric is obtained by averaging the results over all k iterations, providing a more reliable estimate of the model’s generalization ability.

## ROC curve (Receiver Operating Characteristics)

The ROC curve is a graphical representation used to evaluate the performance of a binary classification model. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. The TPR, also known as sensitivity or recall, measures the proportion of actual positives correctly identified by the model, while the FPR measures the proportion of actual negatives incorrectly classified as positives.

![ROC Curve](images/Roc.webp) 

Here is the confusion matrix:


|                   | Predicted Positive | Predicted Negative |
|-------------------|--------------------|--------------------|
| **Actual Positive** | TP                 | FN                 |
| **Actual Negative** | FP                 | TN                 |
