### What is Classification in Machine Learning?
Classification is a supervised machine learning method where the model tries to predict the correct label of a given input data. In classification, the model is fully trained using the training data, and then it is evaluated on test data before being used to perform prediction on new unseen data.
<br>
For instance, an algorithm can learn to predict whether a given email is spam or ham (no spam), as illustrated below. 
<br>

![image.png](attachment:image.png)

Before diving into the classification concept, we will first understand the difference between the two types of learners in classification: lazy and eager learners. Then we will clarify the misconception between classification and regression. 

<b><br>
    
**Lazy Learners Vs. Eager Learners**
    
There are two types of learners in machine learning classification: lazy and eager learners. 

**Eager learners** are machine learning algorithms that first build a model from the training dataset before making any prediction on future datasets. They spend more time during the training process because of their eagerness to have a better generalization during the training from learning the weights, but they require less time to make predictions. 

Most machine learning algorithms are eager learners, and below are some examples: 

* Logistic Regression. 
* Support Vector Machine. 
* Decision Trees. 
* Artificial Neural Networks. 
    
<b><br>
**Lazy learners or instance-based learners** , on the other hand, do not create any model immediately from the training data, and this is where the lazy aspect comes from. They just memorize the training data, and each time there is a need to make a prediction, they search for the nearest neighbor from the whole training data, which makes them very slow during prediction. Some examples of this kind are: 

* K-Nearest Neighbor. 
* Case-based reasoning. 

### Machine Learning Classification Vs. Regression
There are four main categories of Machine Learning algorithms: supervised, unsupervised, semi-supervised, and reinforcement learning. 

Even though classification and regression are both from the category of supervised learning, they are not the same. 

* The prediction task is a *classification* when the target variable is discrete. An application is the identification of the underlying sentiment of a piece of text. 
* The prediction task is a *regression* when the target variable is continuous. An example can be the prediction of the salary of a person given their education degree, previous work experience, geographical location, and level of seniority.

![image.png](attachment:image.png)

### Different Types of Classification Tasks in Machine Learning

There are four main classification tasks in Machine learning: binary, multi-class, multi-label, and imbalanced classifications. 

#### Binary Classification
In a binary classification task, the goal is to classify the input data into two mutually exclusive categories. The training data in such a situation is labeled in a binary format: true and false; positive and negative; O and 1; spam and not spam, etc.

![image.png](attachment:image.png)


Logistic Regression and Support Vector Machines algorithms are natively designed for binary classifications. However, other algorithms such as K-Nearest Neighbors and Decision Trees can also be used for binary classification. 


#### Multi-Class Classification
The multi-class classification, on the other hand, has at least two mutually exclusive class labels, where the goal is to predict to which class a given input example belongs to. In the following case, the model correctly classified the image to be a plane. 

![image-2.png](attachment:image-2.png)

Most of the binary classification algorithms can be also used for multi-class classification. These algorithms include but are not limited to:

* Random Forest
* Naive Bayes 
* K-Nearest Neighbors 
* Gradient Boosting 
* SVM
* Logistic Regression

**One-versus-one**: this strategy trains as many classifiers as there are pairs of labels. If we have a 3-class classification, we will have three pairs of labels, thus three classifiers, as shown below. 

![image-3.png](attachment:image-3.png)

In general, for N labels, we will have Nx(N-1)/2 classifiers. Each classifier is trained on a single binary dataset, and the final class is predicted by a majority vote between all the classifiers. One-vs-one approach works best for SVM and other kernel-based algorithms. 

**One-versus-rest**: at this stage, we start by considering each label as an independent label and consider the rest combined as only one label. With 3-classes, we will have three classifiers. 

In general, for N labels, we will have N binary classifiers.

![image-4.png](attachment:image-4.png)


#### Multi-Label Classification
In multi-label classification tasks, we try to predict 0 or more classes for each input example. In this case, there is no mutual exclusion because the input example can have more than one label. 

Such a scenario can be observed in different domains, such as auto-tagging in Natural Language Processing, where a given text can contain multiple topics. Similarly to computer vision, an image can contain multiple objects, as illustrated below: the model predicted that the image contains: a plane, a boat, a truck, and a dog.

![image-5.png](attachment:image-5.png)

#### Imbalanced Classification
For the imbalanced classification, the number of examples is unevenly distributed in each class, meaning that we can have more of one class than the others in the training data. Let’s consider the following 3-class classification scenario where the training data contains: 60% of trucks, 25% of planes, and 15% of boats. 

![image-6.png](attachment:image-6.png)

The imbalanced classification problem could occur in the following scenario:

* Fraudulent transaction detections in financial industries
* Rare disease diagnosis 
* Customer churn analysis

Using conventional predictive models such as Decision Trees, Logistic Regression, etc. could not be effective when dealing with an imbalanced dataset, because they might be biased toward predicting the class with the highest number of observations, and considering those with fewer numbers as noise. 

So, does that mean that such problems are left behind?

Of course not! We can use multiple approaches to tackle the imbalance problem in a dataset. The most commonly used approaches include sampling techniques or harnessing the power of cost-sensitive algorithms. 


**Sampling Techniques **
These techniques aim to balance the distribution of the original by: 

* Cluster-based Oversampling:
* Random undersampling: random elimination of examples from the majority class. 
* SMOTE Oversampling: random replication of examples from the minority class. 

**Cost-Sensitive Algorithms**

These algorithms take into consideration the cost of misclassification. They aim to minimize the total cost generated by the models.

* Cost-sensitive Decision Trees.
* Cost-sensitive Logistic Regression. 
* Cost-sensitive Support Vector Machines.

![image.png](attachment:image.png)

### 1. Logistic Regression

![image-2.png](attachment:image-2.png)

Logistics regression uses sigmoid function above to return the probability of a label. It is widely used when the classification problem is binary — true or false, win or lose, positive or negative ...

The sigmoid function generates a probability output. By comparing the probability with a pre-defined threshold, the object is assigned to a label accordingly. 

logistic regression common hyperparameters: penalty, max_iter, C, solver

### 2. Decision Tree

![image-3.png](attachment:image-3.png)

Decision tree builds tree branches in a hierarchy approach and each branch can be considered as an if-else statement. The branches develop by partitioning the dataset into subsets based on most important features. Final classification happens at the leaves of the decision tree.

decision tree common hyperparameters: criterion, max_depth, min_samples_split, min_samples_leaf; max_features

### 3. Random Forest

![image-4.png](attachment:image-4.png)

As the name suggest, random forest is a collection of decision trees. It is a common type of ensemble methods which aggregate results from multiple predictors. Random forest additionally utilizes bagging technique that allows each tree trained on a random sampling of original dataset and takes the majority vote from trees. Compared to decision tree, it has better generalization but less interpretable, because of more layers added to the model.

random forest common hyperparameters: n_estimators, max_features, max_depth, min_samples_split, min_samples_leaf, boostrap

### 4. Support Vector Machine (SVM)


![image-5.png](attachment:image-5.png)

Support vector machine finds the best way to classify the data based on the position in relation to a border between positive class and negative class. This border is known as the hyperplane which maximize the distance between data points from different classes. Similar to decision tree and random forest, support vector machine can be used in both classification and regression, SVC (support vector classifier) is for classification problem.

support vector machine common hyperparameters: c, kernel, gamma

### 5. K-Nearest Neighbour (KNN)

![image-6.png](attachment:image-6.png)

You can think of k nearest neighbour algorithm as representing each data point in a n dimensional space — which is defined by n features. And it calculates the distance between one point to another, then assign the label of unobserved data based on the labels of nearest observed data points. KNN can also be used for building recommendation system

KNN common hyperparameters: n_neighbors, weights, leaf_size, p

### 6. Naive Bayes

![image-7.png](attachment:image-7.png)

Naive Bayes is based on Bayes’ Theorem — an approach to calculate conditional probability based on prior knowledge, and the naive assumption that each feature is independent to each other. The biggest advantage of Naive Bayes is that, while most machine learning algorithms rely on large amount of training data, it performs relatively well even when the training data size is small. Gaussian Naive Bayes is a type of Naive Bayes classifier that follows the normal distribution.

gaussian naive bayes common hyperparameters: priors, var_smoothing

### Metrics to Evaluate Machine Learning Classification Algorithms

Now that we have an idea of the different types of classification models, it is crucial to choose the right evaluation metrics for those models. In this section, we will cover the most commonly used metrics: accuracy, precision, recall, F1 score, and area under the ROC (Receiver Operating Characteristic) curve and AUC (Area Under the Curve). 

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)