# Module 2: Supervised Learning

**Module Overview:**

**Dataset:**

## Introduction to Supervised Learning

**Supervised learning** involves machine learning algorithms that aim to build models that learn from the data and are able to predict future values. It is the same as building a mathematical equation or formula with many input variables in order to be able to derive the desired output variable. 
**The data points that the model uses to learn, already have the corresponding outputs.** This is how the model is able to derive a connection between inputs and outputs. 

There are two types of problems in the supervised setting: *regression* and *classification*. In **regression**, the output that the model learns and then tries to predict is a continuous value (e.g learning age, height of people, etc). In **classification**, the output that the model learns and then tries to predict is a categorical value (e.g a class from a finite number of classes like the whether a tumor cell is benign or malignant). 

[Figure 1](#sup_lear_reg_clf) illustrates the machine learning pipeline when in case of supervised learning. From the figure we can see that there are three main parts in the pipeline: (1) **Pre-processing of Data**, (2) **Training of the model** and (3) **Evaluation of the model**. During pre-processing, the raw data undergoes transformations so that it is ready to be used in training. We have already described this steps in Module 1. The pre-processed data is split into train and test sets. During training, the model will use the train set to learn. After training is done, the test set is used to evaluate the performance of the model. This is a way to quantify how well the model has learned the data. When the model is ready to be used, it will receive new, unseen samples and will predict either a continuous value or a label, depending on whether the task at hand is regression or classification.

<center>
    <a id="sup_learn"></a>
    <img src="images/part2_supervised/supervised_learning_pipeline.jpg" alt="ML Supervised Learning" width="90%">
    <center><figcaption><em>Figure 1: Supervised Learning</em></figcaption></center>
</center>


###  Regression

In the case of regression, the model learns from the data, with the aim of outputting a continuous value similar to what it saw during the training phase. 
<center>
    <a id="regression"></a>
    <img src="images/part2_supervised/regression_illustration.jpg" alt="Regression" width="90%">
    <center><figcaption><em>Figure 2: Regression</em></figcaption></center>
</center>


[Figure 2](#regression) gives an illustration of regression in a 2D dataset. In this case, we have samples that have two coordinates: x and y. y depends on x, so y = f(x). The aim of regression would be to find a function f(x) that *best* approximates the corresponding values of y. *Best* is quantified through the loss metrics that we are going to see in the next section. In our example, this would be <span style="color:red">a line of the form $$y=ax+b$$</span>, <span style="color:red">indicated in red</span>. After the function is learned from the data, then we can use it to <span style="color:green">predict the value on a new unseen sample during training (indicated by green)</span>. 

Note that in real life, datasets have more than 2 dimensions. In this case, a line is not sufficient, that is why a plane (3 coordinates) or a hyperplane(+3 coordinates) will be used as the regressor model to predict values. This is called **multiple linear regression**, in contrast to the example shown in Figure 2, which is **simple linear regression**.

Besides linear regression, which assumes linear relationship between the variables, there are other types of regression like Decision Tree regression and Random Forest regression, which we will code below.

### Classification

In classification, the target variable that the model tries to learn from the data and later predict is a class or a category. 

<center>
    <a id="classification"></a>
    <img src="images/part2_supervised/classification_illustration.jpg" alt="Classification" width="90%">
    <center><figcaption><em>Figure 3: Classification</em></figcaption></center>
</center>

[Figure 3](#classification) shows an example of samples in a 2D coordinate system. As you can see, the samples belong to two classes: <span style="color:red">positive</span> and <span style="color:green">negative</span>. As we can see, the samples of the two classes are linearly separable. So, the model can learn a line that can perfectly separate the two classes. <span style="color:blue">This line is indicated with blue</span> in our example. A new, unseen sample will be classified based on which side of the boundary it will fall. In our case, the circled sample is below the boundary line, so the model will classify it as <span style="color:red">red</span>.

Note that in real life scenarios, datasets have many more dimensions and classes. Also, classes may not be linearly separable. In this cases, the models will be hyperplanes or other structures that will be able to capture non-linear dependencies. In this tutorial, we will see in action Logistic Regression, Decision Tree classifiers and Random Forest classifiers.

## Training of Supervised Learning Model

### Parameters vs Hyperparameters


In machine learning, parameters and hyperparameters play different roles. The **parameters** are values that **the machine learning model learns from the data**. At the end of the learning process, the data will be described by a mathematical equation. The main goal of the learning process is to find the parameters of this mathematical equation that would best describe the data. For example, suppose that you have some points scattered in a 2D coordinate system, just like in [Figure 2](#regression). Your aim is to find the line with an equation of the form: $$y = ax + b$$. 

In this case, `a` and `b` would be the parameters that the model would learn from the points so that the line would represent them in the best possible way.

On the other hand, **hyperparameters control the learning process itself and how the parameters will be computed**. Hyperparameters are set by the data scientists/analysts and they are not learned by the model. You can think of them as **settings or configurations to tune the learning process**. Usually people use intuition, trial-and-error, and other, more sophisticated techniques like cross-validation to pick the right hyperparameters that would make the learning process faster and produce more accurate results. Going to the line example, the hyperparameters will determine **how complex** the equation of the line that will describe the points will be. 

All in all, **parameters determine the model output, while hyperparameters determine the way how the parameters would be learned**. You can read more about the distinction between parameters and hyperparameters [in this blog post](https://towardsdatascience.com/parameters-and-hyperparameters-aa609601a9ac).


💡 **CHECKPOINT:**
- How do parameters differ from hyperparameters? What role does each of them play in the learning process?
- What is tuned by the data analyst and what is learned by the model?
- Determine which of the following are parameters and which are hyperparameters in the following scenario: *We have a dataset that contains 1000 points, each of them having a single feature. The function that will approximate these points will have the form $y=ax^2+bx+c$. We do not want the model to overfit the data, so we set a regularization parameter λ=0.1, to be used during the training process. Since we do not want to wait long, we also determine the number of steps that the loss function will be computed and the weights updated, n=100.* Given this scenario, determine whether a, b, c, λ and n are parameters or hyperparameters.

### Cost function

The **cost function**, an essential part of all machine learning algorithms, quantifies how well the model can approximate the data. It is otherwise known as loss metric, loss function or objective function. It measures, how different are the predicted values of the model from the actual values, being them discrete or continuous. 

The cost function is a function of the parameters of the model. For example, in regression tasks, a common cost function used is the **mean squared error**(MSE): $$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2$$, where $y_i$ is the actual value and $\hat{y_i}$ is the predicted value for sample $i$. Referring to [Figure 2](#regression), $\hat{y_i}=ax_i + b $ and the cost function can be written as:$$\text{MSE}=L(a,b)=\frac{1}{n} \sum_{i=1}^{n} (y_i - (ax_i + b))^2$$. We need to find the values of $a$ and $b$, for which minimum $MSE$ is attained. This can be solved analytically by finding the derivative of $L$ with respect to $a$ and $b$, then equalizing them to 0 and solving them for $a$ and $b$. However, an analytic solution is not always possible. That is why, to find the optimal parameters, algorithms like Gradient Descent are used. Gradient Descent is able to find the optimal solutions by computing the predictions and loss function in multiple iterations. [Figure 4](#gradient-descent) illustrates the process. You can learn more about Gradient Descent [here](https://towardsdatascience.com/gradient-descent-algorithm-a-deep-dive-cf04e8115f21).

<center>
    <a id="gradient-descent"></a>
    <img src="images/part2_supervised/lin_reg.gif" alt="Gradient Descent" width="90%">
    <center><figcaption><em>Figure 3: Cost function in Linear Regression</em></figcaption></center>
</center>


The concept of cost function in classification is the same, however, there are different cost functions applied in these tasks. Two common cost functions used are: **binary cross-entropy** and **categorical cross-entropy**, the first one used in binary class classification problems, while the later one used in multi-class classification problems. Again, during training, the parameters of the model are adjusted to  minimize the value of the cost function, iteratively, using algorithms like Gradient Descent.

Cost functions are not only used during training but also during model evaluation, as we will see in the next section.

## Evaluation of a Supervised Learning Model

### Evaluation metrics for regression

### Evaluation metrics for classification

### Overfitting and Underfitting

(Here FACT BOX ABOUT Regularization)

## Classification Models

### Logistic Regression

### Decision Trees

### Random Forest

## Regression Models

### Linear Regression

### Decision Trees

### Random Forest

## Conclusion

**References:**