[Index](https://github.com/basilhan/ml-concepts/blob/master/README.md)

## Cost Function

#### Introduction

Recall that a machine learning model is described mathematically by its [hypothesis function](https://github.com/basilhan/ml-concepts/blob/master/PythonHypothesisFunction.ipynb) $h_{\mathbf{p}}(\mathbf{x})$. This function is parameterized by a set of parameters $\mathbf{p}$. In other words, there can be as many possible versions of a selected model as the set of all possible permutations of parameters. Obviously, the versions will be of different quality. Hence we need to define what makes a version <i>good</i>. It is for this purpose that we introduce the cost function.  
<br>
The cost function $J(\mathbf{p})$ is a measure of the errors in the model. Mathematically, given a dataset $\begin{bmatrix} \mathbf{x}^{(1)} \cdots \mathbf{x}^{(m)} \end{bmatrix}$, the function maps each version of a model (as defined by the parameters $\mathbf{p}$) to a positive real number. This is an aggregated score of the deviations of the learned values from the actual values. There exists a particular set of parameters $\mathbf{p}_{opt}$ which returns the minimal value of $J(\mathbf{p})$. We are interested in determining $\mathbf{p}_{opt}$.  
<br>
But first, we need to defined $J(\mathbf{p})$. This is determined by the specific model selected, to which we will turn our attention to next.

#### Linear Regression

Linear regression seeks to map features to a numeric target using a hypothesis function of the form :

\begin{equation}
h_{\mathbf{p}}(\mathbf{x}) = b + w_1x_1 + w_2x_2 + \cdots + w_nx_n = \hat{y}
\end{equation}

Since we are dealing with numeric values, it is natural to consider a cost function that is proportional to the absolute difference between $\hat{y}^{(i)}$ and the actual value $y^{(i)}$ for a typical data instance. Therefore we usually define $J(\mathbf{p})$ as below :

\begin{equation}
J(\mathbf{p}) = \frac{1}{2m}\sum^{m}_{i=1}(h_\mathbf{p}(\mathbf{x}^{(i)})-y^{(i)}))^2
\end{equation}

Some explanation :
* The square of the difference between $\hat{y}^{(i)}$ and $y^{(i)}$ is used as it automatically removes any distinction which of the two are greater. In other words, that $\hat{y}^{(i)}$ is greater than $y^{(i)}$ by a certain value is treated as the same amount of deviation as $\hat{y}^{(i)}$ smaller than $y^{(i)}$ by the same value. The squaring also gives much greater weight to larger deviations since it is not linear.
* All the squared differences are summed up across the dataset $\begin{bmatrix} \mathbf{x}^{(1)} \cdots \mathbf{x}^{(m)} \end{bmatrix}$.
* The $\frac{1}{2m}$ is a constant added to make later calculations more elegant.

A typical cost function for a 2-dimensional parameter space can be seen below, shown as a smooth surface.

<img src="https://github.com/basilhan/figures/blob/master/LinearRegressionCostFunction.png?raw=true">

#### Logistic Regression


#### Support Vector Machine


#### Locating the Minimum of the Cost Function

As mentioned, once we have defined our $J(\mathbf{p})$, we seek $\mathbf{p}_{opt}$ which returns the minimal value. We could try out different parameters randomly and observe what $J(\mathbf{p})$ returns and then choose the one which is smallest. But considering the size of the parameter space in a typical machine learning task, locating the optimal set by chance would be hugely improbable. It would be good to have a more systematic way of accomplishing this. A popular technique employed is the [gradient descent](https://github.com/basilhan/ml-concepts/blob/master/PythonGradientDescent.ipynb) algorithm.

Permalink : https://nbviewer.jupyter.org/github/basilhan/ml-concepts/blob/master/PythonCostFunction.ipynb