[Index](https://github.com/basilhan/ml-concepts/blob/master/README.md)

## Hypothesis Function

#### Introduction

A hypothesis function $h_{\mathbf{p}}(\mathbf{x})$ provides a mathematical description of a machine learning model by mapping the feature vector $\mathbf{x} = \begin{bmatrix} x_1 & x_2 & \cdots & x_n \end{bmatrix}^\top$ to a real value that will eventually be converted to the final learned target value. The subscript in $h_{\mathbf{p}}(\mathbf{x})$ indicates that the function is parameterized by the parameter vector $\mathbf{p}$.

<img src="https://github.com/basilhan/figures/blob/master/HypothesisFunction.png?raw=true">

#### Linear Models

Many types of machine learning models are [linearly](https://en.wikipedia.org/wiki/Linear_function) parameterized. In these models, the parameter vector consists of a a scalar $b$ known as the bias, and a weight vector $\mathbf{w}$. Mathematically, the parameter vector can be expressed as :

\begin{equation*}
    \mathbf{p}
    =
    \begin{bmatrix}
        p_0 \\
        p_1 \\
        \vdots \\
        p_n
    \end{bmatrix}
    =
    \begin{bmatrix}
        b \\
        w_1 \\
        \vdots \\
        w_n
    \end{bmatrix}
\end{equation*}

where $b = p_0$ and $w_j = p_j$ for $j = 1, 2, \cdots, n$. Therefore, the weight vector, when fully expanded, is :

\begin{equation*}
    \mathbf{w}
    =
    \begin{bmatrix}
        w_1 \\
        w_2 \\
        \vdots \\
        w_n
    \end{bmatrix}
\end{equation*}

Note that the value of $n$ is also the number of dimensions in the feature space. There is therefore a one-to-one correspondence between the weight vector and the feature vector $\begin{bmatrix} x_1 & x_2 & \cdots & x_n \end{bmatrix}^\top$. That is, the weight $w_j$ corresponds to the feature $x_j$, where $j$ is an integer from $1$ to $n$. We can now write out the following linear expression.

\begin{equation}
b + w_1x_1 + w_2x_2 + \cdots + w_nx_n
\end{equation}

This can be expressed in the equivalent algebraic form :

\begin{equation}
b + \sum_{j=1}^{n} w_j x_j
\end{equation}

Or in more compact matrix notation :

\begin{equation}
b + \mathbf{w}^\mathsf{T}\mathbf{x}
\end{equation}

We note that the addition term on the right is actually the [dot product](https://en.wikipedia.org/wiki/Dot_product) of the vectors in feature and weight spaces. Here are some important characteristics of the dot product :
* It is zero when the two vectors are [orthogonal](https://en.wikipedia.org/wiki/Orthogonality) to each other.
* It is a maximum when the two vectors point in exactly the same direction.
* It is a minimum when the two vectors point in completely opposite directions.

Finally, as a useful visual, we can also imagine the following data flow representation of the linear expression :

<img src="https://github.com/basilhan/figures/blob/master/LinearEquation.png?raw=true">

Here, we can clearly see that this expression is the sum of weighted values of the features plus a bias, hence the names for the parameters.

#### Linear Regression

In linear regression, we are interested in mapping features to a numeric target. The linear expression in the previous section is directly used as the hypothesis function.

\begin{equation}
h_{\mathbf{p}}(\mathbf{x}) = b + w_1x_1 + w_2x_2 + \cdots + w_nx_n = \hat{y}
\end{equation}

Let us first consider the single-feature case, with $\mathbf{x}^{(i)} = \begin{bmatrix} x_1^{(i)} \end{bmatrix}$. Suppose we plot the feature values against the actual target values, we may get something like this :

<img src="https://github.com/basilhan/figures/blob/master/SimpleLinearRegressionDots.png?raw=true">

The points appear to fall onto an imaginary straight line in the 2-dimensional model space. In such a case, a linear model will provide a good generalization of the data. The algebraic expression for such a straight line is :

\begin{equation}
h_(x) = b + w_1x_1 = \hat{y}
\end{equation}

Plotting a line which provides a good fit :

<img src="https://github.com/basilhan/figures/blob/master/SimpleLinearRegression.png?raw=true">

This line is fully defined by the two parameters $b$ and $w_1$. By varying them, different lines of different fits will be obtained. The gradient of the line is proportional to $w_1$. The bias $b$ determines where the line crosses the vertical line $x_1 = 0$. Once we have this linear expression, we are able to predict the value of $y$ (i.e. $\hat{y}$) for any value of $x_1$ within this range.  

In a similar fashion, the idea can be extended to multiple-dimensional feature spaces. For 2-dimensions ($\mathbf{x}^{(i)} = \begin{bmatrix} x_1^{(i)} & x_2^{(i)} \end{bmatrix}^\top$), the linear expression defines a flat plane tilted in a 3-dimensional model space.

\begin{equation}
h_{\mathbf{p}}(\mathbf{x}) = b + wx_1 + wx_2 = \hat{y}
\end{equation}

In general, although not possible to imagine in our minds, if we have $n$ dimensions in our feature space, we seek an $n$-dimensional hyperplane that will fit our data plotted in $(n+1)$-dimensional model space.

#### Logistic Regression

In logistic regression, we are interested in mapping features to a categorical target. In this section, we only consider binary classes (e.g. positive or negative, true or false). Multi-class classification is simply an extension of the same basic idea.

While linear regression attempts to map features to a continuum of target values, for binary class logistic regression it is more restrictive. Only 2 target values are expected, which we will define as 0 and 1. A convenient technique employed to achieve this is to make use of the [Sigmoid function](https://nbviewer.jupyter.org/github/basilhan/math/blob/master/PythonSigmoid.ipynb) shown below for our hypothesis function.

\begin{equation}
h_{\mathbf{p}}(\mathbf{x}) =
\sigma_{\mathbf{p}}(\mathbf{x}) =
\frac{1}{1 + e^{-(b + \mathbf{w}^\top\mathbf{x})}}
\end{equation}

Plotting the function along a single dimension with unit weight and zero bias, we observe the below S-shaped curve which transitions from 0 to 1 as $x$ increases :

<img src="https://github.com/basilhan/figures/blob/master/Sigmoid.png?raw=true">

$$
\sigma(x) =
\frac{1}{1 + e^{-(b + wx)}}
$$

With an additional step, we can convert the sigmoid function output to one of two values as our learned target.

\begin{eqnarray}
  \hat{y} & = & \left\{ \begin{array}{ll}
      0 & \mbox{if } \sigma(x) \leq 0.5 \\
      1 & \mbox{if } \sigma(x) \gt 0.5
      \end{array} \right.
\end{eqnarray}

This is equivalent to the below expression considering only the exponent :

\begin{eqnarray}
  \hat{y} & = & \left\{ \begin{array}{ll}
      0 & \mbox{if } b + wx \leq 0 \\
      1 & \mbox{if } b + wx \gt 0
      \end{array} \right.
\end{eqnarray}

We can then easily extend this to multiple dimensions :

\begin{eqnarray}
  \hat{y} & = & \left\{ \begin{array}{ll}
      0 & \mbox{if } b + \mathbf{w}^\top\mathbf{x} \leq 0 \\
      1 & \mbox{if } b + \mathbf{w}^\top\mathbf{x} \gt 0
      \end{array} \right.
\end{eqnarray}

Below shows the 2-dimensional sigmoid function.

<img src="https://github.com/basilhan/figures/blob/master/Sigmoid2D.png?raw=true">

Although we are not able to imagine for higher dimensions in $n$, as a generalization, the hypercurve defined by the multi-variate sigmoid function in $(n+1)$-dimensional model space projects onto a $(n-1)$-dimensional hyperplane in the feature space. This is in fact our decision hyperplane and is defined by the equation : 

\begin{equation}
b + w_1x_1 + w_2x_2 + \cdots + w_nx_n = 0
\end{equation}

There are several properties to note.  

* The hyperplane divides the feature space into two halves.
* Where the expression on the LHS of the equation evaluates to a positive real number, $\hat{y} = 1$. Therefore, the set of features for which this is true belongs to the "1" class.
* Similarly, the set of features for which the LHS expression evaluates to a negative real number belongs to the "0" class ($\hat{y} = 0$).
* The bias $b$ regulates the hyperplane's distance from the origin. Increasing the bias shifts the hyperplane such that more of the feature space falls under the "1" class and vice versa.
* The direction pointed to by the weight vector $\mathbf{w}$ is always orthogonal to the hyperplane. In other words, 
$\mathbf{w}$ regulates the orientation of (because it is orthogonal to) the decision hyperplane. This interesting result is an implication of the dot product mentioned in the previous section on linear models.

There are more details on the effect of $b$ and $\mathbf{w}$ available in the [Sigmoid function](https://nbviewer.jupyter.org/github/basilhan/math/blob/master/PythonSigmoid.ipynb) section.

For illustrative purpose, let us consider a feature space of 2-dimensions ($\mathbf{x}^{(i)} = \begin{bmatrix} x_1^{(i)} & x_2^{(i)} \end{bmatrix}^\top$). The decision hyperplane in this case will be a straight line (i.e. dimension of 1). Below is an example which classifies green data instances ($\hat{y} = 1$) and red data instances ($\hat{y} = 0$) with the line defined by the equation :

$$
-0.422 + 2.22x_1 -3.76x_2 = 0
$$

<img src="https://github.com/basilhan/figures/blob/master/BinaryClassification.png?raw=true">

This example is taken from a more detailed explanation [here](https://nbviewer.jupyter.org/github/basilhan/ml-in-action/blob/master/PythonBasicBivariateLogisticRegression.ipynb). Note also that the arrow in the plot points in the same direction as the weight vector $\mathbf{w} = \begin{bmatrix} 2.22 & -3.76 \end{bmatrix}^\top$.

Permalink : https://nbviewer.jupyter.org/github/basilhan/ml-concepts/blob/master/PythonHypothesisFunction.ipynb