[Index](https://github.com/basilhan/ml-concepts/blob/master/README.md)

## Hypothesis Function

#### Introduction

A hypothesis function $h_{\boldsymbol{\theta}}(\mathbf{x})$ provides a mathematical description of a machine learning model by mapping the feature vector $\mathbf{x}$ to a real value that will eventually be converted to the final learned target value. The subscript in $h_{\boldsymbol{\theta}}(\mathbf{x})$ indicates that the function is parameterized by the parameter vector $\boldsymbol{\theta}$.  

#### Linear Models

Many types of machine learning models are [linearly](https://en.wikipedia.org/wiki/Linear_function) parameterized. In these models, the parameter vector consists of a a scalar $b$ known as the bias, and a weight vector $\mathbf{w}$. Mathematically, the parameter vector can be expressed as :

\begin{equation*}
    \boldsymbol{\theta}
    =
    \begin{bmatrix}
        \theta_0 \\
        \theta_1 \\
        \vdots \\
        \theta_n
    \end{bmatrix}
    =
    \begin{bmatrix}
        b \\
        w_1 \\
        \vdots \\
        w_n
    \end{bmatrix}
\end{equation*}

where $b = \theta_0$ and $w_j = \theta_j$ for $j = 1, 2, \cdots, n$. Therefore, the weight vector, when fully expanded, is :

\begin{equation*}
    \mathbf{w}
    =
    \begin{bmatrix}
        w_1 \\
        w_2 \\
        \vdots \\
        w_n
    \end{bmatrix}
\end{equation*}

Note that the value of $n$ is the number of dimensions in the feature space. There is therefore a one-to-one correspondence between the weight vector and the feature vector $\begin{bmatrix} x_1 & x_2 & \cdots & x_n \end{bmatrix}^\top$. That is, the weight $w_j$ corresponds to the feature $x_j$, where $j$ is an integer from $1$ to $n$. We can now write out the following linear expression.

$$
b + w_1x_1 + w_2x_2 + \cdots + w_nx_n
$$

This can be expressed in the equivalent algebraic form :

$$
b + \sum_{j=1}^{n} w_j x_j
$$

Or in more compact matrix notation :

$$
b + \mathbf{w}^\mathsf{T}\mathbf{x}
$$

We note that the addition term on the right is actually the [dot product](https://en.wikipedia.org/wiki/Dot_product) of the vectors in feature and weight spaces. Here are some important characteristics of the dot product :
* It is zero when the two vectors are [orthogonal](https://en.wikipedia.org/wiki/Orthogonality) to each other.
* It is a maximum when the two vectors point in exactly the same direction.
* It is a minimum when the two vectors point in completely opposite directions.

Finally, as a useful visual, we can also imagine the following data flow representation of the linear expression :

<img src="https://github.com/basilhan/figures/blob/master/LinearEquation.png?raw=true">

Here, we can clearly see that this expression is the sum of weighted values of the features plus a bias, hence the names for the parameters.

#### Linear Regression

In linear regression, we are interested in mapping features to a numeric target. The expression in the previous section can be directly used as the hypothesis function.

$$
h_{\boldsymbol{\theta}}(\mathbf{x}) = b + w_1x_1 + w_2x_2 + \cdots + w_nx_n = \hat{y}
$$

We can acquire a geometrical intuition by considering a single-feature learning task first.

$$
h_{\boldsymbol{\theta}}(\mathbf{x}) = b + wx = \hat{y}
$$

<img src="https://github.com/basilhan/figures/blob/master/SimpleLinearRegression.png?raw=true">



#### Logistic Regression

In logistic regression, we are interested in mapping features to a categorical target. In this section, we only consider binary classes (e.g. positive or negative, true or false). Multi-class classification is simply an extension of the same basic idea.

While linear regression attempts to map features to a continuum of target values, for binary class logistic regression, only 2 target values are expected. A convenient technique employed to achieve this is to make use of the [Sigmoid function](https://nbviewer.jupyter.org/github/basilhan/math/blob/master/PythonSigmoid.ipynb) shown below for our hypothesis function.

$$
h_{\boldsymbol{\theta}}(\mathbf{x}) =
\sigma_{\boldsymbol{\theta}}(\mathbf{x}) =
\frac{1}{1 + e^{-(b + \mathbf{w}^\top\mathbf{x})}}
$$

Plotting out the function along a single dimension with unit weight and zero bias, we observe the below S-shaped curve which transitions from 0 to 1 :

<img src="https://github.com/basilhan/figures/blob/master/Sigmoid.png?raw=true">

Adding an additional step, we can convert the sigmoid function output to one of two values as our learned target.

\begin{eqnarray}
  \hat{y} & = & \left\{ \begin{array}{ll}
      0 & \mbox{if } b + \mathbf{w}^\top\mathbf{x} \leq \mbox{0} \\
      1 & \mbox{if } b + \mathbf{w}^\top\mathbf{x} > \mbox{0}
      \end{array} \right.
\end{eqnarray} 

Graphically, we can imagine that we are solving the equation :

$$
w_1x_1 + w_2x_2 + \cdots + w_nx_n + b = 0
$$

This is equivalent to solving for a $(n-1)$-dimensional decision hyperplane within a $n$-dimensional space.  

For example, if $n = 2$, the decision hyperplane is 1-dimensional line in a 2-dimensional space. Below shows the line for the function $-0.422 + 2.22x_1 -3.76x_2 = 0$. This example is taken from a more detailed explanation [here](https://nbviewer.jupyter.org/github/basilhan/ml-in-action/blob/master/PythonBasicBivariateLogisticRegression.ipynb).

<img src="https://github.com/basilhan/figures/blob/master/BinaryClassification.png?raw=true">

If $n = 3$, we have a 2-dimensional plane in 3-dimensional space, and so on. In every case, we have a hyperplane which bisects the space into a positive half and a negative half.  

There are some interesting properties associated with the weight vector $\mathbf{w}$ and the bias $b$. The direction pointed to by $\mathbf{w}$ is always orthogonal to the hyperplane. This is illustrated by the arrow in the 2-dimensional plot above. In other words, $\mathbf{w}$ regulates the orientation of (because it is orthogonal to) the decision hyperplane. Independently, the bias $b$ regulates the hyperplane's distance from the origin.

Permalink : https://nbviewer.jupyter.org/github/basilhan/ml-concepts/blob/master/PythonHypothesisFunction.ipynb