# CS229, Fall 2017
# Problem Set #1 (Unsupervised Learning)

## 1. Logistic Regression [25 points]

(a) [10 points] Consider the average empirical loss (the risk) for logistic regression:

$$ J(\theta) = \frac{1}{m} \sum_{i=1}^{m} log(1+e^{y^{(i)} \theta^{T} x^{(i)} }) = - \frac{1}{m} \sum_{i=1}^{m} log(h_{\theta}(y^{(i)}x^{(i)}))$$

where $ y^{(i)} \in \{ -1, 1\}$, $h_{\theta}(x) = g(\theta^{T}x)$ and $g(z) = 1 /(1+e^{-z})$. Find
the hessian of this function and show that for any vector $z$, it holds true that

$$ z^{T} H z \geq 0$$

(a) [Solution]

Before we proceed we define:
* $m$ - number of training samples
* $n$ - number of features
* $x^{(i)}$ - ith training sample
* $x^{(i)}_{j}$ j-th feature of the i-th training example

Given the definition of a Hessian Matrix, we can write the Hessian Matrix of the average empirical loss $J(\theta)$ as:

$$(H_{J})_{i,j} = \frac{\partial^2 J}{\partial \theta_{i} \partial \theta_{j}}\tag{1}$$

We will start by computing the first partial derivative of $J(\theta)$ with respect to $\theta_{j}$


$$\frac{\partial J}{\partial \theta_{j}} =-\frac{1}{m} \sum_{i=1}^{m} \frac{\partial}{\partial \theta_{j}}[\log(h_{\theta}(y^{(i)}x^{(i)})]\tag{2}$$

Note that partial derivative is a linear operator, that why the partial derivative with respect to $\theta_{j}$ is passed inside
the sum in equation (2). Also it is important to note that $x^{(i)} \in \mathbb{R}^{n}$ and $y^{(i)} \in [-1,1]$, thus their
product is also an N-dimensional vector. 

Now we will substitute $h_{\theta}(x^{(i)}y^{(i)})$ with $g(\theta^{T}x^{(i)}y^{(i)})$ and thus equation (2) can be re-written
as:

$$\frac{\partial J}{\partial \theta_{j}} =-\frac{1}{m} \sum_{i=1}^{m} \frac{\partial}{\partial \theta_{j}}[\log(g(\theta^{T}y^{(i)}x^{(i)}))]\tag{3}$$

The partial derivative of $log(g(\theta^{T}x^{(i)}y^{(i)}))$ with respect to $\theta_{j}$ can be computed by using the chain
rule:

$$ \frac{\partial \log(g)}{\partial \theta_{j}} = \frac{1}{g} \frac{\partial g}{\partial \theta_{j}}\tag{4}$$ 

We will know compute the partial derivative on the LHS of equation (4):

$$\frac{\partial g}{\partial \theta_{j}} = \frac{\partial}{\partial \theta_{j}}\left[\frac{1}{1+e^{-\theta^{T} x^{(i)}y^{(i)}}}\right] $$

$$ \frac{\partial g}{\partial \theta_{j}} = \frac{0\cdot(1+e^{-\theta^{T}x^{(i)}y^{(i)}})- \frac{\partial}{\partial \theta_{j}} [1+e^{-\theta^{T}x^{(i)}y^{(i)}}]}{(1+e^{-\theta^{T}x^{(i)}y^{(i)}})^{2}} $$

$$\frac{\partial g}{\partial \theta_{j}} = y^{(i)} \frac{\partial}{\partial \theta_{j}}[\theta^{T}x^{(i)}] \frac{e^{\theta^{T}x^{(i)}y^{(i)}}}{(1+e^{\theta^{T}x^{(i)}y^{(i)}})^{2}} \tag{5}$$

Equation (5) seems a little complicated. In order to obtain a more helpful representation of equation (5), we will re-write some
of its terms.

$$ \frac{\partial}{\partial \theta_{j}}[\theta^{T} x^{(i)} ] =\frac{\partial}{\partial \theta_{j}}[\sum_{l=0}^{n}\theta_{l}x^{(i)}_{l}] =\frac{\partial}{\partial \theta_{j}}[\theta_{0}x^{(i)}_{0}+\theta_{1}x^{(i)}_1+..+\theta_{n}x^{(i)}_{n}] \tag{6}$$

In equation (6), all the derivatives will be zero when $j \neq l$. In case that $j=l$ the derivative will be simply equal to:

$$ \frac{\partial}{\partial \theta_{j}}[\theta^{T}x^{(i)}] = x^{(i)}_{j} \tag{7}$$

Also the last term in the LHS of equation (5) can be written as:

$$\frac{e^{\theta^{T}x^{(i)}y^{(i)}}}{(1+e^{\theta^{T}x^{(i)}y^{(i)}})^{2}} = g(\theta^{T}x^{(i)}y^{(i)})(1-g(\theta^{T}x^{(i)}y^{(i)})) \tag{8}$$

By substituting equations (7) and (8) into equation (5), we get

$$ \frac{\partial g}{\partial \theta_{j}} = y^{(i)} x^{(i)}_{j} g(\theta^{T}x^{(i)}y^{(i)})(1-g(\theta^{T}x^{(i)}y^{(i)})) \tag{9}$$


And by substituting equation (9) into equation (4) we simply get:
$$ \frac{\partial \log(g)}{\partial \theta_{j}} = y^{(i)} x^{(i)}_{j} (1-g(\theta^{T}x^{(i)}y^{(i)})) \tag{10} $$


Finally, by substituting equation (10) into equation (4) we find that the partial derivative of $J$ with respect to $\theta_{j}$ is
equal to:

$$ \frac{\partial J}{\partial \theta_{j}} = - \frac{1}{m} \sum_{i=1}^{m} y^{(i)} x^{(i)}_{j} (1-g(\theta^{T}x^{(i)}y^{(i)})) $$

We will continue by computing the second spatial derivative with respect to $\theta_{i}$:

(b) We have provided two data files:
*  http://cs229.stanford.edu/ps/ps1/logistic_x.txt
*  http://cs229.stanford.edu/ps/ps1/logistic_y.txt

In [None]:
import numpy as np