# 4. Convexity of Generalized Linear Models
In this question we will explore and show some nice properties of Generalized Linear Models, specifically those related to its use of Exponential Family distributions to model the output.

Most commonly, GLMs are trained by using the negative log-likelihood (NLL) as the loss function. This is mathematically equivalent to Maximum Likelihood Estimation (i.e., maximizing the log-likelihood is equivalent to minimizing the negative log-likelihood). In this problem, our goal is to show that the NLL loss of a GLM is a convex function w.r.t the model parameters. As a reminder, this is convenient because a convex function is one for which any local minimum is also a global minimum.

To recap, an exponential family distribution is one whose probability density can be represented
$$p(y,\eta)=b(y)\exp\left(\eta^TT(y) - a(\eta)\right)$$
where $\eta$ is the natural parameter of the distribution. Moreover, in a Generalized Linear Model, $\eta$ is modeled as $\theta^Tx$, where $x\in \mathbb{R}^n$ is the input features of the example, and $\theta\in\mathbb{R}^n$ is learnable parameters. In order to show that the NLL loss is convex for GLMs, we break down the process into sub-parts, and approach them one at a time. Our approach is to show that the second derivative (i.e., Hessian) of the loss w.r.t the model parameters is Positive Semi-Definite (PSD) at all values of the model parameters. We will also show some nice properties of Exponential Family distributions as intermediate steps.

For the sake of convenience we restrict ourselves to the case where $\eta$ is a scalar. Assume 
$p(Y|X,\theta\sim{\rm ExponentialFamily}(\eta)$
where $\eta\in \mathbb{R}$ is a scalar, and $T(y) = y$. This makes the exponential family representation take the form
$$p(y,\eta)=b(y)\exp\left(\eta^Ty - a(\eta)\right)$$.

<b>(a)</b> [5 points] Derive an expression for the mean of the distribution. Show that 
$E[Y;\eta]=\frac{\partial}{\partial \eta}a(\eta)$ 
(note that $E[Y;\eta] = E[Y| X; \theta]$ since $\eta = \theta^TX$). In other words, show that the mean of an exponential family distribution is the first derivative of the log-partition function with
respect to the natural parameter.

<b>Hint:</b> Start with observing that 

\begin{align*}
\frac{\partial}{\partial\eta}\int p(y,\eta)dy =\int \frac{\partial}{\partial\eta} p(y,\eta)dy
\end{align*}
### Answer:

### Caution:  In our solution, we assume that $\theta$ is an $m\times n$ matrix, $x$ is a vertor of length $n$, and $T(y)$ is a vector of length $m$. 

\begin{align*}
\mathbf{0}
& = \frac{\partial}{\partial\eta}\int p(y,\eta)dy\\ 
& = \int \frac{\partial}{\partial\eta} p(y,\eta)dy\\
& = \int \frac{\partial}{\partial\eta} b(y)\exp\left(\eta^TT(y) - a(\eta)\right)dy \\
& = \int\left(T(y)- \frac{\partial}{\partial\eta}a(\eta)\right)b(y)\exp\left(\eta^TT(y) - a(\eta)\right)dy\\
& = \int T(y)b(y)\exp\left(\eta^TT(y) - a(\eta)\right)dy - \frac{\partial}{\partial\eta}a(\eta)\int b(y)\exp\left(\eta^TT(y) - a(\eta)\right)dy\\
& = E(T(y)|\eta)-\frac{\partial}{\partial\eta}a(\eta).
\end{align*}
#### <b>Remark:</b> By integral of a vector, we here mean elementwise integration!

<b>(b)</b> [5 points] Next, derive an expression for the variance of the distribution. In particular,
show that ${\rm Var}(Y;\eta) = \frac{\partial^2}{\partial \eta^2}a(\eta)$ (again, note that 
${\rm Var}(Y; \eta) = {\rm Var}(Y|X; \theta)$). In other words, show that the variance of an exponential family distribution is the second derivative of the log-partition function w.r.t. the natural parameter.

<b>Hint:</b> Building upon the result in the previous sub-problem can simplify the derivation. 
### Answer: 

Assume that $\eta$ is a vector of length $n$.
Using what we have seen in previous part, we have
\begin{align*}
[\mathbf{0}]_{n\times n}
& = \frac{\partial}{\partial\eta}\int\left(T(y)- 
    \frac{\partial}{\partial\eta}a(\eta)\right)b(y)\exp\left(\eta^TT(y) - a(\eta)\right)dy\\    
& = -\frac{\partial^2}{\partial\eta^2}a(\eta)\int b(y)\exp\left(\eta^TT(y) - a(\eta)\right)dy + \int\Big(T(y)- 
    \frac{\partial}{\partial\eta}a(\eta)\Big)\Big(T(y)- 
    \frac{\partial}{\partial\eta}a(\eta)\Big)^Tb(y)\exp\left(\eta^TT(y) - a(\eta)\right)dy\\    
& = -\frac{\partial^2}{\partial\eta^2}a(\eta) + \int\Big(T(y)- 
    E(T(y);\eta)\Big)\Big(T(y)- E(T(y);\eta)\Big)^Tb(y)\exp\left(\eta^TT(y) - a(\eta)\right)dy\\   
& = -\frac{\partial^2}{\partial\eta^2}a(\eta) + {\rm Cov}(T(Y);\eta).
\end{align*}
    as desired.

<b>(c)</b> [5 points] Finally, write out the loss function $l(\theta)$, the NLL of the distribution, as a function of $\theta$. Then, calculate the Hessian of the loss w.r.t $\theta$, and show that it is always PSD. This concludes the proof that NLL loss of GLM is convex.

<b>Hint 1:</b> Use the chain rule of calculus along with the results of the previous parts to simplify your derivations.

<b>Hint 2:</b> Recall that variance of any probability distribution is non-negative. 

###  Answer: 
<b>Remark:</b> The main takeaways from this problem are:
1. Any GLM model is convex in its model parameters.
2. The exponential family of probability distributions are mathematically nice. Whereas calculating mean and variance of distributions in general involves integrals (hard), surprisingly we can calculate them using derivatives (easy) for exponential family.

For simlicity in notation, assume that $\theta$ is of shape $(m, n)$.

\begin{align*}
l(\theta) 
& = -\log p(y|\eta)\\
& = -\log b(y)\exp\left(\eta^TT(y) - a(\eta)\right)\\
& = C - \left(\eta^TT(y) - a(\eta)\right)\\
& = C - \left(x^T\theta T(Y) - a(\theta^T x)\right)
\end{align*}
where $\eta = \theta^T x$.
Therefore, 
\begin{align*}
\frac{\partial}{\partial\theta}l(\theta) 
& = -x\left(T(y) - \frac{\partial}{\partial\eta}a(\eta)\right)^T
\end{align*}
and thus, 
\begin{align*}
\frac{\partial}{\partial\theta_{ij}}l(\theta) 
& = -x_i\left(T(y)_j - \frac{\partial}{\partial\eta_j}a(\eta)\right)
\end{align*}
Computing the second derivative of $l$, we obtain 
\begin{align*}
\frac{\partial}{\partial\theta_{kl}}\Big(\frac{\partial}{\partial\theta_{ij}}l(\theta)\Big)
& = x_i\frac{\partial}{\partial\theta_{kl}}\left(\frac{\partial}{\partial\eta_j}a(\eta)\right)\\
& = x_ix_{k}\frac{\partial}{\partial\eta_{l}}\left(\frac{\partial}{\partial\eta_j}a(\eta)\right)\\
& = x_ix_{k}{\rm Cov}(T(Y);\eta)_{jl}
\end{align*}

Now, let $Z$ be a matrix with of shape $(m,n)$ (same as $\theta$) but consider $Z$ as a vector with length $mn$
whose coordinates are indexed by $ij$'s. 
Indeed, the Hessian matrix $H = \frac{\partial^2}{\partial\theta^2}l(\theta)$ is of shape $(mn,mn)$ and $Z$ is of vercor of $mn$ elements.  

\begin{align*}
\Big(Z^THZ\Big)_{ij,kl}
& = \sum_{ij}\sum_{kl}z_{ij}H_{ij,kl}z_{kl}\\
& = \sum_{ij}\sum_{kl}z_{ij}\Big(x_ix_{k}{\rm Cov}(T(Y);\eta)_{jl}\Big)z_{kl}\\
& = \sum_{ij}\sum_{kl}x_iz_{ij}\Big({\rm Cov}(T(Y);\eta)_{jl}\Big)x_{k}z_{kl}\\
& = x^TZ\ C\ Z^Tx\\
& = (x^TZ)\ C\ (x^TZ)^T \geq 0
\end{align*}
since covariance matrix $C = {\rm Cov}(T(Y);\eta)$ is positive semi-definite.