# Quadratic Optimization
## Introduction
Experience will soon tell you that amid the wide range of solvers that support NLP, most provide support specifically for quadratic functions. This is because there is a wide range of problems that deal specifically with quadratic functions, rather than with generic non-linear functions. Recall that a quadratic function is a polynomal function of the decision variables with terms of at most second order degree.
For instance, an outstanding example of quadratic function is the quadratic error function: 

$Loss(z) = \frac{z^2}{2}$

Any optimization problem that aims to minimize a given error can use this empirical error function as the objective function. The next subsections present some examples of quadratic optimization problems that use exclusively this type of functions. 

## Machine Learning Linear Regression as a Quadratic Unconstrained NLP
### Supervised Learning Regression Problem Set-Up
The objective of a linear regression problem is to *explain* a response variable, or dependent variable, based on the 
values of a set of explanatory variables, or independent variables, assuming that the relationship between the 
response variable and the explanatory variables is linear. 

That is, given: 

- $x^{(t)}$: feature vector t. Sample vector t of explanatory variables (e.g. different characteristics of people like gender, age, region, device, …). $x(t)$ is a d-dimensional vector (i.e. $x^{(t)} \in \mathbb{R^d}$)
- $y^{(t)}$: Label t. Response variable t that we want to explain (e.g for instance the probability that a person is a potential customer or not). Let us assume that every $y^{(t)}$ is a real value (i.e. $y^{(t)} \in \mathbb{R}$)
- $S_n$: Training set containing all labeled feature vectors, i.e. $S_n = \{x^{(t)}, y^{(t)} | t=1,...,n \}$

The objective is to learn the linear mapping between the feature vector space x and the response variable y: 

$\hat y = f(x, \theta, \theta_0) = \sum_{i=1}^{d}{x*\theta} + \theta_0 = x·\theta + \theta_0$

So that the value of y can be estimated for any point in the vector space x based on the training data set. The parameters to learn are:

- $\theta$: Linear regression function coefficients
- $\theta_0$: Linear regression independent term (Let us assume it is zero for now, without loss of generality)

In general, the objective is to minimise the **empirical risk**, noted as $R_n$, which is the average of the estimation error at every training point t. Let as for now note $\text{Loss}$ the loss function that represents the error at every training point. Then, the objective function is:

$\min R_n(\theta) = \frac{1}{n}\sum_{t=1}^{n}{\text{Loss}( y^{(t)} - x^{(t)}·\theta)}$
    
Now, if we consider the quadratic error function, the empirical risk becomes: 

$\min R_n(\theta) = \frac{1}{n}\sum_{t=1}^{n}{\frac{( y^{(t)} - x^{(t)}·\theta)^2}{2}}$

This minimisation problem is a quadratic optimization problem where the decision variables are the components of $\theta$.

Thus, we can compute the closed form of the gradient as: 

$\nabla_{\theta} R_n(\theta) = \frac{1}{n}\sum_{t=1}^{n}{\nabla_{\theta}\frac{( y^{(t)} - x^{(t)}·\theta)^2}{2}} = \frac{1}{n}\sum_{t=1}^{n}{\left [ (y^{(t)}-x^{(t)}·\theta)*(-x^{(t)})\right ]} = -\frac{1}{n}\sum_{t=1}^{n}{y^{(t)}*x^{(t)}}-\frac{1}{n}\sum_{t=1}^{n}{(x^{(t)}·\theta)*-x^{(t)}}$

Now, note that the first term is the sum-product of scalars ($y^{(t)}$) and vectors of size d ($x^{(t)}$) and therefore, it is a vector. Let us note this first term as b:

$b = \frac{1}{n}\sum_{t=1}^{n}{y^{(t)}*x^{(t)}}$

As for the second term, note that is the sum of the scalar product of two vectors of size d (which is a scalar) times another vector of size d. We can apply the following transformation: 

$-\frac{1}{n}\sum_{t=1}^{n}{(x^{(t)}·\theta)*-x^{(t)}} = \frac{1}{n}\sum_{t=1}^{n}{x^{(t)}*(x^{(t)})^T}*\theta = A*\theta$  

That is, first we use the vector product properties to place together the two $x^{(t)}$ and then we move $\theta$ outside the sumatory since it does not depend on t. 

With this transformation, the closed form of the optimisation problem now becomes: 


$\nabla_{\theta} R_n(\theta) = -b + A\theta = 0 \rightarrow A\theta = b$
