# CHAPTER 9 - Interior-Point Methods

---
---

**Author:** Dr Giordano Scarciotti (g.scarciotti@imperial.ac.uk) - Imperial College London 

**Module:** ELEC70066 - Advanced Optimisation

**Version:** 1.1.2 - 02/03/2023

---
---

The material of this chapter is adapted from $[1]$.

# 9.1 Introduction: Interior-Point Methods

In [None]:
from IPython.display import HTML
HTML('<iframe width="850" height="480" src="https://www.youtube.com/embed/Wj96VECDBcE"></iframe>')

In this chapter we discuss *interior-point methods* for solving convex optimization problems that include inequality constraints

$$
\begin{array}{lll}
\min & f_0(x) &\\
s.t. & f_i(x) \le 0, & i = 1,\dots,m\\
& A x = b,  & 
\end{array} \tag{1}
$$

Where $f_0,\dots,f_m: \mathbb{R}^n \to \mathbb{R}$ are convex and twice continuously differentiable, and $A\in\mathbb{R}^{p\times n}$ has $\textbf{rank }A = p < n$. We assume that the problem if solvable with solution $x^*$ and value $f_0(x) = p^*$. We also assume that the problem is strictly feasible and so Slater's condition holds and there exist dual optimal variables $\lambda^*\in\mathbb{R}^m$, $\nu^*\in\mathbb{R}^p$ that verify the KKT conditions, namely

$$
\begin{array}{rcll}
f_i(x^*) & \le & 0 & i=1,\dots,m\\
Ax^*-b & =   & 0 & \\
\lambda_i^* & \ge & 0 & i=1,\dots,m\\
\nabla f_0(x^*) +  \sum_{i=1}^m \lambda_i^* \nabla f_i(x^*) + A^\top \nu_i^* &=&0 &\\
\lambda_i^* f_i(x^*) & = & 0 & i=1,\dots,m.
\end{array}\tag{2}
$$

Interior-point methods solve problem $(1)$ or the KKT conditions $(2)$ by applying Newton’s method to a sequence of equality constrained problems, or
to a sequence of modified versions of the KKT conditions. 

We will concentrate on a particular interior-point algorithm, the *barrier method*. We also briefly comment on a simple *primal-dual interior-point method* at the end of the chapter.

Interior-point methods solve an optimisation problem with equality and inequality constraints by reducing it to a sequence of linear
equality constrained problems. In [Chapter $8$](https://colab.research.google.com/drive/15IaEHZr_pq6MURtITz6xVILNAp818PZu?usp=sharing) we have seen that a linear equality constrained problem is solved by Newton's method by reducing it to a sequence of linear equality constrained quadratic problems. A linear equality constrained quadratic problem is solved by reducing it to system of linear equations which can be solved analytically.

In summary, the have completed the chain: the general problem $(1)$ is reduced to a series of linear equations, which we know how to solve ([Chapter $7$](https://colab.research.google.com/drive/1JzySYFafmlgUUkn7tDxiybTUebQlZB3u?usp=sharing))

Many problems are already in the form $(1)$, and satisfy the assumption that the objective and constraint functions are twice differentiable. Obvious examples are LPs, QPs, QCQPs, and GPs in convex form. Many other problems do not have the required form $(1)$, with twice differentiable
objective and constraint functions, but can be reformulated in the required form. Examples include the $\ell_1$ and $\ell_\infty$ norms (see [Chapter $1$](https://colab.research.google.com/drive/1efBkJ5F5U6QpIT4nU6QzxM5mZq5I6iv4?usp=sharing), and in particular Class Exercise $3$ of Week $6$).

Other convex optimisation problems, such as SOCPs and SDPs, are not readily recast in the required form, but can be handled by extensions of interior-point
methods to problems with generalized inequalities, which we will see later in this chapter.

# 9.2 Logarithmic Barrier Function and Central Path

## 9.2.1 Logarithmic Barrier Function

In [None]:
from IPython.display import HTML
HTML('<iframe width="850" height="480" src="https://www.youtube.com/embed/-wIP9xLfXAw"></iframe>')

Our goal is to formulate the inequality constrained problem $(1)$ as an equality constrained problem to which Newton’s method can be applied. Our first attempt consists in rewriting $(1)$ as 

$$
\begin{array}{ll}
\min & f_0(x) + \sum_{i=1}^m I_-(f_i(x))\\
s.t. & A x = b, 
\end{array}\tag{3}
$$

where $I_-(\cdot)$ is the [*indicator function*](https://en.wikipedia.org/wiki/Indicator_function) for the nonpositive reals, i.e. it is $0$ when the argument is nonpositive and it is $\infty$ when the argument is positive. Problems $(1)$ and $(3)$ are equivalent, but problem $(3)$ is not differentiable because the indicator function is discontinuous. Thus, Newton’s method cannot be applied.

The basic idea of the barrier method is to approximate the indicator function $I_−$
by the function

$$
\hat{I}_−(u) = −\frac{1}{t} \log(−u),\qquad \textbf{dom } \hat{I}_− = −\mathbb{R}_{++},
$$

where $t > 0$ is a parameter that sets the accuracy of the approximation. Both $I_−$ and $\hat{I}_−$ are convex, nondecreasing and $\infty$ for positive arguments. However, $\hat{I}_−$ is differentiable and it increases smoothly to $\infty$ as the argument increases to $0$.

Figure 9.1 below shows the functions $I_−$ and $\hat{I}_−$, for several values of $t$. As $t$ increases, the approximation
becomes more accurate.

<div>
<img src="https://drive.google.com/uc?export=view&id=1DmBg95C4Tsgn0Vj07E9mTOtUPPoJTfgk" width="400"/>
</div>

Figure 9.1. *The dashed line shows the function $I_-(u)$. The solid lines show $\hat{I}_-(u)=-(1/t)\log(-u)$, for $t=0.5,1,2$. Source: page 563 of $[1]$.*

Thus, problem $(3)$ is approximated by

$$
\begin{array}{ll}
\min & f_0(x) - \frac{1}{t} \sum_{i=1}^m \log(-f_i(x))\\
s.t. & A x = b, 
\end{array}\tag{4}
$$

on which we can apply Newton's method. The function 

$$
\phi(x) = -\sum_{i=1}^m \log(- f_i(x))
$$

is called *logarithmic barrier function* for problem $(1)$. Its domain is the set of points that satisfy the inequality constraints of $(1)$ strictly: no matter the value of $t$, the barrier grows unbounded as $f_i(x) \to 0$ for any $i$.

Problem $(4)$ is just an approximation of $(1)$ and we should quantify how good or bad this approximation is. Naturally, if the parameter $t$ is large, then the approximation is good. However, a large $t$ implies also that the Hessian varies rapidly near the boundary of the feasibility set, which means that Newton's method will struggle to solve the problem if we pick $t$ large. It looks like we reached an *impasse*.

We will see that this problem can be circumvented by solving a sequence of problems of the form $(4)$, increasing the parameter $t$
(and therefore the accuracy of the approximation) at each step, and starting each Newton minimisation at the solution of the problem for the previous value of $t$.

For future reference, note that the gradient of $\phi(x)$ is

$$
\nabla\phi(x) =\sum_{i=1}^m \frac{1}{-f_i(x)} \nabla f_i(x)
$$

i.e. it is a weighted sum of the gradients of the constraint functions, where the weights are the negative values of the constraints violations (thus, the weights are positive). 

## 9.2.2 Central Path

In [None]:
from IPython.display import HTML
HTML('<iframe width="850" height="480" src="https://www.youtube.com/embed/IY-zg0z3nNc"></iframe>')

We now rewrite problem $(4)$ by multiplying the objective function by $t$, obtaining

$$
\begin{array}{ll}
\min & t f_0(x) - \phi(x)\\
s.t. & A x = b, 
\end{array}\tag{5}
$$

which has the same minimisers. For $t>0$ we define $x^*(t)$ as the solution of $(5)$ and we assume that this solution is unique. We call the set

$$
\{x^*(t) : t>0\}
$$

the *central path* associated to problem $(1)$. Points on the central path satisfies and are characterised by the KKT conditions. In particular, the points are strictly feasible and there exists $\hat \nu \in \mathbb{R}^p$ such that

$$
0 = t \nabla f_0(x^*(t)) + \nabla \phi(x^*(t)) + A^\top \hat \nu, \tag{6}
$$

which we call *centrality condition*.

If we consider an LP in inequality form (see Exercise 6.4), then the central path has a simple geometric interpretation. In such a problem, $f_0(x) = c^\top x$, $f_1(x) = Ax-b$ and there is no equality constraint. So the centrality condition boils down to

$$
t c + \nabla \phi(x^*(t)) = 0.
$$

At a point $x^*(t)$ on the central path the gradient $\nabla \phi(x^*(t))$, which is normal to the level set of $\phi$ through $x^*(t)$, must be parallel to $−c$. In other words, the hyperplane $c^\top x = c^\top x^*(t)$
is tangent to the level set of $\phi$ through $x^*(t)$. This interpretation is illustrated in the figure below.




<div>
<img src="https://drive.google.com/uc?export=view&id=1s5ZkUXLnJfjoobNT8DFpCKn9UllbANks" width="400"/>
</div>

Figure 9.2. *Central path for an LP with $n = 2$ and $m = 6$. The dashed
curves show three contour lines of the logarithmic barrier function $\phi$. The central path converges to the optimal point $x^*$ as $t\to\infty$. Also shown is the
point on the central path with $t = 10$. The line $c^\top x = c^\top x^*(10)$ is tangent to the contour line of $\phi$ through $x^*(10)$. Source: page 566 of $[1]$.*

From $(6)$ we can derive an important property of the central path: every central point yields a dual feasible point, and hence a lower bound on the optimal value $p^*$. In fact, note that $(6)$ can be rewritten as 

$$
0 = \nabla f_0(x^*(t)) + \sum_{i=1}^m \lambda^*_i(t) \nabla f_i(x^*(t)) + A^\top \nu^*(t)
$$

where 

$$
\lambda^*_i(t) = -\frac{1}{t f_i(x^*(t))}, \quad i=1,\dots,m \qquad \qquad \nu^*(t) = \frac{\hat{\nu}}{t}.
$$

It is clear that $\lambda^*_i(t) \succ 0$, because $f_i(x^*(t)) < 0$ for all $i$. Moreover, this equation implies that $x^*(t)$ minimises the Lagrangian $L(x,\lambda^*(t),\nu^*(t))$, which means that $(\lambda^*(t),\nu^*(t))$ are dual feasible.


We can then determine a lower bound on the optimal value $p^*$ based on the solution of $(5)$. From the Lagrangian we find the dual function, which yields

$$
g(\lambda^*(t),\nu^*(t)) = f_0(x^*(t)) + \sum_{i=1}^m \lambda^*_i(t) f_i(x^*(t)) +  \nu^*(t)^\top (A x^*(t) - b) =  f_0(x^*(t)) - \frac{m}{t}
$$

where we have used the definition of $\lambda^*(t)$ and the fact that $x^*(t)$ is feasible. Thus, the duality gap associated to $x^*(t)$, $\lambda^*(t)$ and $\nu^*(t)$ is simply $m/t$. Thus, by weak duality we have

$$
f_0(x^*(t)) - p^* \le \frac{m}{t}
$$

which means that $x^*(t)$ is no more than $m/t$-suboptimal and that $x^*(t)$ converges to the optimal point as $t \to \infty$.

We conclude this section giving two alternative interpretations to the logarithmic barrier method:


*   We can interpret the centrality condition has a continous deformation of the KKT conditions. In particular, a point $x$ is equal to $x^*(t)$ if and only if there exist $\lambda$ and $\nu$ such that
$$
\begin{array}{rcll}
f_i(x) & \le & 0 & i=1,\dots,m\\
Ax-b & =   & 0 & \\
\lambda_i & \ge & 0 & i=1,\dots,m\\
\nabla f_0(x) +  \sum_{i=1}^m \lambda_i \nabla f_i(x) + A^\top \nu &=&0&\\
-\lambda_i f_i(x) & = & \frac{1}{t} & i=1,\dots,m.
\end{array}
$$
The only difference with the KKT conditions for $(1)$ is the presence of the term $\frac{1}{t}$ in the complementary condition (which is a direct consequence of the expression of $\lambda_i^*(t)$ obtained above). Again, we see that for large $t$, $x^*(t)$ and the associated dual point $\lambda^*(t)$ and $\nu^*(t)$ "almost" satisfy the KKT optimality conditions for $(1)$, with the discrepancy approaching $0$ as $t \to \infty$.
*   We can interpret the central path with a physics analogy. We associate with each constraint, the force $F_i(x) = \frac{1}{f_i(x)}\nabla f_i(x)$ acting on the particle when it is at position $x$. The potential associated with the total force field generated by the constraints is the logarithmic barrier $\phi$. As the
particle moves toward the boundary of the feasible set, it is strongly repelled by the forces generated by the constraints. We also associate a force $F_0(x) = -t \nabla f_0(x)$ acting on $x$ which pulls the particle towards where $f_0$ is smaller. The parameter $t$ scales the objective force, relatively to the constraint forces. The central point $x^*(t)$ is the point where all forces are balanced. As $t$ increases the balance changes, with the particle being pulled more towards the optimal point. This idea is illustrated in the figure below. 



<div>
<img src="https://drive.google.com/uc?export=view&id=1qlF4BTeyYiLOmLTxWXwPcgosb4mL15oi" width="600"/>
</div>

Figure 9.3. *Force field interpretation of central path. The objective force is equal to $−c$ and $−3c$, respectively. Source: page 568 of $[1]$.*

# 9.3 The Barrier Method

In [None]:
from IPython.display import HTML
HTML('<iframe width="850" height="480" src="https://www.youtube.com/embed/c1s5nieiGvU"></iframe>')

We have seen that the point $x^*(t)$ is $m/t$-suboptimal, and that a certificate of this accuracy is provided by the dual feasible pair $\lambda^*(t)$, $\nu^*(t)$. Thus, a very straightforward method for solving the original problem $(1)$ with a guaranteed specified accuracy $\varepsilon$ is to take $t = m/\varepsilon$ and solve the equality constrained problem $(5)$ using Newton’s method. Although this method can work well for small problems, good starting points, and moderate accuracy (i.e., $\varepsilon$ not too small), it does not work well in general. As a result it is rarely used.

A simple but very effective improvement to the approach above is based on solving a sequence of unconstrained (or linearly constrained) minimisation problems, using the last point found as the starting point for the next unconstrained minimisation problem. This method is called *barrier method*, *path-following method*, or *sequential unconstrained minimisation technique* (this is the original name when it was firstly proposed). The algorithm is given below.

$$
\begin{array}{l}
\textbf{given} \text{ a stricly feasible } x, t:=t^{(0)}, \mu>1\text{, tolerance }\varepsilon >0.\\
\textbf{repeat}\\
\quad 1.\textit{ Centering step. }\\
\qquad\text{ Compute } x^*(t) \text{ by minimising }tf_0 + \phi, \text{ subject to }Ax=b \text{, starting at } x.\\
\quad 2.\textit{ Update. }x:=x^*(t).\\
\quad 3.\textit{ Stopping criterion. }\textbf{quit} \text{ if }m/t < \varepsilon\\
\quad 4.\textit{ Increase }t.\,\, t:=\mu t.
\end{array}
$$

At each iteration (except the first one) we compute the central point $x^*(t)$ starting from the previously computed central point, and then increase $t$ by a factor $\mu > 1$. We refer to each execution of Step $1$ as *outer iteration* (at times we call it *centering step*), while the iterations of the Newton's method executed during Step $1$ are called *inner iterations*. At each inner step, we have a primal feasible point; however, we have a dual feasible point only at the end of each outer/centering step.

The choice of the parameter $\mu$ involves a trade-off in the number of inner and outer iterations required. If $\mu$ is small (i.e., near $1$) then at each outer iteration $t$ increases by a small factor. As a result the initial point for the Newton process, i.e., the previous optimal point, is a very good starting point, and the number of Newton steps needed to compute the next iterate is small. Thus for small $\mu$ we expect a small number of Newton steps per outer iteration, but of course a large number of outer
iterations since each outer iteration reduces the gap by only a small amount. On the other hand if $\mu$ is large we have the opposite situation. After each outer iteration, $t$ increases by a large amount. Thus, the previous optimal point is probably not a very good approximation of the next iterate. Thus we expect many more inner
iterations. This results in fewer outer iterations, since the duality gap is reduced by the large factor $\mu$ at each outer iteration, but more inner iterations.


We will see below that for $\mu$ in a fairly large range, from around $3$ to $100$ or so, the two effects nearly cancel, so that the speed at which the duality gap is reduced remains approximately constant. This means that the choice of $\mu$ is not particularly critical.

Another important issue is the choice of initial value of $t$. Here the trade-off is simple: if $t^{(0)}$ is chosen too large, the first centering step will require too many iterations.
If $t^{(0)}$ is chosen too small, the algorithm will require extra outer iterations. There are several heuristic to select $t^{(0)}$ we do not dwell more into this matter.

Convergence analysis for the barrier method is straightforward. Assuming that $t f_0 + \phi$ can be minimized by Newton’s method for $t = t^{(0)}$, $\mu t^{(0)}$, $\mu^2 t^{(0)}$, ..., then the duality gap after the initial centering step and $k$ additional centering steps, is $m/(\mu^k t^{(0)})$. Therefore the desired accuracy $\varepsilon$ is achieved after exactly

$$
\left\lceil \frac{\log(m/(\varepsilon t^{(0)}))}{\log \mu}\right\rceil
$$

centering steps, plus the initial centering step. It follows that the barrier method works provided the centering problem $(5)$ is solvable by Newton’s method, for $t \ge t^{(0)}$. The solvability of each Newton's method depends on conditions seen in Chapter $7$ (e.g. closedness of level sets and so on).

While this analysis shows that the barrier method does converge, under reasonable assumptions, it does not address a basic question: as the parameter $t$ increases, do the centering problems become more difficult (and therefore take more and more
iterations)? We now see in some numerical applications that this is not the case in practice. The centering problems appear to require a nearly constant number of Newton steps to solve, even as $t$ increases. This can be also shown theoretically for self-concordant functions, as we will see in later sections.

In the following series of figures we analyse the convergence rate of the barrier method on numerical simulations. We solve a small ($100$ constraints and $50$ variables) LP in inequality form, namely

$$
\begin{array}{ll}
\min & c^\top x\\
s.t. & A x \succeq b, 
\end{array} \tag{7}
$$

<div>
<img src="https://drive.google.com/uc?export=view&id=1oFisodhrDAfNkTPD0vWCh12kXnQ1W-oV" width="400"/>
</div>

Figure 9.4. *Progress of barrier method for a small LP, showing duality
gap versus cumulative number of Newton steps. Three plots are shown,
corresponding to three values of the parameter $\mu$: $2$, $50$ and $150$. In each case, we have approximately linear convergence of duality gap. Source: page 572 of $[1]$.*

Figure 9.4 shows the progress of the barrier method on this LP for three values of the parameter $\mu$. The vertical axis shows the duality gap on a log scale. The horizontal axis shows the cumulative total number of inner iterations. Each of the plots has a staircase shape, with each step associated with one outer iteration. The width of each stair tread (i.e., horizontal portion) is the number of Newton steps required for that outer iteration. The height of each stair riser (i.e., the vertical portion) is
exactly equal to (a factor of) $\mu$, since the duality gap is reduced by the factor of $\mu$ at the end of each outer iteration. The plots illustrate several typical features of the barrier method. 
*    The method works very well, with approximately linear convergence of the duality gap.
*    There is a trade-off in the choice of $\mu$. For small $\mu$, the treads are shorts. But the risers are also short.  For large $\mu$, the risers are large, but also the treads are large.

We analyse this trade-off further in the next figure. 

<div>
<img src="https://drive.google.com/uc?export=view&id=1RrThQxkL3oAkwgfJePo0aipPNcNjtMgS" width="400"/>
</div>

Figure 9.5. *Trade-off in the choice of the parameter $\mu$, for a small LP. The vertical axis shows the total number of Newton steps required to reduce the duality gap from $100$ to $10^{−3}$. Source: page 573 of $[1]$.*

Figure 9.5 shows the total number of Newton steps (so the sum over multiple outer iterations) required to reduce the duality gap by a fixed amount as a function of $\mu$ for the same LP. The plot shows that the barrier method performs very well for a wide range of values of $\mu$, from around $3$ to $200$. As our intuition suggests, the total number of Newton steps rises when $\mu$ is too small, due to the larger number of outer iterations required. However, unexpectedly, the total number of Newton steps does
not vary much for values of $\mu$ larger than around $3$. Since the performance does not improve with larger values of $\mu$, then a good choice is in the range $10$ – $100$.

This trend is true for several classes of problems, not just LP. For instance, $[1]$ shows that a figure identical to Figure 9.4 is obtained for a GP. This is omitted here.

Finally, we look at how the *dimension of the problem* (rather than $\mu$) influences the number of Newton steps in the inner iterations. This is done by solving a family of LPs with $m$ constraints and $n=2m$ variables in which the number of variables in increased from $20$ to $2000$, while keeping $\mu$ constant. The figure below shows that the average number of Newton steps is between $20$ and $27$, and never more than $30$, no matter the dimension of the problem. You may think that the trend in the figure is increasing for larger dimensions, but no, it is not. After $m=100$ variables the number of iterations flatten to around $27$. This has been tested also on problems with billions of variables (but of course in this case each iteration is very slow).

<div>
<img src="https://drive.google.com/uc?export=view&id=14gsj0LUDNnB0n_lkXgylOLfewzvuZgbt" width="400"/>
</div>

Figure 9.6. *Average number of Newton steps required to solve $100$ randomly
generated LPs of different dimensions, with $n = 2m$. Error bars show the standard deviation, around the average value, for each value of $m$. Source: page 576 of $[1]$.*

In summary, 
1.   We can pick $\mu$ between $10$ and $100$, without much difference in terms of speed of convergence. 
2.   No matter how big the problem is, or the type of the problem, the number of inner steps is bounded to around $30$. (Of course, if the problem is large, then each Newton step will take longer to run.)
3.   The speed of convergence of the barrier method is linear.


# 9.4 Feasibility and Phase I Methods

In [None]:
from IPython.display import HTML
HTML('<iframe width="850" height="480" src="https://www.youtube.com/embed/kaAYJ8URSqk"></iframe>')

The barrier method requires a strictly feasible starting point $x^{(0)}$. When such a point is not known, the barrier method is preceded by a preliminary stage, called
*phase I*, in which a strictly feasible point is computed (or the constraints are found
to be infeasible). The strictly feasible point found during phase I is then used as the starting point for the barrier method, which is called *phase II*.

In this section we look at two phase I methods and we discuss what happens around the boundary between feasibility and infeasibility.

To find a feasible starting point for the barrier method, we can construct an auxiliary problem which has these two properties:
*   An initial point for this problem is always found easily.
*   A solution of this problem is a feasible point for the original problem.

Consider the problem

$$
\begin{array}{lll}
\displaystyle\min_{x,s} & s &\\
s.t. & f_i(x) \le s, & i = 1,\dots,m\\
& A x = b.  & 
\end{array} \tag{8}
$$

The variable $s$ can be interpreted as a bound on the maximum infeasibility of the inequalities; the goal is to drive the maximum infeasibility below zero.

This problem is always strictly feasible and solvable with the barrier method from a straightforward starting point: we can choose $x^{(0)}$ as any starting point for $x$ such that $Ax^{(0)}=b$, while for $s^{(0)}$ we can choose any number larger than $\max_{i=1,...,m} f_i(x^{(0)})$, which can be computed directly. We can therefore apply the barrier method to solve $(8)$ which is
a phase I optimization problem.


We can distinguish three cases depending on the sign of the optimal value $\bar{p}^*$ of $(8)$.
1.    If $\bar{p}^* < 0$, then the original problem has a strictly feasible solution. Moreover, we do not need to compute $\bar{p}^*$. As soon as we find a feasible pair $(x, s)$ for $(8)$ with $s < 0$, then $x$ satisfies $f_i(x) < 0$ and we can move the phase II.
2.    If $\bar{p}^*> 0$, then the original problem is infeasible. As in case 1, we do not need to solve
the phase I optimization problem to high accuracy; we can terminate
when a dual feasible point is found with positive dual objective (which proves by weak duality that $\bar{p}^* > 0$).
3. If $\bar{p}^*= 0$ and the minimum is attained at $x^*$ and $s^* = 0$, then the set of inequalities is feasible, but not strictly feasible. If $\bar{p}^* = 0$ and the minimum
is not attained, then the inequalities are infeasible. In practice, we never obtain $0$ (because of numerical approximations), but the algorithm will terminate with $|\bar{p}^*|<\varepsilon$. This means that the inequalities $f_i(x) \le - \varepsilon$ are infeasible, while the inequalities $f_i(x) \le  \varepsilon$ are feasible.

Problem $(8)$ is not the only possible phase I method. For instance, an alternative is based on minimising the sum of the infeasibilities, namely

$$
\begin{array}{lll}
\displaystyle\min_{x,s} & \mathbf{1}^\top s &\\
s.t. & f_i(x) \le s_i, & i = 1,\dots,m\\
& A x = b.  & \\
& s \succeq 0.
\end{array} \tag{9}
$$

The optimal value of $(9)$ is zero and achieved if and only if the original set of equalities and inequalities is feasible.

By looking at $(8)$ and $(9)$ can you guess what the difference will be when the problem is infeasible? You may recognise that problem $(8)$ is very similar to the LP formulation of the $\ell_\infty$-norm problem, whereas problem $(9)$ is very similar to the LP formulation of the $\ell_1$-norm problem (see Class Exercise $3$ of Week $6$). As exptected, if we have an infeasible problem and we plot the number of constraints violated, problem $(8)$ will have a high number of violated inqualities with little violation, while problem $(9)$ will have a large number of constraints which are satified exactly at zero (as explained in [Section 1.1.3](https://colab.research.google.com/drive/1efBkJ5F5U6QpIT4nU6QzxM5mZq5I6iv4#scrollTo=1mUZN-95PGaY)). A figure is reported below.

<div>
<img src="https://drive.google.com/uc?export=view&id=1_jclUr0tVEDCdrFgoFJ9Xzaw_U-_7UfG" width="800"/>
</div>

Figure 9.7. *Distributions of the infeasibilities $b_i − a_i^\top x$ for an infeasible set
of $100$ inequalities $a_i^\top x \le b_i$, with $50$ variables. The vector $x_\max$ used in the left plot was obtained by the basic phase I algorithm. It satisfies $39$ of the $100$ inequalities. In the right plot the vector $x_{\text{sum}}$ was obtained by minimizing the sum of the infeasibilities. This vector satisfies $79$ of the $100$ inequalities. Source: page 581 of $[1]$.*

It is also valuable to look at the number of iteration needed by the Newton's method on $(8)$ (or $(9)$) depending on whether the problem is feasible or not. Consider the constraint

$$
Ax \preceq b(\gamma) 
$$

where $b(\gamma) = b + \gamma \Delta b$. The data of the problem is then selected so that for positive $\gamma$ the problem is feasible and for negative $\gamma$ the problem is infeasible. The next figure shows the number of Newton steps as a function of $\gamma$.

<div>
<img src="https://drive.google.com/uc?export=view&id=1509K75NIM_Ee4RgL9DKpEmsY9zLHpVzU" width="400"/>
</div>

Figure 9.8. *Number of Newton iterations required to detect feasibility or infeasibility of a set of linear inequalities $Ax \preceq b + \gamma \Delta b$ parametrized by
$\gamma$. The inequalities are strictly feasible for $\gamma > 0$, and infeasible for $\gamma < 0$. For $\gamma$ larger than around $0.2$, about $30$ steps are required to compute a strictly feasible point; for $\gamma$ less than $−0.5$ or so, it takes around $35$ steps to produce a certificate proving infeasibility. For values of $\gamma$ in between, and
especially near zero, more Newton steps are required to determine feasibility. Source: page 584 of $[1]$.*

The figure shows that while for $\gamma>0.2$ we require $30$ steps and for $\gamma < -0.5$ we require $35$ steps, for $\gamma$ close to zero the number of steps increases dramatically. This is typical. The cost of solving a set of convex inequalities and linear equalities using the barrier method is modest, and approximately constant, as long as the problem is not very close to the boundary between feasibility and
infeasibility. When the problem is very close to the boundary, the number of Newton steps required to find a strictly feasible point or produce a certificate of infeasibility grows. When the problem is exactly on the boundary between
strictly feasible and infeasible, for example, feasible but not strictly feasible, the cost becomes infinite.

One could replace the barrier method for phase I with the infeasible start Newton's method to find a feasible point. However, this has two drawbacks:
1.   There is no good stopping criterion when the problem is infeasible; the residual simply fails to converge to zero.
2.   When close to infeasibility, the infeasible start Newton's method requires thousands of Newton steps more than the phase I method for the same level of "small" feasibility. 

# 9.5 Complexity Analysis via Self-Concordance

In [None]:
from IPython.display import HTML
HTML('<iframe width="850" height="480" src="https://www.youtube.com/embed/UpHOgtjN84c"></iframe>')

**Errata:** 
*   At 4:48 the video says "optimal value". It should be "optimal point".
*   At 11:32 the video says "$\mu$". It should be "$N$". At 11:36, 11:38 and 11:44 the video says "$N$". It should be "$\mu$". (a bit messy, just go with the text instead).

In Section $9.3$ we provided the complexity analysis of the outer iterations of the barrier method.
Using the complexity analysis of Newton’s method for self-concordant functions given in Chapter $7$, we can give a complete complexity analysis of the barrier method. We make two assumptions.
1.    The function $tf_0 + \phi$ is closed and self-concordant for all $t \ge t^{(0)}$.
2.    The sublevel sets of $(1)$ are bounded.

The second assumption implies that the centering problem has bounded sublevel sets, and, therefore, the centering problem is solvable (moreover, it also implies that the Hessian is positive definite everywhere). The self-concordance assumption holds naturally for a variery of problems such as LPs, QP and QCQP. Other problems can be reformulated so that the equivalent problem has self-concordant functions. For instance the linear inequality constrained entropy maximization problem

$$
\begin{array}{ll}
\min & \sum_{i=1}^n x_i \log(x_i)\\
s.t. & Fx \preceq g,\\
& A x = b, 
\end{array}
$$

does not satisfy assumption 1. However, the modified problem


$$
\begin{array}{ll}
\min & \sum_{i=1}^n x_i \log(x_i)\\
s.t. & Fx \preceq g,\\
& A x = b, \\
& x \succeq 0
\end{array}
$$

does satisfy assumption 1. Other problems that do not naturally satisfy the self-concordant assumption but that can be reformulated so that they satisfy it include GPs.



Recall from Chapter $7$ that the number of steps required by Newton's method for self-concordant functions is bounded above by

$$
\frac{f(x)-p^*}{\gamma} + c \tag{10}
$$

where $\gamma$ is a constant that depends only on the backtracking parameters and $c=\log_2 \log_2 (1/\varepsilon_{nt})$, with $\varepsilon_{nt}$ the tolerance of Newton's method. To simplify the notation, let $x=x^*(t)$ be the current (starting) iterate, $x^+=x^*(\mu t)$ be the next iterate (solution), $\lambda=\lambda^*(t)$ and $\mu=\mu^*(t)$. Applying the barrier algorithm given in Section $9.3$, we have a centering step starting with $x=x^*(t)$ and $t=\mu t$. Hence $f(x) = \mu t f_0(x) + \phi(x)$ and $p^* = f(x^+)$. Thus, $(10)$ reads

$$
\frac{\mu t f_0(x) + \phi(x) - \mu t f_0(x^+) - \phi(x^+) }{\gamma} + c
$$

which is an upper bound on the number of inner iterations. This upper bound cannot be computed because $x^+$ is unknown. Thus we find a further upper bound that is computable. Details of the next computation is provided below

$$
\begin{array}{l}
\displaystyle \mu t f_0(x) + \phi(x) - \mu t f_0(x^+) - \phi(x^+)\\ 
\displaystyle = \mu t f_0(x) - \mu t f_0(x^+) + \sum_{i=1}^m \log(-\mu t \lambda_i f_i(x^+)) - m\log\mu\\
\displaystyle\le \mu t f_0(x) - \mu t f_0(x^+) -\mu t \sum_{i=1}^m \lambda_i f_i(x^+) - m - m\log\mu\\
\displaystyle= \mu t f_0(x) - \mu t \left( f_0(x^+) + \sum_{i=1}^m \lambda_i f_i(x^+) + \nu^\top(Ax^+ -b) \right) - m - m\log\mu\\
\displaystyle\le \mu t f_0(x) - \mu t g(\lambda,\nu) - m - m\log\mu\\
= m(\mu-1-\log \mu).
\end{array}
$$

To obtain the second line from the first, we use $\lambda_i = −1/(t f_i(x))$. In the first inequality we use
the fact that $\log a \le a − 1$ for $a > 0$. To obtain the fourth line from the third, we use $Ax^+ = b$, so the extra term $\nu^\top (Ax^+ −b)$ is zero. The second inequality follows from the definition of the dual function. The last line follows from $g(\lambda, \nu) = f_0(x) − m/t$.

The conclusion is that

$$
\frac{m(\mu-1-\log \mu)}{\gamma} + c
$$

is an upper bound on the number of Newton steps required by one outer iteration. Combining this with the results of Section $9.3$ we obtain

$$
N = \left\lceil \frac{\log(m/(\varepsilon t^{(0)}))}{\log \mu}\right\rceil \left(\frac{m(\mu-1-\log \mu)}{\gamma} + c\right)\tag{11}
$$

as an upper bound on the total number of Newton steps required by the barrier method ($(11)$ omits the complexity of the initial centering step). This formula confirms the observations that we have already noted in Section $9.3$ from the figures:
*    The barrier method converges at least linearly, since (for fixed $\mu$ and $m$) the number of steps required to reach a given precision grows logarithmically with the inverse of the precision ($N \propto \log(m/(\varepsilon t^{(0)}))$, the log of the ratio of the initial duality gap $m/t^{(0)}$ and the final duality gap $\varepsilon$).
*    The bound does not depend on $t$. This means that the centering steps do not become more difficult as $t$ grows.
*    The bound does not depend on the number of variables $n$ and on the number of linear constraints $p$. The bound depends linearly on the number of inequality constraints $m$ (in fact, this result can be sharped to show that this dependence grows with $\sqrt{m}$ instead of $m$).
*   As $\mu$ approaches $1$, $N$ grows large. For large $\mu$, $N$ grows as $\mu/\log\mu$. This is consistent with our observation that values of $\mu$ between $3$ and $100$ work best.

We conclude with a series of reminders about the limitations of the convergence analysis based on self-concordance:
*    The bound $(11)$ is not sharp. In fact, it predicts a number of steps which is orders of magnitude larger than the actual number. 
*    The bound predicts growth with $m$ (or $\sqrt{m}$), but from the simulations we saw that the number of steps is more or less constant for any $m$.
*    Consequently, the bound $(11)$ has little practical value. However, it has theoretical value. It shows that our observations in the simulations (such as independence on $n$ and so on) are mathematically true.
*    Self-concordance is used for this analysis, but the barrier method behaves as well also for functions which are not self-concordant. In summary, all we can say is that when the self-concordance condition holds, we can give a worst case complexity bound.

# 9.6 Interior-Point Methods with Generalised Inequalities

In [None]:
from IPython.display import HTML
HTML('<iframe width="850" height="480" src="https://www.youtube.com/embed/lY13PkuCauM"></iframe>')

**Errata:** In the video $y_1$, $y_n$, $y_{n+1}$ appear in the gradient of the logarithm of the second-order cone (at e.g. 8:20). These should be $x_1$, $x_n$, $x_{n+1}$.

We now look at the case of generalised inequalities. We keep the mathematical expressions to a minimum by highlighting only the differences with the scalar case. Consider problem $(1)$ where $f_i(x)\le 0$ is replaced by $f_i(x) \preceq_{K_i} 0$, where $K_i$ is a proper cone. All functions $f_i$ are convex with respect to the proper cones $K_i$. The rest of the assumptions are the same. Problems of greatest interest are the SOCPs and SDPs. 

The KKT conditions are the same as $(2)$ once we replace $f_i(x^*)\le 0$ with $f_i(x) \preceq_{K_i} 0$, and $\lambda_i^* \ge 0$ with $\lambda_i^* \succeq_{K_i^*} 0$.

The development of the barrier method is parallel to the case with scalar constraints once we develop a generalisation of the logarithm function that applies to general
proper cones. Then we can define a logarithmic barrier function. From that point on, the development is essentially the same as in the scalar case. In particular, the central path, barrier method, and complexity analysis are very similar.

We first define the analog of the logarithm, $\log x$, for a proper cone $K \subseteq \mathbb{R}^q$. We
say that $\psi : \mathbb{R}^q \to \mathbb{R}$ is a *generalised logarithm* for $K$ if: 
*    $\psi$ is concave, closed, twice continuously differentiable, $\textbf{dom }\psi = \textbf{int }K$, and $\nabla^2\psi(y) \prec 0$ for $y \in \textbf{int }K$.
*    There is a constant $\theta > 0$ such that for all $y \succ_K 0$, and all $s > 0$, 
$$
\psi(sy) = \psi(y) + \theta \log s.
$$
In other words, $\psi$ behaves like a logarithm along any ray in the cone $K$.

We call the constant $\theta$ the *degree* of $\psi$. Note that a generalised logarithm is only defined up to an additive constant: if $\psi$ is a generalised logarithm for $K$, then so is $\psi + a$, where $a \in\mathbb{R}$. The ordinary logarithm is, of course, a generalised logarithm for $\mathbb{R}_+$.

We will use the following two properties, which are satisfied by any generalised logarithm: If $y \succ_K 0$, then
$$
\nabla \psi (y) \succ_{K^*} 0,
$$
which implies $\psi$ is $K$-increasing, and
$$
y^\top \nabla \psi(y) = \theta.
$$

This is a useful property because it tells us the degree of the generalised logarithm.

---

**Example 9.1:** (*Nonnegative orthant*). The function $\psi(x) = \sum_{i=1}^n\log x_i$ is a generalised logarithm for $K = \mathbb{R}_+^n$. For $x\succ 0$.
$$
\nabla \psi(x) = \left[\frac{1}{x_1},\dots,\frac{1}{x_n}\right]
$$
so $\nabla \psi(x) \succ 0$ and $x^\top \nabla \psi(x) = n$.

**Example 9.2:** (*Second-order cone*). The function 
$$
\psi(x) = \log\left(x_{n+1}^2 - \sum_{i=1}^n x_i^2\right)
$$ 
is a generalised logarithm for the second-order cone 
$$
K=\left\{x \in \mathbb{R}^{n+1}: \left(\sum_{i=1}^n x_i^2\right)^{1/2} \le x_{n+1} \right\}.
$$
The gradient of $\psi$ is given by
$$
\nabla \psi(x) = \frac{2}{x_{n+1}^2 - \sum_{i=1}^n x_i^2}\left[\begin{array}{r}-x_1\\\vdots\\-x_n\\x_{n+1}\end{array}\right]
$$
so $x^\top \nabla \psi(x) = 2$. The logarithm of the second-order cone has degree $2$. 

**Example 9.3:** (*Positive semidefinite cone*). The function $\psi(X) = \log \det X$ is a generalised logarithm for $K = \mathbb{S}_+^p$. The degree is $p$ since
$$
\log\det(sX) = \log \det X + p \log s
$$
for $s > 0$. The gradient of $\psi$ at a point $X \in \mathbb{S}_{++}^p$ is equal to
$$
\nabla \psi(x) = X^{-1}
$$
so $\nabla \psi(x) = X^{-1} \succ 0$ and $\textbf{Tr}(X \nabla \psi(x)) = p$.

---

Returning to the development of the barrier method, the logarithmic barrier function is given by

$$
\phi(x) = - \sum_{i=1}^m \psi_i(-f_i(x)) \qquad \textbf{dom }\phi = \{ x: f_i(x) \prec_{K_i} 0, \,i=1,\dots, m\}
$$

where $\psi_i$ is the generalised logarithm for cone $K_i$. $\phi$ is convex because the functions $\psi_i$ are $K_i$-increasing and the functions $f_i$ are $K_i$-convex.

The rest of the discussion made for the scalar case follows identically once we replace

$$
\lambda^*_i(t) = -\frac{1}{t f_i(x^*(t))}
$$

with

$$
\lambda^*_i(t) = \frac{1}{t} \nabla \psi_i (- f_i(x^*(t)))
$$

and the bound

$$
f_0(x^*(t)) - p^* \le \frac{m}{t}
$$

with the bound

$$
f_0(x^*(t)) - p^* \le \frac{\bar{m}}{t}
$$

where 

$$
\bar{m} = \sum_{i=1}^m \theta_i.
$$

Everything is the same.  Also the Figures 9.4, 9.5, 9.6 are virtually identical when the problem is an SOCP or an SDP instead of an LP/GP used in the scalar case. 


<div>
<img src="https://drive.google.com/uc?export=view&id=1qj6XbtleZKf-c7-rDCCHkQdz6u-gd1-E" width="800"/>
</div>

Figure 9.9. *(Left) Progress of the barrier method for an SOCP, showing duality gap versus cumulative number of Newton steps. (Right) Trade-off in the choice of the parameter $\mu$, for a small SOCP. Source: page 603 of $[1]$.*

Likewise the complexity analysis is identical and so is the bound $(11)$ (once $m$ is replaced by $\bar{m}$).

It is interesting to note that the degree $\theta_i$ can be interpreted as a measure of equivalence with respect to the scalar case, i.e. $\theta_i$ tells us that the generalised constraint $i$ corresponds to $\theta_i$ scalar constraints.

# 9.7 Primal-Dual Interior-Point Methods

In [None]:
from IPython.display import HTML
HTML('<iframe width="850" height="480" src="https://www.youtube.com/embed/saxeKI5UwXk"></iframe>')

The barrier method is just a basic example of interior-point methods. As far as this module is concerned, we will not look into other methods (as we did not look into alternatives to Newton's method for unconstrained optimisation). The point here is to explain the main ideas behind convex optimisation algorithms. Swapping Newton's or barrier with other algorithms does not change those ideas and properties which are proper of convex problems, not the specific algorithms. Anyway, your favorite solver will use the right method when needed. If you have to implement your own solver, then you will have to look at the research literature relevant to your specific application to verify whether better alternatives to the barrier method exist.

Nevertheless, we now mention some details of the methods called *primal-dual interior-point methods* just to give you a flavour of the potential benefits with respect to the basic barrier method.

Primal-dual interior-point methods are very similar to the barrier method, with some differences.
*    There is only one type of loop/iteration, i.e., there is no distinction between inner
and outer iterations as in the barrier method. At each iteration, both the primal and dual variables are updated.
*    The search directions in a primal-dual interior-point method are obtained from Newton’s method applied to modified KKT equations (i.e., the optimality conditions for the logarithmic barrier centering problem). The primal-dual search directions are similar, but not identical, to the search directions that arise in the barrier method.
*    In a primal-dual interior-point method, the primal and dual iterates are not necessarily feasible.
*    Primal-dual interior-point methods are often more efficient than the barrier method, especially when high accuracy is required, since they can exhibit better than linear convergence. For several basic problem classes, such as linear, quadratic, second-order cone, geometric, and semidefinite programming, customized primal-dual
methods outperform the barrier method. 
*    Another advantage of primal-dual algorithms over the barrier method is that they can work when the problem is feasible, but not strictly feasible.

For a few more details on primal-dual methods you can check $[1]$, but if you want a more thorough exposition, you will need to read dedicated literature.

# End of CHAPTER 9