# Input Convex Neural Networks




## 1. Introduction

a new neural network architecture: input convex neural network (ICNN).

scalar-valued neural networks $f(x, y ; \theta)$
$x$ and $y$ denotes inputs to the function 
$\theta$ denotes the parameters, built in such a way that the network is convex in (a subset of) inputs $y 

fundamental benefit: we can optimize over the convex inputs to the network given some fixed value for other inputs. That is, given some fixed $x$ (and possibly some fixed elements of[^0]$y$ ) we can globally and efficiently (because the problem is convex) solve the optimization problem

$$
\begin{equation*}
\underset{y}{\operatorname{argmin}} f(x, y ; \theta) \tag{1}
\end{equation*}
$$

we can perform inference in the network via optimization. 

instead of making predictions in a neural network via a purely feedforward process,
we can make predictions by optimizing a scalar function (which effectively plays the role of an energy function) over some inputs to the function given 

potential use cases for these networks.

Structured prediction: 

Given (typically high-dimensional) structured input and output spaces $\mathcal{X} \times \mathcal{Y}$, we can build a network over $(x, y)$ pairs that encodes the energy function for this pair, following typical energy-based learning formalisms (LeCun et al., 2006). 

Prediction involves finding the $y \in \mathcal{Y}$ that minimizes the energy for a given $x$, which is exactly the argmin problem in (1). 


In our setting, assuming that $\mathcal{Y}$ is a convex space (a common assumption in structured prediction), this optimization problem is convex. 

This is similar in nature to the structured prediction energy networks (SPENs) (Belanger \& McCallum, 2016), which also use deep networks over the input and output spaces, with the difference being that in our setting $f$ is convex in $y$, so the optimization can be performed globally.


Data imputation:

if we are given some space $\mathcal{Y}$ we can learn a network $f(y ; \theta)$ (removing the additional $x$ inputs, though these can be added as well) that, given an example with some subset $\mathcal{I}$ missing, imputes the likely values of these variables by solving the optimization problem as above $\hat{y}_{\mathcal{I}}=\operatorname{argmin}_{y_{\mathcal{I}}} f\left(y_{\mathcal{I}}, y_{\overline{\mathcal{I}}} ; \theta\right)$ This could be used

[^1]e.g., in image inpainting where the goal is to fill in some arbitrary set of missing pixels given observed ones.

Continuous action reinforcement learning Given a reinforcement learning problem with potentially continuous state and action spaces $\mathcal{S} \times \mathcal{A}$, we can model the (negative) $Q$ function, $-Q(s, a ; \theta)$ as an input convex neural network. In this case the action selection procedure can be formulated as a convex optimization problem $a^{\star}(s)=$ $\operatorname{argmin}_{a}-Q(s, a ; \theta)$.

This paper lays the foundation for optimization, inference, and learning in these input convex models, and explores their performance in the applications above. Our main contributions are: we propose the ICNN architecture and a partially convex variant; we develop efficient optimization and inference procedures that are well-suited to the complexity of these specific models; we propose techniques for training these models, based upon either max-margin structured prediction or direct differentiation of the argmin operation; and we evaluate the system on multi-label prediction, image completion, and reinforcement learning domains; in many of these settings we show performance that improves upon the state of the art.

## 3. Convex neural network architectures

chief claim: the class of (full and partial) input convex models is rich 

### 3.1. Fully input convex neural networks

consider a fully convex, $k$-layer, fully connected ICNN that we call a FICNN and is shown in Figure 1. This model defines a neural network over the input $y$ (i.e., omitting any $x$ term in this function) using the architecture for $i=0, \ldots, k-1$

$$
\begin{equation*}
z_{i+1}=g_{i}\left(W_{i}^{(z)} z_{i}+W_{i}^{(y)} y+b_{i}\right), \quad f(y ; \theta)=z_{k} \tag{2}
\end{equation*}
$$

where $z_{i}$ denotes the layer activations (with $z_{0}, W_{0}^{(z)} \equiv 0$ ), $\theta=\left\{W_{0: k-1}^{(y)}, W_{1: k-1}^{(z)}, b_{0: k-1}\right\}$ are the parameters, and $g_{i}$ are non-linear activation functions. The central result on convexity of the network is the following:
Proposition 1. The function $f$ is convex in $y$ provided that all $W_{1: k-1}^{(z)}$ are non-negative, and all functions $g_{i}$ are convex and non-decreasing.

proof:nonnegative sums of convex functions are also convex and that the composition of a convex and convex non-decreasing function is also convex (see e.g. Boyd \& Vandenberghe (2004, 3.2.4)). The constraint that the $g_{i}$ be convex nondecreasing is not particularly restrictive, as current nonlinear activation units like the rectified linear unit or maxpooling unit already satisfy this constraint. The constraint that the $W^{(z)}$ terms be non-negative is somewhat restrictive, but because the bias terms and $W^{(y)}$ terms can be negative, the network still has substantial representation power, as we will shortly demonstrate empirically.
One notable addition in the ICNN are the "passthrough" layers that directly connect the input $y$ to hidden units in
![](https://cdn.mathpix.com/cropped/2024_11_05_2df55116d7c46344fa72g-03.jpg?height=324&width=709&top_left_y=212&top_left_x=1109)

Figure 2. A partially input convex neural network (PICNN).
deeper layers. Such layers are unnecessary in traditional feedforward networks because previous hidden units can always be mapped to subsequent hidden units with the identity mapping; however, for ICNNs, the non-negativity constraint subsequent $W^{(z)}$ weights restricts the allowable use of hidden units that mirror the identity mapping, and so we explicitly include this additional passthrough. Some passthrough layers have been recently explored in the deep residual networks (He et al., 2015) and densely connected convolutional networks (Huang et al., 2016), though these differ from those of an ICNN as they pass through hidden layers deeper in the network, whereas to maintain convexity our passthrough layers can only apply to the input directly.
Other linear operators like convolutions can be included in ICNNs without changing the convexity properties. Indeed, modern feedforward architectures such as AlexNet (Krizhevsky et al., 2012), VGG (Simonyan \& Zisserman, 2014), and GoogLeNet (Szegedy et al., 2015) with ReLUs (Nair \& Hinton, 2010) can be made input convex with Proposition 1. In the experiment that follow, we will explore ICNNs with both fully connected and convolutional layers, and we provide more detail about these additional architectures in Section A of the supplement.

### 3.2. Partially input convex architectures

The FICNN provides joint convexity over the entire input to the function, which indeed may be a restriction on the allowable class of models. Furthermore, this full joint convexity is unnecessary in settings like structured prediction where the neural network is used to build a joint model over an input and output example space and only convexity over the outputs is necessary.
In this section we propose an extension to the pure FICNN, the partially input convex neural network (PICNN), that is convex over only some inputs to the network (in general ICNNs will refer to this new class). As we will show, these networks generalize both traditional feedforward networks and FICNNs, and thus provide substantial representational benefits. We define a PICNN to be a network over $(x, y)$ pairs $f(x, y ; \theta)$ where $f$ is convex in $y$ but not convex in $x$. Figure 2 illustrates one potential $k$-layer PICNN architec-
ture defined by the recurrences

$$
\begin{align*}
u_{i+1} & =\tilde{g}_{i}\left(\tilde{W}_{i} u_{i}+\tilde{b}_{i}\right) \\
z_{i+1} & =g_{i}\left(W_{i}^{(z)}\left(z_{i} \circ\left[W_{i}^{(z u)} u_{i}+b_{i}^{(z)}\right]_{+}\right)+\right. \\
& \left.W_{i}^{(y)}\left(y \circ\left(W_{i}^{(y u)} u_{i}+b_{i}^{(y)}\right)\right)+W_{i}^{(u)} u_{i}+b_{i}\right) \\
f(x, y ; \theta) & =z_{k}, u_{0}=x \tag{3}
\end{align*}
$$

where $u_{i} \in \mathbb{R}^{n_{i}}$ and $z_{i} \in \mathbb{R}^{m_{i}}$ denote the hidden units for the " $x$-path" and " $y$-path", where $y \in \mathbb{R}^{p}$, and where - denotes the Hadamard product, the elementwise product between two vectors. The crucial element here is that unlike the FICNN, we only need the $W^{(z)}$ terms to be nonnegative, and we can introduce arbitrary products between the $u_{i}$ hidden units and the $z_{i}$ hidden units. The following proposition highlights the representational power of the PICNN.
Proposition 2. A PICNN network with $k$ layers can represent any FICNN with $k$ layers and any purely feedforward network with $k$ layers.

Proof. To recover a FICNN we simply set the weights over the entire $x$ path to be zero and set $b^{(z)}=b^{(y)}=1$. We can recover a feedforward network by noting that a traditional feedforward network $\hat{f}(x ; \theta)$ where $f: \mathcal{X} \rightarrow \mathcal{Y}$, can be viewed as a network with an inner product $f(x ; \theta)^{T} y$ in its last layer (see e.g. (LeCun et al., 2006) for more details). Thus, a feedforward network can be represented as a PICNN by setting the $x$ path to be exactly the feedforward component, then having the $y$ path be all zero except $W_{k-1}^{(y u)}=I$ and $W_{k-1}^{(y)}=1^{T}$.

In [1]:
import torch
import torch.nn as nn
import torch.autograd as autograd

class StandardNN(nn.Module):
    def __init__(self):
        super(StandardNN, self).__init__()
        self.fc1 = nn.Linear(1, 10)
        self.fc2 = nn.Linear(10, 10)
        self.fc3 = nn.Linear(10, 1)
        self.activation = nn.ReLU()  # Non-linear activation

    def forward(self, x):
        x = self.activation(self.fc1(x))
        x = self.activation(self.fc2(x))
        return self.fc3(x)


In [2]:
class ICNN(nn.Module):
    def __init__(self):
        super(ICNN, self).__init__()
        self.fc1 = nn.Linear(1, 10)
        self.fc2 = nn.Linear(10, 10)
        self.fc3 = nn.Linear(10, 1)
        self.activation = nn.Softplus()

        # Enforce non-negative weights (e.g., with ReLU on weights)
        for layer in [self.fc1, self.fc2, self.fc3]:
            layer.weight.data = torch.abs(layer.weight.data)

    def forward(self, x):
        x = self.activation(self.fc1(x))
        x = self.activation(self.fc2(x))
        return self.fc3(x) + x**2  # Add convex term x^2 for guaranteed convexity


In [6]:
# Ensure x is a scalar input for simplicity in this example
x = torch.tensor([1.0], requires_grad=True)

# Instantiate models
nn_model = StandardNN()
icnn_model = ICNN()

# Forward pass through standard NN
y_nn = nn_model(x)
y_nn_scalar = y_nn.sum()  # Ensure scalar output
y_nn_scalar.backward()
grad_nn = x.grad.data

# Clear gradients
x.grad.data.zero_()

# Forward pass through ICNN
y_icnn = icnn_model(x)
y_icnn_scalar = y_icnn.sum()  # Ensure scalar output by summing all elements
y_icnn_scalar.backward()
grad_icnn = x.grad.data

print(f"Standard NN Output: {y_nn_scalar.item()}, Gradient: {grad_nn.item()}")
print(f"ICNN Output: {y_icnn_scalar.item()}, Gradient: {grad_icnn.item()}")


Standard NN Output: -0.37659865617752075, Gradient: 22.6495361328125
ICNN Output: 57.33120346069336, Gradient: 22.6495361328125
