This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
\section{Introduction \& Literature Review}\label{sec1}
Pruning algorithms, as comprehensively surveyed by \cite{reed1993pruning}, are a useful set of heuristics designed to identify and remove elements from a neural network which are either redundant or do not significantly contribute to the output of the network. This is motivated by the observed tendency of neural networks to over-fit to the idiosyncrasies of their training data given too many trainable parameters or too few input patterns from which to generalize, as stated by \cite{chauvin1990generalization}.
Network architecture design and hyperparameter selection are inherently difficult tasks typically approached using a few well-known rules of thumb, e.g. various weight initialization procedures, choosing the width and number of layers, different activation functions, learning rates, momentum, etc. Some of this ``black art'' appears unavoidable. For problems which cannot be solved using linear threshold units alone, \cite{baum1989size} demonstrate that there is no way to precisely determine the appropriate size of a neural network a priori given any random set of training instances. Using too few neurons seems to inhibit learning, and so in practice it is common to attempt to over-parameterize networks initially using a large number of hidden units and weights, and then prune them afterwards if necessary.
Network architecture design and hyperparameter selection are inherently difficult tasks typically approached using a few well-known rules of thumb, e.g. various weight initialization procedures, choosing the width and number of layers, different activation functions, learning rates, momentum, etc. Some of this ``black art'' appears unavoidable. For problems which cannot be solved using linear threshold units alone, \cite{baum1989size} demonstrate that there is no way to precisely determine the appropriate size of a neural network a priori given any random set of training instances. Using too few neurons seems to inhibit learning, and so in practice it is common to attempt to over-parameterize networks initially using a large number of hidden units and weights, and then prune them afterwards if necessary. Of course, as the old saying goes, there's more than one way to skin a neural network.
\subsection{Non-Pruning Based Generalization \& Compression Techniques}
The generalization performance of neural networks has been well studied, and apart from pruning algorithms many heuristics have been used to avoid overfitting, such as dropout (\cite{srivastava2014dropout}), maxout (\cite{goodfellow2013maxout}), and cascade correlation (\cite{fahlman1989cascade}), among others. Compressing networks often has benefits with respect to generalization performance and the portability of neural networks to operate in memory-constrained or embedded environments. Without explicitly removing parameters from the network, weight quantization allows for a reduction in the number of bytes used to represent each weight parameter, as investigated by \cite{balzer1991weight}, \cite{dundar1994effects}, and \cite{hoehfeld1992learning}.
The generalization performance of neural networks has been well studied, and apart from pruning algorithms many heuristics have been used to avoid overfitting, such as dropout (\cite{srivastava2014dropout}), maxout (\cite{goodfellow2013maxout}), and cascade correlation (\cite{fahlman1989cascade}), among others. Of course, while cascade correlation specifically tries to construct of minimal networks, many techniques to improve network generalization do not explicitly attempt to reduce the total number of parameters or the memory footprint of a trained network per se.
Model compression often has benefits with respect to generalization performance and the portability of neural networks to operate in memory-constrained or embedded environments. Without explicitly removing parameters from the network, weight quantization allows for a reduction in the number of bytes used to represent each weight parameter, as investigated by \cite{balzer1991weight}, \cite{dundar1994effects}, and \cite{hoehfeld1992learning}.
A recently proposed method for compressing recurrent neural networks (\cite{prabhavalkar2016compression}) uses the singular values of a trained weight matrix as basis vectors from which to derive a compressed hidden layer. Some other recent works like \cite{Anders2016quant} have tried successfully to achieve compression through weight quantization followed by an encoding step while others such as \cite{deepcompression2016} have tried to expand on this by adding weight-pruning as a preceding step to quantization and encoding.
\section{Review of Pruning Algorithms}
In summary, we can say that there are many different ways to improve network generalization by altering the training procedure, the objective error function, or by using compressed representations of the network parameters.
\subsection{Pruning Techniques}
If we wanted to continually shrink a network to its absolute minimal size in an optimal manner, we might accomplish this using any number of off-the-shelf pruning algorithms, such as Skeletonization (\cite{mozer1989skeletonization}), Optimal Brain Damage (\cite{lecun1989optimal}), or later variants such as Optimal Brain Surgeon (\cite{hassibi1993second}). In fact, we borrow much of our inspiration from these antecedent algorithms, with one major variation:\textit{ Instead of pruning individual weights, we prune entire neurons, thereby eliminating all of their incoming and outgoing weight parameters in one go, resulting in more memory saved, faster.}
If we wanted to continually shrink a network to its absolute minimal size, we might accomplish this using any number of off-the-shelf pruning algorithms, such as Skeletonization (\cite{mozer1989skeletonization}), Optimal Brain Damage (\cite{lecun1989optimal}), or later variants such as Optimal Brain Surgeon (\cite{hassibi1993second}). In fact, we borrow much of our inspiration from these antecedent algorithms, with one major variation: Instead of pruning individual weights, we prune entire neurons, thereby eliminating all of their incoming and outgoing weight parameters in one go, resulting in more memory saved, faster.
Scoring and ranking individual weight parameters in a large network is computationally expensive, and generally speaking the removal of a single weight from a large network is a drop in the bucket in terms of reducing a network's core memory footprint. We argue that pruning neurons instead of weights is more efficient computationally as well as practically in terms of quickly reaching a target reduction in memory size. Our approach also attacks the angle of giving downstream applications a realistic expectation of the minimal increase in error resulting from the removal of a specified percentage of neurons. Such trade-offs are unavoidable, but performance impacts can be limited if a principled approach is used to find candidate neurons for removal.