@@ -1,4 +1,4 @@
This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) (preloaded format=pdflatex 2016.6.30) 4 NOV 2016 15:44
This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) (preloaded format=pdflatex 2016.6.30) 4 NOV 2016 16:07
entering extended mode
restricted \write18 enabled.
%&-line parsing enabled.
@@ -1446,66 +1446,67 @@ fonts/map/pdftex/updmap/pdftex.map}
]) (./methodology.tex [2]
(./taylor_series.tex) (./less_math_error_approximation.tex [3])
(./pruning_algorithm.tex
<png/intuition_mini.png, id=196, 492.33937pt x 253.69781pt>
<png/intuition_mini.png, id=200, 492.33937pt x 253.69781pt>
File: png/intuition_mini.png Graphic file (type png)

<use png/intuition_mini.png>
Package pdftex.def Info: png/intuition_mini.png used on input line 3.
(pdftex.def) Requested size: 397.48499pt x 204.82826pt.
[4 <./png/intuition_mini.png>])) (./results.tex
[5]
<png/mnist-acc99-single-pass-method.pdf, id=223, 392.87402pt x 262.28615pt>
<png/mnist-acc99-single-pass-method.pdf, id=227, 392.87402pt x 262.28615pt>
File: png/mnist-acc99-single-pass-method.pdf Graphic file (type pdf)

<use png/mnist-acc99-single-pass-method.pdf>
Package pdftex.def Info: png/mnist-acc99-single-pass-method.pdf used on input l
ine 11.
(pdftex.def) Requested size: 194.76982pt x 130.03413pt.

<png/mnist-acc99-single-pass-accuracy.pdf, id=224, 385.3459pt x 266.80302pt>
<png/mnist-acc99-single-pass-accuracy.pdf, id=228, 385.3459pt x 266.80302pt>
File: png/mnist-acc99-single-pass-accuracy.pdf Graphic file (type pdf)

<use png/mnist-acc99-single-pass-accuracy.pdf>
Package pdftex.def Info: png/mnist-acc99-single-pass-accuracy.pdf used on input
line 12.
(pdftex.def) Requested size: 194.76982pt x 134.85048pt.
[6 <./png/mnist-acc99-single-pass-method.pdf> <./png/mnist-acc99-single-pass-a
ccuracy.pdf

<png/mnist-acc99-iterative-rerank-method.pdf, id=229, 392.87402pt x 262.28615pt
pdfTeX warning: /Library/TeX/texbin/pdflatex (file ./png/mnist-acc99-single-pas
s-accuracy.pdf): PDF inclusion: multiple pdfs with page group included in a sin
gle page
>]
<png/mnist-acc99-iterative-rerank-method.pdf, id=334, 392.87402pt x 262.28615pt
>
File: png/mnist-acc99-iterative-rerank-method.pdf Graphic file (type pdf)
<use png/mnist-acc99-iterative-rerank-method.pdf>
Package pdftex.def Info: png/mnist-acc99-iterative-rerank-method.pdf used on in
put line 23.
(pdftex.def) Requested size: 194.76982pt x 130.03413pt.

<png/mnist-acc99-iterative-rerank-accuracy.pdf, id=230, 385.3459pt x 262.28615p
<png/mnist-acc99-iterative-rerank-accuracy.pdf, id=335, 385.3459pt x 262.28615p
t>
File: png/mnist-acc99-iterative-rerank-accuracy.pdf Graphic file (type pdf)
<use png/mnist-acc99-iterative-rerank-accuracy.pdf>
Package pdftex.def Info: png/mnist-acc99-iterative-rerank-accuracy.pdf used on
input line 24.
(pdftex.def) Requested size: 194.76982pt x 132.5675pt.

<png/mnist-acc99-gt-gain.pdf, id=233, 945.4384pt x 379.3234pt>
<png/mnist-acc99-gt-gain.pdf, id=338, 945.4384pt x 379.3234pt>
File: png/mnist-acc99-gt-gain.pdf Graphic file (type pdf)

<use png/mnist-acc99-gt-gain.pdf>
Package pdftex.def Info: png/mnist-acc99-gt-gain.pdf used on input line 41.
(pdftex.def) Requested size: 397.48499pt x 159.47679pt.
[6 <./png/mnist-acc99-single-pass-method.pdf> <./png/mnist-acc99-single-pass-a
ccuracy.pdf

pdfTeX warning: /Library/TeX/texbin/pdflatex (file ./png/mnist-acc99-single-pas
s-accuracy.pdf): PDF inclusion: multiple pdfs with page group included in a sin
gle page
>] <png/mnist-acc99-g1-gain.pdf, id=334, 945.4384pt x 379.3234pt>
<png/mnist-acc99-g1-gain.pdf, id=339, 945.4384pt x 379.3234pt>
File: png/mnist-acc99-g1-gain.pdf Graphic file (type pdf)

<use png/mnist-acc99-g1-gain.pdf>
Package pdftex.def Info: png/mnist-acc99-g1-gain.pdf used on input line 48.
(pdftex.def) Requested size: 397.48499pt x 159.47679pt.

<png/mnist-acc99-g2-gain.pdf, id=335, 945.4384pt x 379.3234pt>
<png/mnist-acc99-g2-gain.pdf, id=340, 945.4384pt x 379.3234pt>
File: png/mnist-acc99-g2-gain.pdf Graphic file (type pdf)

<use png/mnist-acc99-g2-gain.pdf>
@@ -1525,114 +1526,121 @@ f): PDF inclusion: multiple pdfs with page group included in a single page

pdfTeX warning: /Library/TeX/texbin/pdflatex (file ./png/mnist-acc99-g2-gain.pd
f): PDF inclusion: multiple pdfs with page group included in a single page
>] <png/mnist-deep-single-pass-method.pdf, id=568, 392.87402pt x 262.28615pt>
>] <png/mnist-deep-single-pass-method.pdf, id=574, 392.87402pt x 262.28615pt>
File: png/mnist-deep-single-pass-method.pdf Graphic file (type pdf)

<use png/mnist-deep-single-pass-method.pdf>
Package pdftex.def Info: png/mnist-deep-single-pass-method.pdf used on input li
ne 86.
(pdftex.def) Requested size: 194.76982pt x 130.03413pt.

<png/mnist-deep-single-pass-accuracy.pdf, id=569, 385.3459pt x 266.80302pt>
<png/mnist-deep-single-pass-accuracy.pdf, id=575, 385.3459pt x 266.80302pt>
File: png/mnist-deep-single-pass-accuracy.pdf Graphic file (type pdf)

<use png/mnist-deep-single-pass-accuracy.pdf>
Package pdftex.def Info: png/mnist-deep-single-pass-accuracy.pdf used on input
line 87.
(pdftex.def) Requested size: 194.76982pt x 134.85048pt.

<png/mnist-deep-iterative-rerank-method.pdf, id=573, 392.87402pt x 262.28615pt>
<png/mnist-deep-iterative-rerank-method.pdf, id=579, 392.87402pt x 262.28615pt>
File: png/mnist-deep-iterative-rerank-method.pdf Graphic file (type pdf)
<use png/mnist-deep-iterative-rerank-method.pdf>
Package pdftex.def Info: png/mnist-deep-iterative-rerank-method.pdf used on inp
ut line 101.
(pdftex.def) Requested size: 194.76982pt x 130.03413pt.

<png/mnist-deep-iterative-rerank-accuracy.pdf, id=574, 385.3459pt x 266.80302pt
<png/mnist-deep-iterative-rerank-accuracy.pdf, id=580, 385.3459pt x 266.80302pt
>
File: png/mnist-deep-iterative-rerank-accuracy.pdf Graphic file (type pdf)
<use png/mnist-deep-iterative-rerank-accuracy.pdf>
Package pdftex.def Info: png/mnist-deep-iterative-rerank-accuracy.pdf used on i
nput line 102.
(pdftex.def) Requested size: 194.76982pt x 134.85048pt.

Underfull \vbox (badness 6625) has occurred while \output is active []

[9 <./png/mnist-deep-single-pass-method.pdf> <./png/mnist-deep-single-pass-acc
uracy.pdf

pdfTeX warning: /Library/TeX/texbin/pdflatex (file ./png/mnist-deep-single-pass
-accuracy.pdf): PDF inclusion: multiple pdfs with page group included in a sing
le page
> <./png/mnist-deep-iterative-rerank-method.pdf

pdfTeX warning: /Library/TeX/texbin/pdflatex (file ./png/mnist-deep-iterative-r
erank-method.pdf): PDF inclusion: multiple pdfs with page group included in a s
ingle page
> <./png/mnist-deep-iterative-rerank-accuracy.pdf

pdfTeX warning: /Library/TeX/texbin/pdflatex (file ./png/mnist-deep-iterative-r
erank-accuracy.pdf): PDF inclusion: multiple pdfs with page group included in a
single page
>] <png/mnist-deep-gt-gain.pdf, id=771, 945.4384pt x 379.3234pt>
>] <png/mnist-deep-gt-gain.pdf, id=683, 945.4384pt x 379.3234pt>
File: png/mnist-deep-gt-gain.pdf Graphic file (type pdf)

<use png/mnist-deep-gt-gain.pdf>
Package pdftex.def Info: png/mnist-deep-gt-gain.pdf used on input line 117.
(pdftex.def) Requested size: 397.48499pt x 159.47679pt.

<png/mnist-deep-g1-gain.pdf, id=772, 945.4384pt x 379.3234pt>
<png/mnist-deep-g1-gain.pdf, id=684, 945.4384pt x 379.3234pt>
File: png/mnist-deep-g1-gain.pdf Graphic file (type pdf)

<use png/mnist-deep-g1-gain.pdf>
Package pdftex.def Info: png/mnist-deep-g1-gain.pdf used on input line 124.
(pdftex.def) Requested size: 397.48499pt x 159.47679pt.

<png/mnist-deep-g2-gain.pdf, id=773, 945.4384pt x 379.3234pt>
<png/mnist-deep-g2-gain.pdf, id=685, 945.4384pt x 379.3234pt>
File: png/mnist-deep-g2-gain.pdf Graphic file (type pdf)

<use png/mnist-deep-g2-gain.pdf>
Package pdftex.def Info: png/mnist-deep-g2-gain.pdf used on input line 131.
(pdftex.def) Requested size: 397.48499pt x 159.47679pt.
[10 <./png/mnist-deep-gt-gain.pdf> <./png/mnist-deep-g1-gain.pdf

pdfTeX warning: /Library/TeX/texbin/pdflatex (file ./png/mnist-deep-g1-gain.pdf
Underfull \vbox (badness 10000) has occurred while \output is active []

[10 <./png/mnist-deep-iterative-rerank-method.pdf> <./png/mnist-deep-iterative
-rerank-accuracy.pdf

pdfTeX warning: /Library/TeX/texbin/pdflatex (file ./png/mnist-deep-iterative-r
erank-accuracy.pdf): PDF inclusion: multiple pdfs with page group included in a
single page
> <./png/mnist-deep-gt-gain.pdf

pdfTeX warning: /Library/TeX/texbin/pdflatex (file ./png/mnist-deep-gt-gain.pdf
): PDF inclusion: multiple pdfs with page group included in a single page
>] <png/diamond.png, id=863, 881.2925pt x 332.24126pt>
>]
Underfull \vbox (badness 1048) has occurred while \output is active []

[11 <./png/mnist-deep-g1-gain.pdf> <./png/mnist-deep-g2-gain.pdf

pdfTeX warning: /Library/TeX/texbin/pdflatex (file ./png/mnist-deep-g2-gain.pdf
): PDF inclusion: multiple pdfs with page group included in a single page
>] <png/diamond.png, id=916, 881.2925pt x 332.24126pt>
File: png/diamond.png Graphic file (type png)
<use png/diamond.png>
Package pdftex.def Info: png/diamond.png used on input line 156.
(pdftex.def) Requested size: 317.9892pt x 119.88051pt.

<png/rshape.png, id=864, 875.27pt x 322.20375pt>
<png/rshape.png, id=917, 875.27pt x 322.20375pt>
File: png/rshape.png Graphic file (type png)
<use png/rshape.png>
Package pdftex.def Info: png/rshape.png used on input line 157.
(pdftex.def) Requested size: 317.9892pt x 117.06012pt.
)
(./conclusions.tex [11 <./png/mnist-deep-g2-gain.pdf>] [12 <./png/diamond.png>
<./png/rshape.png>]) (./iclr2017_conference.bbl [13

]) [14] (./diagram.tex
(./drawing.pdf_tex <drawing.pdf, id=935, page=1, 355.65945pt x 199.1693pt>
(./conclusions.tex [12 <./png/diamond.png> <./png/rshape.png>])
(./iclr2017_conference.bbl [13]) [14] (./diagram.tex (./drawing.pdf_tex
<drawing.pdf, id=941, page=1, 355.65945pt x 199.1693pt>
File: drawing.pdf Graphic file (type pdf)

<use drawing.pdf, page 1>
Package pdftex.def Info: drawing.pdf, page1 used on input line 52.
(pdftex.def) Requested size: 357.73405pt x 200.32971pt.

<drawing.pdf, id=936, page=2, 355.65945pt x 199.1693pt>
<drawing.pdf, id=942, page=2, 355.65945pt x 199.1693pt>
File: drawing.pdf Graphic file (type pdf)

<use drawing.pdf, page 2>
Package pdftex.def Info: drawing.pdf, page2 used on input line 58.
(pdftex.def) Requested size: 357.73405pt x 200.32971pt.

<drawing.pdf, id=937, page=3, 355.65945pt x 199.1693pt>
<drawing.pdf, id=943, page=3, 355.65945pt x 199.1693pt>
File: drawing.pdf Graphic file (type pdf)

<use drawing.pdf, page 3>
Package pdftex.def Info: drawing.pdf, page3 used on input line 61.
(pdftex.def) Requested size: 357.73405pt x 200.32971pt.

<drawing.pdf, id=938, page=4, 355.65945pt x 199.1693pt>
<drawing.pdf, id=944, page=4, 355.65945pt x 199.1693pt>
File: drawing.pdf Graphic file (type pdf)

<use drawing.pdf, page 4>
@@ -1673,7 +1681,7 @@ Package atveryend Info: Empty hook `AfterLastShipout' on input line 81.
Package atveryend Info: Executing hook `AtVeryEndDocument' on input line 81.
Package atveryend Info: Executing hook `AtEndAfterFileList' on input line 81.
Package rerunfilecheck Info: File `iclr2017_conference.out' has not changed.
(rerunfilecheck) Checksum: D787844ECBDD21416B638EED9F9A752F;2175.
(rerunfilecheck) Checksum: C07622087CD1A1A66A9DBD5509F563E0;2289.


LaTeX Warning: There were multiply-defined labels.
@@ -1685,13 +1693,13 @@ Package atveryend Info: Empty hook `AtVeryVeryEnd' on input line 81.
### simple group (level 1) entered at line 1061 ({)
### bottom level
Here is how much of TeX's memory you used:
15405 strings out of 494447
265882 string characters out of 6166766
352599 words of memory out of 5000000
18410 multiletter control sequences out of 15000+600000
15407 strings out of 494447
265922 string characters out of 6166766
352603 words of memory out of 5000000
18411 multiletter control sequences out of 15000+600000
20489 words of font info for 38 fonts, out of 8000000 for 9000
319 hyphenation exceptions out of 8191
40i,18n,45p,10423b,2109s stack positions out of 5000i,500n,10000p,200000b,80000s
40i,18n,45p,10423b,2117s stack positions out of 5000i,500n,10000p,200000b,80000s
{/usr/local/texlive/2016basic/texmf-dist/fonts/enc/dvips/base/8
r.enc}</usr/local/texlive/2016basic/texmf-dist/fonts/type1/public/amsfonts/cm/c
mex10.pfb></usr/local/texlive/2016basic/texmf-dist/fonts/type1/public/amsfonts/
@@ -1708,10 +1716,10 @@ ist/fonts/type1/urw/courier/ucrr8a.pfb></usr/local/texlive/2016basic/texmf-dist
/fonts/type1/urw/times/utmb8a.pfb></usr/local/texlive/2016basic/texmf-dist/font
s/type1/urw/times/utmr8a.pfb></usr/local/texlive/2016basic/texmf-dist/fonts/typ
e1/urw/times/utmri8a.pfb>
Output written on iclr2017_conference.pdf (21 pages, 760705 bytes).
Output written on iclr2017_conference.pdf (21 pages, 761715 bytes).
PDF statistics:
1168 PDF objects out of 1200 (max. 8388607)
539 compressed objects within 6 object streams
164 named destinations out of 1000 (max. 500000)
314 words of extra memory for PDF output out of 10000 (max. 10000000)
1174 PDF objects out of 1200 (max. 8388607)
545 compressed objects within 6 object streams
165 named destinations out of 1000 (max. 500000)
322 words of extra memory for PDF output out of 10000 (max. 10000000)

@@ -1,26 +1,27 @@
\BOOKMARK [1][-]{section.1}{Introduction \046 Literature Review}{}% 1
\BOOKMARK [1][-]{section.2}{Review of Pruning Algorithms}{}% 2
\BOOKMARK [1][-]{section.3}{Pruning Neurons to Shrink Neural Networks}{}% 3
\BOOKMARK [2][-]{subsection.3.1}{Brute Force Removal Approach}{section.3}% 4
\BOOKMARK [2][-]{subsection.3.2}{Taylor Series Representation of Error}{section.3}% 5
\BOOKMARK [3][-]{subsubsection.3.2.1}{Linear Approximation Approach}{subsection.3.2}% 6
\BOOKMARK [3][-]{subsubsection.3.2.2}{Quadratic Approximation Approach}{subsection.3.2}% 7
\BOOKMARK [2][-]{subsection.3.3}{Proposed Pruning Algorithm}{section.3}% 8
\BOOKMARK [3][-]{subsubsection.3.3.1}{Algorithm I: Single Overall Ranking}{subsection.3.3}% 9
\BOOKMARK [3][-]{subsubsection.3.3.2}{Algorithm II: Iterative Re-Ranking}{subsection.3.3}% 10
\BOOKMARK [1][-]{section.4}{Experimental Results}{}% 11
\BOOKMARK [2][-]{subsection.4.1}{Pruning a 1-Layer Network}{section.4}% 12
\BOOKMARK [3][-]{subsubsection.4.1.1}{Single Overall Ranking Algorithm}{subsection.4.1}% 13
\BOOKMARK [3][-]{subsubsection.4.1.2}{Iterative Re-Ranking Algorithm}{subsection.4.1}% 14
\BOOKMARK [3][-]{subsubsection.4.1.3}{Visualization of Error Surface \046 Pruning Decisions}{subsection.4.1}% 15
\BOOKMARK [2][-]{subsection.4.2}{Pruning A 2-Layer Network}{section.4}% 16
\BOOKMARK [3][-]{subsubsection.4.2.1}{Single Overall Ranking Algorithm}{subsection.4.2}% 17
\BOOKMARK [3][-]{subsubsection.4.2.2}{Iterative Re-Ranking Algorithm}{subsection.4.2}% 18
\BOOKMARK [3][-]{subsubsection.4.2.3}{Visualization of Error Surface \046 Pruning Decisions}{subsection.4.2}% 19
\BOOKMARK [2][-]{subsection.4.3}{Experiments on Toy Datasets}{section.4}% 20
\BOOKMARK [1][-]{section.5}{Conclusions \046 Future Work}{}% 21
\BOOKMARK [1][-]{appendix.A}{Second Derivative Back-Propagation}{}% 22
\BOOKMARK [2][-]{subsection.A.1}{First and Second Derivatives}{appendix.A}% 23
\BOOKMARK [3][-]{subsubsection.A.1.1}{Summary Of Output Layer Derivatives}{subsection.A.1}% 24
\BOOKMARK [3][-]{subsubsection.A.1.2}{Hidden Layer Derivatives}{subsection.A.1}% 25
\BOOKMARK [3][-]{subsubsection.A.1.3}{Summary Of Hidden Layer Derivatives}{subsection.A.1}% 26
\BOOKMARK [2][-]{subsection.1.1}{Non-Pruning Based Generalization \046 Compression Techniques}{section.1}% 2
\BOOKMARK [2][-]{subsection.1.2}{Pruning Techniques}{section.1}% 3
\BOOKMARK [1][-]{section.2}{Pruning Neurons to Shrink Neural Networks}{}% 4
\BOOKMARK [2][-]{subsection.2.1}{Brute Force Removal Approach}{section.2}% 5
\BOOKMARK [2][-]{subsection.2.2}{Taylor Series Representation of Error}{section.2}% 6
\BOOKMARK [3][-]{subsubsection.2.2.1}{Linear Approximation Approach}{subsection.2.2}% 7
\BOOKMARK [3][-]{subsubsection.2.2.2}{Quadratic Approximation Approach}{subsection.2.2}% 8
\BOOKMARK [2][-]{subsection.2.3}{Proposed Pruning Algorithm}{section.2}% 9
\BOOKMARK [3][-]{subsubsection.2.3.1}{Algorithm I: Single Overall Ranking}{subsection.2.3}% 10
\BOOKMARK [3][-]{subsubsection.2.3.2}{Algorithm II: Iterative Re-Ranking}{subsection.2.3}% 11
\BOOKMARK [1][-]{section.3}{Experimental Results}{}% 12
\BOOKMARK [2][-]{subsection.3.1}{Pruning a 1-Layer Network}{section.3}% 13
\BOOKMARK [3][-]{subsubsection.3.1.1}{Single Overall Ranking Algorithm}{subsection.3.1}% 14
\BOOKMARK [3][-]{subsubsection.3.1.2}{Iterative Re-Ranking Algorithm}{subsection.3.1}% 15
\BOOKMARK [3][-]{subsubsection.3.1.3}{Visualization of Error Surface \046 Pruning Decisions}{subsection.3.1}% 16
\BOOKMARK [2][-]{subsection.3.2}{Pruning A 2-Layer Network}{section.3}% 17
\BOOKMARK [3][-]{subsubsection.3.2.1}{Single Overall Ranking Algorithm}{subsection.3.2}% 18
\BOOKMARK [3][-]{subsubsection.3.2.2}{Iterative Re-Ranking Algorithm}{subsection.3.2}% 19
\BOOKMARK [3][-]{subsubsection.3.2.3}{Visualization of Error Surface \046 Pruning Decisions}{subsection.3.2}% 20
\BOOKMARK [2][-]{subsection.3.3}{Experiments on Toy Datasets}{section.3}% 21
\BOOKMARK [1][-]{section.4}{Conclusions \046 Future Work}{}% 22
\BOOKMARK [1][-]{appendix.A}{Second Derivative Back-Propagation}{}% 23
\BOOKMARK [2][-]{subsection.A.1}{First and Second Derivatives}{appendix.A}% 24
\BOOKMARK [3][-]{subsubsection.A.1.1}{Summary Of Output Layer Derivatives}{subsection.A.1}% 25
\BOOKMARK [3][-]{subsubsection.A.1.2}{Hidden Layer Derivatives}{subsection.A.1}% 26
\BOOKMARK [3][-]{subsubsection.A.1.3}{Summary Of Hidden Layer Derivatives}{subsection.A.1}% 27
BIN +1010 Bytes (100%) iclr2017_conference.pdf
Binary file not shown.
BIN +2.07 KB (100%) iclr2017_conference.synctex.gz
Binary file not shown.
@@ -1,15 +1,22 @@
\section{Introduction \& Literature Review}\label{sec1}

Pruning algorithms, as comprehensively surveyed by \cite{reed1993pruning}, are a useful set of heuristics designed to identify and remove elements from a neural network which are either redundant or do not significantly contribute to the output of the network. This is motivated by the observed tendency of neural networks to over-fit to the idiosyncrasies of their training data given too many trainable parameters or too few input patterns from which to generalize, as stated by \cite{chauvin1990generalization}.

Network architecture design and hyperparameter selection are inherently difficult tasks typically approached using a few well-known rules of thumb, e.g. various weight initialization procedures, choosing the width and number of layers, different activation functions, learning rates, momentum, etc. Some of this ``black art'' appears unavoidable. For problems which cannot be solved using linear threshold units alone, \cite{baum1989size} demonstrate that there is no way to precisely determine the appropriate size of a neural network a priori given any random set of training instances. Using too few neurons seems to inhibit learning, and so in practice it is common to attempt to over-parameterize networks initially using a large number of hidden units and weights, and then prune them afterwards if necessary.
Network architecture design and hyperparameter selection are inherently difficult tasks typically approached using a few well-known rules of thumb, e.g. various weight initialization procedures, choosing the width and number of layers, different activation functions, learning rates, momentum, etc. Some of this ``black art'' appears unavoidable. For problems which cannot be solved using linear threshold units alone, \cite{baum1989size} demonstrate that there is no way to precisely determine the appropriate size of a neural network a priori given any random set of training instances. Using too few neurons seems to inhibit learning, and so in practice it is common to attempt to over-parameterize networks initially using a large number of hidden units and weights, and then prune them afterwards if necessary. Of course, as the old saying goes, there's more than one way to skin a neural network.

\subsection{Non-Pruning Based Generalization \& Compression Techniques}

The generalization performance of neural networks has been well studied, and apart from pruning algorithms many heuristics have been used to avoid overfitting, such as dropout (\cite{srivastava2014dropout}), maxout (\cite{goodfellow2013maxout}), and cascade correlation (\cite{fahlman1989cascade}), among others. Compressing networks often has benefits with respect to generalization performance and the portability of neural networks to operate in memory-constrained or embedded environments. Without explicitly removing parameters from the network, weight quantization allows for a reduction in the number of bytes used to represent each weight parameter, as investigated by \cite{balzer1991weight}, \cite{dundar1994effects}, and \cite{hoehfeld1992learning}.
The generalization performance of neural networks has been well studied, and apart from pruning algorithms many heuristics have been used to avoid overfitting, such as dropout (\cite{srivastava2014dropout}), maxout (\cite{goodfellow2013maxout}), and cascade correlation (\cite{fahlman1989cascade}), among others. Of course, while cascade correlation specifically tries to construct of minimal networks, many techniques to improve network generalization do not explicitly attempt to reduce the total number of parameters or the memory footprint of a trained network per se.

Model compression often has benefits with respect to generalization performance and the portability of neural networks to operate in memory-constrained or embedded environments. Without explicitly removing parameters from the network, weight quantization allows for a reduction in the number of bytes used to represent each weight parameter, as investigated by \cite{balzer1991weight}, \cite{dundar1994effects}, and \cite{hoehfeld1992learning}.

A recently proposed method for compressing recurrent neural networks (\cite{prabhavalkar2016compression}) uses the singular values of a trained weight matrix as basis vectors from which to derive a compressed hidden layer. Some other recent works like \cite{Anders2016quant} have tried successfully to achieve compression through weight quantization followed by an encoding step while others such as \cite{deepcompression2016} have tried to expand on this by adding weight-pruning as a preceding step to quantization and encoding.

\section{Review of Pruning Algorithms}
In summary, we can say that there are many different ways to improve network generalization by altering the training procedure, the objective error function, or by using compressed representations of the network parameters.

\subsection{Pruning Techniques}

If we wanted to continually shrink a network to its absolute minimal size in an optimal manner, we might accomplish this using any number of off-the-shelf pruning algorithms, such as Skeletonization (\cite{mozer1989skeletonization}), Optimal Brain Damage (\cite{lecun1989optimal}), or later variants such as Optimal Brain Surgeon (\cite{hassibi1993second}). In fact, we borrow much of our inspiration from these antecedent algorithms, with one major variation:\textit{ Instead of pruning individual weights, we prune entire neurons, thereby eliminating all of their incoming and outgoing weight parameters in one go, resulting in more memory saved, faster.}
If we wanted to continually shrink a network to its absolute minimal size, we might accomplish this using any number of off-the-shelf pruning algorithms, such as Skeletonization (\cite{mozer1989skeletonization}), Optimal Brain Damage (\cite{lecun1989optimal}), or later variants such as Optimal Brain Surgeon (\cite{hassibi1993second}). In fact, we borrow much of our inspiration from these antecedent algorithms, with one major variation: Instead of pruning individual weights, we prune entire neurons, thereby eliminating all of their incoming and outgoing weight parameters in one go, resulting in more memory saved, faster.

Scoring and ranking individual weight parameters in a large network is computationally expensive, and generally speaking the removal of a single weight from a large network is a drop in the bucket in terms of reducing a network's core memory footprint. We argue that pruning neurons instead of weights is more efficient computationally as well as practically in terms of quickly reaching a target reduction in memory size. Our approach also attacks the angle of giving downstream applications a realistic expectation of the minimal increase in error resulting from the removal of a specified percentage of neurons. Such trade-offs are unavoidable, but performance impacts can be limited if a principled approach is used to find candidate neurons for removal.