Add content to ROA paper

aleju · Jun 2, 2016 · 99d4b24 · 99d4b24
1 parent ad2560e
commit 99d4b24
Showing 1 changed file with 12 additions and 6 deletions.
diff --git a/neural-nets/Rank_Ordered_Autoencoders.md b/neural-nets/Rank_Ordered_Autoencoders.md
@@ -22,12 +22,18 @@
     * Their encoder and decoder seem to be empty, i.e. z is the only hidden layer in the network.
     * Their output is not just one image (or whatever is encoded), instead they generate one for every unit in layer z.
     * Then they order these outputs based on the activation of the units in z (rank ordering), i.e. the output of the unit with the highest activation is placed in the first position, the output of the unit with the 2nd highest activation gets the 2nd position and so on.
-    * They then generate the final output image based on a cumulative sum. So for three reconstructed output images `I1, I2, I3` (rank ordered that way) they would compute `I1 + (I1+I2) + (I1+I2+I3)`.
+    * They then generate the final output image based on a cumulative sum. So for three reconstructed output images `I1, I2, I3` (rank ordered that way) they would compute `final image = I1 + (I1+I2) + (I1+I2+I3)`.
     * They then compute the error based on that reconstruction (`reconstruction - input image`) and backpropagate it.
-  * 
+  * Cumulative sum:
+    * Using the cumulative sum puts most optimization pressure on units with high activation, as they have the largest influence on the reconstruction error.
+    * The cumulative sum is best optimized by letting few units have high activations and generate most of the output (correctly). All the other units have ideally low to zero activations and low or no influence on the output. (Though if the output generated by the first units is wrong, you should then end up with an extremely high cumulative error sum...)
+      * So their `z` coding should end up with few but high activations, i.e. it should become very sparse.
+    * The cumulative generates an individual error per output, while an ordinary sum generates the same error for every output. They argue that this "blurs" the error less.
+  * To avoid blow ups in their network they use TReLUs, which saturate below 0 and above 1, i.e. `min(1, max(0, input))`.
+  * They use a custom derivative function for the TReLUs, which is dependent on both the input value of the unit and its gradient. Basically, if the input is `>1` (saturated) and the error is high, then the derivative pushes the weight down, so that the input gets into the unsaturated regime. Similarly for input values `<0` (pushed up). If the input value is between 0 and 1 and/or the error is low, then nothing is changed.
+  * They argue that the algorithmic complexity of the rank ordering should be low, due to sorts being `O(n log(n))`, where `n` is the number of hidden units in `z`.
 
 * Results
-
-![Examples](images/Accurate_Image_Super-Resolution__examples.png?raw=true "Examples")
-
-*Super-resolution quality of their model (top, bottom is a competing model).*
+  * They autoencode 7x7 patches from CIFAR-10.
+  * They get very sparse activations.
+  * Training and test loss develop identically, i.e. no overfitting.