Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time



Stacked Capsule Autoencoders

Thought on paper

  • After visiting several presentation on this topic I read the paper finally. Though still skipped some details
  • Though the main idea is to add inductive prior (i.e. define affine transformation, prediction probablity and occlusion/alpha),
  • the model is realized and improved by applying several tricks. E.g. sparcity. Probabily there is model parts, CNN, autoencode have certain weakness to implement the SCAE
  • it focuses mainly in vision problem, is not a generic idea.
  • As in previous paper, add known knowledge to model seems the new trend in this area.


Growing Neural Cellular Automata

Thought on paper

  • this paper is most interesting article in recent time
  • it connects my personal adventure into deep learning, complex system, e.g. CA and life science.
  • in pure deep learning, many model respect(inspired by life science, e.g CNN) but not fully consider the physical world's laws. I think it is too much abstract. Similar like multi universe model vs our world in theoretical physics
  • by adding bias, here the biological knowledge to neural network and computation model(cell), the result seems to approach reality.
  • another article of CapsuleNetwork 2019 from Hinton seems emphasize adding known real world laws into deep learning model.


Visualizing the Impact of Feature Attribution Baselines

Thought on paper

  • this paper extends the topic from last paper and focus on effect of baseline and how to choose a baseline and its pitfall
  • instead of constant value baseline, several different approach is introduced, some of them are promising, e.g. distribution of data
  • claim that intepretation of model shall follow human logic is not valid. Not to say we don't know exactly human intution works.e
  • A question is answered with more questions but still insightful


Axiomatic Attribution for Deep Networks

Thought on paper

  • this is a paper to reason the deep network in sense which input (I) contributes the output(O)
  • it uses a method "Integrated Gradient" which literally integrats of gradient of output/input.
  • But why not gradient? the paper explains that pure gradient may lose sensitivity due to network implementation for some input
  • Then how to use it? Using a network, do an inference. Then compute the "Integrated Gradient" by using Riemann integral approximation method, namely sampling n discrete values and sum/avarageing.
  • The last point, I am happy that I got so much information from this paper:)


Representation Learning with Contrastive Predictive Coding

Thought on paper

  • this article introduce the article of CPC and explains the motivation of extract mutual information of input and context in low dimension
  • the point is how to derive and convert the input/ouput into the domain of information
  • I did not get this transformation intuitively from the equation.
  • when dealing with distribution, it is inevitable to confront intractable variables. Aproximate approches are needed.


A Simple Baseline for Bayesian Uncertainty in Deep Learning

Thought on paper

  • Honestly I hardly understand the content
  • It seems to introduce a SWAG algorithm to leverage the uncertanty in SGD and utilize this information to speed up training by maintaining big learning rate and improve the accuracy(?)


Putting An End to End-to-End: Gradient-Isolated Learning of Representations

Thought on paper

  • The method is based on assumption that "slow feature" among temporal or spatial local feature
  • by extracting these "slow feature" use maximizing the mututal information(discarding others) the sub networks/layers transform high dimensional inputs to relative low dimensional inputs, like most classification network do.
  • the layer-wise training seems not depending on the final goal explicitly, as gradient does not propagate back accross modules/layes, implicitly the final goal exerts an one-time relationship: that is "slow feature " assumption.
  • so to speak, author is quite precise and cautious to make the title "Gradient-Isolated Learning of Representations" but not "Isolated Learning of Representations".
  • what is CPC? this is the 2nd time meeting this topic. need homework.
  • Also don't understand the Loss function, assuming relating to CPC.


Scalable Active Learning for Autonomous Driving

Thought on paper

  • this is not a traditional academic paper but rather a engineering one
  • the paper introduce a active learning (AL) method, which applies acquisition function on mulitiple models
  • the paper also presented the solution with all required hardware software stackes. Partially it is a advertisement of Nvidia
  • conclusion: very practical solution to adress continous model improvement with new data.


Analyzing and Improving the Image Quality of StyleGAN

Thought on paper

  • intented to catch up the development of GAN.
  • the paper introduces some correction/improvement of StyleGAN
  • E.g. Progressive growing was thought to be a major improvement to generate high resolution image but comes with small drawbacks
  • explains the normalization trap with StyleGAN
  • analyzes entangled latent variables vs disentangled ones by transformation(I am not sure if this is my assumption or the article states expilicitly)
  • During reading the paper I come up with correlation between neural network and chaos systems. TO be specific, a system with multiple variables needs to be caustious not entering chaotic state.


Neural networks grown and self-organized by noise

Thought on paper

  • this is a very interesting article regarding to how to mimic bio-neural network formats
  • I have been speculating a network which can self organize and adjust (increase/decrease) based on tasks.
  • the article introduces 2 algorithms
    • how to emerge a single cell to 2 layer networks
    • how to form the patches, like CNN filters by Heebian like Algorithm
  • however it does not consider the high-level learning, i.e. the learned patches are purely from environment, no subjective purpose.
  • It seems a kind of an unsupervised model
  • the provided MNIST application is not convincing.


Computing Receptive Fields of Convolutional Neural Networks

Thought on paper

  • provides a way to calculate receptive field side in single path network and multi-path network
  • discussed about the alignment of center of recpetive field problem
  • attribution of accurracy improviment related to receptive field side.
    • probably the increasing of receptive field size catches the outliers or long tail of data distribution


The Loss Surfaces of Multilayer Network

Thought on paper

  • The authoer represent a loss function in form of polynomial expression, w1w2wn
  • the output is a collection of subset of walks from input through hidden layers, with some walk filtered out by activation functions, e.g. Relu
  • The article tries to prove some boundary of neural network with physical model Hamilton spin-glass model(Ising model). Not famliar with these things
  • conclusion is big net tends to converge but overfit.
  • No example of loss surface, but I guess due to polynomial expression of loss, the Hessian Matrix w.r.t Weights w is random with minus/positive signs, therefore it is non-convex.


Perceptual Losses for Real-Time Style Transfer and Super-Resolution

Thought on paper

  • add feature loss function to Loss, feature loss is extracted from a transfer model
  • need example and code to see how it is done


Visualizing the Loss Landscape of Neural Nets

Thought on paper

  • non-convex loss function, why and how to know? check with Hessian matrix and its eigen value, another topic needs to learn
  • how is loss surface calculated and visualized. PCA?
  • is it universal by add skip-connection to convexize the loss function?
  • wider net has much convexer loss surface
  • The paper provides code.
  • is there other tricks to convexize and smooth the loss function?


Deep Residual Learning for Image Recognition

Thought on paper

  • I was confused why it was called residual network. The residual network here is represented by F(x), the path without skip connection.
  • why it is called "residual", the difference between target output H(x) and input X.
  • It seems residual is applied for residual block, because final output of network has quite different shape as input X.
  • that also bothers me when input output dimension does not match.


Visualizing and Understanding Convolutional Networks

Thought on paper

  • Revisit the topic of feature visualization
  • visualize single activation in final layer shows the decomposition from layer 1 to last
  • i.e. if you visualize the last layer, you will see gradually abstracter activation maps
  • if you visualize activation in ealier layer, you see the element the filter recoginize


A systematic study of the class imbalance problem in convolutional neural networks

Thought on paper

  • several methods are experimented, oversampling, undersampling; threshold
  • ROC AUC metric to measure the model. ROC AUS is another topic to explore.
  • class imbalance was the first problem when I trained the CNN network with data from simulation
  • I am still boggling how to handle this problem in RL domain, since in RL environment input distribution shifts during training


Cyclical Learning Rates for Training Neural Networks

Thought on paper

  • how to find a range of Learning rate based on trend of accurracy: when accurracy decreases while learning rate increases, set the max boundary of learning rate
  • use cyclical learning rate schedule to train network instead of fixed learning rate or monotonic descreaing schedule.
  • the justification of why CLR works is not very clear for me. Some hints like to add more variance of gradients?

A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay

Thought on paper

  • experiments on learning rate, batch-size, momentum and weight decay.
  • introduce one-shot fit
  • experiments on cyclic learning rate, moment.
  • experiments on different datasets and neural network architectures
  • all problems are supervised learning
  • Interesting to see similar analysis in RL domain.


Revisiting Small Batch Training for Deep Neural Networks

Thought on paper

  • Batch size affects the variance of weight update
  • Big batch size with mean loss reduce the variance and degrade the generalization of SGD
  • Batch Normalization counter-interact the mean effect and prefers relative large batch size.


Group Normalization

Thought on paper

  • Generalizes the Layer Normalization and Instance Normalization
  • A way to add human prior to model achitecture. I.e. we manually group the feature channel into similar catelogs.


Instance Normalization: The Missing Ingredient for Fast Stylization

Thought on paper

  • the method is applied for Style transfer
  • instance normalization as its name imply, normalization on a single image. I.e. contrast normalization


Softmax and the class of "background"

  • Jerome Howard explains that Softmax needs to be used with caution, as it always convert logits to probablity, the input needs to be currated that at least one class exists, else it makes no sense when test inputs do not have any object belongs to classes; for superviosed training it is not a problem, because the input is always labelled with a class.
  • for semantic segmentation, the "background" class, i.e. none of classes wanted is hard to classify, since it needs O( N x I(class)) capacity to identify. The weights need to classify all possible classes to negative. I assume the "background" class learns the average bias or threshold of all possible classes to be classified as negative(Not a class). So it depends highly on the training distribution and has bad generalization.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Thought on paper

  • Batch Normalization transform the inputs of each layer to 0-mean,1 variant distribution, how would it affect the bias?
  • Needs to work with a concrete example.

Layer Normalization

Thought on paper

  • Layer normalization deals with problem of RNNs, which Batch Normalization can not
  • the paper provides more insight of mathmatical properties of normalization operation, especially on invariants
  • Layer normalization does not proove to be better for CNNs, as CNNs has small "perceptral angle" than Dense Network, so correlate the neuros with unrelated information is a factor that Layer Normalization may not work well
  • The paper gives thinking on circumstance of problem when applying what kind of normalization.


Self-Normalizing Neural Networks

Thought on paper

  • The article introduced a SELU activation function, with additional parameter of lambda and alpha
  • the SELU has property to maintain the mean and variance of activation across multiple layers
  • it does not consider the gradient propagation, so I am not sure if gradient vanishing or exploding problem is tackled
  • the mathematical derivation is overwhelming and initimdating. it uses ~ 100 pages to illustrate that.

All you need is a good init

Thought on paper

  • apply the weight initialization with orthogonal matrix plus scaling with batch input data
  • extension from Xavier, Kaiming's initialization

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

Thought on paper

  • apply the weight initialization with orthogonal matrix plus scaling
  • extension from Xavier, Kaiming's initialization


A Probabilistic Representation of Deep Learning

Thought on paper

  • the article tries to assemble DNNs with Graph Probablistic Model.
  • intuitively sounds a attractive idea, that layers near data learn the prior distribution of data and layers to output assemble the posterier of target information.
  • Did not go through the details, so not sure if the statement is rectified in theory


Fixup Initialization: Residual Learning Without Normalization

Thought on paper

  • the He initialization solves the Relu activation, but ResNet introduces skip-connection, which introduces new factor for gradient explosion.
  • a new method is introduced to fix this new achitecture.
  • for me it seems, it is always required to analyze the propoerty of data(input), the model achitecture, when an application is tried to be sovled by neural network.
    • to check whether the solution will work at first place from theory
    • monitor the training/learning process


Understanding the difficulty of training deep feedforward neural networks

Thought on paper

  • the paper deals with the combination of CNN with RELU case, it shall be ok with non-CNN but with RELU
  • with Kaiming intialization, it preserves either the variance of activation or back-propagation gradients, but not both. One of them will be scaled to c2/dL or dL/c2.
  • it is enlighting the convolution can be simplified as a Matrix mulitplication W*x, with W containing some regularity see


Understanding the difficulty of training deep feedforward neural networks

Thought on paper

  • The initialization of weights strongly influences or determines the training process.
  • standard uniform random initialization [-1/sqrt(n), 1/sqrt(n)] causes saturation of weights in deep layer network
  • sigmoid activation with uniform intializatio tends to saturated neural activation at 0.5, thus causes network slow to learn
  • keep the variance of activation, back gradient propagation stable, i.e. constant throughout layers,
  • thus, xavier intialization of weights sqrt(6/(n1+n2)) is compromise of make variance of activation, gradient stable cross layers.


A Discussion of Adversarial Examples Are Not Bugs, They Are Features

Thought on paper

  • The idea is Neural network catches human-insensitive features and uses it for classification
  • The human-insentive featues are not "Robust", meaning that they are random and has weak correlation to result. Does it mean that dataset does not have enough data covers the distribution of this feature, like mentioned high frequency features in image? In other words, a man with a king-crown is classified as King, here king crown is not "Robust" feature.


No description, website, or topics provided.



No releases published


No packages published