# Denoising Autoencoders (dA)

- Denoising Autoencoder (dA)는 전통적인 autoencoder이고 deep network을 대한 block을 쌓기 위해서 도입됨.
- Autoencoders에 대해서 먼저 짧게 배워보자.

## Autoencoders
- autoencoder는 입력값 \\( x \in {[0,1]}^{d} \\)과 첫번째로 미리 결정된 mapping을 통해서  hidden 표현식 \\( y \in [0,1]^{d'} \\) 로 맵핑함.
- \\( y = s( {W}{x} + {b} ) \\)
- s는 sigmoid와 같이 비선형임. 


- 이후에 보여지는 표현 \\( {y} \\) 또는 code는 reconstruction \\( {z} \\)에 의해서 \\( {x} \\) 와 같은 모양으로  역맵핑함.
- 맵핑은 비슷한 변환 방식으로 이루어짐.
- \\( {z} = s({W'}{y} + {b'} ) \\)



- (prime symbol은 전치행렬을 의미하지는 않음.) \\( {z} \\) 는 주어진 code \\( {y} \\)으로 \\( {x} \\)을 추정한 것으로 보여짐.
- 역맵핑에 대한 가중치 행렬 \\( {W'} \\)는 forward 맵핑에 대한 전치를 포함함 : \\( {W'} = {W}^T \\) .
- 이 모델의 파라메타( \\( {W}, {b}, {b'}, {W'}  \\) )는 평균 reconstruction error을 최소화하는 방식으로 최적화되어짐.


- reconstruction error는 주어진 code에 대한 입력값의 적합한 분포 가정을 기반으로 여러가지 방식으로 측정할 수 있음.
- 전통적인 squared error \\( L({x} {z}) = || {x} - {z} ||^2 \\)을 사용할 수 있음.
- 입력값이 0, 1 또는 0 ,1 에 대한 확률일때는 reconstruction의  cross-entropy값을 사용할 수 있음.
![이미지](http://www.deeplearning.net/tutorial/_images/math/499a4658ab7ecc5de5c19c7cc89a53205f3ecdf9.png)
 
 
- code \\( {y} \\)는 데이터에서 변동의 주요소에 의해서 설명될 수 있음.
- 이것은 주성분분석( PCA )와 비슷함.
- 특히 하나의 hidden layer가 있고 mean squared error로 네트워크를 학습하였으면, k hidden unit은 k개의 주성분분석과 비슷함.
- hidden layer가 비선형이라면, auto-encoder PCA와 다르게 됨.


- \\( {y} \\)는  \\( {x} \\)의 손실압축과 비슷하게 보임.


- class형식으로 Theano을 가지고 auto-encoder을 구현해보자. 
- 첫번째는 autoencoder의 파라메타 ( \\( {W}, {b}, {b'}  \\) )을 shared 변수를 만듬
- \\( {W}^T \\)는 \\({W'}\\)을 위해서 사용됨.

In [None]:
    def __init__(
        self,
        numpy_rng,
        theano_rng=None,
        input=None,
        n_visible=784,
        n_hidden=500,
        W=None,
        bhid=None,
        bvis=None
    ):
        """
        dA 클래스의 생성자
        visible units의 수(입력값의 차원),  hidden units의 수, corruption level을 설정, 
        입력으로 symbolic 변수를 받음.
        SdAs에서는 첫번째 dA의 결과가 두번째 dA의 입력으로 갖음.

        :type numpy_rng: numpy.random.RandomState
        :param numpy_rng: number random generator used to generate weights

        :type theano_rng: theano.tensor.shared_randomstreams.RandomStreams
        :param theano_rng: Theano random generator; if None is given one is
                     generated based on a seed drawn from `rng`

        :type input: theano.tensor.TensorType
        :param input: a symbolic description of the input or None for
                      standalone dA

        :type n_visible: int
        :param n_visible: number of visible units

        :type n_hidden: int
        :param n_hidden:  number of hidden units

        :type W: theano.tensor.TensorType
        :param W: Theano variable pointing to a set of weights that should be
                  shared belong the dA and another architecture; if dA should
                  be standalone set this to None

        :type bhid: theano.tensor.TensorType
        :param bhid: Theano variable pointing to a set of biases values (for
                     hidden units) that should be shared belong dA and another
                     architecture; if dA should be standalone set this to None

        :type bvis: theano.tensor.TensorType
        :param bvis: Theano variable pointing to a set of biases values (for
                     visible units) that should be shared belong dA and another
                     architecture; if dA should be standalone set this to None


        """
        self.n_visible = n_visible
        self.n_hidden = n_hidden

        # create a Theano random generator that gives symbolic random values
        if not theano_rng:
            theano_rng = RandomStreams(numpy_rng.randint(2 ** 30))

        # note : W' was written as `W_prime` and b' as `b_prime`
        if not W:
            # W is initialized with `initial_W` which is uniformely sampled
            # from -4*sqrt(6./(n_visible+n_hidden)) and
            # 4*sqrt(6./(n_hidden+n_visible))the output of uniform if
            # converted using asarray to dtype
            # theano.config.floatX so that the code is runable on GPU
            initial_W = numpy.asarray(
                numpy_rng.uniform(
                    low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)),
                    high=4 * numpy.sqrt(6. / (n_hidden + n_visible)),
                    size=(n_visible, n_hidden)
                ),
                dtype=theano.config.floatX
            )
            W = theano.shared(value=initial_W, name='W', borrow=True)

        if not bvis:
            bvis = theano.shared(
                value=numpy.zeros(
                    n_visible,
                    dtype=theano.config.floatX
                ),
                borrow=True
            )

        if not bhid:
            bhid = theano.shared(
                value=numpy.zeros(
                    n_hidden,
                    dtype=theano.config.floatX
                ),
                name='b',
                borrow=True
            )

        self.W = W
        # b corresponds to the bias of the hidden
        self.b = bhid
        # b_prime corresponds to the bias of the visible
        self.b_prime = bvis
        # tied weights, therefore W_prime is W transpose
        self.W_prime = self.W.T
        self.theano_rng = theano_rng
        # if no input is given, generate a variable representing the input
        if input is None:
            # we use a matrix because we expect a minibatch of several
            # examples, each example being a row
            self.x = T.dmatrix(name='input')
        else:
            self.x = input

        self.params = [self.W, self.b, self.b_prime]

- the symbolic output (the \\( {y} \\) above) of layer k will be the symbolic input of layer k+1.

- hidden layer에서의 수식과 복원식을 보자.

In [None]:
    def get_hidden_values(self, input):
        """ Computes the values of the hidden layer """
        return T.nnet.sigmoid(T.dot(input, self.W) + self.b)

In [None]:
    def get_reconstructed_input(self, hidden):
        """Computes the reconstructed input given the values of the
        hidden layer
        """
        return T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime)

- cost을 계산하고 stochastic gradient descent 통해서 파라메타를 업데이트함.

In [None]:
    def get_cost_updates(self, corruption_level, learning_rate):
        """ This function computes the cost and the updates for one trainng
        step of the dA """

        tilde_x = self.get_corrupted_input(self.x, corruption_level)
        y = self.get_hidden_values(tilde_x)
        z = self.get_reconstructed_input(y)
        # note : we sum over the size of a datapoint; if we are using
        #        minibatches, L will be a vector, with one entry per
        #        example in minibatch
        L = - T.sum(self.x * T.log(z) + (1 - self.x) * T.log(1 - z), axis=1)
        # note : L is now a vector, where each element is the
        #        cross-entropy cost of the reconstruction of the
        #        corresponding example of the minibatch. We need to
        #        compute the average of all these to get the cost of
        #        the minibatch
        cost = T.mean(L)

        # compute the gradients of the cost of the `dA` with respect
        # to its parameters
        gparams = T.grad(cost, self.params)
        # generate the list of updates
        updates = [
            (param, param - learning_rate * gparam)
            for param, gparam in zip(self.params, gparams)
        ]

        return (cost, updates)

- 복원 cost을 최소화하는 W, b, b_primer와 같은 parameter을 업데이터하는 반복적인 함수를 정의함.

In [None]:
    da = dA(
        numpy_rng=rng,
        theano_rng=theano_rng,
        input=x,
        n_visible=28 * 28,
        n_hidden=500
    )

    cost, updates = da.get_cost_updates(
        corruption_level=0.,
        learning_rate=learning_rate
    )

    train_da = theano.function(
        [index],
        cost,
        updates=updates,
        givens={
            x: train_set_x[index * batch_size: (index + 1) * batch_size]
        }
    )

    start_time = timeit.default_timer()

    ############
    # TRAINING #
    ############

    # go through training epochs
    for epoch in xrange(training_epochs):
        # go through trainng set
        c = []
        for batch_index in xrange(n_train_batches):
            c.append(train_da(batch_index))

        print 'Training epoch %d, cost ' % epoch, numpy.mean(c)

    end_time = timeit.default_timer()

    training_time = (end_time - start_time)

    print >> sys.stderr, ('The no corruption code for file ' +
                          os.path.split(__file__)[1] +
                          ' ran for %.2fm' % ((training_time) / 60.))
    image = Image.fromarray(
        tile_raster_images(X=da.W.get_value(borrow=True).T,
                           img_shape=(28, 28), tile_shape=(10, 10),
                           tile_spacing=(1, 1)))
    image.save('filters_corruption_0.png')

    # start-snippet-3
    #####################################
    # BUILDING THE MODEL CORRUPTION 30% #
    #####################################

    rng = numpy.random.RandomState(123)
    theano_rng = RandomStreams(rng.randint(2 ** 30))

    da = dA(
        numpy_rng=rng,
        theano_rng=theano_rng,
        input=x,
        n_visible=28 * 28,
        n_hidden=500
    )

    cost, updates = da.get_cost_updates(
        corruption_level=0.3,
        learning_rate=learning_rate
    )

    train_da = theano.function(
        [index],
        cost,
        updates=updates,
        givens={
            x: train_set_x[index * batch_size: (index + 1) * batch_size]
        }
    )

    start_time = timeit.default_timer()

    ############
    # TRAINING #
    ############

    # go through training epochs
    for epoch in xrange(training_epochs):
        # go through trainng set
        c = []
        for batch_index in xrange(n_train_batches):
            c.append(train_da(batch_index))

        print 'Training epoch %d, cost ' % epoch, numpy.mean(c)

    end_time = timeit.default_timer()

    training_time = (end_time - start_time)

    print >> sys.stderr, ('The 30% corruption code for file ' +
                          os.path.split(__file__)[1] +
                          ' ran for %.2fm' % (training_time / 60.))
    # end-snippet-3

    # start-snippet-4
    image = Image.fromarray(tile_raster_images(
        X=da.W.get_value(borrow=True).T,
        img_shape=(28, 28), tile_shape=(10, 10),
        tile_spacing=(1, 1)))
    image.save('filters_corruption_30.png')
    # end-snippet-4

    os.chdir('../')


if __name__ == '__main__':
    test_dA()

- 입력보다 많은 hidden unit을 갖는 auto-encoder는 유용한 identity function을 학습에 방해가 됨.
- hidden unit이 zero 또는 near-zero값을 갖는 것을 희소성이라고 함.
- 희소성은 많은 경우에 매우 성공적으로 발견됨.
- 입력으로부터 복원되는 변환에서 randomness( 임의성 )이 추가되어지고, RBM이나 Denoising Auto-Encoders에서 사용되는 기술임.

## Denoising Autoencoders

- denoising autoencoders의 아이디어는 단순함.
- 강건한 특징을 발견하고 단순한 identity 학습을 방해받지 않도록 hidden layer을 강화하기 위해서, corrupted(오염된) version으로부터 입력을 복원하는 autoencoder을 학습함.


- denoising auto-encoder는 auto-encoder의 stochastic(확률) version임.
- 직관적으로 denoising auto-encoder는 2가지 일을 함.
   - 1) 입력값을 encode 하고 auto-encoder의 입력값이 적용되어 확률적으로 불순도 과정의 효과를 제거함. 
   - 2) 입력 사이에서 통계적인 의존성을 잡아줌.
- denoising auto-encoder는 여러 가지 관점( 매니 폴드 학습 관점, 확률 운영자의 관점, 상향식 (bottom-up) - 정보 이론 관점, 탑 - 다운 - 생식 모델의 관점  )으로 이해할 수 있음.


- 확률적인 불순도 과정은 입력의 어떤 부분은 zero로 설정함.
- 그러므로, denoising auto-encoder은 오염되지 않은 (i.e., non-missing) 값으로부터 오염된 값을 예측할려고 함.
- 나머지로부터 변수들의 서브셋을 예측하는 것은 변수의 집합들로부터 연결분포( joint distribut)을 완벽히 잡아낸으로 충분한 조건인 것을 주목하자.   Note how being able to predict any subset of variables from the rest is a sufficient condition for completely capturing the joint distribution between a set of variables (this is how Gibbs sampling works).


- autoencoder 클래스를   denoising autoencoder 클래스로 변경하기 위해서, 입력값을 확률적인 오염시키는 과정이 필요함.
- 입력값은 여러 가지 방법으로 오염시킬 수 있으며, 여기에서는 램덤하게 값을 zero을 만들어주는 오염과정을 거침.

In [None]:
    def get_corrupted_input(self, input, corruption_level):
        """This function keeps ``1-corruption_level`` entries of the inputs the
        same and zero-out randomly selected subset of size ``coruption_level``
        Note : first argument of theano.rng.binomial is the shape(size) of
               random numbers that it should produce
               second argument is the number of trials
               third argument is the probability of success of any trial

                this will produce an array of 0s and 1s where 1 has a
                probability of 1 - ``corruption_level`` and 0 with
                ``corruption_level``

                The binomial function return int64 data type by
                default.  int64 multiplicated by the input
                type(floatX) always return float64.  To keep all data
                in floatX when floatX is float32, we set the dtype of
                the binomial to floatX. As in our case the value of
                the binomial is always 0 or 1, this don't change the
                result. This is needed to allow the gpu to work
                correctly as it only support float32 for now.

        """
        return self.theano_rng.binomial(size=input.shape, n=1,
                                        p=1 - corruption_level,
                                        dtype=theano.config.floatX) * input

- stacked autoencoder class (Stacked Autoencoders)에서는 dA의 가중치가 sigmoid layer에 대응되는 것과 값이 공유됨.
- 그 이유는 dA의 생성자는 shared parameters로 Theano variables을 갖음.
- 만약, 다른  parameter들은 None으로 처리함. 
- 새로운 생성자를 만들어 보자.


- 아래는 최종 denoising autoencoder class

In [None]:
class dA(object):
    """Denoising Auto-Encoder class (dA)

    A denoising autoencoders tries to reconstruct the input from a corrupted
    version of it by projecting it first in a latent space and reprojecting
    it afterwards back in the input space. Please refer to Vincent et al.,2008
    for more details. If x is the input then equation (1) computes a partially
    destroyed version of x by means of a stochastic mapping q_D. Equation (2)
    computes the projection of the input into the latent space. Equation (3)
    computes the reconstruction of the input, while equation (4) computes the
    reconstruction error.

    .. math::

        \tilde{x} ~ q_D(\tilde{x}|x)                                     (1)

        y = s(W \tilde{x} + b)                                           (2)

        x = s(W' y  + b')                                                (3)

        L(x,z) = -sum_{k=1}^d [x_k \log z_k + (1-x_k) \log( 1-z_k)]      (4)

    """

    def __init__(
        self,
        numpy_rng,
        theano_rng=None,
        input=None,
        n_visible=784,
        n_hidden=500,
        W=None,
        bhid=None,
        bvis=None
    ):
        """
        Initialize the dA class by specifying the number of visible units (the
        dimension d of the input ), the number of hidden units ( the dimension
        d' of the latent or hidden space ) and the corruption level. The
        constructor also receives symbolic variables for the input, weights and
        bias. Such a symbolic variables are useful when, for example the input
        is the result of some computations, or when weights are shared between
        the dA and an MLP layer. When dealing with SdAs this always happens,
        the dA on layer 2 gets as input the output of the dA on layer 1,
        and the weights of the dA are used in the second stage of training
        to construct an MLP.

        :type numpy_rng: numpy.random.RandomState
        :param numpy_rng: number random generator used to generate weights

        :type theano_rng: theano.tensor.shared_randomstreams.RandomStreams
        :param theano_rng: Theano random generator; if None is given one is
                     generated based on a seed drawn from `rng`

        :type input: theano.tensor.TensorType
        :param input: a symbolic description of the input or None for
                      standalone dA

        :type n_visible: int
        :param n_visible: number of visible units

        :type n_hidden: int
        :param n_hidden:  number of hidden units

        :type W: theano.tensor.TensorType
        :param W: Theano variable pointing to a set of weights that should be
                  shared belong the dA and another architecture; if dA should
                  be standalone set this to None

        :type bhid: theano.tensor.TensorType
        :param bhid: Theano variable pointing to a set of biases values (for
                     hidden units) that should be shared belong dA and another
                     architecture; if dA should be standalone set this to None

        :type bvis: theano.tensor.TensorType
        :param bvis: Theano variable pointing to a set of biases values (for
                     visible units) that should be shared belong dA and another
                     architecture; if dA should be standalone set this to None


        """
        self.n_visible = n_visible
        self.n_hidden = n_hidden

        # create a Theano random generator that gives symbolic random values
        if not theano_rng:
            theano_rng = RandomStreams(numpy_rng.randint(2 ** 30))

        # note : W' was written as `W_prime` and b' as `b_prime`
        if not W:
            # W is initialized with `initial_W` which is uniformely sampled
            # from -4*sqrt(6./(n_visible+n_hidden)) and
            # 4*sqrt(6./(n_hidden+n_visible))the output of uniform if
            # converted using asarray to dtype
            # theano.config.floatX so that the code is runable on GPU
            initial_W = numpy.asarray(
                numpy_rng.uniform(
                    low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)),
                    high=4 * numpy.sqrt(6. / (n_hidden + n_visible)),
                    size=(n_visible, n_hidden)
                ),
                dtype=theano.config.floatX
            )
            W = theano.shared(value=initial_W, name='W', borrow=True)

        if not bvis:
            bvis = theano.shared(
                value=numpy.zeros(
                    n_visible,
                    dtype=theano.config.floatX
                ),
                borrow=True
            )

        if not bhid:
            bhid = theano.shared(
                value=numpy.zeros(
                    n_hidden,
                    dtype=theano.config.floatX
                ),
                name='b',
                borrow=True
            )

        self.W = W
        # b corresponds to the bias of the hidden
        self.b = bhid
        # b_prime corresponds to the bias of the visible
        self.b_prime = bvis
        # tied weights, therefore W_prime is W transpose
        self.W_prime = self.W.T
        self.theano_rng = theano_rng
        # if no input is given, generate a variable representing the input
        if input is None:
            # we use a matrix because we expect a minibatch of several
            # examples, each example being a row
            self.x = T.dmatrix(name='input')
        else:
            self.x = input

        self.params = [self.W, self.b, self.b_prime]

    def get_corrupted_input(self, input, corruption_level):
        """This function keeps ``1-corruption_level`` entries of the inputs the
        same and zero-out randomly selected subset of size ``coruption_level``
        Note : first argument of theano.rng.binomial is the shape(size) of
               random numbers that it should produce
               second argument is the number of trials
               third argument is the probability of success of any trial

                this will produce an array of 0s and 1s where 1 has a
                probability of 1 - ``corruption_level`` and 0 with
                ``corruption_level``

                The binomial function return int64 data type by
                default.  int64 multiplicated by the input
                type(floatX) always return float64.  To keep all data
                in floatX when floatX is float32, we set the dtype of
                the binomial to floatX. As in our case the value of
                the binomial is always 0 or 1, this don't change the
                result. This is needed to allow the gpu to work
                correctly as it only support float32 for now.

        """
        return self.theano_rng.binomial(size=input.shape, n=1,
                                        p=1 - corruption_level,
                                        dtype=theano.config.floatX) * input

    def get_hidden_values(self, input):
        """ Computes the values of the hidden layer """
        return T.nnet.sigmoid(T.dot(input, self.W) + self.b)

    def get_reconstructed_input(self, hidden):
        """Computes the reconstructed input given the values of the
        hidden layer

        """
        return T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime)

    def get_cost_updates(self, corruption_level, learning_rate):
        """ This function computes the cost and the updates for one trainng
        step of the dA """

        tilde_x = self.get_corrupted_input(self.x, corruption_level)
        y = self.get_hidden_values(tilde_x)
        z = self.get_reconstructed_input(y)
        # note : we sum over the size of a datapoint; if we are using
        #        minibatches, L will be a vector, with one entry per
        #        example in minibatch
        L = - T.sum(self.x * T.log(z) + (1 - self.x) * T.log(1 - z), axis=1)
        # note : L is now a vector, where each element is the
        #        cross-entropy cost of the reconstruction of the
        #        corresponding example of the minibatch. We need to
        #        compute the average of all these to get the cost of
        #        the minibatch
        cost = T.mean(L)

        # compute the gradients of the cost of the `dA` with respect
        # to its parameters
        gparams = T.grad(cost, self.params)
        # generate the list of updates
        updates = [
            (param, param - learning_rate * gparam)
            for param, gparam in zip(self.params, gparams)
        ]

        return (cost, updates)

- 예제 코드 : http://deeplearning.net/tutorial/code/

In [11]:
! rm -f dA.py utils.py logistic_sgd.py

! wget http://deeplearning.net/tutorial/code/dA.py
! wget http://deeplearning.net/tutorial/code/utils.py
! wget http://deeplearning.net/tutorial/code/logistic_sgd.py

--2016-02-02 16:37:49--  http://deeplearning.net/tutorial/code/dA.py
Resolving deeplearning.net (deeplearning.net)... 132.204.26.28
Connecting to deeplearning.net (deeplearning.net)|132.204.26.28|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14736 (14K) [text/plain]
Saving to: ‘dA.py’


2016-02-02 16:37:50 (337 MB/s) - ‘dA.py’ saved [14736/14736]

--2016-02-02 16:37:50--  http://deeplearning.net/tutorial/code/utils.py
Resolving deeplearning.net (deeplearning.net)... 132.204.26.28
Connecting to deeplearning.net (deeplearning.net)|132.204.26.28|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5101 (5.0K) [text/plain]
Saving to: ‘utils.py’


2016-02-02 16:37:50 (437 MB/s) - ‘utils.py’ saved [5101/5101]

--2016-02-02 16:37:50--  http://deeplearning.net/tutorial/code/logistic_sgd.py
Resolving deeplearning.net (deeplearning.net)... 132.204.26.28
Connecting to deeplearning.net (deeplearning.net)|132.204.26.28|:80... connected.
HTTP request se

In [12]:
! ls -al *.py

-rw-rw-r-- 1 deepbio deepbio 14736 Feb  2 16:05 dA.py
-rw-rw-r-- 1 deepbio deepbio 17006 Feb  2 16:05 logistic_sgd.py
-rw-rw-r-- 1 deepbio deepbio  5101 Feb  2 16:05 utils.py


In [14]:
! python dA.py

... loading data
Training epoch 0, cost  63.2891694201
Training epoch 1, cost  55.7866565443
Training epoch 2, cost  54.7631168984
Training epoch 3, cost  54.2420533514
Training epoch 4, cost  53.888670659
Training epoch 5, cost  53.6203505434
Training epoch 6, cost  53.4037459012
Training epoch 7, cost  53.2219976788
Training epoch 8, cost  53.0658010178
Training epoch 9, cost  52.9295596873
Training epoch 10, cost  52.8094163525
Training epoch 11, cost  52.7024367362
Training epoch 12, cost  52.606310148
Training epoch 13, cost  52.5191693641
Training epoch 14, cost  52.4395240004
The no corruption code for file dA.py ran for 8.85m
Training epoch 0, cost  81.7714190632
Training epoch 1, cost  73.4285756365
Training epoch 2, cost  70.8632686268
Training epoch 3, cost  69.3396642015
Training epoch 4, cost  68.4134660704
Training epoch 5, cost  67.723705304
Training epoch 6, cost  67.2401360252
Training epoch 7, cost  66.849303071
Training epoch 8, cost  66.5663948395
Training epoch 9, 

![이미지](dA_plots/filters_corruption_0.png) 
![이미지](dA_plots/filters_corruption_30.png) 