# The Future of Deep Learning Research 

深度学习研究的未来

Topics
- How does Backpropagation work?
- What are the most popular deep learning algorithms today?
- 7 Research Directions I've handpicked

主题
- 误差反向传播算法是怎么工作的？
- 今天最流行的深度学习算法是什么？
- 我精心挑选的7个研究方向

In a recent AI conference, Geoffrey Hinton [remarked](https://www.axios.com/ai-pioneer-advocates-starting-over-2485537027.html) that he was “deeply suspicious” of back-propagation, and said “My view is throw it all away and start again.”

在最新的AI会议上， Geoffrey Hinton [remarked](https://www.axios.com/ai-pioneer-advocates-starting-over-2485537027.html) 说过“深度怀疑”误差反向传播算法，也说“我建议扔掉所有的，并且重新开始”

![alt text](https://assets.rbl.ms/11015654/960x540.png "Logo Title Text 1")

## The billion dollar question - how does the brain learn so well from sparse, unlabeled data?

数十亿美元的问题--大脑是如何从稀疏的未标记数据中学到这么好的？

### Let's first understand how backpropagation works

让我们先了解反向传播算法是如何工作的

![alt text](https://i.ytimg.com/vi/An5z8lR8asY/maxresdefault.jpg "Logo Title Text 1")

In 1986 Hinton released [this](http://www.cs.toronto.edu/~hinton/absps/naturebp.pdf) paper detailing a new optimization strategy for neural networks called 'backpropagation'. This paper is the reason the current Deep Learning boom is possible. 

在1986年 Hinton发表的论文  [this](http://www.cs.toronto.edu/~hinton/absps/naturebp.pdf) 详细介绍了一种新的神经网络优化策略, 称为 "反向传播算法"。本文是造成当前深度学习风口的可能原因。

![alt text](https://www.otexts.org/sites/default/files/fpp/images/nnet2.png "Logo Title Text 1")

In [2]:
import numpy as np

#nonlinearity
def nonlin(x,deriv=False):
	if(deriv==True):
	    return x*(1-x)

	return 1/(1+np.exp(-x))
    
#input data
X = np.array([[0,0,1],
            [0,1,1],
            [1,0,1],
            [1,1,1]])
         
#output data
y = np.array([[0],
			[1],
			[1],
			[0]])

np.random.seed(1)

# randomly initialize our weights with mean 0
syn0 = 2*np.random.random((3,4)) - 1
syn1 = 2*np.random.random((4,1)) - 1

print(syn0)
print(syn1)

[[-0.16595599  0.44064899 -0.99977125 -0.39533485]
 [-0.70648822 -0.81532281 -0.62747958 -0.30887855]
 [-0.20646505  0.07763347 -0.16161097  0.370439  ]]
[[-0.5910955 ]
 [ 0.75623487]
 [-0.94522481]
 [ 0.34093502]]


![alt text](https://deeplearning4j.org/img/weighted_input_RBM.png "Logo Title Text 1")

![alt text](https://algebra1course.files.wordpress.com/2013/02/slide10.jpg "Logo Title Text 1")

![alt text](https://imgur.com/a/9krNv "Logo Title Text 1")

3 concepts behind Backpropagtion (From Calculus)

3个在反向传播算法背后的概念（从演算角度）

1. Derivative 导数
![alt text](https://i.imgur.com/eRF9pXu.jpg "Logo Title Text 1")

2. Partial Derivative 偏导数

![alt text](https://i.imgur.com/Rergqbt.jpg "Logo Title Text 1")

3. Chain Rule 链式规则

![alt text](https://i.imgur.com/HFmGQyH.jpg "Logo Title Text 1")

In [None]:
for j in xrange(60000):

	# Feed forward through layers 0, 1, and 2
    k0 = X
    k1 = nonlin(np.dot(k0,syn0))
    k2 = nonlin(np.dot(k1,syn1))

    # how much did we miss the target value?
    k2_error = y - k2

    if (j% 10000) == 0:
        print "Error:" + str(np.mean(np.abs(k2_error)))
        
    # in what direction is the target value?
    # were we really sure? if so, don't change too much.
    k2_delta = k2_error*nonlin(k2,deriv=True)

    # how much did each k1 value contribute to the k2 error (according to the weights)?
    k1_error = k2_delta.dot(syn1.T)
    
    # in what direction is the target k1?
    # were we really sure? if so, don't change too much.
    k1_delta = k1_error * nonlin(k1,deriv=True)

    syn1 += k1.T.dot(k2_delta)
    syn0 += k0.T.dot(k1_delta)

# This is the method of choice for all labeled deep learning models

这是所有 标记的深度学习模型 的选择方法

![alt text](http://www.asimovinstitute.org/wp-content/uploads/2016/09/neuralnetworks.png "Logo Title Text 1")

# How do artificial & biological neural nets compare?

人工神经网络 和 生物神经网络 是如何比较的？

Artificial Neural Networks are inspired by the hierarchial structure of brains neural network

大脑神经网络的层次结构启发了人工神经网路

![alt text](https://appliedgo.net/media/perceptron/neuron.png "Logo Title Text 1")

The brain has 
-100 billion neurons 
-- Each neuron has
   - A cell body w/ connections
   - numerous dendrites 
   - A single axon 
- Parallel chaining (each neurons connected to 10,000+ others)
- Great at connecting different concepts

大脑包含
-1000亿个神经元
-- 每个神经元包含
   - 细胞体 w/连接
   - 许多树突 
   - 一个单一轴突 
- 平行链 (每个神经元连接到 1万 + 其他)
- 很大的连接不同的概念
    
Computers have
- Not neurons, but transistors made in silicon!
- Serially chained (each connected to 2-3 others (logic gates))
- Great at storage and recall

计算机包含
- 没有神经元, 但是是用硅制造的晶体管!
- 串行链 (每个连接到2-3 其他 (逻辑门))
- 伟大的存储和回调

Some key differences 
- All sensory or motor systems in the brain are recurrent
- Sensory systems tend to have lots of lateral inhibition (neurons inhibiting other neurons in the same layer)
- There is no such thing as a fully connected layer in the brain, connectivity is usually sparse (though not random).
- brains are born pre-wired to learn without supervision.
- The Brain is low power. Alpha GO consumed the power of 1202 CPUs and 176 GPUs, not to train, but just to run. Brain’s power consumption is ~20W.

一些关键的区别
- 大脑中的所有感官或运动系统都是循环的
- 感官系统往往有大量的横向抑制 (神经元抑制同一层中的其他神经元)
- 在大脑中没有完全连通的层, 连接通常是稀疏的 (虽不是随机的)。.
- 大脑是天生的预设, 没有监督就可以学习。
- The Brain is low power. Alpha GO consumed the power of 1202 CPUs and 176 GPUs, not to train, but just to run. Brain’s power consumption is ~20W.
-大脑是低功耗的。Alpha GO消耗了1202个cpu和176个gpu的电量, 不是为了训练, 而是为了运用。大脑的功耗相当于是20w。

![alt text](https://images.gr-assets.com/books/1348246481l/5080355.jpg
 "Logo Title Text 1")

"the brain is not a blank slate of neuronal layers 
waiting to be pieced together and wired-up; 
we are born with brains already structured 
for unsupervised learning in a dozen cognitive 
domains, some of which already work pretty well 
without any learning at all." - Steven Pinker

“大脑不是一个空白的神经板, 等待拼凑和它们;我们生来就已经在十几个认知领域中为无监督学习而构建了大脑, 其中一些已经很好的工作
没有任何学习。”- Steven Pinker


# Where are we today in unsupervised learning?

我们今天在哪里在用无监督学习？


## For classification

分类

- Clustering (k-means, dimensionality reduction, anomaly detection)

- 聚类(k-means算法，降维，异常检测)

![alt text](https://ds055uzetaobb.cloudfront.net/image_optimizer/ff1732816ba08239c0d3b200c3a9708070885705.jpg
 "Logo Title Text 1")

- Autoencoders 自编码器
![alt text](https://image.slidesharecdn.com/deeplearningfromanoviceperspective-150811155203-lva1-app6891/95/deep-learning-from-a-novice-perspective-16-638.jpg?cb=1439308391
 "Logo Title Text 1")

## For Generation 为生成

- Generative Adversarial Networks 生成对抗网络

![alt text](http://www.kdnuggets.com/wp-content/uploads/generative-adversarial-network.png
 "Logo Title Text 1")
- Variational Autoencoders 变化的自编码器

![alt text](http://fastforwardlabs.github.io/blog-images/miriam/imgs_code/vae.4.png
 "Logo Title Text 1")

- Differentiable Neural Computer 可微神经计算机

https://en.wikipedia.org/wiki/Differentiable_neural_computer

![alt text](https://storage.googleapis.com/deepmind-live-cms/images/dnc_figure1.width-1500_C612yWA.png
 "Logo Title Text 1")

![alt text](https://storage.googleapis.com/deepmind-live-cms/images/dnc_figure2.width-1500_1bhcgxm.png
 "Logo Title Text 1")

- The controller receives external inputs and, based on these, interacts with the memory using read and write operations known as 'heads'. 
- To help the controller navigate the memory, DNC stores 'temporal links' to keep track of the order things were written in, and records the current 'usage' level of each memory location.
- DNCs were demonstrated, for example, how a DNC can be trained to navigate a variety of rapid transit systems, and then apply what it learned to get around on the London Underground. A neural network without memory would typically have to learn about each different transit system from scratch.

- 制器接收外部输入, 并在此基础上使用称为 "磁头" 的读写操作与内存进行交互。
- 为了帮助控制器导航内存, DNC 存储 "时间链路" 以跟踪所写内容的顺序, 并记录每个内存位置的当前 "使用" 级别。
- 例如, 分布式被证明是如何训练一个 DNC 来导航各种快速运输系统的, 然后运用它所学到的东西在伦敦地铁上四处走动。一个没有记忆的神经网络通常需要从零开始了解每个不同的运输系统。

Basically, Many of the best unsupervised methods still require backprop (GANs, autoencoders, language models, word embeddings, etc. 

基本上, 许多最好的无监督算法 仍然需要 反向传播 (GANs、自编码器、语言模型、单词嵌入等。

So many GANs (https://deephunt.in/the-gan-zoo-79597dc8c347)

非常多的GANs (https://deephunt.in/the-gan-zoo-79597dc8c347)



# 7 Research Directions 7个研究方向

## Thesis - Unsupervised learning and reinforcement learning must be the primary modes of learning, because labels mean little to a child growing up.

论文 - 无监督学习和强化学习必须是学习的主要方式，因为标对孩子的成长意义不大。

### 1 Bayesian Deep Learning (smarter backprop)

贝叶斯深度学习 (智能反向传播)

![alt text](https://pbs.twimg.com/media/CdV2NH_W0AAfpGg.jpg
 "Logo Title Text 1")

- Deep learning struggles to model uncertainty.
- Lets use Smarter weight initialization via Bayes Thereom
- So in a bayesian setting the weights of your neural network are now random variables (sampled from a distribution
- The parameters of this distribution are tuned via backpropagation.

- 深度学习努力建立不确定性。
- 让我们通过 Bayes Thereom 使用更智能的权重初始化
- 所以在贝叶斯设置你的神经网络的权重现在是随机变量 (抽样从分布）
- 通过反向传播的方式调整此分布的参数。

### 2 Spike-Timing-Dependent Plasticity 

钉定时 - 相关塑性

STDP is a rule that encourages neurons to 'pay more attention' to inputs that predict excitation. Suppose you usually only bring an umbrella if you have reasons to think it will rain (weather report, you see rain outside, etc.). Then you notice that if you see your neighbor carrying an umbrella, even though you haven't seen any rain in the forecast, but sure enough, a few minutes later you see an updated forecast (or it starts raining). This happens a few times, and you get the idea: Your neighbor seems to be getting this information (whether it is going to rain) before your current sources. So in the future, you pay more attention to what your neighbor is doing.

STDP 是一个规则, 鼓励神经元'更多注意'来输入并预测激励。假设你通常只带一把雨伞, 如果你有理由认为会下雨 (天气报告, 或者你看到外面下雨等)。然后你注意到, 如果你看到你的邻居拿着一把雨伞, 即使你没有在天气预报中看到任何雨, 但果然, 几分钟后你会看到一个更新的预测 (或者开始下雨)。这发生了几次, 你会得到这样的想法: 你的邻居似乎在你当前的消息来源之前得到了这个信息 (不管是要下雨)。所以在未来, 你要更加关注你的邻居在做什么。

![alt text](http://ars.els-cdn.com/content/image/1-s2.0-S0925231214017007-gr1.jpg
 "Logo Title Text 1")

- Suppose we have two neurons, A and B. 
- A synapses onto B ( A->B ). 
- The STDP rule states that if A fires and B fires after a short delay, the synapse will be potentiated (i.e. B will increase the 'weight' assigned to inputs from A in the future).
-  The magnitude of the weight increase is inversely proportional to the delay between A firing and B firing. 
- if A fires and then B fires ten seconds later, the weight change will be essentially zero. - - But if A fires and B fires ten milliseconds later, the weight update will be more substantial.
- The reverse also applies. If B fires first, then A, then the synapse will weaken, and the size of the change is again inversely proportional to the delay.

- 假设我们有两个神经元, a 和 B。
- A突触到B之上 ( A->B ). 
- STDP 规则规定, 如果A火和B火经过短暂的延迟, 突触将是强化 (即B将增加 "重量" 分配给输入从未来)。
- 重量增加的大小与A火和B火之间的延迟成反比。.
- 如果A火课, 然后十秒后B火了, 重量的变化将基本上为零。- 但如果A火了且十毫秒后B活了, 重量更新将更大。
- 反过来也适用。如果B先火, 然后是A, 那么突触就会减弱, 而改变的大小又与延迟成反比。

TL;DR - You cannot properly backpropagate for weight updates in a graph based network since it's an asynchronous system(there are no layers with activations at fixed times), so you are trusting neurons faster than you at the task. 

你不能正确为权重更新而反向传播，在基于图表的网络中, 因为它是一个异步系统 (在固定时间没有激活的层), 所以你相信神经元的速度比你的任务要快。


### 3 Self Organizing Maps 自组织地图

![alt text](https://image.slidesharecdn.com/somchuc-110117091410-phpapp01/95/sefl-organizing-map-7-728.jpg?cb=1295255891
 "Logo Title Text 1")
 
![alt text](https://i.imgur.com/88tmp7q.png "Logo Title Text 1")

### 4 Synthetic Gradients 合成梯度

https://iamtrask.github.io/2017/03/21/synthetic-gradients/

![alt text](https://storage.googleapis.com/deepmind-live-cms-alt/documents/3-10_18kmHY7.gif
"Logo Title Text 1")

![alt text](https://iamtrask.github.io/img/synthetic_grads_paper.png
"Logo Title Text 1")

- The first layer forward propagates into the Synthetic Gradient generator (M i+1), which then returns a gradient. 
- This gradient is used instead of the real gradient (which would take a full forward propagation and backpropagation to compute). 
- The weights are then updated as normal, pretending that this Synthetic Gradient is the real gradient. 

- 第一层前向传播到合成梯度发生器(M i+1), 然后返回一个梯度。
- 这个梯度是用来代替真实梯度的 (这将需要一个完全正向传播和反向传播来计算)。
- 权重然后更新为正常, 假装这种合成梯度是真正的梯度。

Synthetic Gradient genenerators are nothing other than a neural network that is trained to take the output of a layer and predict the gradient that will likely happen at that layer.

合成梯度发生器 是仅仅一个神经网络, 被训练来采取一个层的输出和预测的梯度, 将可能发生在该层。

The whole point of this technique was to allow individual neural networks to train without waiting on each other to finish forward and backpropagating.

这项技术的全部目的是允许单独的神经网络进行训练, 而不必等待对方来完成向前传播和向后传播。

- Individual layers make a "best guess" for what they think the data will say
- then update their weights according to this guess. 
- This "best guess" is called a Synthetic Gradient.
- The data is only used to help update each layer's "guesser" or Synthetic Gradient generator. 
- This allows for (most of the time), individual layers to learn in isolation, which increases the speed of training.

- 每个独立的层对他们认为的数据做 "最好的猜测"
- 然后根据这个猜测更新他们的权重。
- 这个 "最佳猜想" 叫做合成梯度。
- 数据仅用于帮助更新每个层的 "猜" 或合成梯度发生器。
- 这允许 (大部分时间), 单独的层进行隔离学习, 这增加了训练的速度。

### 5 Evolutionary Strategies (https://blog.openai.com/evolution-strategies/) 进化策略 

1. Create a random, initial brain for the bird (this is the neural network, with 300 neurons in our case)
2. At every epoch, create a batch of modifications to the bird’s brain (also called “mutations”)
3. Play the game using each modified brain and calculate the final reward
4. Update the brain by pushing it towards the mutated brains, proportionate to their relative success in the batch (the more reward a brain has been able to collect during a game, the more it contributes to the update)
5. Repeat steps 2-4 until a local maximum for rewards is reached.

1. 创建一个随机的, 最初的大脑为鸟 (这是神经网络, 300个神经元在我们的案例中)
2. 在每个代, 都要对鸟的大脑进行一系列的修改 (也称为 "突变")
3. 使用每个修改过的大脑来玩游戏并计算最后的奖赏
4. 通过将大脑推向被突变的大脑, 使之更新, 与他们在间歇中的相对成功相称 (在游戏中大脑能够收集的奖励越多, 它对更新的贡献就越大)
5. 重复步骤2-4 直到达到当地最高奖励。

Code: 
https://gist.github.com/karpathy/77fbb6a8dac5395f1b73e7a89300318d

- Mutation, selection, crossover via a fitness function 
- ES only requires the forward pass of the policy and does not require backpropagation (or value function estimation), which makes the code shorter and between 2-3 times faster in practice.
- RL is a “guess and chanck” on actions, while ES is a “guess and check” on parameters. 

- 变异, 选择, 通过合适功能来交叉
- ES 只需要策略的前向传递, 不需要反向传播 (或值函数估计), 这使得代码更短, 在实践中的速度在2-3 倍之间。
- RL 是对操作的 "猜测和检查", 而 ES 是对参数的 "猜测和检查"。

### 6 Moar Reinforcement Learning，Moar 强化学习

![alt text](https://i.imgur.com/ytKyctO.jpg
"Logo Title Text 1")

![alt text](https://adeshpande3.github.io/assets/Cover6th.png
"Logo Title Text 1")


### 7 Better hardware.  更高的硬件

- neuromorphic chips
- TPUs
- Wiring up transistors in parallel like the brain! 

- 神经芯片
- TPUs
- 像大脑一样平行地布线晶体管!

![alt text](http://4.bp.blogspot.com/-QBBjkC58sxo/ThYEopD5XPI/AAAAAAAAL9M/jcab9XA7eyY/s1600/SyNAPSE2goals.jpg
"Logo Title Text 1").

# My Conclusion? I agree with Andrej Karpathy

我的结论？我同意 Andrej Karpathy

## Let's create multi-agent simulated enviroments that heavily rely on reinforcement learning + evolutionary strategies

让我们创建大量依赖于强化学习 + 进化策略的多智能体模拟环境

![alt text](https://i.imgur.com/KnFCWQS.png
"Logo Title Text 1")

![alt text](https://i.imgur.com/spo1Ife.png
"Logo Title Text 1")

It's comes down to the exploration-exploitation trade-off. You need exploitation to refine deep learning techniques but without exploration (other technqiues) you will never get the paradigm shift we need to go beyond classifying cat pictures and beating humans in artificial games

它归结为勘探开发的权衡。你需要利用来提炼深度学习技巧, 但没有探索 (其他技术) 你将永远不会得到范式转变, 我们需要超越分类猫图片和殴打人类在人工游戏