This notebook is for creating an annotated bibliography of works.

Each paper should be read through, and given a brief summary. An example template of what information should be taken from each paper, and how to format it, is given below.

# Index

**Section1**

[World Models](#worldmodels)

[Asynchronous Methods for Deep RL](#asynch_rl)

___
___

<a id='worldmodels'></a>

# World Models
(https://arxiv.org/pdf/1803.10122.pdf)

**keywords**: *Google Brain Project, VAE, RNN, RL*

### Main idea
1. Generate an environment for an RL agent to learn in. This paper focuses on generating a video game frame by frame and proposes a small network which the RL agent will train in.
2. VAE (V) to encode the vision inputs from every frame of a video game, RNN to encode historic information from observing the game (M), Linear Model (C) to convert these encoded inputs to actions. 
3. Figure below represents the high level diagram of the entire system. The V model does an encoding of the frames and makes a latent vector $z_{t}$ at time $t$. M model is an RNN with another output layer called a Mixture Density Network, which outputs a probability distribution $P(z_{t+1}$ over the next latent visual vector $z_{t+1}$ instead of a deterministic prediction (outputting $z_{t+1}$). the RNN will model $P(z_{t+1} | a_{t}, z_{t}, h_{t})$, where at is the action taken at time $t$ and $h_{t}$ is the hidden state of the RNN at time $t$. The controller model C, is a single Linear Layer which takes the concatenated $[z_{t} h_{t}]$ and outputs the action $a_{t}$. This Controller is trained separately from the V and M.  
<img src="./pics/worldmodels/HLD.png" alt="Drawing" style="width: 600px;"/>

### Utility
This is a good Proof of concept that RLs can learn in an environment generated by another model. This paper mainly tries to emulate the fact that humans can predict the future state of the environment and then act accordingly. 

### Criticisms & Failures
1. Since it is a image generative model, this model requires a lot of training, hence the choice for a single linear layer for the Controller network. This model is fitted separately using CMA-ES (Covariance Matrix Adaptation Evolution Strategy) and not trained end-to-end.

### Commentary
Really Interesting: The paper says: "if agent needs to learn complex motor skills to walk around its environment, the world model will learn to imitate its own C model that has already learned to walk." This way, the world model encodes the short term memory into a long term memory by never removing elements from the environment.

<a id='asynch_rl'></a>

# Asynchronous Methods for Deep Reinforcement Learning
(https://arxiv.org/abs/1602.01783)

**keywords**: *Google Deep Mind, RL, Performance Enhancements*

### Main idea
In typical reinforcement learning, you keep around a memory of recent states, and shuffle through those when training. This is done because successive states are highly correlated (one frame barely varies from the previous). So, if you train successively then the network will vary its weights to perform well at the moment, rather than well in general.

The authors propose an alternative to this by having several different agents train asynchronously in different enviornments, but sharing the same underlying model. This leads to states which are not correlated, and as such they don't need to keep around the memory to revisit when training.

They propose several algorithms which show that their new model is both faster and more efficient compared to training a standard RL agent.


<img src="./pics/asynch_rl/performance.png" alt="Drawing" style="width: 600px;"/>

### Utility

The obvious utility is just that if we went through the effort to use this, it should in theory be more efficient and allow us to train faster and thus do more experiments. 


### Criticisms & Failures

n/a (Aka I'm not well-read enough to comment about it -Alex)


### Commentary

Not sure how useful this could actually be to us for two reasons: engineering and resources. It's definitely more overhead to engineer this, and running several RL agents at once would either require scheduling, or using multiple GPUs (a problem we probably don't want to breach here.
