Skip to content

Latest commit

 

History

History
8 lines (5 loc) · 2.76 KB

DRRN.md

File metadata and controls

8 lines (5 loc) · 2.76 KB

The paper extends Deep Q-networks to “unbounded” action spaces effectively by generating action representations the environment provides and greedily choosing the action that provides the highest Q. A novel deep reinforcement relevance network(DRRN) is developed to handle actions defined through natural language in a text games setting. It is demonstrated that although the DRRN architecture uses fewer parameters, it converges faster than the baseline deep Q-networks.

Let |𝕊| denote the state space, and let |𝔸| denote the entire action space that includes all the unique actions over time. A vanilla Q-learning recursion needs to maintain a table of |𝕊|×|𝔸| which is not applicable to a large state space problem. An important structural difference from the DQN is that the DRRN not only extracts a fixed-dimension embedding from the state side but also create a unified distributed representation of state-action and the text-action that it can capture various dimensions of both semantic and syntactic information of a vector. The authors argue that the success of the DRRN in handling a natural language action-space is that both the action-texts and the state-texts are mapped into a finite dimensional embedding space. The “experience-replay” strategy is employed for the learning process which uses a fixed exploration policy to interact with the environment to obtain a data trajectory. This results in more aligned embeddings of the state-text with its relevant action-text so that the corresponding scalar product, which is the Q-function value of the action, is higher. The DRRN is evaluated on two popular text games and is compared with two baselines: a linear model and a NN-RL(with two hidden layers). The input is the text strings of state and action descriptions together as Bag-of-Words and the number of outputs equals the maximum number of actions. Softmax selection is used for the exploration vs exploitation tradeoff.

One shortcoming of the paper is that they do not exhibit the complex nature of natural languages to the full extent. For instance, the largest game tested uses only 2258 words. More importantly, the description of a state at each time step is almost always limited to the visual description of the current scene, lacking any use of higher-level concepts present in natural languages. The environment only provides a small(2-4) number of actions that are required to be evaluated and hence do not pick an action from a large set. This architecture has also been leveraged to learn to attend to actions which take place in multiple actions at each state(Slate MDPs) .