# Using invariant representations to guide future exploration

I would like to further explore an idea of leveraging abstract concepts learned from previous experiences to help aid and guide the behavior of an agent when faced with a novel problem. The aim is to sequentially construct and expand agent's knowledge base, and use it to construct a behavioral policy that would guide the exploration while learning the optimal policy.

## Implementation Plan

The following is a rough outline of the implementation plan:
- [ ] Allow training of agents via Q-Learning for wide range of tasks. At this stage, the goal is to setup the training framework, logging, metrics and everything you might need. In addition, make sure that the behavioral policy of the Q-Learning agent is swappable.
- [ ] Create invariant representation of learned experience. For each new task, first sample tuples of experiences, train an autoencoder network and add that particular task to the knowledge base.
- [ ] Create behavioral policy for a new task, from the knowledge base. At this stage, we alter the training process to:
  - [ ] Let the agent explore for a limited amount of time.
  - [ ] Determine a set of most relevant tasks.
  - [ ] Construct a behavior policy from a set of most relevant tasks and integrate it into Q-Learning.
- [ ] Lastly, I perform the experiment phase. Here, a carefully selected set of tasks  needs to be run, both individually as a baseline, and then in sequence using the knowledge base. If time permits, running several experiments with differently ordered tasks could be useful to obtain standard deviation.

In [1]:
import sys
import os

if os.path.abspath(os.path.join('.')) not in sys.path:
    sys.path.append(os.path.abspath(os.path.join('.')))

%load_ext autoreload
%autoreload 2

In [2]:
from models import Task, LLDQNTrainer, KnowledgeBase

In [3]:
knowledge_base = KnowledgeBase()
task = Task(env_name="CartPole-v1", knowledge_base=knowledge_base)
trainer = LLDQNTrainer(task)

[0.8        0.20563899 0.07758779 0.05       0.05      ]


In [4]:
trainer.run()

Epoch #1:   1%|          | 60/10000 [00:00<00:35, 280.64it/s, env_step=60, len=10, loss=5.366, n/ep=1, n/st=10, rew=10.00]

Epochs 1.
Decay Rate: 0.8
---


Epoch #1: 10001it [00:23, 422.14it/s, env_step=10000, len=226, loss=0.291, n/ep=0, n/st=10, rew=226.00]                           


Epoch #1: test_reward: 385.020000 ± 103.864236, best_reward: 385.020000 ± 103.864236 in #1


Epoch #2:   1%|          | 90/10000 [00:00<00:19, 510.21it/s, env_step=10090, len=165, loss=0.295, n/ep=0, n/st=10, rew=165.00]

Epochs 2.
Decay Rate: 0.20563899166908206
---


Epoch #2: 10001it [01:06, 150.63it/s, env_step=20000, len=231, loss=0.243, n/ep=0, n/st=10, rew=231.00]                           


Epoch #2: test_reward: 437.710000 ± 102.425123, best_reward: 437.710000 ± 102.425123 in #2


Epoch #3:   1%|1         | 120/10000 [00:00<00:14, 698.81it/s, env_step=20120, len=231, loss=0.253, n/ep=0, n/st=10, rew=231.00]

Epochs 3.
Decay Rate: 0.07758779419403619
---


Epoch #3: 10001it [01:07, 148.19it/s, env_step=30000, len=214, loss=0.059, n/ep=0, n/st=10, rew=214.00]                           


Epoch #3: test_reward: 193.330000 ± 13.449948, best_reward: 437.710000 ± 102.425123 in #2


Epoch #4:   1%|          | 80/10000 [00:00<00:20, 487.29it/s, env_step=30080, len=214, loss=0.058, n/ep=0, n/st=10, rew=214.00]

Epochs 4.
Decay Rate: 0.05
---


Epoch #4: 10001it [00:21, 467.67it/s, env_step=40000, len=202, loss=0.006, n/ep=0, n/st=10, rew=202.00]                           


Epoch #4: test_reward: 182.650000 ± 11.956902, best_reward: 437.710000 ± 102.425123 in #2


{'duration': '256.02s',
 'train_time/model': '35.12s',
 'test_step': 245429,
 'test_episode': 900,
 'test_time': '165.28s',
 'test_speed': '1484.95 step/s',
 'best_reward': 437.71,
 'best_result': '437.71 ± 102.43',
 'train_step': 40000,
 'train_episode': 725,
 'train_time/collector': '55.62s',
 'train_speed': '440.80 step/s'}