## Feasibility study of using Reinforcement Learning to grow NCAs

This notebook contains a simplified (sanity) test, to make sure that this approach could be interesting/useful/worth pursuing.

It considers a simplified problem, that of coloring a 3x3 square a specified color.

In [1]:
from stable_baselines3 import PPO

from NCAEnv import NCAEnv

In [2]:
env = NCAEnv()
state = env.reset()

In [3]:
env.action_space, env.observation_space

(Discrete(18), Box(0, 1, (9,), int32))

In [4]:
action = env.action_space.sample()
action

0

In [5]:
next_state, reward, done,_, _ = env.step(action)
env.render()

[[0 1 0]
 [1 0 1]
 [1 1 1]]


In [6]:
# overkill, but it shall serve us well later
model = PPO("MlpPolicy", env, verbose=1, device="cpu")

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [7]:
model.learn(total_timesteps=20_000)

----------------------------------
| rollout/           |           |
|    ep_len_mean     | 874       |
|    ep_rew_mean     | -3.96e+03 |
| time/              |           |
|    fps             | 7009      |
|    iterations      | 1         |
|    time_elapsed    | 0         |
|    total_timesteps | 2048      |
----------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 542         |
|    ep_rew_mean          | -2.25e+03   |
| time/                   |             |
|    fps                  | 4517        |
|    iterations           | 2           |
|    time_elapsed         | 0           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.014375439 |
|    clip_fraction        | 0.106       |
|    clip_range           | 0.2         |
|    entropy_loss         | -2.88       |
|    explained_variance   | -0.00048    |
|    learning_rate  

<stable_baselines3.ppo.ppo.PPO at 0x1055bd490>

In [8]:
obs, _ = env.reset()
for _ in range(100):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, _, _ = env.step(action)
    if done:
        break
env.render()

[[1 1 1]
 [1 1 1]
 [1 1 1]]
