# TeBag-RL - Text-based Adventure Game Reinforcement Learning

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/floriandonhauser/TeBaG-RL/blob/[Linktext](https://)main/tf_TextWorld_RL.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
    Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/floriandonhauser/TeBaG-RL">
    <img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />
    View source on GitHub</a>
  </td>
</table>

This is the jupyter notebook for our project for the TUM seminar "Applied Deep Learning for Natural Language Processing". We attempted to tackle the difficult combination of NLP with reinforcement learning in form of Deep Q-Learning for text-based adventures games, such as the classic [Zork](https://en.wikipedia.org/wiki/Zork).

We are using the [TextWorld](https://github.com/microsoft/TextWorld) environment with a custom PyEnvironment wrapper to use it with TensorFlow. In addition, we utilize custom reward functions and a biased reply buffer accept/reject sampling method. 

Two agents are tested and trained, one ultiziing an [NNLM](https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2) pre-trained embedding and one with a pre-trained [smallBert](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1) model. More details can be found in the accompanying materials.

## General Setup

Installing necessary python packages on either local machine or Google Colab:

In [None]:
%%capture
!pip install tf-agents
!pip install textworld
!pip install tensorflow-text

Google Colab set-up with files in Google Drive:

In [None]:
%%capture
from google.colab import drive
drive.mount("/content/drive")
import os
PROJECT_PATH = "/content/drive/MyDrive/TeBaG-RL/"
os.chdir(PROJECT_PATH)

***Or alternatively*** importing from GitHub onto (Google Colab) machine:

In [None]:
%%capture
!git clone https://github.com/floriandonhauser/TeBaG-RL.git

In [None]:
import os
PROJECT_NAME = "TeBaG-RL/"
os.chdir(PROJECT_NAME)
PROJECT_PATH = os.getcwd()

### Imports

In [None]:
from resources import DEFAULT_PATHS
from tf_train_loop import TWTrainer

%load_ext autoreload
%autoreload 2

## Generate games


Generate simple debug game and large dataset of train and eval games.

In [None]:
os.chdir(PROJECT_PATH + "/scripts/")

This cell will create **a single** debug game. This is at least necessary to run anaything.

In [None]:
%%shell
bash ./make_debug_game.sh

**Only** run this if necessary. Depending on system, this process **will take hours**.

In [None]:
%%capture
%%shell
bash ./make_allgames.sh

In [None]:
os.chdir(PROJECT_PATH)

## Play a game yourself

In [None]:
os.chdir("resources/")

In [None]:
!tw-play game_th_lvl2_simple.ulx

In [None]:
os.chdir(PROJECT_PATH)

## Test environment

We wrote a unit test for the environment creation, as we needed to write a a TensorFlow wrapper for the TextWorld game environment. This also enabled us to add custom rewards and punishments for different scenarios.

The environment test function uses TensorFlow buil-in util methods and a random agent. The printed output shows the input command "Doing: ___",  the full environment state that we work with (only "obs", "description" and "inventory" are passed to the agent) and the resulting reward.

In [None]:
from tests import test_environment_creation

In [None]:
test_environment_creation()

## Run automatic vocab generation

We've implemented aa feature in the environment wrapper to check at each time step whether an interactable entity within the TextWorld game is in the current agent object vocabulary. All missing entities are stored and can be appended to the object vocuabulary file at the end of a game cycle.

This enables automatic vocabulary generation for a set of training games. The implemented method below utilizes a random agent, however, it needs the full generated training set in the "Generate Games" section above.

In [None]:
from environments import run_auto_vocab

In [None]:
run_auto_vocab()

## Train

Set rewards for training.

* **"win_lose_value"**: Value to be rewarded/punished for winning/losing the current game
* **"max_loop_pun"**: Punishmend for having the same state in the buffered set of hashed last states to avoid rewarding loops. (E.g. agent would drop item, go to another room, go back and pick up item while being rewarded for using items and changing the environment.)
* **"change_reward"**: Reward for changing either inventory (using/taking an item) or or the environment (exploring or opening an object).
* **"useless_act_pun"**: Punishment for using a non-recoqnizable command (Env-Return: "I don't understand that." or "I do not see such an object here." etc.)
* **"cmd_in_adm"**: Positive reward if current executed command is in set of admissible commands allowed from the environment. This should encourage environment object linking to commands, even though that command is not the right option.


In [None]:
REWARD_DICT = {
    "win_lose_value": 100,
    "max_loop_pun": 0,
    "change_reward": 1,
    "useless_act_pun": 1,
    "cmd_in_adm": 1,
}

### Activate TensorBoard logging

In [None]:
pathdir = DEFAULT_PATHS["path_logdir"]
%load_ext tensorboard
%tensorboard --logdir $pathdir

### Overfit on single debug game with each agent

Try overfitting on one debug game (correct command "take x" will immediately win or lose the game).
Depending on whether random agent finds correct WIN command, number of iterations will be enough or not.


In [None]:
DEFAULT_HP = {
    "learning_rate": 1e-4,
    "initial_collect_steps": 3000,
    "collect_steps_per_iteration": 1,
    "replay_buffer_max_length": 100000,
    "batch_size": 128,
    "num_eval_episodes": 1,
    "game_gen_buffer": 25,
    "num_eval_games": 5,
}
trainer = TWTrainer(
    reward_dict=REWARD_DICT,
    hpar=DEFAULT_HP,
    debug=False,
    biased_buffer=True,
    # embedding into fc is default policy
    agent_label="FCPolicy",
)
eval_scores = trainer.train(
    num_iterations=2000,
    log_interval=100,
    eval_interval=100,
    game_gen_interval=500,
    plot_avg_ret=True,
)

In [None]:
DEFAULT_HP = {
    "learning_rate": 1e-4,
    "initial_collect_steps": 3000,
    "collect_steps_per_iteration": 1,
    "replay_buffer_max_length": 100000,
    # large values lead to OOM with bert policy
    "batch_size": 64,
    "num_eval_episodes": 1,
    "game_gen_buffer": 25,
    "num_eval_games": 5,
}
trainer = TWTrainer(
    reward_dict=REWARD_DICT, 
    hpar=DEFAULT_HP,
    debug=False,
    biased_buffer=True,
    # embedding into fc is default policy
    # agent_label="FCPolicy",
    agent_label="BertPolicy",
)
eval_scores = trainer.train(
    num_iterations=2000,
    log_interval=100,
    eval_interval=100,
    game_gen_interval=500,
    plot_avg_ret=True,
)

### Train on 10 training games from the same level (level 2) with (simple) Embedding-FC policy agent.

In [None]:
DEFAULT_HP = {
    "learning_rate": 4.8247e-05,
    "initial_collect_steps": 30000,
    "collect_steps_per_iteration": 1,
    "replay_buffer_max_length": 100000,
    "batch_size": 128,
    "num_eval_episodes": 1,
    "game_gen_buffer": 10,
    "num_eval_games": 10,
}

trainer = TWTrainer(
    env_dir="train_games_lvl2",
    reward_dict=REWARD_DICT,
    hpar=DEFAULT_HP,
    debug=False,
    # !!!!!
    biased_buffer=False,
    # embedding into fc is default policy
    agent_label="FCPolicy",
    # agent_label="BertPolicy",
)

eval_scores = trainer.train(
    num_iterations=10000,
    log_interval=250,
    eval_interval=500,
    game_gen_interval=1000000,
    rndm_fill_replay=True,
    plot_avg_ret=True,
)

### Train on 10 training games from the same level (level 2) with Bert policy agent.

In [None]:
DEFAULT_HP = {
    "learning_rate": 4.8247e-05,
    "initial_collect_steps": 30000,
    "collect_steps_per_iteration": 1,
    "replay_buffer_max_length": 100000,
    "batch_size": 64,
    "num_eval_episodes": 1,
    "game_gen_buffer": 10,
    "num_eval_games": 10,
    "num_test_games": 50,
}

trainer = TWTrainer(
    env_dir="train_games_lvl2",
    reward_dict=REWARD_DICT,
    hpar=DEFAULT_HP,
    debug=False,
    biased_buffer=True,
    agent_label="BertPolicy",
)

eval_scores = trainer.train(
    num_iterations=10000,
    log_interval=250,
    eval_interval=250,
    game_gen_interval=1000000,
    rndm_fill_replay=True,
    plot_avg_ret=True,
    test_agent=True,
)

### Train on all 1000 training games from the same level (level 2) with NNLM-FC policy agent.

In [None]:
DEFAULT_HP = {
    "learning_rate": 4.8247e-05,
    "initial_collect_steps": 50000,
    "collect_steps_per_iteration": 1,
    "replay_buffer_max_length": 100000,
    "batch_size": 128,
    "num_eval_episodes": 1,
    "game_gen_buffer": 25,
    "num_eval_games": 10,
    "num_test_games": 50,
}

trainer = TWTrainer(
    env_dir="train_games_lvl2",
    reward_dict=REWARD_DICT,
    hpar=DEFAULT_HP,
    debug=False,
    biased_buffer=True,
    agent_label="FCPolicy",
)

eval_scores = trainer.train(
    num_iterations=50000,
    log_interval=250,
    eval_interval=500,
    game_gen_interval=1000,
    rndm_fill_replay=True,
    plot_avg_ret=True,
    test_agent=True,
)

### Train on all generated  levels with Bert-FC policy agent

This would be the training loop to utilize multiple levels (with increasing difficulty) of the TreasureHunter type game genereted by TextWorld.

They differ in the number of rooms that need to be visited (including dead ends) and the number of other items being available in the environment.

In [None]:
DEFAULT_HP = {
    "learning_rate": 4.8247e-05,
    "initial_collect_steps": 30000,
    "collect_steps_per_iteration": 1,
    "replay_buffer_max_length": 100000,
    "batch_size": 64,
    "num_eval_episodes": 1,
    "game_gen_buffer": 10,
    "num_eval_games": 10,
}

trainer = TWTrainer(
    env_dir="train_games_lvl2",
    reward_dict=REWARD_DICT,
    hpar=DEFAULT_HP,
    debug=False,
    biased_buffer=True,
    agent_label="BertPolicy",
)

eval_scores = trainer.train(
    num_iterations=10000,
    log_interval=250,
    eval_interval=500,
    game_gen_interval=1000,
    rndm_fill_replay=True,
    plot_avg_ret=True,
)

print(f"Changing to next lvl: 3 \n")

trainer.change_env_dir(f"train_games_lvl3")
eval_scores = trainer.train(
    num_iterations=10000,
    log_interval=250,
    eval_interval=500,
    game_gen_interval=1000,
    continue_training=True,
    rndm_fill_replay=True, 
    plot_avg_ret=True,
)

print(f"Changing to next lvl: 4 \n")

trainer.change_env_dir(f"train_games_lvl4")
eval_scores = trainer.train(
    num_iterations=12000,
    log_interval=250,
    eval_interval=500,
    game_gen_interval=1000,
    continue_training=True,
    rndm_fill_replay=True,
    plot_avg_ret=True,
)


## Hyper parameter search

Simple hyper parameter search using the Optuna python package. It also comes with very handy visualization tools illustrating correlations and importance between different hypter parameters.

To test for stable and good training, the objective value is defined as the average score for the last few iterations for the games in the current buffer.

In [None]:
%%capture
!pip install optuna
import optuna
import numpy as np
from optuna.visualization import plot_contour
from optuna.visualization import plot_slice
from optuna.visualization import plot_optimization_history
from optuna.visualization import plot_param_importances

In [None]:
def objective(trial):
    """"""

    REWARD_DICT = {
        "win_lose_value": 100,
        "max_loop_pun": 0,
        "change_reward": 1,
        "useless_act_pun": 1,
        "cmd_in_adm": 1,
    }

    DEFAULT_HP = {
        "learning_rate": trial.suggest_loguniform("lr", 1e-5, 1e-3,
        "initial_collect_steps": 30000,
        "collect_steps_per_iteration": 1,
        "replay_buffer_max_length": 100000,
        #CAREFUL: OOM - Can you handle more than 64?
        "batch_size": trial.suggest_int("batch_size", 32, 64),
        "num_eval_episodes": 1,
        "game_gen_buffer": 10,
        "num_eval_games": 5,
    }

    trainer = TWTrainer(
        env_dir="train_games_lvl2",
        reward_dict=REWARD_DICT,
        hpar=DEFAULT_HP,
        debug=False,
        biased_buffer=True,
        agent_label="BertPolicy",
    )

    eval_scores = trainer.train(
        num_iterations=20000,
        log_interval=250,
        eval_interval=250,
        game_gen_interval=1000000,
        rndm_fill_replay=True,
        plot_avg_ret=True,
    )
    print(DEFAULT_HP)
    print(eval_scores)

    return np.mean(eval_scores[1][-5:])

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=30)
print(study.best_trial)

In [None]:
plot_optimization_history(study)

In [None]:
plot_contour(study)

In [None]:
plot_slice(study)

In [None]:
plot_param_importances(study)