# Outline of the notebook

- [Defining a MDP for recommendation system using gym API](#recsys) (15 min)
- [Online RL - Bandits](#bandits) (15 min)
- [Online RL - DQN](#dqn) (15 min)
- [Break](#break) (5 min)
- [Offline RL](#offline-rl) - (15 min)

- slides [Link](https://github.com/anyscale/academy/blob/main/ray-rllib/acm_recsys_tutorial_2022/slides/rllib_acm_recsys_2022_slides.pdf)

# Creating a RecSys gym environement <a class="anchor" id="recsys"></a>

RL is usually a fit when it comes to sequential decision making problems. For example in recommendation systems, the particular items that our AI recommends could impact the interest profile of the users that it interacts with and as result, can have consequences on the next time-step that it is making a decision. This is in contrast to the passive prediction problems where the task is simply to predict the future and that prediction does not change the outcome.

To successfully train RL agents we usually need a good simulator that can approximate real-world behavior about what is going to happen if your agent takes certain actions. It is always recomended to start with an environement that can be used to emulate real-world behavior. 

In this section, we will learn how to create and evaluate an exemplar RecSys environment using gym API. 

## RecSim

In this notebook, we will use <b><a href="https://github.com/google-research/recsim">Google's RecSim environment</a></b>, which was developed for the YouTube recommendation problem.  It is a configurable environment, where ideally you would plug in your own users, products, and embedding features.

**Some further readings**

- <a href="https://github.com/google-research/recsim">RecSim github</a>
- <a href="https://arxiv.org/pdf/1909.04847.pdf">RecSim paper</a>

The following image depicts all the components of the recsim packages:

<img src="./images/recsim_environment.png" width="70%" />

The environment is <i>Timelimit-based</i>, meaning the termination condition for an episode will be after a fixed number (10) of videos are watched. 

### Document Model

<img src="./images/document_model.png" width="70%" />

Documents represent the candidate pool of items that need to be recommended with features sampled in the range [0, 1].  In this tutorial, we use <b>1 single feature "sweetness"</b> drawn from a uniform distribution between [0.8, 1.0] to represent "chocolaty" items and [0, 0.2] for the "kaley" options. 
- The documents can be different at each step (produced by some "candidate generation" process), or fixed throughout the simulation.
- The recommendation algorithm observes the D candidate documents.  It then makes a selection (possibly ordered) of k documents and presents them in a "slate" to the user. We will focus on **slate size of 1** in this tutorial. 

### User Model

<img src="./images/user_model.png" width="70%" />

In RecSim users are representation by a set of features some of which could be latent hidden variables not observed by our recommendation system during live interaction. In this tutorial we assume that a sampled user from the world does not own any observable features like age, gender, etc. and our AI should infer the latent state of the user from its history of interactions in the current session. 
- The user examines a "slate" of recommended items and makes a choice of one item. After making their choice, the user emits an engagement score which indicates how much that user was engaged with that particular item they chose. The agent has to learn to estimate the latent states of the user that shape their choice model in the future. 

### User Choice

<img src="./images/user_choice_model.png" width="70%" />

This module controls how a particular user would respond to a set of recommendations and how its latent state would evolve as a function of this interaction. 
In the environment in this tutorial, engagement is assumed to be a function of two competing phenomenas:
-   The love of the user for sweet items ($sweetness(item_t)$)
-   The long term satisfaction which cares about healthier options. It is inversely correlated with the sweetness of items suggested so far ($satisfaction_{t-1}$)

$$satisfaction_t := satisfaction_{t-1} * \sigma(-sweetness(item_{t}))$$
$$r(item_t) \propto sweetness(item_t) * satisfaction_{t-1}$$

i.e. If a user who loves chocolates have not had chocolate for a while, a chocolate item would be more engaging than a kaley item. On the other hand if we keep recommending chocolate to the same person, they may lose interest and not use our recommendations. 


Let's look at this enviornement properties closer:

In [None]:
# uncomment for running on anyscale
# import os; os.environ["PATH"] = "/home/ray/anaconda3/bin:" + os.environ["PATH"]
# !pip uninstall -y torch
# !python -m pip install torch==1.12.1
# !python -m pip install seaborn
# import torch
# print(f"torch: {torch.__version__}")

# !pip uninstall -y -r matplotlib
# !python3 -m pip install matplotlib==3.5.3
# import matplotlib
# print(f"matplotlib: {matplotlib.__version__}")

In [None]:
import time
import random
import numpy as np
from pprint import pprint
import pickle
import seaborn as sns
import matplotlib.pyplot as plt

from ray import tune, air, data

import recsim 
from rllib_recsim.rllib_recsim import ModifiedLongTermSatisfactionRecSimEnv

In [None]:
# main parameters
seed = 100
num_candidates = 20
reward_scale = 1.0

In [None]:
# Let's first instantiate the environment

config = {
    # The number of possible documents/videos/candidates that we can recommend
    "num_candidates": num_candidates,  
    # The number of recommendations that we will be making
    "slate_size": 1, 
    # Set to False for re-using the same candidate documents each timestep.
    "resample_documents": True,
    # Use consistent seeds for the environment ...
    "seed": seed,
    # scale rewards with this factor
    "reward_scale": reward_scale,
}

env = ModifiedLongTermSatisfactionRecSimEnv(config)
print(env)

The printed environement shows a hierarchy of wrappers around the main RecSim environment. Next we'll investigate the behavior of observation and action space. For more info on how the environement is actually implemented we refer you to the source code acompanied with this tutorial [link](https://github.com/anyscale/academy/blob/main/ray-rllib/acm_recsys_tutorial_2022/rllib_recsim/rllib_recsim.py). 

In [None]:
# Let's checkout the observation space and action space
print("observation_space:")
print("-"*20)
print(env.observation_space)
print("observation_space example:")
print("-"*20)
print(env.observation_space.sample())
print("observation space keys:")
print("-"*20)
print(list(env.observation_space.keys()))

The observation space is a dictionary of four keys. 
- The `user` key is empty, because there is no observable user features. 
- The `doc` key contains a set of documents presented with numerical keys. Each doc item has a scalar score representing its sweetness. 
- The `response` key includes a single record of `click` and `engagement` from the immediate previous interaction. 
- The `time` shows a normalized timestep within the session for this user. -0.5 corresponds to the beginnig of the interaction and +0.5 corresponds to the end of the interaction.

In [None]:
# Let's checkout the action space
print("action_space:")
print("-"*20)
print(env.action_space)
print("action_space example:")
print("-"*20)
print(env.action_space.sample())

The action space is simply an integer number between 0, 19 (`Discrete(20)`). 

Let's now checkout environement's reward behavior. Let's see what happens if we always pick the sweetest item and plot the reward over time.

In [None]:
# Let's checkout the reward space
# TODO: code here
rewards = []

In [None]:
plt.plot(rewards)
plt.ylabel('engagement per step for maximum greedy policy')
plt.xlabel('step')

We can note a couple of things here already:

1. The immediate engagement would be high initially when the sweetest item is suggested for the first time.
2. As we keep recommending the sweetest items, the user satisfaction significantly tampers off and as a result engagement quickly drops.
3. Episodes seem to last for 10 timesteps.

#### Exercise (2 min):
Instead of picking the item with the highest feature, pick the item with the lowest feature and see what happens?

- What do your observations imply about this environment?
- What policy maximizes engagement with the user?

In [None]:
obs = env.reset()
rewards = []
done = False
while not done:
    # TODO (exercise): code here
    action = ...
    obs, reward, done, info = env.step(action)
    rewards.append(reward)

In [None]:
plt.plot(rewards)
plt.ylabel('engagement per step for minimum greedy policy')
plt.xlabel('step')

## Getting some baseline policies
Next we will run some simple baselines to get a feeling of the reward we can accumulate in these enviornements using simple policies.

- Greedy minimum feature value (recommending the kaliest option)
- Greedy maximum feature value (recommending the chocoletiest option)
- random policy (recommending random items from the pool)
- even argmin (recommmending alternations between argmax and argmin to keep the engagement high)


In [None]:
# Function that measures and outputs the random baseline reward.
# This computes  the expected accumulated reward per episode, if we act randomly (recommend random items) at each time step.
def calc_baseline(baseline_type="random",
                  episodes=100):

    env_config = {
        # The number of possible documents/videos/candidates that we can recommend
        # no flattening necessary (see `convert_to_discrete_action_space=False` below)
        "num_candidates": num_candidates,  
        # The number of recommendations that we will be making
        "slate_size": 1, 
        # Set to False for re-using the same candidate documents each timestep.
        "resample_documents": True,
        # Use consistent seeds for the environment ...
        "seed": seed,
        # scale rewards with this factor
        "reward_scale": reward_scale,
    }

    env = ModifiedLongTermSatisfactionRecSimEnv(env_config)
    # Reset the env.
    obs = env.reset()

    # Number of episodes already done.
    num_episodes = 0
    # Current episode's accumulated reward.
    episode_reward = 0.0
    epsiode_satisfaction = []
    # Collect all episode rewards here to be able to calculate a random baseline reward.
    episode_rewards = []
    episode_satisfactions = []
    
    # Enter while loop (to step through the episode).
    time_step = 0
    while num_episodes < episodes:
        # Produce an action
        # TODO: code here
        random_action = ...
        argmax_action = ...
        argmin_action = ...

        action_dict = {
            'argmax': argmax_action, # greedy choc
            'argmin': argmin_action, # greedy kale
            'random': random_action,
        }
        # a baseline that performs argmax in even time steps and argmin in odd time steps
        action_dict["even_argmin"] = (
            action_dict["argmin"] if time_step % 2 == 0 else action_dict["argmax"]
        )
        action = action_dict[baseline_type]
        
        # Send the action to the env's `step()` method to receive: obs, reward, done, and info.
        obs, reward, done, _ = env.step(action)
        
        # Accumulate the rewards
        episode_reward += reward
        
        # Append satisfaction to episode_satiscation
        epsiode_satisfaction.append(
            env.environment._user_model._user_state.satisfaction
        )

        time_step += 1
        # Check, whether the episde is done, if yes, reset and increase episode counter.
        if done:
            if num_episodes % 99 == 0:
                print(f" {num_episodes} ", end="")
            elif num_episodes % 9 == 0:
                print(".", end="")
                
            # increment on end of episode
            num_episodes += 1
            time_step = 0
            obs = env.reset()
            episode_rewards.append(episode_reward)
            episode_reward = 0.0
            episode_satisfactions.append(np.mean(epsiode_satisfaction))

    # Print out and return mean episode reward (and standard error of the mean).
    env_mean_reward = np.mean(episode_rewards)
    env_sd_reward = np.std(episode_rewards)

    # Print out the satisfaction over the episodes
    env_mean_satisfaction = np.mean(episode_satisfactions)
    env_sd_satisfaction = np.std(episode_satisfactions)
    
    print(f"\nMean {baseline_type} baseline reward: {env_mean_reward:.2f}+/-{env_sd_reward:.2f}, satisfaction: {env_mean_satisfaction:.2f}+/-{env_sd_satisfaction:.2f}")

    return env_mean_reward, episode_rewards

In [None]:
num_episodes = 1000
kaliest_baseline, _ = calc_baseline(baseline_type="argmin", episodes=num_episodes)
sweetest_baseline,  _ = calc_baseline(baseline_type="argmax", episodes=num_episodes)
random_baseline, _ = calc_baseline(baseline_type="random", episodes=num_episodes)
even_margin_baseline, _ = calc_baseline(baseline_type="even_argmin", episodes=num_episodes)

### Discussion about the baselines

For every baseline we have printed out not only the engagement score of the entire user session but also the average of the satisfaction term over the entire session as well.
- **Random policy beats greedy options**
- **Alternation between argmax and argmin beats random**

The question is whether we automatically learn an optimal policy in this recommendation enviornement?


## Questions (2 min)

- Any questions so far?

# Train a contextual bandit on the environement <a class="anchor" id="bandits"></a>

Bandit is a classical algorithm used in RecSys that is known to optimize single-step objectives. **They maximize immediate engagement not accumulated engagement over the user session.**

Any RL algorithm can be turned into a contextual bandit algorithm if the discount factor of the Markov Decision Process is set to 0.0. This will result in maximizing the immediate reward and hence a bandit solution. 

In this section, we will use [DQN](https://docs.ray.io/en/latest/rllib/rllib-algorithms.html#dqn) to train an agent both with $\gamma = 0$ and $\gamma = 0.99$ to see the difference between a bandit solution and an RL solution.

In the config below we set a few settings:
- For the environement, we specify 20 candidates that are randomly re-sampled at each time-step.
- We use the torch implementation of the algorithm in RLlib
- For evaluation, we rollout 100 complete episodes at the end of each training iteration and compute the average of un-discounted reward over the episode


In [None]:
# use tune to register the environment
tune.register_env("modified-lts", 
    lambda config: ModifiedLongTermSatisfactionRecSimEnv(config)
)

In [None]:
from ray.rllib.algorithms.dqn import DQNConfig, DQN

In [None]:
# set up the environment config
env_config = {
    "num_candidates": num_candidates,  
    "slate_size": 1, 
    "resample_documents": True,
    "seed": seed,
    "reward_scale": reward_scale
}

bandit_config = DQNConfig()
# setup the env
bandit_config = bandit_config.environment(env="modified-lts", env_config=env_config)
# setup framework to be torch
bandit_config = bandit_config.framework("torch")
# setup the gamma
bandit_config = bandit_config.training(gamma=0.0)

In [None]:
pprint(bandit_config.to_dict())

In RLlib we can run training loops in two ways:
1. Ad-hoc for-loop via calling algo.train()
2. Using `tune.Tuner()` with stopping condition (recommended)

The code below shows the differences. Moving forward (and in all scripts) we use the later option.

In [None]:
# Ad-hoc for-loop
# TODO: Code here
results_bandit = None

In [None]:
# print the results
print(results_bandit.keys())
pprint(results_bandit)

In [None]:
# Using tune.Tuner(param_space=..., run_config=air.RunConfig)
# air.RunConfig(local_dir=..., stop=...)
# TODO: Code here
results_bandit =  None

In [None]:
results_bandit

In [None]:
pprint(results_bandit[0].metrics)

Tune collects these results from every iteration and puts them in the output directory where the other logging artifcats are stored. 

To run this experiment longer until it's trained you can run the following command to launch the bandit experiment:


```bash
python tutorial_scripts/run_online_rl.py --seed 0 --gamma 0.0 --exp_name bandits --timesteps 10_000
```

This script take 5 minutes to run. It will create experiment artifacts under `./results_scripts/` which includes the checkpoints as well as tensorboard and tabular logs. You can later inspect this folder to monitor your experiments.

The suggested way is to use tensorboard to monitor the metrics of your run and look for `episode_reward_mean`.

```bash
tensorboard --logdir "./results_scripts"
```



In [None]:
# load the trained results and plot the metrics in notebook
import pandas as pd

# Load the results from the progress.csv in the result folder of your running script
df = pd.read_csv("saved_runs/bandits/progress.csv")
df

In [None]:
sns.lineplot(data=df, x="training_iteration", y="episode_reward_mean", label="bandits")
plt.axhline(sweetest_baseline, color="red", linestyle='--', label="sweetest baseline")
plt.axhline(random_baseline, color="k", linestyle='--', label="random baseline")

plt.legend()
plt.title('Bandits training performance')

## Compare the Bandit Training results to Baseline
- Bandit Mean Reward=~56 
- Kaleist (Argmin) Baseline Mean Reward = ~10.87+/-0.26
- Random Baseline Mean Reward = ~99.90+/-23.75
- Sweetest (Argmax) Baseline Mean Reward = ~56.56+/-1.37

<div class="alert alert-block alert-success">
    🤔  <b>Bandit mean reward is approx the same as the sweetest baseline!</b> 
</div>  

Try the code block below to compare what the bandit recommends with what is the sweetest item... you will see that the bandit always recommends the sweetest item!

In [None]:
# build the algorithm and load from checkpoint
with open("saved_runs/bandits/params.pkl", 'rb') as f:
    params = pickle.load(f)

pprint(params)

In [None]:
bandit_algo = DQN(params)
bandit_algo

In [None]:
# Load the checkpoint from the result folder of your running script
checkpoint = "saved_runs/bandits/checkpoint_000666"
bandit_algo.restore(checkpoint)

In [None]:
env = ModifiedLongTermSatisfactionRecSimEnv(env_config)
obs = env.reset()

# Run a single episode.
done = False
while not done:
    # use `compute_single_action` method of our Trainer.
    # This is one way to perform inference on a learned policy.
    # TODO: Code here
    bandit_action = ...
    argmax_action = int(max(obs['doc'], key=lambda x: obs['doc'][x]))

    feature_value_of_bandit = obs["doc"][str(bandit_action)]
    feature_value_of_greedy = obs["doc"][str(argmax_action)]


    # Print out the picked document's feature value and compare that to the highest possible feature value.
    print("-"*50)
    print("observation_features: ", np.concatenate(list(obs["doc"].values())))
    print(f"bandit's feature value={feature_value_of_bandit}; argmax feature={feature_value_of_greedy};")

    # Apply the computed action in the environment and continue.
    obs, r, done, _ = env.step(bandit_action)

In this dummy Recsim environment, we did not have any user features.  This makes the contextual bandit without any user context, i.e. without any state.  A stateless bandit cannot remember things between timesteps, so it will sort of converge to the most greedy policy that recommends chocolotes.

# Solving the problem with RL <a class="anchor" id="dqn"></a>

So far the bandit solution has just converged to the perforamnce of greedy argmax policy. **How can we improve over the random policy baseline** Now let's run the DQN algorithm with $\gamma = 0.99$. We run this script for 1M environment timesteps till convergence. It will take ~ 1hour to run this training job.

**Exercise (1 min)** How would you modify the previous config object (`bandit_config`) to train an RL agent to optimize long-term engagement (hint: use `.training()` API)?


In [None]:
# TODO: Code here
dqn_config = ...

tuner = tune.Tuner(
    DQN,
    param_space=dqn_config.to_dict(),
    run_config=air.RunConfig(
        local_dir='./results_notebook/online_rl/dqn',
        stop={"training_iteration": 1},  # this is enough for it to converge
    )
)
dqn_results = tuner.fit()

You can run the same script as before, with `gamma=0.99` passed in as a parameter. We should also run the script longer (1M steps) as it will take longer for DQN to converge. 
```bash
python tutorial_scripts/run_online_rl.py --seed 0 --gamma 0.99 --exp_name dqn --timesteps 1_000_000
```

**Exercise** Take a look at the results and compare them to bandits via tensorboard.

In [None]:
# take a look at the results
bandit_df = pd.read_csv("saved_runs/bandits/progress.csv")
dqn_df = pd.read_csv("saved_runs/dqn/progress.csv")

In [None]:
# plot the results and compare to baselines
sns.lineplot(data=bandit_df, x="training_iteration", y="episode_reward_mean", label="Bandits")
sns.lineplot(data=dqn_df, x="training_iteration", y="episode_reward_mean", label="DQN")
plt.axhline(random_baseline, color="red", linestyle='--', label="random baseline")
plt.axhline(sweetest_baseline, color="blue", linestyle='--', label="sweetest baseline")
plt.legend()
plt.title('RL vs. Bandits training performance')


- **Sweetest** straight line, is the mean reward achieved by the greedy policy, selecting always the item with most immediate reward value.
- **Bandit** short term reward hovers around the "sweetest baseline".
- **Random** straight line, items are randomly chosen to be recommended at each timestep. Since this baseline mixes the sweetest and kaliest options the engagement is kept higher than either of the greedy methods, obtaining larger rewards.
- **DQN (RL)**, such as DQN that optimize for long-term engagement significantly improves upon random baselines.

# Questions and Break (5 min) <a class="anchor" id="break"></a>

# Offline RL with RecSys <a class="anchor" id="offline-rl"></a>

<img src="images/offline_rl.png">


### If we don't have a live environment, how do we know, how well our trained policy will perform?

One of the challenges in offline RL is the evaluation of the trained policy. In online RL (when a simulator
is available), one can either use the data collected for training to compute episode total rewards. Remember
that observations, actions, rewards, and done flags are all part of this training data. Alternatively,
one could run a separate worker (with the same trained policy) and run it on a fresh evaluation-only environment.
In this latter case, we would also have the chance to switch off any exploratory behavior (e.g. stochasticity used
for better action entropy).

In offline RL, no such data from a live environment is available to us. There are two common ways of addressing this dilemma:

1) We deploy the learned policies into production, or maybe just a portion of our production system (similar to A/B testing), and see what happens.

2) We use a method called "off policy evaluation" (OPE) to compute an estimate on how the new policy would perform if we were to deploy it into a real environment. There are different OPE methods available in RLlib off-the-shelf.

3) The third option - which we will use here - is kind of cheating and only possible if you actually do have a simulator available (but you only want to use it for evaluation, not for training, because you want to see how cool offline RL is :) )

In this tutorial, we will use the third option to show the effectiveness of offline RL in improving over existing policies running in production. We will also see how much benefit we can get by improving our dataset quality, starting from a totally random policy all the way to 20% expert demonstrations adn 80% random. 

We can use the currently running policies in production to collect some "historical data" that we can use to train RL agents with. Offline RL can be used to improve upon the existing policies deployed in production. We have prepared some datasets in advance for the purpose of this tutorial. They were all generated using `<path to the script>`. In this script we can mix the percentage of the "expert" data vs. random data to investigate the effect of dataset quality on the final performance of our models.   

Let's look at an exemplar dataset we have prepared before:



In [None]:
prefix = "s3://air-example-data/rllib/acm_recsys22_tutorial_data/"
train_data_path = prefix + "sampled_data_train_random_transitions_small"

dset = data.read_json(train_data_path)
df = dset.to_pandas()
print('Colimns: ', df.columns)
print('Number of rows: ', len(df))
print('Value of the first row')
print('-'*20)
print(df.iloc[0])
print('Value of the second row:')
print('-'*20)
print(df.iloc[1])


From the dataset schema, we can see that RLlib always expects a `type` column that is `SampleBatch`. It will have the normal transition entities per each row (i.e. observation, next_observation, action, reward, done values). It will also contain an episode_id, a timestep indicator, and an action_prob that show the probablity of the action that we chosen at the time of data collection. For random policy the action prob will always be 1/20 (0.05). 

Now that we have an understaning of the dataset example format, we can use RLlib to train an offline RL algorithm. RLlib provides several out of the box offline RL algorithms that you can use. Beside those offline-RL-specific algorithms, we can also use any off-policy algorithm (e.g. DQN) to do offline-RL. The only difference between online and offline version is that instead of using an enviornement sampler, we use a dataset sampler to get the data from. In the next section, we will use DQN, with the difference that we pass a dataset path to the input config.

Here is the summary of the differences:

- Change the input config from a sample to offline dataset is configured by `.offline_data()` API:

```python
    .offline_data(
        input_='dataset',
        input_config={
            'format': 'json',
            'paths': train_data_path
        }
    )
```

- The environement is not passed to the enviornement anymore. Instead we need to pass in the expected action and observation space to construct the policies. We can create them manually based on our knowledge of our system. In this case we actually cheat and use the environement attributes to get the correct action and observatin spaces.

```python 
    .environment(
        action_space=action_space,
        observation_space=observation_space,
    )
```

- evaluation config: Since during the evaluation we still need to use the enviroenement simulations, we need to specify it explicitly here. RLlib by default will use the training settings during evaluation and to avoid that default behavior we need to explicitly specify the simulation environement configs. 

```python
    .evaluation(
        evaluation_config={
            "input": "sampler",
            "explore": False,
            "env": "modified-lts",
            "env_config": env_config,
        },
    )
```

- (Advanced) configuring the replay buffer to become a dummy buffer. By default RLlib uses a large replay buffer that is useful in online RL but doesn't add much value in the offline case. It is recommended to by-pass this behavior by configuring the replay buffer size to be the same as the dataset sampling size, so that the sampling flow does not get disrupted by the replay buffer behavior.

```python
    .training(
        replay_buffer_config={
            "capacity": 512,
            "learning_starts": 0
        }
    )

```

Running the script below takes about 1 hour. 

In [None]:
env = ModifiedLongTermSatisfactionRecSimEnv(env_config)
action_space = env.action_space
observation_space = env.observation_space

dqn_config_offline = (
    dqn_config
    .offline_data(
        input_='dataset',
        input_config={
            'format': 'json',
            'paths': train_data_path,
        }
    )
    .environment(
        action_space=action_space,
        observation_space=observation_space,
    )
    .evaluation(
        evaluation_interval=10, 
        evaluation_duration=10, 
        evaluation_duration_unit="episodes",
        evaluation_parallel_to_training=True,
        evaluation_config={
            "input": "sampler",
            "explore": False,
            "env": "modified-lts",
            "env_config": env_config,
        },
    )
    .debugging(seed=seed, log_level="ERROR")
    .training(
        gamma=1.0,
        num_atoms=1,
        double_q=True,
        dueling=False,
        model=dict(
            fcnet_hiddens=[1024, 1024, 1024],
            fcnet_activation='relu', 
        ),
        train_batch_size=512,
        lr=3e-4,
        target_network_update_freq=512,
        replay_buffer_config={
            "capacity": 512,
            "learning_starts": 0
        }
    )
)

To run the long-running learning script run:

```bash
python tutorial_scripts/run_offline_rl.py --dataset_suffix sampled_data_train_random_transitions
```

In [None]:
tuner = tune.Tuner(
    DQN,
    param_space=dqn_config_offline.to_dict(),
    run_config=air.RunConfig(
        local_dir="./results_notebook/offline_rl/",
        stop={"training_iteration": 1},
    )
)
dqn_offline_results = tuner.fit()

Let's now look at the offline training results. From the plot below we can see that by running offline RL on randomly collected transitions we can improve over the random policy. This is extremely useful in practical scenarios where our goal is to improve over existing production policies.

In [None]:
dqn_offline_results

In [None]:
dqn_offline_results[0].metrics

In [None]:
# plot the results and compare to baselines
fname = "saved_runs/dqn_offline/random_data/progress.csv"
offline_dqn_df = pd.read_csv(fname)
print(offline_dqn_df.columns)
offline_dqn_df

In [None]:

sns.lineplot(data=offline_dqn_df, x="training_iteration", y="evaluation/episode_reward_mean", label="Offline_DQN")
sns.lineplot(data=dqn_df, x="training_iteration", y="episode_reward_mean", label="Online_DQN")
sns.lineplot(data=bandit_df, x="training_iteration", y="episode_reward_mean", label="Bandits")
plt.axhline(random_baseline, color="red", linestyle='--', label="random baseline")
plt.legend()
plt.title('Offline RL vs. Baselines training performance')

In [None]:
sns.lineplot(data=offline_dqn_df, x="training_iteration", y="info/learner/default_policy/learner_stats/mean_q", label="q_value")
plt.legend()
plt.title('Average Q value')

## Conclusion

- Bandits converges to a short-sighted solution
    -  Optimizes imediate reward. 
- DQN takes long-term reward into account
    -  Achieves a policy better than random.
    -  Can become even better than our heuristic based baseline (even_argmin).
- Offline RL is a viable option that works when we don't have simulators for training
    -  Since it doesn't explore freely it won't perform as good as an online RL method. 
    -  Its performance is bounded to the quality of the dataset.
