In [1]:
!pip install tf-agents[reverb]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import abc
import tensorflow as tf
import numpy as np

from tf_agents.environments import py_environment
from tf_agents.environments import tf_environment
from tf_agents.environments import tf_py_environment
from tf_agents.environments import utils
from tf_agents.specs import array_spec
from tf_agents.environments import wrappers
from tf_agents.environments import suite_gym
from tf_agents.trajectories import time_step as ts

## Python Environments

Python environments have a `step(action) -> next_time_step` method that applies an action to the environment, and returns the following information about the next step:
1. `observation`: This is the part of the environment state that the agent can observe to choose its actions at the next step.
2. `reward`: The agent is learning to maximize the sum of these rewards across multiple steps.
3. `step_type`: Interactions with the environment are usually part of a sequence/episode. e.g. multiple moves in a game of chess. step_type can be either `FIRST`, `MID` or `LAST` to indicate whether this time step is the first, intermediate or last step in a sequence.
4. `discount`: This is a float representing how much to weight the reward at the next time step relative to the reward at the current time step.

These are grouped into a named tuple `TimeStep(step_type, reward, discount, observation)`.


In [3]:
import gym

# Using CartPole-v0

###Implement the CartPole environment for a certain number of steps

In [4]:
def run_for_steps(steps=100):
    env = gym.make("CartPole-v0")
    total_reward = 0
    for step in range(steps):
      observation = env.reset()
      action = env.action_space.sample() # take a random action
      observation, reward, done, info = env.step(action)
      total_reward += reward
      if done:
          break
    return total_reward

###Implement the CartPole environment for a certain number of episodes

In [5]:
def run_for_episodes(episodes=100):
    env = gym.make("CartPole-v0")
    total_reward = 0
    for episode in range(episodes):
        observation = env.reset()
        episode_reward = 0
        while True:
            action = env.action_space.sample() # take a random action
            observation, reward, done, info = env.step(action)
            episode_reward += reward
            if done:
                break
        total_reward += episode_reward
    return total_reward / episodes

### Compare and comment on the rewards earned for both approaches.

In [6]:
steps_reward = run_for_steps()
episodes_reward = run_for_episodes()
print(f"Reward for running for steps: {steps_reward}")
print(f"Reward for running for episodes: {episodes_reward}")

Reward for running for steps: 100.0
Reward for running for episodes: 23.01


  logger.warn(


Reward for running for 500 steps: 500.0

Reward for running for 500 episodes: 21.798

Reward for running for 1000 steps: 1000.0

Reward for running for 1000 episodes: 22.53

Given that the environment is reset after each episode but continues for a given amount of steps, the prizes obtained for each strategy will probably change. In addition, the computation of the reward for the number of steps is cumulative, whereas the calculation for the number of episodes is averaged. Hence the reward for running episodes lie within close proximity even after changing the no. of episodes.