#**CartPole - V0 Environment**  
  
**AIM:**  
Build a simple agent.  
  
**OBJECTIVE:**  
Use the CartPole-v0 environment and write a program to :-  
1. Implement the CartPole environment for a certain number of steps  
2. Implement the CartPole environment for a certain number of episodes  
3. Compare and comment on the rewards earned for both approaches.  
  

In [None]:
!pip install tf-agents[reverb]

In [18]:
#Importing libraries
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import abc
import tensorflow as tf

#Importing library for mathematical computation
import numpy as np

#Importing environment related modules
from tf_agents.environments import py_environment
from tf_agents.environments import tf_environment
from tf_agents.environments import tf_py_environment
from tf_agents.environments import utils
from tf_agents.specs import array_spec
from tf_agents.environments import wrappers
from tf_agents.environments import suite_gym
from tf_agents.trajectories import time_step as ts

In [None]:
#Implement the CartPole environment for a certain number of steps
env = suite_gym.load('CartPole-v0') #Load Environment

tf_env = tf_py_environment.TFPyEnvironment(env)

time_step = tf_env.reset()  # reset() creates the initial time_step after resetting the environment.
#defining variables
num_steps = 500 #Defining number of steps
transitions = []  #Creating empty transitions list
reward = 0  #Initializing reward to 0

for i in range(num_steps):
  action = tf.constant([i % 2])
  next_time_step = tf_env.step(action)  # applies the action and returns the new TimeStep.
  transitions.append([time_step, action, next_time_step])
  reward = reward + next_time_step.reward #Calculating total reward
  time_step = next_time_step

np_transitions = tf.nest.map_structure(lambda x: x.numpy(), transitions)
print('\n'.join(map(str, np_transitions)))

In [20]:
#Displaying total reward
print("Total reward over {} timesteps : {} ".format(num_steps,reward.numpy()))
#Displaying average reward
print("Average reward over {} timesteps : {} ".format(num_steps, reward.numpy()/num_steps))

Total reward over 500 timesteps : [488.] 
Average reward over 500 timesteps : [0.976] 


In [21]:
#Implement the CartPole environment for a certain number of episodes
env = suite_gym.load('CartPole-v0') #Load Environment
tf_env = tf_py_environment.TFPyEnvironment(env)

time_step = tf_env.reset()

#defining variables
rewards = []  #creating empty list for rewards 
steps = []  #creating empty list for steps
num_episodes = 500  #definig number of steps to be 500

for _ in range(num_episodes):
  episode_reward = 0
  episode_steps = 0
  while not time_step.is_last():
    action = tf.random.uniform([1], 0, 2, dtype=tf.int32)
    time_step = tf_env.step(action) # applies the action and returns the new TimeStep.
    episode_steps = episode_steps + 1
    episode_reward += time_step.reward.numpy()  #total reward
  rewards.append(episode_reward)
  steps.append(episode_steps) #total number of steps
  time_step = tf_env.reset()  # reset() creates the initial time_step after resetting the environment.

num_steps = np.sum(steps)
avg_length = np.mean(steps)
avg_reward = np.mean(rewards)

In [22]:
#Displaying total reward
print("Total reward over {} episodes : {} ".format(num_episodes, sum(rewards)))
#Displaying average reward
print('Average total reward over {} episodes: {}'.format(num_episodes, avg_reward))

Total reward over 500 episodes : [10693.] 
Average total reward over 500 episodes: 21.38599967956543


**Inferences:**  
**Step-** Every cycle of state-action-reward is called a step. The reinforcement learning system continues to iterate through cycles until it reaches the desired state or a maximum number of steps are expired.   
**Episode-** This series of steps is called an episode. At the beginning of each episode, the environment is set to an initial state and the agent’s reward is reset to zero.  
  
For our training environment we have taken 500 timesteps and 500 episodes.   

* For 500 timesteps the total reward is found to be 488.0 and the average reward is 0.976.  
* For 500 episodes the total reward is found to be 10693.0 and the average reward is 21.39.  
  
The objective of learning is to maximize reward and hence, training over an episode is preferred over timestep.