### Episodes of the Cart Pole Game

An episode of begins by calling the ``reset`` function. This returns the **observation** values associated with the initial state of the MDP.

At each state, an **action** must be chosen that is then sent to the ``step`` function. The step function returns a **new observation** and the **reward** that resulted from that action.

The process of selecting an action that results a new observation and a reward is repeated until the MDP terminates. The step function also returns a flag that indicates whether it has terminated, or not.

An **episode** of the CartPole MDP starts by calling the reset function and then repeatedly calling the step function until the MDP terminates. The set of actions taken is called the **policy** and the sum of the rewards received is called the **value** of the policy. 

In [1]:
#import OpenAI Gym
import gymnasium as gym

In [2]:
#create CartPole environment
env = gym.make('CartPole-v1')

##### Starting Cart Pole
- Start and episode of Cart Pole using the reset() function  
- Print the observed values - Note these values have a random component to them and will not be the same every time you reset the environment.

In [3]:
#start an episode 
obs,_ = env.reset()
print(f"Observation: {obs}")

Observation: [ 0.00250597 -0.00742596 -0.00226926 -0.01668785]


##### Selecting actions
- The sample() function returns a random action from the action space  
- For Cart Pole, the actions are 0 and 1

The code in the next cell print a small set of actions from the sample function to see the different values

In [4]:
# sample() returns a random action
for i in range(8):
    action = env.action_space.sample()
    print(f"Action {i}: {action}")

Action 0: 1
Action 1: 1
Action 2: 1
Action 3: 0
Action 4: 0
Action 5: 0
Action 6: 1
Action 7: 1


##### An episode of Cart Pole
An episode of Cart Pole starts using the reset function. Then while the done boolean variable is false, an action is selected and passed to the step function that executes the step. The step function returns a new observation, a reward, and whether the episode has terminated, or not, with the done boolean variable.  
- Run an episode of Cart Pole by taking a random action at each step of the episode  
- Reset to start an episode
- While not done
    - Get a random action
    - Perform the action and record results
    - Print 
        - step number 
        - action taken 
        - new observation 
        - reward
        - terminated flag
        - truncated flag  
- The episode terminates when the terminated flag is True

The cell below runs an episode of the Cart Pole MDP. Actions are selected at random. Run the cell a number of times to observe that the number of steps in the MDP varies. Note in the final step that ther **terminated** flag has been set to True.

In [5]:
obs,_ = env.reset()
print(f"Initial Observation: {obs}")
i = 0 # counts the number of steps in the episode
sum_reward = 0.0 # sums the rewards
terminated = False
while not terminated:
    action = env.action_space.sample()
    obs,reward,terminated,truncated,_ = env.step(action)
    i += 1
    sum_reward += reward
    print(f"Step: {i}, Action: {action}, New Observtion: {obs}, Reward: {reward}")
    print(f"   --- Sum of Rewards: {sum_reward}, Terminated: {terminated}, Truncated: {truncated} ")

Initial Observation: [ 0.02010552 -0.04910374  0.02950318 -0.01884915]
Step: 1, Action: 1, New Observtion: [ 0.01912344  0.14558296  0.0291262  -0.3020794 ], Reward: 1.0
   --- Sum of Rewards: 1.0, Terminated: False, Truncated: False 
Step: 2, Action: 0, New Observtion: [ 0.0220351  -0.04994174  0.02308461 -0.00035487], Reward: 1.0
   --- Sum of Rewards: 2.0, Terminated: False, Truncated: False 
Step: 3, Action: 0, New Observtion: [ 0.02103627 -0.24538703  0.02307751  0.2995212 ], Reward: 1.0
   --- Sum of Rewards: 3.0, Terminated: False, Truncated: False 
Step: 4, Action: 0, New Observtion: [ 0.01612853 -0.4408302   0.02906794  0.59939206], Reward: 1.0
   --- Sum of Rewards: 4.0, Terminated: False, Truncated: False 
Step: 5, Action: 1, New Observtion: [ 0.00731192 -0.24612673  0.04105578  0.31600505], Reward: 1.0
   --- Sum of Rewards: 5.0, Terminated: False, Truncated: False 
Step: 6, Action: 0, New Observtion: [ 0.00238939 -0.4418087   0.04737588  0.62134767], Reward: 1.0
   --- Sum

If you run another step of the MDP from the terminal state you will get a warning that the behavior of the step function is undefined. However, note that 
- **terminated** remains True
- the **reward** is zero, and consequently
- the **sum of the rewards** does not increase 

In [6]:
action = env.action_space.sample()
obs,reward,terminated,truncated,_ = env.step(action)
i += 1
sum_reward += reward
print(f"Step: {i}, Action: {action}, New Observtion: {obs}, Reward: {reward}")
print(f"   --- Sum of Rewards: {sum_reward}, Terminated: {terminated}, Truncated: {truncated} ")

Step: 15, Action: 0, New Observtion: [-0.14817002 -1.4293967   0.28458607  2.4311156 ], Reward: 0.0
   --- Sum of Rewards: 14.0, Terminated: True, Truncated: False 


  logger.warn(
