# Episodes of CartPole implemented in MAPDL   

In [1]:
#!conda install numpy
#!pip install gym

#!pip install ./pyansys_rl ./pyansys_gym -q --user --no-warn-script-location

In [2]:
import itertools
import os

import gym
import numpy as np

import pyansys_cartpole

np.set_printoptions(precision=4, suppress=True)

<h2>Background:  Markov Decision Process</h2>
<img src="media/MDP_board.jpg" alt="Drawing" style="width: 400px;"/>

In a Markov Decision Process we have an agent immersed in an environment.  At any given time, the agent finds itself in a state and it must select one of the available actions.  Upon taking an action, the environment reponds by assigning a reward and transitioning the agent to a successor state.  This loop continues until a terminal state is reached.  It is interesting to ask: could we learn to act optimally in such a setup? could we learn to select sequences of actions that maximize long term cumulative rewards? 

<img src="media/MDP_loop.jpg" alt="Drawing" style="width: 700px;"/>

<h2>CartPole</h2>
<img src="media/cartpole_description.jpg" alt="Drawing" style="width: 600px;"/>

The CartPole is a classic control problem.  It is a balancing task: push the cart such that the pinned pole remains upright. In other words, the pole behaves as a solid inverted pendulum and is unstable about the desired configuration.  A simple implementation could use a revolute/hinge joint between the cart and the pole, and a translational joint between the cart and the ground. 

<h3>MAPDL in the loop</h3>
<img src="media/ANSYS_loop.jpg" alt="Drawing" style="width: 600px;"/>
<center>Fig: A single iteration of the CartPole as a Markov Decision Process using MAPDL</center>

In this implementation of the CartPole as a Markov Decision Process, we highlight the following components:

* Actions: push either left (0) or right (1) 
* State: $x_{\text{cart}}, v_{\text{cart}}, \theta_{\text{pole}}, v_{\text{pole}}$
* Reward: +1 for every timestep still in equilibrium
* Transition Model: courtesy of an MAPDL structural transient analysis 

At each episode, the system starts in a randomly seeded state, with positions, velocities and angles picked from a uniform distribution about the vertical/resting position, thus it is unlikely to ever be at equilibrium.  Even if it were, the equilibrium would be unstable. 

## Instance creation: MAPDL in the loop
Create an instance of an MAPDL environment that is specially wrapped for use in [OpenAI Gym](https://gym.openai.com/) thanks to the newly developed python gRPC bindings ([pyansys](https://pypi.org/project/pyansys/)).  The wrapper sets up the CartPole physics, accepts the available actions (i.e. forces), and calculates the state transitions (kinematic response) after every time step (an MAPDL load step).  For reference, OpenAI Gym provides its own ad hoc [environment](https://gym.openai.com/envs/CartPole-v1/) for solving the system's [kinematic equations](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py)

In [4]:
env_name = 'pyansys-CartPole-v0'
env = gym.make(env_name)
mapdl = env.env.env._mapdl

  result = entry_point.load(False)


run several episodes (e.g., 3) of the CartPole using a random action, i.e., sometimes 0 (push left), sometimes 1 (push right)

In [6]:
n_episodes = 3
for i in range(n_episodes):
    print('*' * 30, f'Episode: {i+1}', '*' * 30)
    cur_state = env.reset()
    done, r_tot = False, 0
    while not done:
        action = np.random.choice([0, 1])
        next_state, reward, done, info = env.step(action)
        print('State:', cur_state, '\tAction:', '--->' if action else '<---', '\tReward: ', reward)
        cur_state, r_tot = next_state, r_tot + reward
    print('Episode Reward:', r_tot)
    print('')

****************************** Episode: 1 ******************************
State: [ 0.0133 -0.0328  1.5069  0.    ] 	Action: ---> 	Reward:  1
State: [0.0144 0.1074 1.3911 0.0258] 	Action: ---> 	Reward:  1
Episode Reward: 2

****************************** Episode: 2 ******************************
State: [-0.0218 -0.0213  1.9935  0.    ] 	Action: <--- 	Reward:  1
State: [-0.0229 -0.1079  2.1172  0.0343] 	Action: ---> 	Reward:  1
Episode Reward: 2

****************************** Episode: 3 ******************************
State: [0.0467 0.0398 1.7562 0.    ] 	Action: <--- 	Reward:  1
State: [ 0.0456 -0.1078  1.8794  0.0337] 	Action: ---> 	Reward:  1
Episode Reward: 2

