## INTRODUCTION

Good afternoon.  In my free time I enjoy experimenting with AI.  Not just solving a problem but playing around with models and parameters to get a feel for what's really going on.  One of the topics I find most interesting is reinforcement learning (RL).  Instead of fitting a model to a dataset, RL forces us to be creative in creating an agent that experiences the data in the form of the game state and labels in the form of rewards.  Therefore, not only do we have to be careful in how we design the model, but we need to consider how we construct the game, inputs, and reward system.  However, despite the freedom inherent in reinforcement learning, there exists an underlying structure to each problem that the model is trying to learn.  Hopefully, during the course of this notebook I'll successfully illustrate both the structure behind RL as well as the methodology behind reinforcement learning as a whole.

First, let me define some necessary terms used frequently in reinforcement learning.

1. <b>Environment</b>.  The environment is anything that pertains to the game being played.  This includes the rules for the game, the current state of the game, and the players participating in the game.
1. <b>Observation</b>.  The observation is a representation of all the relevant information about the environment.  The mathematical name for the group of all possible states for a given environment is its state space.
1. <b>Agent</b>.  An agent, put simply, is any model that receives input (usually in the form of a game state fed from the environment) and gives an output (usually in the form of a policy gradient ${\pi}$).  The agent is also able to interact with the environment in this way iteratively, tuning its model to produce better policy gradients in order to optimize 

![Original Image](https://github.com/alexikerd/Cartpole/blob/main/graphics/sample_random.gif?raw=true)

## Cartpole

In order to best discuss the core concepts behind RL, I've decided to use the simplest use case known as the cartpole problem.  It consists of a pole standing on top of a cart.  The player can accelerate the cart right or left, moving the pivot point in an attempt to prevent the pole from falling.  The model is fed four different inputs: the x position of the cart, the x velocity of the cart, the angular position of the pole, and the angular velocity of the pole.

![Original Image](https://github.com/alexikerd/Cartpole/blob/main/graphics/ObservationSpace.PNG?raw=true)

At the beginning of the game, each of the four values are randomly initialized between -0.05 and 0.05 so that each initial state is slightly different but close to zero.  The game ends when either the absolute value of the x position exceeds 2.4, the pole angle exceeds 41.8$^{\circ}$, or the game lasts for 500 frames.  The limitation on the cart is a lot more lenient than the pole, which means any agent training to balance the pole will prioritize the pole's position over the cart's.  The agent has two options; accelerate the cart left or right.

![Original Image](https://github.com/alexikerd/Cartpole/blob/main/graphics/ActionSpace.PNG?raw=true)

Note that the agent is not allowed to keep the current speed.  The agent is also not allowed to directly adjust the pole speed.  Instead it must move the cart and use the resultant torque to keep the pole standing up.

## Physics

Before I get into the models I've used, let's talk about the environment and the physics behind it.  The cartpole is an example of an equilibrium problem, where we are trying to maintain a certain position (pole being straight up).  Furthermore, this is a type of equilibrium that is known as unstable equilibrium where the slightest deviation from the target position leads to further deviation without intervention.  As the horizonal distance limit is much larger than the angle limit, we can ignore it right now in order to simplify our analysis of the environment.

![Original Image](https://github.com/alexikerd/Cartpole/blob/main/graphics/PotentialEnergyGraph.png?raw=true)|![Original Image](https://github.com/alexikerd/Cartpole/blob/main/graphics/PhaseSpaceDiagram.png?raw=true)
-|-
(A)|(B)

Figure A demonstrates the instability of the potential energy.  This is a good initial assessment as every system tends to move towards lowest potential energy.  Figure B is known as a phase space diagram, where we graph the rate of change in terms of both position and velocity.  As we can see, the further the angle is from zero, the greater the acceleration (rate of change of velocity).  The greater the angular velocity, the greater the rate of change of angle (obviously, but what matters it that the rate of change is pointed away from the center or equilibrium point).  While describing the same phenomena, the phase space diagram does a much better job because we can see that the velocity is much more dangerous to equilibrium than the angle (this is because the game ends at smaller angles so gravity does not get a chance to really affect the pole).  Another thing to note is that the velocity values will seem higher than possible.  This is because the game runs at 50 fps so that any model has the ability to update the speed every 0.02 seconds.  The dynamics of the system can be explored with the Lagrangian and the Euler-Lagrange equation.

$$\begin{equation}L = T - V\end{equation}$$
$(3)$


$$\dfrac{d}{dt}\dfrac{\delta L}{\delta\dot q} = \dfrac{\delta L}{\delta q}$$
$(4)$

Eq (3) sets the lagrangian (L) equal to the total kinetic energy of the system (T) minus the total potential energy of the system (V).  As there is no friction, the lagrangian will be a constant.  Deriving lagrangian mechanics is a little out of the scope for this report, but I'll leave links at the end for those interested.  The first link is a mathematical proof that the Euler-Lagrange equation solves for the shortest path between two points and the second provides an explanation that eq (4) solves for the shortest path through the energy of the system.  For now, I'll keep with the assumption that solving eq (4) properly will solve the practical dynamics of the cartpole (and any similar RL environments).  The EL equation states that the derivative of the lagrange with respect to the spatial coordinates (q) must be equal to the derivative with respect to time of the derivative with respect to the rate of change of the spatial coordinates of the lagrange.

## Baselines








![Original Image](https://github.com/alexikerd/Cartpole/blob/main/graphics/suicide_model_sample.gif?raw=true)
![Original Image](https://github.com/alexikerd/Cartpole/blob/main/graphics/random_model_sample.gif?raw=true)
![Original Image](https://github.com/alexikerd/Cartpole/blob/main/graphics/controlled_angle_sample.gif?raw=true)
![Original Image](https://github.com/alexikerd/Cartpole/blob/main/graphics/controlled_velocity_sample.gif?raw=true)

## Markov



In [5]:
import numpy as np


class MarkovAgent():
    
    def __init__(self):
        
        self.memory = []
        self.Qs = {}
        self.Ns = {}
        self.done = False
        self.gamma = 0.9
        self.epsilon = 1
        self.decay = 0.99
        
        self.session_score = 0
        self.max_distance = 0
        
        self.env = gym.make("CartPole-v3")
        
    def reset(self):
        
        self.done = False
        self.memory = []
        
    def play_session(self):
        
        observation = self.env.reset()
        
        self.session_score = 0
        self.max_distance = 0
        
        while not self.done:
            
            s,action = self.take_action(observation)
            observation, reward, self.done, info = self.env.step(action)
            self.session_score += 1            
            self.memory.append([s,-100 if self.done and self.session_score<2000 else reward,action])
            
            distance = observation[0]
            
            if abs(distance)>self.max_distance:
                self.max_distance = abs(distance)
            
        past_reward = 0
        
        for s, reward, action in reversed(self.memory):
            
            self.Qs[s][action] = (self.Qs[s][action]*self.Ns[s][action] + (reward + self.gamma*past_reward))/(self.Ns[s][action]+1)
            self.Ns[s][action] += 1
            past_reward = (reward + self.gamma*past_reward)
            
        self.epsilon = max(self.decay*self.epsilon,0.05)
        self.reset()
        
        
    def take_action(self,observation):
        
        s = np.asarray([round(observation[i],1+int(i/2)) for i in range(len(observation))]).tostring()
        if s not in self.Ns:
            
            self.Ns[s] = [5,5]
            self.Qs[s] = [100,100]
            
        q_list = self.Qs[s]
        
        if np.random.uniform(0,1)>self.epsilon:
            
            action = np.argmax(q_list)
            
        else:
            
            action = self.env.action_space.sample()
            
        return s,action

![Original Image](https://github.com/alexikerd/Cartpole/blob/main/graphics/training_markov.gif?raw=true)

![Original Image](https://github.com/alexikerd/Cartpole/blob/main/graphics/markov_decisionboundary.png?raw=true)|![Original Image](https://github.com/alexikerd/Cartpole/blob/main/graphics/svm_decisionboundary.png?raw=true)
-|-
A|B


I increased limit to 1000 just to test the model and wowza


![Original Image](https://github.com/alexikerd/Cartpole/blob/main/graphics/svm_model_sample.gif?raw=true)

## Citations


1. Derivation of Euler-Lagrange equation https://farside.ph.utexas.edu/teaching/336L/Fluid/node266.html
2. Explanation of lagrangian mechanics http://www.physicsinsights.org/lagrange_1.html
3. Solution to Cartpole problem using physics https://danielpiedrahita.wordpress.com/portfolio/cart-pole-control/
4. How to handle P(s) and Q(s) with state as input https://web.stanford.edu/~surag/posts/alphazero.html