# Acrobot

El sistema acrobot consiste en una especie de brazo robotico que incluye dos uniones. Inicialmente, los enlaces cuelgan hacia abajo y el objetivo es mover el extremo del enlace inferior hasta una altura determinada.
Gracias a las librerias de openAI Gym podemos preocuparnos solamente en el algoritmo de entrenamiento sin necesidad de programar en si al [Acrobot](https://gym.openai.com/envs/Acrobot-v1/).

In [70]:
import numpy as np
import gym

## Política Aleatoria

A continuación, una demostración del comportamiento del acrobot, el cual toma la decisión de moverse aleatoriamente.

In [75]:
import gym
env = gym.make('Acrobot-v1')
for i_episode in range(20):
    total_reward = 0.0
    observation = env.reset()
    while not done:
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        env.render()
        total_reward += reward
        if done:
            print ("Episode {} finished after {} timesteps. Total reward: {}".format(i_episode, t+1, total_reward))
            break

## Mejor Política

In [76]:
def maxAction(Q,s):
    values=np.array([Q[s,a] for a in [0,1,2]])
    action = np.argmax(values)
    return action

Comenzaremos discretizando el espacio de estados. Los estados son tuplas de 6 elementos, de los cuales los primeros cuatro son senos y cosenos de los ángulos que forman el brazo robótico. Éstos toman valores entre -1 y 1 y los otros dos, en su mayoría, según unas pruebas, entre -5 y 5.

In [77]:
#[cos(theta1) sin(theta1) cos(theta2) sin(theta2) thetaDot1 thetaDot2].
ct1_space=np.linspace(-1,1,10)
st1_space=np.linspace(-1,1,10)
ct2_space=np.linspace(-1,1,10)
st2_space=np.linspace(-1,1,10)
t1_space=np.linspace(-5,5,10)
t2_space=np.linspace(-5,5,10)

La siguiente función reciba la observación real del acrobot, es decir la tupla de 6 elementos, y regresa el estado con elementos enteros, que básicamente consisten en la posicion en donde se encuentra cada número flotante o real en los rangos definidos anteriormente.

In [78]:
def getState(observation):
    ct1=int(np.digitize(observation[0],ct1_space))
    st1=int(np.digitize(observation[1],st1_space))
    ct2=int(np.digitize(observation[2],ct2_space))
    st2=int(np.digitize(observation[3],st2_space))
    t1=int(np.digitize(observation[4],t1_space))
    t2=int(np.digitize(observation[5],t2_space))
    
    return(ct1,st1,ct2,st2,t1,t2)
        

Definimos el conjunto de estados e inicializamos nuestro diccionario Q

In [79]:
states=[]

for i in range(len(ct1_space)+1):
    for j in range(len(st1_space)+1):
        for k in range(len(ct2_space)+1):
            for l in range(len(st2_space)+1):
                for m in range(len(t1_space)+1):
                    for n in range(len(t2_space)+1):
                        states.append((i,j,k,l,m,n))
                        

In [80]:
Q={}
for s in states:
    for a in [0,1,2]:
        Q[s,a]=0.0
        

## Sarsa

In [81]:
def sarsa(env,alpha, gamma, epsilon,episodes, Q):
    
    Rewards=np.zeros(episodes)
    moves=np.zeros(episodes)
    for i in range(episodes):
        
        observation=env.reset()
        s=getState(observation)
        rand=np.random.random()
        a=maxAction(Q,s) if rand<1-epsilon else env.action_space.sample()
        done=False
        epReward=0
        epMoves=0
        while not done:
            observation_,reward,done,info= env.step(a)
            s_=getState(observation_)
            rand=np.random.random()
            a_=maxAction(Q,s) if rand<1-epsilon else env.action_space.sample()
            epReward+=reward
            epMoves+=1
            
            Q[s,a]=Q[s,a]+alpha*(reward + gamma*Q[s_,a_]-Q[s,a])
            
            s,a=s_,a_
            
        epsilon-=1/epsilon if epsilon>0 else 0
        Rewards[i]=epReward
        moves[i]=epMoves
        
        if i%100==0:
            print("Episode {} finished after {} timesteps. Total reward: {}".format(i, moves[i], Rewards[i]))
    return Q,Rewards,moves
        
        

## Función para simular el acrobot

In [82]:
def taste(env,Q, episodes):
    
    for i in range(episodes):
        done=False 
        observation=env.reset()
        total_reward=0
        moves=0
        while not done:
            action = maxAction(Q, getState(observation))
            observation, reward, done, info = env.step(action)
            env.render()
            total_reward += reward
            moves+=1
            if done:
                print("Episode {} finished after {} timesteps. Total reward: {}".format(i, moves, total_reward))
                break

In [86]:
env=gym.make('Acrobot-v1')
episodes=10000 #12000+5000
aplha=0.5
gamma=0.8
eps=0.1

Q, Rewards, moves = sarsa(env,aplha, gamma, eps,episodes, Q)


Episode 0 finished after 500.0 timesteps. Total reward: -500.0
Episode 100 finished after 373.0 timesteps. Total reward: -372.0
Episode 200 finished after 500.0 timesteps. Total reward: -500.0
Episode 300 finished after 500.0 timesteps. Total reward: -500.0
Episode 400 finished after 500.0 timesteps. Total reward: -500.0
Episode 500 finished after 449.0 timesteps. Total reward: -448.0
Episode 600 finished after 500.0 timesteps. Total reward: -500.0
Episode 700 finished after 483.0 timesteps. Total reward: -482.0
Episode 800 finished after 265.0 timesteps. Total reward: -264.0
Episode 900 finished after 500.0 timesteps. Total reward: -500.0
Episode 1000 finished after 247.0 timesteps. Total reward: -246.0
Episode 1100 finished after 500.0 timesteps. Total reward: -500.0
Episode 1200 finished after 500.0 timesteps. Total reward: -500.0
Episode 1300 finished after 283.0 timesteps. Total reward: -282.0
Episode 1400 finished after 500.0 timesteps. Total reward: -500.0
Episode 1500 finished 

In [87]:
taste(env,Q,1)

Episode 0 finished after 310 timesteps. Total reward: -309.0


In [None]:
np.save("Q_sarsa.npy", Q)


In [None]:
Qsaved=np.load("Q_acrobot.npy")

In [None]:
taste(env,Qsaved,1)