# **Reinforcement Learning: Pengenalan**

Di dalam Tutorial ini, akan menjelaskan penggunaan Reinforcement Learning dasar yang akan digunakan, dan juga untuk memenuhi tugas kuliah Advance Machine Learning. Referensi yang akan digunakan di dalam tutorial ini akan berbasis dari buku dan juga paper. untuk kasus yang akan dijelaskan disini adalah penggunaan Reinforcement Learning yang akan menerapkan metode Q-learning tabular klasik untuk [Frozen Lake](https://gym.openai.com/envs/FrozenLake-v0/) klasik Puzzle. 

![alt text](https://media2.giphy.com/media/46ib09ZL1SdWuREnj3/giphy.gif?cid=3640f6095c6e92762f3446634d90bc65) ![alt text](https://media0.giphy.com/media/d9QiBcfzg64Io/200w.webp?cid=3640f6095c6e93e92f30655873731752)![alt text](https://i.gifer.com/GpAY.gif)

Reinforcement Learning bisa beroperasi dengan melakukan indentifikasi pola yang akan digunakan secara optimal, di dalam konteks dari masalah masalah yang diberikan, sehingga agen pada reinforcement learning dapat membuat keputusan terbaik untuk langkah berikutnya.

## **Q-Learning**

### Tabular Q-Learning with Frozen Lake

In [1]:
import gymnasium as gym
import numpy as np

# You need this part
# S:Start, F:Frozen, H:Hole, G:Goal
map = ["SFFF", "FHFH", "FFFF", "HFFG"]
# is_slippery=True means stochastic and is_slippery=False means deterministic
env = gym.make('FrozenLake-v1', render_mode="human", desc=map, map_name="4x4", is_slippery=True)
env.reset()
env.render()

# You need to find the policy using both value iteration and policy iteration
# You may not need this part!
action = ["left", "down", "right", "up"]
ncols = 4
nrows = 4
e = 0.001
max_iterations = 1000 #  maximum iterations if there is an infinite loop

# GIVEN
# A sample policy to make the following while loop works
# policy = [1, 2, 1, 0, 1, 0, 1, 0, 2, 1, 1, 1, 0, 2, 2, 0]

#  Initializing the variables
n_states = env.observation_space.n # the total number of states in the environment
n_actions = env.action_space.n # number of possible actions in the environment
gamma_list = [0.8] # substitute with ('0.5' and '1')

'''
This function implements the value iteration algorithm to compute the optimal policy.

It initializes the state value function (V) for all states to 0 and then iteratively updates the values until they converge to the optimal values.
The loop continues until the maximum change in any value is less than the error threshold (e).

Parameters
----------
env : an object of a Gym environment class
    Given environment

gamma : float
    Discount factor

e : float
    Error threshold

Returns
-------
policy
    1-D NumPy array of integers
    Each element in the array represents the best action to take in the corresponding state to maximize the expected cumulative reward
    The optimal policy

'''
def value_iteration(env, gamma, e):
    V = np.zeros(n_states)

    # check for convergence
    # runs until 'delta' is less than a predefined value 'e'
    while True:
        delta = 0   # the maximum absolute difference between the old value of a state v and the new value V[s] computed in the current iteration
        for s in range(n_states):
            v = V[s]
            q_vals = np.zeros(n_actions)
            for a in range(n_actions):
                for p, s_next, r, done in env.P[s][a]:
                    q_vals[a] += p * (r + gamma * V[s_next])
                V[s] = max(q_vals)
            delta = max(delta, abs(v - V[s]))
        if delta < e:
            break
    policy = np.zeros(n_states, dtype=int)
    for s in range(n_states):
        q_vals = np.zeros(n_actions)
        for a in range(n_actions):
            for p, s_next, r, done in env.P[s][a]:
                q_vals[a] += p * (r + gamma * V[s_next])              
            policy[s] = np.argmax(q_vals)
    return policy

'''
This function implements the policy iteration algorithm to compute the optimal policy.

It initializes a random policy for all states, then iteratively evaluates and improves the policy until convergence.
In the evaluation step, it computes the state values for the given policy until they converge to the optimal values.
In the improvement step, it updates the policy for each state by selecting the action that maximizes the expected value of the next state.

Parameters
----------
env : an object of a Gym environment class
    Given environment

gamma : float
    Discount factor

e : float
    Error threshold

Returns
-------
policy
    1-D NumPy array of integers
    Each element in the array represents the best action to take in the corresponding state to maximize the expected cumulative reward.
    The optimal policy.

'''
def policy_iteration(env, gamma, e):
    num = 0 # to check for the iterations
    policy = np.zeros(n_states, dtype=int)
    print()
    print("policy:", policy) # Add print statement to see policy at each step
    print() 
    while True:
        V = np.zeros(n_states)
        while True:
            delta = 0
            #if num > max_iterations:   # stops iteration at 1000 but the GUI hangs
            #    break                
            for s in range(n_states):
                v = V[s]
                a = policy[s]
                q_val = 0
                for p, s_next, r, done in env.P[s][a]:
                    q_val += p * (r + gamma * V[s_next])
                V[s] = q_val
                delta = max(delta, abs(v - V[s]))
            num += 1 # increment to keep count
            print("V: " + str(num))
            print(V.reshape(4, 4)) # Add print statement to see V values at each step
            if delta < e:
                break
        policy_stable = True

        # check for convergence
        # old policy at each state is saved in 'old_action', and then the Q-values for each action at the state are evaluated using the updated 
        # value function 'V'. The policy is then updated to choose the action with the highest Q-value, and if the updated policy at any state 
        # is different from the old policy, then policy_stable is set to False.
        for s in range(n_states):
            old_action = policy[s]
            q_vals = np.zeros(n_actions)
            for a in range(n_actions):
                for p, s_next, r, done in env.P[s][a]:
                    q_vals[a] += p * (r + gamma * V[s_next])
            policy[s] = np.argmax(q_vals)
            if old_action != policy[s]:
                policy_stable = False
        if policy_stable: # if 'policy_stable' remains True after the loop, then the policy is considered to have converged and the iteration loop is broken
            break
        print()
        print("policy:", policy) # Add print statement to see policy at each step
        print()  
    return policy

# Loop to print out the optimal policies:
# Both policies are represented as arrays of integers, with each index corresponding to a state in the environment and the value at that 
# index representing the action to be taken in that state according to the optimal policy
for gamma in gamma_list:
    print("gamma: ", gamma)
    value_policy = value_iteration(env, gamma, e) #the optimal policy obtained by running the value iteration algorithm on the environment
    print("Value iteration policy:", value_policy)
    print()
    policy_policy = policy_iteration(env, gamma, e) #the optimal policy obtained by running the policy iteration algorithm on the environmeny
    print() 
    print("Policy iteration policy:", policy_policy)
    print()
    print("------------------------")


# GIVEN
# This part uses the found policy to interact with the environment.
# You don't need to change anything here.

s = 0
goal = ncols * nrows - 1
while s != goal:
    a = value_policy[s]
    s, r, t, f, p = env.step(a)
    if t == True and s != goal:
        env.reset()
        s = 0
print("END")
print("------------------------")

gamma:  0.8
Value iteration policy: [1 3 0 3 0 0 0 0 3 1 1 1 0 2 2 0]


policy: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

V: 1
[[0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.33333333]
 [0.         0.         0.         0.        ]]
V: 2
[[0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.33333333]
 [0.         0.         0.         0.        ]]

policy: [0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0]

V: 3
[[0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.33333333]
 [0.         0.         0.33333333 0.        ]]
V: 4
[[0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.17777778 0.46962963]
 [0.         0.         0.42222222 0.        ]]
V: 5
[[0.         0.         0.         0.        ]
 [0.         

: 

### CartPole

In [8]:
import gym, numpy as np, matplotlib.pyplot as plt
from neural_networks.policy_gradient_utilities import PolicyGradient

In [10]:
n_units = 5
gamma = .99
batch_size = 50
learning_rate = 1e-3
n_episodes = 10000
render = False
goal = 190
n_layers = 2
n_classes = 2
environment = gym.make('CartPole-v1')
environment_dimension = len(environment.reset())

In [11]:
def calculate_discounted_reward(reward, gamma=gamma):
    output = [reward[i] * gamma**i for i in range(0, len(reward))]
    
    return output[::-1]

In [35]:
def score_model(model, n_tests, render=render):
   scores = []
   for _ in range(n_tests):
     environment.reset()
     observation = environment.reset()
     reward_sum = 0
     while True:
        if render:
            environment.render()

        state = np.reshape(observation, [1, environment_dimension])
        predict = model.predict([state])[0]
        action = np.argmax(predict)
        observation, reward, done, _ = environment.step(action)
        reward_sum += reward
       
        if done:
           break
        scores.append(reward_sum)
        
        environment.close()
        return np.mean(scores)

In [36]:
def cart_pole_game(environment, policy_model, model_predictions):
    loss = []
    n_episode, reward_sum, score, episode_done = 0, 0, 0, False
    n_actions = environment.action_space.n
    observation = environment.reset()
    states = np.empty(0).reshape(0, environment_dimension)
    actions = np.empty(0).reshape(0, 1)
    rewards = np.empty(0).reshape(0, 1)
    discounted_rewards = np.empty(0).reshape(0, 1)

    while n_episode < n_episodes:
        state = np.reshape(observation, [1, environment_dimension])
        prediction = model_predictions.predict([state])[0]
        action = np.random.choice(range(environment.action_space.n), p=prediction)
        states = np.vstack([states, state])
        actions = np.vstack([actions, action])
        observation, reward, episode_done, info = environment.step(action)
        reward_sum += reward
        rewards = np.vstack([rewards, reward])

        if episode_done == True:
            discounted_reward = calculate_discounted_reward(rewards)
            discounted_rewards = np.vstack([discounted_rewards, discounted_reward])
            rewards = np.empty(0).reshape(0, 1)

        if (n_episode + 1) % batch_size == 0:
            discounted_rewards -= discounted_rewards.mean()
            discounted_rewards /= discounted_rewards.std()
            discounted_rewards = discounted_rewards.squeeze()
            actions = actions.squeeze().astype(int)
            train_actions = np.zeros([len(actions), n_actions])
            train_actions[np.arange(len(actions)), actions] = 1
            error = policy_model.train_on_batch([states, discounted_rewards], train_actions)
            loss.append(error)
            states = np.empty(0).reshape(0, environment_dimension)
            actions = np.empty(0).reshape(0, 1)
            discounted_rewards = np.empty(0).reshape(0, 1)
            score = score_model(model=model_predictions, n_tests=10)
            print('\nEpisode: %s \nAverage Reward: %s \nScore: %s \nError: %s' %(n_episode+1, reward_sum/float(batch_size), score, np.mean(loss[-batch_size:])))

        if score >= goal:
            break

        reward_sum = 0
        n_episode += 1
        observation = environment.reset()

    plt.title('Policy Gradient Error plot over %s Episodes'%(n_episode+1))
    plt.xlabel('N batches')
    plt.ylabel('Error Rate')
    plt.plot(loss)
    plt.show()


In [65]:
if __name__ == '__main__':
    mlp_model = PolicyGradient(
        n_units=n_units,
        n_layers=n_layers,
        n_columns=environment_dimension,
        n_outputs=n_classes,
        learning_rate=learning_rate,
        hidden_activation='selu',
        output_activation='softmax',
        loss_function='log_likelihood'
    )
    
    policy_model, model_predictions = mlp_model.create_policy_model(environment_dimension,)
    
    policy_model.summary()
    
    cart_pole_game(
        environment=environment,
        policy_model=policy_model,
        model_predictions=model_predictions
    )

TypeError: ('Keyword argument not understood:', 'input')

## **Munti Carlo**