# **COMP 2211 Exploring Artificial Intelligence** #
## Lab 10 Reinforcement Learning ##

<img src="https://miro.medium.com/max/4800/1*7PoZafFLEVXQiseVx5y4cw.jpeg" width="600" height="326" /> 

# Introduction
MDP (Markov Decision Process) is a good starting point for studying Reinforcement Learning.

In this lab, let's build up an MDP system.


# Problem Setting
You wake up in your bed.

You're in Hong Kong now.

You know you have finished all your final exams. 
Today is your first day of the trip to Hong Kong during the summer break. 
Now you want to plan for your activities sequence for today. 

Let's use an integer to represent the Hong Kong Dollar you have.
To start with, you have **m** HKD. 
You will choose one of two activities (if you only have 1 HKD, you can't choose activity B) repeatedly until you double the money you have (have at least 2m HKD) or spend it all (have 0 HKD).

1. Activity A costs 1 HKD. With probability 0.05, it will return 2 HKD, and 0 HKD otherwise.

2. Activity B costs 2 HKD. With probability 0.03, it will return 4 HKD, and 0 HKD otherwise.

Of course, you hope to take the best strategy to maximize your likelihood of ending up with at least **2m** HKD.

# Task
Now we want to set up an MDP system. Using the Bellman Optimality Equation, with $\xi=10^{-5}$, we can obtain the near-optimal policy through the Value Iteration process. 

Input: an integer $m \le 100$, a real number $\gamma \in (0,1)$ standing for the discount factor.

To better test your implementation, we provide a constraint to ensure the uniqueness of **correct implementation**: Lump sum reward (one-off). Reward **0** for ending with **no money**, and reward **$10^5$** for ending with at least **2m** HKD.

Hint 1: for the Lump sum reward, it doesn't appreciate whether you have almost won (e.g., having 2m-1 HKD) or not. You only care about the final result of your day.

Hint 2: feel free to check out the review part of this lab.

Hint 3: discount is also a way to encourage faster winning.

In [None]:
import numpy as np

In [None]:
def construction(m):
    # Return a tuple R, T
    # R: numpy array of shape = (2m+1, 2m+1), dtype = float
    # T: numpy array of shape = (2, 2m+1, 2m+1), dtype = float

    # Setting of the two activities
    pr = 0.05, 0.03                 # Tuple type
    cost = 1, 2
    reward = 2, 4
    #################################################################################
    # SRART OF YOUR CODE                            #
    # TODO:                                    #
    # Construct reward vecter R;                        #
    # Construct transition matrix T1, T2 standing for two actions in T. #
    #################################################################################
    R = None 
    T = None, None


    #################################################################################
    # END OF YOUR CODE                             #
    #################################################################################

    return R, T

def valueIteration(m, gamma, threshold = 1e-5):
    # Return a numpy array optimal_policy with shape = (2m+1, ), dtype = int
    # Optimal_policy consists of 0 or 1, meaning firstly starting action/activity 1 or 2 respectively
    # Optimal_policy[0] and optimal_policy[2m+1] are 0

    R, T = construction(m)
    #######################################################################
    # SRART OF YOUR CODE                       #
    # TODO:                               #
    # Initilize expected discounted sum of rewards vector v_prev #
    #######################################################################
    v_prev = None                   # shape = (2m+1,)
    
    
    #######################################################################
    # END OF YOUR CODE                        #
    #######################################################################
    
    optimal_policy = None 
    while True:
        ###################################################################
        # SRART OF YOUR CODE                     #
        # TODO:                             #
        # Complete the computation of v_now and optimal_policy.  #
        ###################################################################
        v_now = None

        
        
        ###################################################################
        # END OF YOUR CODE                      #
        ###################################################################

        error = np.abs(v_now - v_prev).max()
        if error < threshold:
          break
        v_prev = v_now
    
    print('valueIteration function with m =', str(m), 'and gamma =', str(gamma))
    print('Your v_now:\n', v_now)             # Variable scope = this function
    print('Your optimal_policy:', optimal_policy)
    print('\n')
    return optimal_policy

# Test

Feel free to try your own parameters after finish these tests.

In [None]:
R, T = construction(2)
print('Your R:')
print(R)
print('Your T:')
print(T)

# np.array(...) is the expected resulto0, 0, 0, 0, 100000], [0, 0, 0, 0, 100000], [0, 0, 0, 0, 100000], [0, 0, 0, 0, 100000], [0, 0, 0, 0, 0]])).all()
assert (T == np.array([[[1, 0, 0, 0, 0], [0.95, 0, 0.05, 0, 0], [0, 0.95, 0, 0.05, 0], [0, 0, 0.95, 0, 0.05], [0, 0, 0, 0, 1]], [[1, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0.97, 0, 0, 0, 0.03], [0, 0.97, 0, 0, 0.03], [0, 0, 0, 0, 1]]])).all()

Your R:
[[     0.      0.      0.      0. 100000.]
 [     0.      0.      0.      0. 100000.]
 [     0.      0.      0.      0. 100000.]
 [     0.      0.      0.      0. 100000.]
 [     0.      0.      0.      0.      0.]]
Your T:
[[[1.   0.   0.   0.   0.  ]
  [0.95 0.   0.05 0.   0.  ]
  [0.   0.95 0.   0.05 0.  ]
  [0.   0.   0.95 0.   0.05]
  [0.   0.   0.   0.   1.  ]]

 [[1.   0.   0.   0.   0.  ]
  [0.   0.   0.   0.   0.  ]
  [0.97 0.   0.   0.   0.03]
  [0.   0.97 0.   0.   0.03]
  [0.   0.   0.   0.   1.  ]]]


In [None]:
# np.array(...) is the expected result
assert (valueIteration(m=2, gamma=1) == np.array([0, 0, 1, 0, 0])).all()
assert (valueIteration(m=4, gamma=1) == np.array([0, 0, 1, 1, 1, 0, 1, 0, 0])).all()
assert (valueIteration(m=4, gamma=0.9) == np.array([0, 0, 1, 0, 1, 0, 1, 0, 0])).all()
assert (valueIteration(m=10, gamma=1) == np.array([0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0])).all()

valueIteration function with m = 2 and gamma = 1
Your v_now:
 [   0.  150. 3000. 7850.    0.]
Your optimal_policy: [0 0 1 0 0]


valueIteration function with m = 4 and gamma = 1
Your v_now:
 [0.00000000e+00 1.43342530e-01 2.86685070e+00 7.50159258e+00
 9.55616902e+01 2.45418347e+02 3.09269484e+03 7.93806009e+03
 0.00000000e+00]
Your optimal_policy: [0 0 1 1 1 0 1 0 0]


valueIteration function with m = 4 and gamma = 0.9
Your v_now:
 [0.00000000e+00 1.03284014e-01 2.29520033e+00 5.78773015e+00
 8.50074197e+01 2.11020860e+02 3.07421148e+03 7.62845081e+03
 0.00000000e+00]
Your optimal_policy: [0 0 1 0 1 0 1 0 0]


valueIteration function with m = 10 and gamma = 1
Your v_now:
 [0.00000000e+00 1.25384912e-10 2.50885562e-09 6.56373936e-09
 8.36285207e-08 2.14773850e-07 2.70664665e-06 6.94708666e-06
 8.75175661e-05 2.24628733e-04 2.82974625e-03 7.26301088e-03
 9.14951404e-02 2.34837536e-01 2.95834322e+00 7.59308080e+00
 9.56530978e+01 2.45509618e+02 3.09278350e+03 7.93814433e+03
 0.00000000e+

# What Next?
1. You may want to check out the course [CS234 Reinforcement Learning provided by Stanford](https://web.stanford.edu/class/cs234) (one of the homework is also the ideal source for this lab).

2. [Gymnasium](https://gymnasium.farama.org/) is a standard API for reinforcement learning and a diverse collection of reference environments

3. In our setting, the transition matrix is very sparse. In this case, vectorization and matrix multiplication are not that beneficial.
