# **COMP 2211 Exploring Artificial Intelligence** #
## Lab 10 Reinforcement Learning Review ##

<img src="https://miro.medium.com/max/4800/1*7PoZafFLEVXQiseVx5y4cw.jpeg" width="600" height="326" /> 


# Backgroud knowledge

## In Class Knowledge

### Discount

**Discounting** is a concept, where a parameter called the **discount factor**, $\gamma \in [0,1)$, and $0 \leq \gamma \leq 1$. It is a power to multiply a reward.

**Discounted sum of future rewards** $=reward_{now}+\sum_{i=1}^{\infty} \gamma^i \times reward_i$

**Bellman equation** $V(s)=R(s,a)+γ*\sum P(s'|s,a)*V(s')$.
1.   $V(s)$: the value of state $s$.
2.   $R(s, a)$: the immediate reward for taking action a in state $s$.
3.   $P(s'|s, a)$: the probability of transitioning to state $s'$ given state s and action $a$.
4.   $V(s')$: the value of the next state $s'$.




### Markov State

$S_t\ is\ Markov \iff P(S_{t+1}|S_t)=P(S_{t+1}|S_1,S_2, \dots, S_t)$

1. The future is independent of the past, given the present.
2. The present captures information about the past.
3. Once the present is known, the history may be thrown away.



### A Markov System with Rewards

- Components
    - MRP: $(S, T, R, γ)$
    - A set of N states $s_i$; each $s_i$ has a reward $r_i$ [$\vec r$]
    - A transition **probability** matrix [$T$]
        - where $T_{i,j}=P[i \to j]=P[next= s_j|now=s_i]$
        - each row sums up to be 1
    - Discount factor $\gamma \in (0,1)$ to all rewards to compute the expected discounted sum of future rewards **starting in state $s_i$** [$\vec v$]:

- Directly solve the linear equation

$$
\vec v=\vec r+\gamma T\vec v \to \vec v=(I-\gamma T)^{-1} \vec r
$$

- Dynamic Programming (we will do Value Iteration)
    - $\vec v^{(0)}=\vec r$
    - $\vec v^{(k)}=\vec r+\gamma T\vec v^{(k-1)}$
    - Compute the expected discounted sum of rewards over the next k time steps from now
    - Stop when the maximum absolute difference between two successive expected discounted sum of rewards is less than a threshold ($\max_i |\vec v^{(k)}-\vec v^{(k-1)}|<\xi$)
- Monte-Carlo evaluation
- Temporal-Difference learning


### Example: Weather

- We will use the Markov chain on [the lecture slide](https://course.cse.ust.hk/comp2211/notes/12-reinforcement-learning-full.pdf) P23-28


In [None]:
import numpy as np

In [None]:
# T: transition probability matrix
#    sun wind hail
T = np.array([[0.5, 0.5, 0.0],  # sun
        [0.5, 0.0, 0.5],  # wind
        [0.0, 0.5, 0.5]])  # hail

# R: the reward for each state
#       sun wind hail
R = np.array([4, 0, -8], dtype=float)

# gamma: the discount value
gamma = 0.5


# Try to directly solve this equation:
def cal_reward_directly(T, R, gamma):
  return np.linalg.inv(np.identity(T.shape[0]) - gamma * T) @ R

print("Total future rewards calculated directly:")
print(cal_reward_directly(T, R, gamma))


# Now we use iterative method to find the total reward:
# threshold: the difference we accept bewteen 2 successive run
threshold = 1e-1

def cal_reward_by_iterations(T, R, gamma, threshold):
  reward_previous_iteration = np.zeros_like(R, dtype=float)
  reward_this_iteration = R
  count = 0           # Record the time of iterations required to reach the result
  while np.max(np.abs(reward_previous_iteration - reward_this_iteration)) > threshold:
    count += 1
    reward_previous_iteration = reward_this_iteration
    reward_this_iteration = R + gamma * (T @ reward_this_iteration)

  print(f"It took {count} iterations to find the total reward with the threshold = {threshold}.")
  return reward_this_iteration

print("\n\nTotal future rewards calculated by iteration:")
print(cal_reward_by_iterations(T, R, gamma, threshold))
print("Try to change the threshold value to see how it affects the number of iterations required and the total reward value")

Total future rewards calculated directly:
[  4.8  -1.6 -11.2]


Total future rewards calculated by iteration:
It took 5 iterations to find the total reward with the threshold = 0.1.
[  4.83984375  -1.55859375 -11.15625   ]
Try to change the threshold value to see how it affects the number of iterations required and the total reward value


### Markov Decision Process(MDP)

- MDP: $(S, A, T, R, γ)$
- A Markov System with Rewards and **Actions/Decisions** [a]
- Each action a has a matrix $T^{(a)}$, where $T_{i,j}^{(a)}=P[i \overset{a}{ \to } j]=P[next=s_j|now=s_i, action=a]$
- Relax reward from a vector to be a matrix R, where $r_{i,j}$ is the reward from $s_i$ to $s_j$
- **Bellman Optimality Equation**
    - $v^{(k)}_i=\max_a \{ \sum_j T^{(a)} _{i,j} (r _{i,j} + \gamma  v^{(k-1)} _j)  \}$
    - $\vec v^{(k)}=\max_a \{ [T^{(a)} \otimes R]+ \gamma T^{(a)} \vec v^{(k-1)} \}$ (elementwise multiplication $\otimes$→ row sum $[\cdot ]$ → elementwise max)
    - Equivalently, $\vec v^{(k)}=\max_a \{ R^{(a)} + \gamma T^{(a)} \vec v^{(k-1)} \}$
    - Corner case: a **terminal state** can be represented as a state that transitions back into itself and yields reward 0 with probability 1, regardless of the action taken. **You need to handle these states carefully.**
- A **policy** is a mapping from states to actions
- Value Iteration
    - **Near optimal policy** $\vec v^{(k)}=\text{argmax}_a \{ [T^{(a)} \otimes R]+ \gamma T^{(a)} \vec v^{(k-1)} \}$ (elementwise multiplication $\otimes$, row sum $[\cdot ]$, elementwise argmax)
    - Stop until $\max_i |\vec v^{(k)}-\vec v^{(k-1)}|<\xi$

## High Level Summary

> Consider the game of chess. The only real reward signal comes at the end of the game when we either win, earning a reward of, say, 1, or when we lose, receiving a reward of, say, -1. **So reinforcement learners must deal with the *credit assignment* problem: determining which actions to credit or blame for an outcome.** The same goes for an employee who gets a promotion on October 11. That promotion likely reflects a large number of well-chosen actions over the previous year. **Getting more promotions in the future requires figuring out what actions along the way led to the promotion.**
> 
>Reinforcement learners may also have to deal with the problem of **partial observability**. That is, the current observation might not tell you everything about your current state. Say a cleaning robot found itself trapped in one of many identical closets in a house. Inferring the precise location of the robot might require considering its previous observations before entering the closet.
>
>Finally, at any given point, reinforcement learners might know of one good policy, but there might be many other better policies that the agent has never tried. The reinforcement learner must constantly choose whether to ***exploit*** the best (currently) known strategy as a policy, or to ***explore*** the space of strategies, potentially giving up some short-run reward in exchange for knowledge.
>
>When the environment is fully observed, we call the reinforcement learning problem a ***Markov decision process***. When the state does not depend on the previous actions, we call the problem a ***contextual bandit problem***. When there is no state, just a set of available actions with initially unknown rewards, this problem is the classic ***multi-armed bandit problem***.

(From [d2l.ai](https://d2l.ai/chapter_introduction/index.html?highlight=reinforcement#reinforcement-learning))

# Credit
1. the image: [medium](https://medium.datadriveninvestor.com/alphago-a-documentary-about-artificial-intelligence-37c147252889)