# Q-Learning Pseudocode

```
┌─────────────────────────────┬────────
│                             │
│    A         B         C    │    D
│                             │
└─────────┐         
          │         │                  │
     E    │    F    │    G         H   │
          │         │                  │
                    └─────────         │
│                                      │
│    I         J         K         L   │
│                                      │
└───────────────────           ────────┘

```

|location|state|
|-|-|
|A|0|
|B|1|
|C|2|
|D|3|
|E|4|
|F|5|
|G|6|
|H|7|
|I|8|
|J|9|
|K|10|
|L|11|

action은 도착할 수 있는 location의 인덱스.

도착할 수 있는 location을 선택할 경우 1, 아니면 0 reward

G로 보내고 싶을 때 AI agent에게 주어지는 보상 행렬은 다음과 같다.

||A|B|C|D|E|F|G|H|I|J|K|L|
|-|-|-|-|-|-|-|-|-|-|-|-|-|
|A|0|1|0|0|0|0|0|0|0|0|0|0|
|B|1|0|1|0|0|1|0|0|0|0|0|0|
|C|0|1|0|0|0|0|1|0|0|0|0|0|
|D|0|0|0|0|0|0|0|1|0|0|0|0|
|E|0|0|0|0|0|0|0|0|1|0|0|0|
|F|0|1|0|0|0|0|0|0|0|1|0|0|
|G|0|0|1|0|0|0|1000|1|0|0|0|0|
|H|0|0|0|1|0|0|1|0|0|0|0|1|
|I|0|0|0|0|1|0|0|0|0|1|0|0|
|J|0|0|0|0|0|1|0|0|1|0|1|0|
|K|0|0|0|0|0|0|0|0|0|1|0|1|
|L|0|0|0|0|0|0|0|1|0|0|1|0|

Environment()와 AI() class 구성 한눈에 보기

```python
class Environment():

	def __init__(self):
		Initialize the Environment
	
	def get_random_state(self):
		Return a random possible state of the game
	
	def get_qvalue(self, random_state, action):
		Return the Q-value of this random_state, action couple	

	def update(self, action):
		Update the environment, reach the next state and return the Q-values of this new state

	def get_reward(self, random_state, action):
		Return the reward obtained by playing this action from this random possible state
	
	def calculate_TD(self, qvalue, next_state, reward, gamma):
		Return the calculated Temporal Difference using the equation: TD = reward + gamma*max(qvalues_next_state) - qvalue

	def update_qvalue(self, TD, qvalue, alpha):
		Update the qvalue specified as argument using the equation: qvalue = qvalue + alpha * TD

class AI():
	
	def __init__(self):
		Initialize the AI

	def play_action(self):
		Play a random action	

```

만들어진 Environment와 AI를 가지고 Q-learning을 하는 과정

```python						
env = Environment()
ai = AI()

Initialize gamma
Initialize alpha

while True:
	random_state = env.get_random_state()

	action = ai.play_action()
	
	qvalue = env.get_qvalue(random_state, action)

	next_state = env.update(action)

	reward = env.get_reward(random_state, action)

	TD = env.calculate_TD(qvalue, next_state, reward, gamma)

	env.update_qvalue(TD, qvalue, alpha)
```