# Q-Learning with OpenAI Gym

### Reinforcement Learning is Learning "what to do" from positive experiences.

- agent: individual that is exposed to the environment
- state: situation that the individual encounters</td>
- action: performing a transition from one state to another</td>
- reward / penalty: received by the agent after transition performing depending on the result
- policy: trategy of choosing an action given a state in expectation of better outcomes
- state space: set of all possible situations
- action space: set of all the actions, the agent can take in a given state

### Taxi-V2 Environment

- The filled square represents the taxi, which is yellow without a passenger and green with a passenger.
- The pipe ("|") represents a wall which the taxi cannot cross.
- R, G, Y, B are the possible pickup and destination locations. The blue letter represents the current passenger pick-up location, and the purple letter is the current destination.

Problem:<br>
There are 4 locations (labeled by different letters), and our job is to pick up the passenger at one location and drop him off at another. We receive +20 points for a successful drop-off and lose 1 point for every time-step it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.

In [2]:
# Install OpenAI Gym for game environment.
import gym

# Build environment and render it.
env = gym.make("Taxi-v2").env
env.render()

+---------+
|[34;1mR[0m: | : :G|
| : : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B:[43m [0m|
+---------+



In [5]:
# Reset environment and return to random initial state.
env.reset()
env.render()

print(f"Action Space {env.action_space} (south, north, east, west, pickup, dropoff)")
print(f"State Space {env.observation_space}")

+---------+
|R: | : :G|
| :[43m [0m: : : |
| : : : : |
| | : | : |
|[35mY[0m| : |[34;1mB[0m: |
+---------+

Action Space Discrete(6) (south, north, east, west, pickup, dropoff)
State Space Discrete(500)


In [10]:
# Encode a specific state and give it to the environment to render it in Gym.
state = env.encode(2, 2, 2, 0) # (taxi row, taxi column, passenger index, destination index)
print("State:", state)

env.s = state
env.render()

State: 248
+---------+
|[35mR[0m: | : :G|
| : : : : |
| : :[43m [0m: : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+



In [11]:
# Get reward table for specific state with the default reward values for each action.
env.P[248] # dictionary structure {action: [(probability, nextstate, reward, done)]}

{0: [(1.0, 348, -1, False)],
 1: [(1.0, 148, -1, False)],
 2: [(1.0, 268, -1, False)],
 3: [(1.0, 228, -1, False)],
 4: [(1.0, 248, -10, False)],
 5: [(1.0, 248, -10, False)]}

### Q-Learning
- lets the agent use the environment's rewards to learn, over time, the best action to take in a given state
- to remember if an action was beneficial, a Q-value is saved