# Chapter 47: Q-Learning

### This code generates the data results for Example 4 in Chapter 47: Q-Learning  (vol. II)
TEXT: A. H. Sayed, INFERENCE AND LEARNING FROM DATA, Cambridge University Press, 2022.

<div style="text-align: justify">
DISCLAIMER:  This computer code is  provided  "as is"   without  any  guarantees.
Practitioners  should  use it  at their own risk.  While  the  codes in  the text 
are useful for instructional purposes, they are not intended to serve as examples 
of full-blown or optimized designs. The author has made no attempt at optimizing 
the codes, perfecting them, or even checking them for absolute accuracy. In order 
to keep the codes at a level  that is  easy to follow by students, the author has 
often chosen to  sacrifice  performance or even programming elegance in  lieu  of 
simplicity. Students can use the computer codes to run variations of the examples 
shown in the text. 
</div>

The Jupyter notebook and python codes are developed by Eduardo Faria Cabrera

required libraries:
    
1. numpy
2. matplotlib

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from functions import *

## Example 47.4 (Optimal policy for a game over a grid)

We illustrate the operation of the $Q-$learning algorithm under $\epsilon-$greedy exploration  by reconsidering the earlier grid problem from Fig. 44.4. Recall that the grid consisted of $16$ squares labeled $\#1$ through $\#16$. Four squares are special; these are squares $\#4, \#8, \#11,$ and $\#16$. Square $\#8$ (danger) and $\#16$ (stars) are terminal states. If the agent reaches one of these terminal states, the agent moves to an EXIT state and the game stops. The agent collects either a reward of $+10$ at square $\#16$ or a reward of $-10$ at square $\#8$. The agent collects a reward of $-0.1$ at all other states. We iterate the $Q-$learning 
iteration (47.52) over 50,000 episodes using $\epsilon=0.1$, $\gamma=0.9$, and a constant step size $\mu=0.01$. We employ the optimistic initialization (47.49) for the state--action value function, namely, 

$ q_{-1}(s,a)=\frac{\max\{0.1,10\}}{1-0.9}=100,\;\;\;\forall\;(s,a)\in\mathbb{S}\times\mathbb{A}$

In the implementation of the algorithm, the maximization of $q_{n-1}(s_n,a')$ and $q_{n-1}(s',a')$ is performed over the actions that are $\textit{permissible}$ at the respective states, $s_n$ and $s'$, respectively. After convergence, we examine the resulting state--action value function $q(s,a)$, whose values we collect into a matrix $Q$ with each row corresponding to a state value and each column to an action:

$
Q=\begin{array}{c|rrrrr}
&\textnormal{ up}&\textnormal{ down}&\textnormal{ left}&\textnormal{ right}&\textnormal{ stop}\\\hline
s=1&94.8026 & \fbox{108.0683} & 105.7811 & 105.9368 & 100.0000\\
s=2&\fbox{108.7510}&  108.2464&  108.4248 & 107.9651 & 100.0000\\
s=3&\fbox{108.8600} & 108.4572 & 108.3800&  108.3750 & 100.0000\\
s=4&- & -& - & - & -\\
s=5&\fbox{109.2077} & 109.0464 & 109.0972 & 108.9634&  100.0000\\
s=6&108.9250 & 108.7999 & \fbox{109.0351} & 108.8364&  100.0000\\
s=7&106.4342 & 105.5675 & \fbox{109.0016} &  94.6887&  100.0000\\
s=8&- & - & - & -&   \fbox{90.0000}\\
s=9&\fbox{109.8279} &  95.5984&  107.0995 & 106.5895&  100.0000\\
s=10&\fbox{109.7102} & 108.9413 & 109.4142&  109.5597&  100.0000\\
s=11&- & - & - & -&  -\\
s=12&\fbox{109.3789} & 109.1449 & 109.2611 & 109.2649 & 100.0000\\
s=13&109.4312 & 109.3414 & 109.3911 & \fbox{109.5225} & 100.0000\\
s=14&109.5837 & 109.5774 & 109.4614 & \fbox{109.6903}&  100.0000\\
s=15&109.7302 & 109.6244 & 109.6038 & \fbox{109.8346} & 100.0000\\
s=16&- & - & - & -&  \fbox{110.0000}\\
s=17&- & - & - & - & \fbox{100.0000}
\end{array}$

For each state $s$, we determine the action $a$ that maximizes $q(s,a)$. These results are indicated by boxes in the above expression, leading to  the following deterministic optimal policy:

$
\pi(a=\textnormal{ "down"}|s=1)=1\\
\pi(a=\textnormal{ "up"}|s=2)=1\\
\pi(a=\textnormal{ "up"}|s=3)=1\\
\pi(a=\textnormal{ "up"}|s=5)=1\\
\pi(a=\textnormal{ "left"}|s=6)=1\\
\pi(a=\textnormal{ "left"}|s=7)=1\\
\pi(a=\textnormal{ "stop"}|s=8)=1\\
\pi(a=\textnormal{ "up"}|s=9)=1\\
\pi(a=\textnormal{ "up"}|s=10)=1\\
\pi(a=\textnormal{ "up"}|s=12)=1\\
\pi(a=\textnormal{ "right"}|s=13)=1\\
\pi(a=\textnormal{ "right"}|s=14)=1\\
\pi(a=\textnormal{ "right"}|s=15)=1\\
\pi(a=\textnormal{ "stop"}|s=16)=1\\
\pi(a=\textnormal{ "stop"}|\textnormal{ s="EXIT"})=1
$

The optimal actions are represented by arrows in Fig. 47.1. Observe in particular that the optimal action at state $s=7$ is to move left. By doing so, the agent has a $70\%$ chance to move to state $s=6$ and $15\%$ chance each to move to states $s=2$ and $s=10$. Note that there is no chance to end up in the danger state $s=8$. Likewise, the optimal action selected for state $s=1$ avoids any possibility of ending up in the danger state $s=1$. This is because by choosing to move downward, the agent has $85\%$ chance of staying at state $s=1$ and $15\%$ chance of moving to state $s=2$. 

In [2]:
# grid problem from example 1

# states
# we include the block locations 4 and 11 for convenience of coding; though they will never be reached
states = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17] # s = 17 is the EXIT state
NS = len(states) # number of states

# actions
actions = ['up', 'down', 'left', 'right', 'stop']
NA = len(actions) # number of actions

# rewards
reward = -0.1*np.ones(NS)
reward[7] = -10 # reward at state s = 8
reward[15] = +10 # reward at state s = 16
reward[16] = 0 # reward at exit satate s = 17

# target policy pi(a|s)
Pi = np.zeros((NA, NS)) # matrix Pi specifies the policy pi(a|s)
                      # each row is an action; each column is a state

for j in range(NS):
    s = states[j]
    if s in [1, 2, 3, 5, 6, 7, 9, 10, 12, 13, 14, 15]:
        Pi[0,j] = 1/4 # up
        Pi[1,j] = 1/4 # down
        Pi[2,j] = 1/4 # left
        Pi[3,j] = 1/4 # right
        Pi[4,j] = 0  # STOP
    
    else:
        Pi[0,j] = 0 # up
        Pi[1,j] = 0 # down
        Pi[2,j] = 0 # left
        Pi[3,j] = 0 # right
        Pi[4,j] = 1 # STOP

# transition kernel
P = np.zeros((NS, NA, NS)) # entries are Prob(s, a, s')

P[0, 0, 0] = 0.15 # start at s=1, move UP, end in state 1
P[0, 0, 1] = 0.15
P[0, 0, 7] = 0.7

P[0, 1, 0] = 0.85 # start at s=1, move DOWN, end in state 1
P[0, 1, 1] = 0.15

P[0,2,0] = 0.15 # start at s=1, move LEFT, end in state 1
P[0,2,1] = 0.70
P[0,2,7] = 0.15

P[0,3,0] = 0.85 # start at s=1, move RIGHT, end in state 1
P[0,3,7] = 0.15

P[1,0,0] = 0.15  # start at s=2, move UP, end in state 1
P[1,0,2] = 0.15
P[1,0,6] = 0.70

P[1,1,0] = 0.15  # start at s=2, move DOWN, end in state 1
P[1,1,2] = 0.15
P[1,1,1] = 0.70
                
P[1,2,2] = 0.70  # start at s=2, move LEFT, end in state 3                  
P[1,2,6] = 0.15
P[1,2,1] = 0.15

P[1,3,0] = 0.70   # start at s=2, move RIGHT, end in state 1 
P[1,3,6] = 0.15
P[1,3,1] = 0.15

P[2,0,5] = 0.70   # start at s=3, move UP
P[2,0,2] = 0.15
P[2,0,1] = 0.15

P[2,1,2] = 0.85  # start at s=3, move DOWN
P[2,1,1] = 0.15

P[2,2,2] = 0.85  # start at s=3, move LEFT
P[2,2,5] = 0.15 

P[2,3,1] = 0.70   # start at s=3, move RIGHT
P[2,3,5] = 0.15
P[2,3,4] = 0.15

P[4,0,11] = 0.70  # start at s=5, move UP
P[4,0,4]  = 0.15
P[4,0,5]  = 0.15

P[3,0,3] = 1 # values for location 4 this state is never reached
P[3,1,3] = 1 # so these values are irrelevant
P[3,2,3] = 1
P[3,3,3] = 1
P[3,4,3] = 1

P[4,1,4] = 0.85  # start at s=5, move DOWN
P[4,1,5] = 0.15

P[4,2,4]  = 0.85  # start at s=5, move LEFT
P[4,2,11] = 0.15

P[4,3,5]  = 0.70  # start at s=5, move RIGHT
P[4,3,11] = 0.15
P[4,3,4]  = 0.15

P[5,0,4] = 0.15   # start at s=6, move UP
P[5,0,5] = 0.70
P[5,0,6] = 0.15

P[5,1,2] = 0.70    # start at s=6, move DOWN
P[5,1,4] = 0.15
P[5,1,6] = 0.15

P[5,2,4] = 0.70   # start at s=6, move LEFT
P[5,2,5] = 0.15
P[5,2,2] = 0.15

P[5,3,6] = 0.70   # start at s=6, move RIGHT
P[5,3,5] = 0.15
P[5,3,2] = 0.15

P[6,0,9] = 0.70  # start at s=7, move UP
P[6,0,5]  = 0.15
P[6,0,7]  = 0.15

P[6,1,1] = 0.70  # start at s=7, move DOWN
P[6,1,5] = 0.15
P[6,1,7] = 0.15

P[6,2,5]  = 0.70 # start at s=7, move LEFT
P[6,2,9] = 0.15
P[6,2,1]  = 0.15

P[6,3,7] = 0.70  # start at s=7, move RIGHT
P[6,3,1] = 0.15
P[6,3,9] = 0.15

P[7,0,16] = 0   # start at s=8 [DANGER] EXIT
P[7,1,16] = 0
P[7,2,16] = 0
P[7,3,16] = 0
P[7,4,16] = 1 #STOP action

P[8,0,15] = 0.70   # start at s=9 move UP
P[8,0,9] = 0.15
P[8,0,8]  = 0.15

P[8,1,7]  = 0.70   # start at s=9 move DOWN
P[8,1,9] = 0.15
P[8,1,8]  = 0.15

P[8,2,9] = 0.70  # start at s=9 move LEFT
P[8,2,15] = 0.15
P[8,2,7]  = 0.15

P[8,3,8]  = 0.70  # start at s=9 move RIGHT
P[8,3,7]  = 0.15
P[8,3,15] = 0.15

P[9,0,14] = 0.70   # start at s=10 move UP
P[9,0,8]  = 0.15
P[9,0,9] = 0.15

P[9,1,6]  = 0.70  # start at s=10 move DOWN
P[9,1,8]  = 0.15
P[9,1,9] = 0.15

P[9,2,9] = 0.70  # start at s=10 move LEFT
P[9,2,14] = 0.15
P[9,2,6]  = 0.15

P[9,3,8]  = 0.70   # start at s=10 move RIGHT
P[9,3,6]  = 0.15
P[9,3,14] = 0.15

P[10,0,3] = 1 # values for location 11 this state is never reached
P[10,1,3] = 1 # so these values are irrelevant
P[10,2,3] = 1
P[10,3,3] = 1
P[10,4,3] = 1

P[11,0,12] = 0.70  # start at s=12 move UP
P[11,0,11] = 0.30

P[11,1,4]  = 0.70  # start at s=12 move DOWN
P[11,1,11] = 0.30

P[11,2,12] = 0.15  # start at s=12 move LEFT
P[11,2,4]  = 0.15
P[11,2,11] = 0.70

P[11,3,11] = 0.70  # start at s=12 move RIGHT
P[11,3,4]  = 0.15
P[11,3,12] = 0.15

P[12,0,12] = 0.85 # start at s=13 move UP
P[12,0,13] = 0.15

P[12,1,11] = 0.70  # start at s=13 move DOWN
P[12,1,12] = 0.15
P[12,1,13] = 0.15

P[12,2,12] = 0.85 # start at s=13 move LEFT
P[12,2,11] = 0.15

P[12,3,13] = 0.70  # start at s=13 move RIGHT
P[12,3,11] = 0.15
P[12,3,12] = 0.15

P[13,0,13] = 0.70 # start at s=14 move UP
P[13,0,12] = 0.15
P[13,0,14] = 0.15

P[13,1,13] = 0.70  # start at s=14 move DOWN
P[13,1,12] = 0.15
P[13,1,14] = 0.15

P[13,2,12] = 0.70  # start at s=14 move LEFT
P[13,2,13] = 0.30

P[13,3,14] = 0.70  # start at s=14 move RIGHT
P[13,3,13] = 0.30

P[14,0,14] = 0.70  # start at s=15 move UP
P[14,0,13] = 0.15
P[14,0,15] = 0.15

P[14,1,9] = 0.70   # start at s=15 move DOWN
P[14,1,13] = 0.15
P[14,1,15] = 0.15

P[14,2,13] = 0.70  # start at s=15 move LEFT
P[14,2,9] = 0.15
P[14,2,14] = 0.15

P[14,3,15] = 0.70   # start at s=15 move RIGHT
P[14,3,9] = 0.15
P[14,3,14] = 0.15

P[15,0,16] =0   # start at s=16 [REWARD] EXIT
P[15,1,16] =0
P[15,2,16] =0
P[15,3,16] =0
P[15,4,16] =1 # STOP action

P[16,0,16] = 0
P[16,1,16] = 0
P[16,2,16] = 0
P[16,3,16] = 0
P[16,4,16] = 1 # EXIT state

# Computing rpi(s)
rpi = np.zeros(NS)
for s in range(NS):
    policy = Pi[:, s]
    for a in range(NA):
        for sprime in range(NS):
            rpi[s] += policy[a]*P[s, a, sprime]*reward[s]

# Computing P^{\pi}
Ppi = np.zeros((NS, NS))
for s in range(NS):
    policy = Pi[:, s]
    for sprime in range(NS):
        for a in range(NA):
            Ppi[s, sprime] += policy[a]*P[s, a, sprime]

# behavior policy phi(a|s) used to simulate off-policy algorithms
Phi = np.zeros((NA, NS)) # matri Phi specifies the behavior policy phi(a|s)
                         # each row is an action; each column is a state

for j in range(NS):
    s = states[j]
    if s in [1, 2, 3, 5, 6, 7, 9, 10, 12, 13, 14, 15]:
        Phi[0,j] = 3/8 # up
        Phi[1,j] = 1/8 # down
        Phi[2,j] = 2/6 # left
        Phi[3,j] = 1/6 # right
        Phi[4,j] = 0  # STOP
    else:
        Phi[0,j] = 0 # up
        Phi[1,j] = 0 # down
        Phi[2,j] = 0 # left
        Phi[3,j] = 0 # right
        Phi[4,j] = 1  # STOP

# one-hot encoding for the actions
A = np.zeros((5, 5))
A[0, :] = np.array([1, 0, 0, 0, 0]) # up
A[1, :] = np.array([0, 1, 0, 0, 0]) # down
A[2, :] = np.array([0, 0, 1, 0, 0]) # left
A[3, :] = np.array([0, 0, 0, 1, 0]) # right
A[4, :] = np.array([0, 0, 0, 0, 1]) # STOP

# 4x1 reduced feature vectors with four binary entries
# is agent on same row as SUCCESS
# is agent on same row as DANGER
# is agent in rightmost two columns
# is agent in leftmost two columns

# reduced features for state-value function
# no offset is included in the feature vectors since v^{\pi}=0 at state 17
# v^{\pi}(s) = h'*w

Mr = 4
Hr = np.zeros((NS, Mr))
Hr[0,:]  = np.array([0, 0, 1, 0]) # state 1
Hr[1,:]  = np.array([0, 0, 1, 0]) # state 2
Hr[2,:]  = np.array([0, 0, 0, 1]) # state 3
Hr[3,:]  = np.array([0, 0, 0, 0]) # not a valid state
Hr[4,:]  = np.array([0, 1, 0, 1]) # state 5...
Hr[5,:]  = np.array([0, 1, 0, 1])
Hr[6,:]  = np.array([0, 1, 1, 0])
Hr[7,:]  = np.array([0, 1, 1, 0])
Hr[8,:]  = np.array([0, 0, 1, 0])
Hr[9,:] = np.array([0, 0, 1, 0])
Hr[10,:] = np.array([0, 0, 0, 0]) # not a valid state 
Hr[11,:] = np.array([0, 0, 0, 1])
Hr[12,:] = np.array([1, 0, 0, 1])
Hr[13,:] = np.array([1, 0, 0, 1])
Hr[14,:] = np.array([1, 0, 1, 0])
Hr[15,:] = np.array([1, 0, 1, 0]) # state 16
Hr[16,:] = np.array([0, 0, 0, 0]) # EXIT state

Fr = np.kron(Hr, A) # Kronecker product of dimensions (NSxNA) x (MrxNA)
Tr = Mr*NA

# one-hot encoded feature vectors for state-value function
# no offset is included in the feature vectors because v^{\pi}=0 at state 17
# v^{\pi}(s) = h'*w

Me = NS
He = np.zeros((NS, Me))
He[0,:]   = np.array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) # state 1
He[1,:]   = np.array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) # state 2
He[2,:]   = np.array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) # state 3
He[3,:]   = np.array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) # not valid state
He[4,:]   = np.array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) # state 5
He[5,:]   = np.array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) # ...
He[6,:]   = np.array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
He[7,:]   = np.array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
He[8,:]   = np.array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0])
He[9,:]  = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0])
He[10,:]  = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]) # not valid state
He[11,:]  = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0])
He[12,:]  = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0])
He[13,:]  = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0])
He[14,:]  = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0])
He[15,:]  = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]) # state 16
He[16,:]  = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) # EXIT state

Fe = np.kron(He, A) # Kronecker product of dimensions (NSxNA) x (MexNA)
Te = Me*NA


In [3]:
# Q-learning under epsilon-greedy exploration

E = 50000 # number of episodes
gamma = 0.9
mu = 0.01
epsilon = 0.1 # parameter for epsilon-greedy exploration
rmin = 0.1
rmax = 10
Q = (max(rmin, rmax)/ (1-gamma))*np.ones((NS, NA)) # Q(s, a) matrix; optimistic initialization
beta = np.zeros((NS, NA))
max_episode_duration = 50

q_vec = np.zeros(NA)
q_prime_vec = np.zeros(NA)
a_vec = np.zeros(NA)
kernel = np.zeros(NS)

for e in range(E): # iterates over episodes
    counter = 0
    sample = 1
    while sample == 1:
        idx = np.random.randint(NS-1)+1 # select a random non-exit state index
        if (idx != 4) and (idx != 11) and (idx != 17): # excluding the block locations and exit state
            s = states[idx-1]
            sample = 0
    
    while (s != 17) and (counter < max_episode_duration): # state s different from EXIT state
        q = Q[s, :] # row in Q corresponding to state s

        pi_vec = Pi[:, s] # policy vector at state s --> determines which actions are possible ar s
        counter2 = 0
        for j in range(NA):
            if pi_vec[j] > 0: # the j-th action is possible
                q_vec[counter2] = q[j] # q-value
                a_vec[counter2] = j # corresponding valid action
                counter2 += 1
        ax = np.argmax(q_vec[0:counter2]) # permissible action at s with largest q-value
        idx = (q_vec[0:counter2]).max()
        act1 = int(a_vec[ax]) # index of the permissible action

        y = np.random.rand() # epsilon-greed strategy
        if y <= epsilon:
            ay = np.random.randint(counter2) # choose from actions permissible at state s
            act = int(a_vec[ay])
        else:
            act = act1
        
        for j in range(NS):
            kernel[j] = P[s, act, j]
        
        sprime = select_next_state(kernel)
        r = reward[s]

        pi_prime_vec = Pi[:, sprime] # again, we find max Q(s', a') over permissible actions at s'
        counter3 = 0
        q_prime = Q[sprime, :]
        for j in range(NA):
            if pi_prime_vec[j] > 0: # the h-th action is possible
                q_prime_vec[counter3] = q_prime[j] # q_value
                counter3 += 1
        
        max_value = (q_prime_vec[0:counter3]).max() # maximum over permissible actions at s' 

        beta[s, act] = r + max_value - Q[s, act]
        Q[s, act] += mu*beta[s, act]
        s = sprime
        counter += 1

# after convergence, we determine the optimal policy from the resulting Q martrix

act = np.zeros(NS)
action_state = [None]*NS
act[3] = -1 # no action since 4 is not a valud state
act[10] = - 1 # 11 is not a valid state
action_state[3] = 'NA' # no action; not applicable since 4 and 11 are not valid states
action_state[10] = 'NA'

for s in range(NS):
    if (s != 3) and (s != 10): # not valid states; exclude them
        q = Q[s, :] # row in Q corresponding to state s
        pi_vec = Pi[:, s] # policy vector at state s --> determines which actions are possible ar s
        counter = 0
        for j in range(NA):
            if pi_vec[j] > 0: # the j-th action in possible
                q_vec[counter] = q[j] # q-value
                a_vec[counter] = j # corresponding valid action
                counter += 1
        ax = np.argmax(q_vec[0:counter]) # permissible action at s with largest q-value
        idx = (q_vec[0:counter]).max()
        act[s] = int(a_vec[ax]) # index of the permissible action
        action_state[s] = actions[int(act[s])]
        print(s+1, action_state[s])

print(Q)

1 down
2 up
3 up
5 up
6 left
7 left
8 stop
9 up
10 up
12 up
13 right
14 right
15 right
16 stop
17 stop
[[  94.67839089  107.77649441  104.88192131  103.40640133  100.        ]
 [ 108.64136084  107.56378441  107.7978384   107.17389268  100.        ]
 [ 108.82349707  107.99568507  107.97622014  107.95527212  100.        ]
 [ 100.          100.          100.          100.         -261.441     ]
 [ 109.20840562  108.52212306  108.54057623  108.590292    100.        ]
 [ 108.59504551  108.44819562  109.04015803  108.48017103  100.        ]
 [ 106.58022704  105.2559485   108.96588207   95.52035231  100.        ]
 [ 100.          100.          100.          100.           90.        ]
 [ 109.83254345   95.67699701  105.24131119  105.22059385  100.        ]
 [ 109.70864799  108.75325587  109.38233091  109.54897926  100.        ]
 [ 100.          100.          100.          100.         -251.08260866]
 [ 109.38376481  108.23449245  108.43942792  108.59703859  100.        ]
 [ 109.06907061  108.