## Problem 3: High dimensional linear system 

This experiment test how well the global and stepwise algorithms handle high-dimensional data. 
    
### MDP

- States: 30 dimensional, 15 exogenous & 15 endogenous state variables 
    - X_t = [X_1t, ... ,X_15t]^T & 
    - E_t = [E_1t, ... ,E_15t]^T
        
- State transition function: 
    - X_t+1 = M_x * X_t + epsilon_x
    - E_t+1 = M_e * [E_t, X_t, A_t]^T + epsilon_e
    - where M_x & M_e are the transition functions for the exogenous MRP & endogenous MDP, generated according to N(0,1)
    - epsilon_x is the exogenous normal noise distribution N(0, 0.09) and 
    - epsilon_e is the endogenous normal noise distirbution N(0, 0.04) 

- Starting state is a zero vector 

- The observed state vector is a linear mixture of the hidden exogenous & endogenous states defined as :
    - S_t = M * [E_t, X_t]^T where M is 30x30 element of the reals generated according to N(0,1)

- Reward: R_t = R_xt + R_et where R_xt is the endogenous reward & R_et is the exogenous reward 


### Experiments 
All 4 Q-learners: 
- observe the entire current state s_t
- are initialized identically and employ the same random seed 
- discount factor is 0.9
- learning rate is 0.05

Difference between Q-learners:
- The "full Q-learner" is trained on the full reward  
- The endogenous reward Q-learners are trained on the (estimated) endogenous reward 
- For the first L steps, where L=1000 steps, the full reward is employed, & we collect a database of (s,a,s,r') transitions 
- After these L steps, we apply the Global & Stepwise algorithms to estimate W_x & W_e
- Then the algorithms fit a linear regression model R_exo(W_x^Ts) to predict the reward r as a function of the exogenous state x=W_x^Ts


- The temperature for the Boltzmann exploration is 5.0
- Used steepest descent solcing in Manopt 
- The PCC constraint epsilon is 0.05

In [1]:
import numpy as np

### MDP

__States:__ 30 dimensional, with exogenous & endogenous components

In [2]:
# Initial state is zero vector 
states = np.zeros((1,30))
print(states.dtype)
print(states)

float64
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0.]]


__State Transitions:__ 

Implement the following:
    - transition functions for the exo MRP & endo MDP generated according to N(0,1)
        - M_x an element of the reals with dimensionality 15x15
        - M_e an element of the reals with dimensionality 15x31
        - where each row of each matrix is normalized to sum to 0.99 for stability  
    - gaussian noise distribution variables
        - Epsilon_x generated by N(0,0.09)
        - Epsilon_e generated by N(0,0.04)

In [18]:
# functions to verify that l2-norm is working
'''
def sum_sqrt(a):
    return np.sqrt(np.sum(np.abs(a)**2, axis=-1))

def apply_norm_along_axis(a):
    return np.apply_along_axis(np.linalg.norm, 1, a)
'''

'\ndef sum_sqrt(a):\n    return np.sqrt(np.sum(np.abs(a)**2, axis=-1))\n\ndef apply_norm_along_axis(a):\n    return np.apply_along_axis(np.linalg.norm, 1, a)\n'

In [19]:
# transition functions for MRP & MDP; M_x is 15x15 & M_e is 15x31 for M 30x30
m_x = np.random.randn(15,15)
m_e = np.random.randn(15,15)
print('m_x: \n', m_x)
print('m_e: \n', m_e)

m_x: 
 [[-9.46621752e-01  4.98574745e-01  1.43495572e+00 -1.10285555e+00
   8.37911565e-01  1.58601299e+00  4.34951766e-01 -1.29515650e+00
   1.64795162e-02  1.34774909e+00  4.02203364e-01  1.85942485e+00
  -2.65910459e-01 -2.33998367e-01 -9.61939086e-02]
 [ 3.22983222e-01  1.62890433e-01  3.12229324e-01 -7.20060568e-01
  -2.66994435e+00 -5.32024649e-01  1.99346518e-01 -8.06205080e-01
   4.71412407e-01  1.22570259e+00  6.35789901e-01  1.45221552e+00
   8.27014670e-03  1.13573659e+00  2.38917505e-01]
 [-1.62402783e-01  5.12649359e-01  4.97954994e-01 -9.36729038e-02
  -3.94621988e-01 -1.51732368e+00  8.86838323e-01  5.70386630e-01
   7.61705108e-01  1.04345847e+00 -7.89903146e-01 -3.89348654e-01
  -7.87939004e-01  4.94090586e-01  3.44180279e-01]
 [ 2.35889020e-01  1.32789137e+00 -7.86342256e-01 -9.88295592e-01
  -5.34896380e-01  5.13032889e-01 -9.18139103e-01  1.17520623e+00
   1.31890804e+00 -1.67986517e-01 -4.77506645e-02  4.59400975e-01
  -3.37757107e-01 -4.62720938e-01 -1.78477818e-0

In [20]:
# normalize transition functions M_x & M_e using L2 norm
# np.linalg.norm(x, axis=1) is fastest way to compute the L2-norm
# L2-norm: each row's squared elements sum to 1 
m_x_norm = np.linalg.norm(m_x, axis=1)
m_e_norm = np.linalg.norm(m_e, axis=1)
print('m_x L2-norm: \n', m_x[0])
#print('m_x L2-norm: \n', m_e)

#print('sum: ', apply_norm_along_axis(m_x))

m_x L2-norm: 
 [-0.94662175  0.49857475  1.43495572 -1.10285555  0.83791157  1.58601299
  0.43495177 -1.2951565   0.01647952  1.34774909  0.40220336  1.85942485
 -0.26591046 -0.23399837 -0.09619391]


In [21]:
# verify l2-norm works
norm_m_x = sum_sqrt(m_x)
print(norm_m_x)

[3.88263423 3.79907453 2.74658182 2.92036291 4.10050081 3.18606146
 4.44727577 4.83298637 3.45013083 2.9719169  3.47634091 3.32461781
 4.02265662 4.04991993 3.40690131]


In [22]:
# state noise distributions: exo is N(0,0.09) & endo is N(0,0.04)
# N(mu, sigma^2) -> sigma * np.random.randn() + mu
sigma_x  = 0.3
sigma_e = 0.2
epsilon_x = sigma_x * np.random.randn()
epsilon_e = sigma_e * np.random.randn()
print(epsilon_x)
print(epsilon_e)

0.26841882395143063
-0.07673246312768339


In [9]:
# State Transitions -- WIP 
#new_exo_state = Mx * previous_exo_state + epsilon_x
#new_endo_state = Me * previous_endo_state + epsilon_e

#update_x = mx * state_x + epsilon_x
#update_e = me * state_e + epsilon_e

In [None]:
# policy: action selection is done with 
#if exploration == "boltzmann":
    #Choose an action probabilistically, with weights relative to the Q-values.
    #Q_d, allQ = sess.run([q_net.Q_dist,q_net.Q_out],feed_dict={q_net.inputs:[s],q_net.Temp:e,q_net.keep_per:1.0})
    #a = np.random.choice(Q_d[0],p=Q_d[0])
    #a = np.argmax(Q_d[0] == a)