## Problem 3: High dimensional linear system 

This experiment test how well the global and stepwise algorithms handle high-dimensional data. 
    
### MDP

- States: 30 dimensional, 15 exogenous & 15 endogenous state variables 
    - X_t = [X_1t, ... ,X_15t]^T & 
    - E_t = [E_1t, ... ,E_15t]^T
        
- State transition function: 
    - X_t+1 = M_x * X_t + epsilon_x
    - E_t+1 = M_e * [E_t, X_t, A_t]^T + epsilon_e
    - where M_x & M_e are the transition functions for the exogenous MRP & endogenous MDP, generated according to N(0,1)
    - epsilon_x is the exogenous normal noise distribution N(0, 0.09) and 
    - epsilon_e is the endogenous normal noise distirbution N(0, 0.04) 

- Starting state is a zero vector 

- The observed state vector is a linear mixture of the hidden exogenous & endogenous states defined as :
    - S_t = M * [E_t, X_t]^T where M is 30x30 element of the reals generated according to N(0,1)

- Reward: R_t = R_xt + R_et where R_xt is the endogenous reward & R_et is the exogenous reward 


### Experiments 
All 4 Q-learners: 
- observe the entire current state s_t
- are initialized identically and employ the same random seed 
- discount factor is 0.9
- learning rate is 0.05

Difference between Q-learners:
- The "full Q-learner" is trained on the full reward  
- The endogenous reward Q-learners are trained on the (estimated) endogenous reward 
- For the first L steps, where L=1000 steps, the full reward is employed, & we collect a database of (s,a,s,r') transitions 
- After these L steps, we apply the Global & Stepwise algorithms to estimate W_x & W_e
- Then the algorithms fit a linear regression model R_exo(W_x^Ts) to predict the reward r as a function of the exogenous state x=W_x^Ts


- The temperature for the Boltzmann exploration is 5.0
- Used steepest descent solcing in Manopt 
- The PCC constraint epsilon is 0.05

In [1]:
import numpy as np

### MDP

__States:__ 30 dimensional, with exogenous & endogenous components

In [2]:
# Initial state is zero vector 
states = np.zeros((1,30))
print(states.dtype)
print(states)

float64
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0.]]


__State Transitions:__ 

Implement the following:
    - transition functions for the exo MRP & endo MDP generated according to N(0,1)
        - M_x an element of the reals with dimensionality 15x15
        - M_e an element of the reals with dimensionality 15x31
        - where each row of each matrix is normalized to sum to 0.99 for stability  
    - gaussian noise distribution variables
        - Epsilon_x generated by N(0,0.09)
        - Epsilon_e generated by N(0,0.04)

In [3]:
# functions to verify that l2-norm is working
'''
def sum_sqrt(a):
    return np.sqrt(np.sum(np.abs(a)**2, axis=-1))

def apply_norm_along_axis(a):
    return np.apply_along_axis(np.linalg.norm, 1, a)
'''

'\ndef sum_sqrt(a):\n    return np.sqrt(np.sum(np.abs(a)**2, axis=-1))\n\ndef apply_norm_along_axis(a):\n    return np.apply_along_axis(np.linalg.norm, 1, a)\n'

In [4]:
# transition functions for MRP & MDP; M_x is 15x15 & M_e is 15x31 for M 30x30
m_x = np.random.randn(15,15)
m_e = np.random.randn(15,15)
print('m_x: \n', m_x)
print('m_e: \n', m_e)

m_x: 
 [[-1.66111977  0.99017422  0.99553902 -0.96027142 -1.57219366 -0.53846818
  -2.00343655  0.60045816  1.09411002  1.48793533 -0.92398151 -1.20933663
   0.5259119  -2.18418288 -1.74992139]
 [ 0.17246995  1.48789604 -1.06840166 -0.41209592 -1.47960086  0.60065623
  -0.00710774  1.11440123  0.10969171 -0.85004143  0.54350354  0.06543376
   0.73261124  1.83704256 -0.35133936]
 [ 0.0781673  -0.26877405  0.3996426  -0.46343816 -0.85401835 -0.63584499
   0.33157194  0.91387019 -0.56911346  0.99898567  1.88545343  0.28997697
   0.93476602  0.23487627 -0.97662392]
 [-0.28076461  0.7976943   0.02564357  1.29383044  0.42603453  0.94328828
  -0.2752606   0.01113817 -1.1210267  -1.31733059  1.8676044   1.15167585
  -0.99937183 -0.10299309 -0.98266096]
 [-0.03954612 -0.20370832 -0.23029544  0.23044053 -0.30333559  0.99637357
   1.24826898  0.21399581  1.0143156  -0.42438508 -1.7981757  -0.84726075
  -0.33514992 -1.07039646  0.08945039]
 [-0.48760187 -0.87040387  0.69004991 -0.29784728  1.00799

In [5]:
# normalize transition functions M_x & M_e using L2 norm
# np.linalg.norm(x, axis=1) is fastest way to compute the L2-norm
# L2-norm: each row's squared elements sum to 1 
m_x_norm = np.linalg.norm(m_x, axis=1)
m_e_norm = np.linalg.norm(m_e, axis=1)
print('m_x L2-norm: \n', m_x[0])
#print('m_x L2-norm: \n', m_e)

#print('sum: ', apply_norm_along_axis(m_x))

m_x L2-norm: 
 [-1.66111977  0.99017422  0.99553902 -0.96027142 -1.57219366 -0.53846818
 -2.00343655  0.60045816  1.09411002  1.48793533 -0.92398151 -1.20933663
  0.5259119  -2.18418288 -1.74992139]


In [7]:
# verify l2-norm works
#norm_m_x = sum_sqrt(m_x)
#print(norm_m_x)

In [8]:
# state noise distributions: exo is N(0,0.09) & endo is N(0,0.04)
# N(mu, sigma^2) -> sigma * np.random.randn() + mu
sigma_x  = 0.3
sigma_e = 0.2
epsilon_x = sigma_x * np.random.randn()
epsilon_e = sigma_e * np.random.randn()
print(epsilon_x)
print(epsilon_e)

-0.512761333133071
0.004779688078053899


In [9]:
# State Transitions -- WIP 
# new_exo_state = Mx * previous_exo_state + epsilon_x
# new_endo_state = Me * previous_endo_state + epsilon_e

# update_x = mx * state_x + epsilon_x
# update_e = me * state_e + epsilon_e

In [None]:
# policy: action selection is done with 
#if exploration == "boltzmann":
    #Choose an action probabilistically, with weights relative to the Q-values.
    #Q_d, allQ = sess.run([q_net.Q_dist,q_net.Q_out],feed_dict={q_net.inputs:[s],q_net.Temp:e,q_net.keep_per:1.0})
    #a = np.random.choice(Q_d[0],p=Q_d[0])
    #a = np.argmax(Q_d[0] == a)

### Action Selection 

In [None]:
#### softmax action selection 

- actions are ranked and weighted according to value estimates 
- here we use a Boltzmann distribution 
- temperature value: 
        - high temperatures cause the actions to be nearly equiprobable
        - low temperatures cause the actions to have a greater difference in selection probability 

In [9]:
def action_selection(Qs, time_step, temperature):
    
    # loop through number of N -- what is N/? 
    
        # exp( Q(s_t, a) / temperature )
        num = math.exp(Qs[time_step,i]/temp)
        
        # sum_i of exp( Q(s_t, a_i) / temperature )  
        denom = sum(math.exp(val/temp) for val in Qs[time_step,:])
        
        # action_t ~ num/denom
        action = num / denom
        
        