## Problem 3: High dimensional linear system 

This experiment test how well the global and stepwise algorithms handle high-dimensional data. 
    
### MDP

- States: 30 dimensional, 15 exogenous & 15 endogenous state variables 
    - X_t = [X_1t, ... ,X_15t]^T & 
    - E_t = [E_1t, ... ,E_15t]^T
        
- State transition function: 
    - X_t+1 = M_x * X_t + epsilon_x
    - E_t+1 = M_e * [E_t, X_t, A_t]^T + epsilon_e
    - where M_x & M_e are the transition functions for the exogenous MRP & endogenous MDP, generated according to N(0,1)
    - epsilon_x is the exogenous normal noise distribution N(0, 0.09) and 
    - epsilon_e is the endogenous normal noise distirbution N(0, 0.04) 

- Starting state is a zero vector 

- The observed state vector is a linear mixture of the hidden exogenous & endogenous states defined as :
    - S_t = M * [E_t, X_t]^T where M is 30x30 element of the reals generated according to N(0,1)

- Reward: R_t = R_xt + R_et where R_xt is the endogenous reward & R_et is the exogenous reward 


### Experiments 
All 4 Q-learners: 
- observe the entire current state s_t
- are initialized identically and employ the same random seed 
- discount factor is 0.9
- learning rate is 0.05

Difference between Q-learners:
- The "full Q-learner" is trained on the full reward  
- The endogenous reward Q-learners are trained on the (estimated) endogenous reward 
- For the first L steps, where L=1000 steps, the full reward is employed, & we collect a database of (s,a,s,r') transitions 
- After these L steps, we apply the Global & Stepwise algorithms to estimate W_x & W_e
- Then the algorithms fit a linear regression model R_exo(W_x^Ts) to predict the reward r as a function of the exogenous state x=W_x^Ts


- The temperature for the Boltzmann exploration is 5.0
- Used steepest descent solcing in Manopt 
- The PCC constraint epsilon is 0.05

In [1]:
import numpy as np

### MDP

__States:__ 30 dimensional, with exogenous & endogenous components

In [2]:
# Initial state is zero vector 
states = np.zeros((1,30))
print(states.dtype)
print(states)

float64
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0.]]


__State Transitions:__ 
Define 
    - transition functions for MDP & MDP 
        - Mx
        - Me
    - gaussian noise distribution variables
        - epsilon_x generated by N(0,0.09)
        - epsilon_e generated by N(0,0.04)

In [8]:
# transition functions for MDP & MRP
mx = np.random.randn()
me = np.random.randn()

In [4]:
# exo-endo noise variables: N(mu, sigma^2) sigma * np.random.randn() + mu
exo_epsilon = sigma * np.random.randn() + 0
endo_epsilon = sigma * np.random.randn() + 0

In [5]:
# State Transitions
#new_exo_state = Mx * previous_exo_state + epsilon_x
#new_endo_state = Me * previous_endo_state + epsilon_e