# Non-Stationary Contextual Bandits Evaluation

In this notebook we will build a simple setting (contextual bandits) to understand how a non-stationary policy should be evaluated. 

#### Contextual bandits
We introduce a contextual bandit with 2 arms, where the choice of its actions $a_t\in\{0,1\}$ also depends on its context $x_t$, which is sampled from a Context distribution $\mathcal{D}$. Since we want a non-stationary contextual bandit, this distribution will change smoothly over time, so the context is sampled from a time-dependend distribution $x_t\sim\mathcal{D}_t$.
Moreover, at each time $t$ the policy parameter will be given by a time-dependent hyperpolicy $\nu$ that is tuned with the hyperparameters $\rho$.

#### Problem definition 
Define the non-stationary process that builds the context distributions $\mathcal{D}_t$ as a sinusoidal one, i.e. we have $\mathcal{D}_t = \mathcal{N}(\mu_x(t),\sigma_x)$ with $\mu_x(t) = A_x\sin(\phi_x t + \psi_x) + B_x$ and a constant $\sigma_x$. We want the agent to recognise if $x_t>\mu_x(t)$ or if $x_t<\mu_x(t)$. The two possible actions represent those two cases, and the reward is equal to $|x_t-\mu_x(t)|$ if the agent is right and $-|x_t-\mu_x(t)|$ if it is wrong. In order to handle the non-stationary process, the hyperpolicy needs to be based on a similar sinusoidal process, and its goal should be learning and replicating the non-stationary process in order to maximize the rewards.

In the end, the bandit can be represented in two ways, depending on the desired approach:
 - **Action-based exploration**: The stochasticity is given by the policy, which will be represented by a Bernoulli distribution $\pi_{\theta_t}(a_t|x_t) = \left(p_t, 1-p_t\right)$, where $p_t \equiv f(\theta_t, x_t) = S(x_t-\theta_t)$ (with $S(x)$ the sigmoid function) and $\theta_t \sim \nu_\rho(\theta_t|t)$. In this case, the hyperpolicy is deterministic and, at each $t$, the value of $\theta_t$ is given by $\theta_t = \nu_\rho(t) = A_\theta\sin(\phi_\theta t + \psi_\theta) + B_\theta$.
 - **Parameter-based exploration**: The stochasticity is given by the hyperpolicy, which is represented by a time-dependent gaussian distribution $\nu_\rho(\theta_t|t) = \mathcal{N}(\mu_\theta(t),\sigma_\theta)$ with $\mu_\theta(t) = A_\theta\sin(\phi_\theta t + \psi_\theta) + B_\theta$ and a constant $\sigma_\theta$. The sampled $\theta_t$ will now define a deterministic policy $\pi_{\theta_t}(x_t)$, which will be represented by a step function centered in $(x_t-\theta_t)$ and will be equal to $1$ if $x_t>\theta_t$ and $0$ if $x_t<\theta_t$.

In [1]:
# Import libraries
import numpy as np
import scipy.stats

## Define the environment

Start by defining the environment that contains the non-stationary context and that assigns rewards for each action of the bandit.

In [None]:
class environment:
    
    def __init__(self, t0=0, sigma_x=1, **kwargs):
        
        # Store the given parameters
        self.t = t0
        self.sigma_x = sigma_x
        
        # if params that define the mean are not given, take random ones
        var_names = ['A_x', 'B_x', 'phi_x', 'psi_x'] 
        for var in var_names:
            if var in kwargs.keys():
                setattr(self, var, kwargs[var])
            else:
                setattr(self, var, np.random.rand())
    
    
    # Increase the time variable of 'delta_t' steps
    def increase_t(self, delta_t=1):
        self.t += delta_t
    
    
    # Compute the context mean in a given istant (or in 'self.t')
    def x_mean(self, t=None):
        if t == None:
            return self.A_x * np.sin(self.phi_x*self.t + self.psi_x) + self.B_x
        else:
            return self.A_x * np.sin(self.phi_x*t + self.psi_x) + self.B_x
    
    # Sample a context from the distibution at time 'self.t'
    def sample_x(self):
        return scipy.stats.norm.rvs(loc   = self.x_mean(), 
                                    scale = self.sigma_x)
    
    # Obtain reward of the action 'over_mu': positive if correct
    #                                        negative if wrong
    def get_reward(self, x_t, over_mu):
        if over_mu == (x_t > self.x_mean()):
            return np.abs(x_t-self.x_mean())
        else: 
            return -np.abs(x_t-self.x_mean())
    
    
    #def play(self, nu, n_steps):
        
        # TO DO
        
    

## Hyper-policy and policy

Here we will define the classes for the hyperpolicies and policies. We will follow the structure described at the beginning of the notebook.

In [None]:
class policy:
    
    def __init__(self, theta, stochastic=False):
        self.theta = theta
        self.stochastic = stochastic
        
    def params(self):
        return self.theta
        
    def action(self, x):
        if self.stochastic:
            p = 1 / (1 + np.exp(-(x-self.theta)))   ### CHECK IF IT IS CORRECT
            return scipy.stats.bernoulli.rvs(p)
        else:
            return (x > theta)

In [None]:
class hyperpolicy:
        
    def __init__(self, sigma_theta=1, **kwargs):
        
        # Store sigmas of the gaussian distributions
        self.sigma_W = sigma_W
        self.sigma_b = sigma_b
        
        # List of other params that can be set
        self.var_names = ['A_W', 'B_W', 'phi_W', 'psi_W'] 
        
        # If params are given, initialize hyperpolicy means with them, else take random ones
        for var in self.var_names:
            if var in kwargs.keys():
                setattr(self, var, kwargs[var])
            else:
                setattr(self, var, np.random.rand())
               
            
        
    def params(self):
        values = [self.A_W, self.B_W, self.phi_W, self.psi_W,
                  self.A_b, self.B_b, self.phi_b, self.psi_b]
        return {x:y for x,y in zip(self.var_names, values)}
    
    
    def W_mean(self, t):
        return self.A_W * np.sin(self.phi_W*t + self.psi_W) + self.B_W
    
    def b_mean(self, t):
        return self.A_b * np.sin(self.phi_b*t + self.psi_b) + self.B_b
    
    def W_pdf(self, W, t):
        return scipy.stats.norm.pdf(W, loc=self.W_mean(t), scale=self.sigma_W)
    
    def b_pdf(self, b, t):
        return scipy.stats.norm.pdf(b, loc=self.b_mean(t), scale=self.sigma_b)
    
    def theta_pdf(self, theta, t):
        return self.W_pdf(theta[0],t) * self.b_pdf(theta[1],t)
    
    
    def sample_theta(self, t):
        W = scipy.stats.norm.rvs(loc=self.W_mean(t), scale=self.sigma_W)
        b = scipy.stats.norm.rvs(loc=self.b_mean(t), scale=self.sigma_b)
        return W, b

    def sample_policy(self, t):
        W, b = self.sample_theta(t)
        return linear_policy(W,b)
    
    
    #def update_params(self, delta_params):
        
        # TO DO