# RL WCST Agent 001

The first, simplest, RL agent. Showcasing the basic structure.

# Mødel Development

1. Consider the simplest possible model:
    - 1 agent learning the wcst overtime
    - no Bayes
    - no data
 
2. Fitting to data == infering the learning rate. 
    - Optimisation techniques? 
    - Performance criterion?

3. Additonal Experiments
    - Capture variation **between** individuals
    - Model neurocorrelates

4. Heirarchical Bayes
    - Capture varation **across** individuals

## Optimality

The optimal learning behaviour would be too simply choose the last correct action. This would equate to an update equation that totally saturates previous information & essentially only considers the most recent but of information.

In [1]:
# !conda activate dynocog
# !conda init
# !conda install pandas -y
# !conda install -c pytorch pytorch -y
# !conda install -c conda-forge numpyro
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import torch
import sys
from tqdm import tqdm
import plotly.express as px
import plotly.graph_objects as go

In [2]:
# ---- Hyperparameters ----x
rules   = ['shape', 'number', 'color']
colors  = ['yellow', 'red', 'green', 'blue']
shapes  = ['star', 'triangle', 'circle', 'cross']
numbers = [1,2,3,4]
cards   = ['card 1', 'card 2', 'card 3', 'card 4']
matching_cards = {
  'card 1': {'color': 'red', 'shape':'circle', 'number':1},
  'card 2': {'color': 'green', 'shape':'triangle', 'number':2},
  'card 3': {'color': 'blue', 'shape':'cross', 'number':3}, 
  'card 4': {'color': 'yellow', 'shape':'star', 'number':4}}

# Build RL Environment

Now we build the RL environment that can be used to train the parameters.

In our model the RL environment reflects WCST.


## Data Structure

Dictionaries are far faster than lists in Python, as such we structure each sample as a dictionary. Each environmental sample has two components:
- Current card: the card drawn
- Target card: the correct matching option
- Rule: the matching rule

Each of these are captured in a dictionary & each observation put into a list object called 'item'. Each item is then added to a list (giving us a list of observations) which is later conververted into a enumerable object.

Return: a list of observations & outputs representing the RL environment.

In [3]:
f = open("../data/wcst.txt", "r")

all_items = []
for l in f.readlines():
    
    # ---- target card ----x
    ind = 'card ' + l[18]

    # ---- current card ----x
    curr_card = l[0:16].strip()
    curr_card = re.split('([0-9]+)', curr_card)
    curr_card = {'color': curr_card[2], 'shape':curr_card[0], 'number':curr_card[1]}

    # ---- matching rule + target card ----x
    rule = (l[27:27+8]).strip().replace('\"','')    
    target_card = matching_cards[ind]

    # ----- list current + target cards ----x
    item = {'curr_card': curr_card, 'target_card':target_card, 'rule':rule}
    all_items.append(item)

# RL Environment

Now we compile these observations in a class to create all the functionality we require.

## Expeience Replay (Replay Memory)
- **sample_environment**: We sample random pairings to decouple bias, this is known as experience replay in the literature.
- **next**: Sample the next $n$ observation in order.


In [4]:
class WCST_Environment:
    def __init__(self, all_items):
        self.all_items = all_items
        self.index     = 0
        self.n         = len(all_items)
        self.sample_idx = []
        self.index = 0
    
    def sample_environment(self, n_samples):
        idx = np.random.choice(range(0,self.n), n_samples)
        self.sample_idx = self.sample_idx.append(idx)
        return([self.all_items[i] for i in idx])
    
    def batch(self, n):
        """Return: n samples
           Note:
                - The function is cylical & will return the same sequence of samples periodically if a large batch is required.
                - If one wishes to 'reset' & sample from the start either:
                    1. set self.index == 0 & sample
                    2. sample from self.all_items directly
        """
        deck = []
        nt = self.index + n
        if nt > 100:    
            loops = nt // 100 - 1    # integer devision == devision drop remainder
                                    # number of full training loops
            res   = nt % 100         # modulo == devision return remainder only
                                    # number of items above 0
            # --- until end ---x
            for i in all_items[nt:]: deck.append(i)
            # --- add 'loops' full data loops ----x
            for i in range(loops):
                for a in all_items: deck.append(a)
            # --- remaining data ---x
            for i in all_items[:res]: deck.append(i)
            self.index = res

        else: 
            deck = all_items[self.index:nt]
            self.index = nt
        return(deck)

In [5]:
wcst = WCST_Environment(all_items)
batch = wcst.batch(7550)
wcst.index

50

# Defining Computation

RL Class should contain:
- Environment
- Model (given trained parameters)
- Training Loop
- Trainable Parameters
- Graphics (training processes etc)

There are two types of trainable model parameters:
- Parameter values: torch.tensor()
- Parameter distributions: numpy object()


# Model

We are working with a bandit as we only have $1$ state. We wish to maximise return by selecting the action that yields the highest expected return, that is, our policy $\pi(\alpha)$ is to select the action $\alpha$ with the highest expected reward $Q_t(\alpha)$ at time $t$:

$$\pi(\alpha) = \underset{\alpha}{argmax} \; Q_t(\alpha)$$

where the action space is given by the decision rule:

$$actions = \alpha = \{"shape", "colour", "number" \}$$

----- 
The simplist calculation of this would be to let the expected return of each action be the sample average return from trying that action. That is to say:

$$Q_t(\alpha) = \frac{1}{t-1}\sum_1^{t-1}r_i$$

where $r_t$ is the binary reward received after taking an action:

$$r_t \in\{0,1\}$$

Note: for faster computation & more efficient memory, this average calculation can be written recursively:

$$Q_t = Q_{t-1} + \frac{1}{n}[r_t - Q_{t-1}]$$

Two problems arise:
- This ignores non-stationarity (the rule changes) and as such more recent feedback must carry more weight.
- This ignores the interaction effect of the game, knowledge about one action actually gives great info about the other actions as they are mutually exclusive. Should we encode this? It can be learnt through more aggressive updating.

The $\frac{1}{n}$ term effectively weights all returns $r_i$ equally. Thus this issues can be mitigated by overweighting newer returns & underweighting older returns. This can be achieved by setting this updating parameter to a constant $\lambda$:

$$Q_t = Q_{t-1} + \lambda[r_t - Q_{t-1}]$$

-----

Recall that $r_t \in\{0,1\}$, thus:

If $r_t = 1$ (correct choice):

$$
\begin{equation}
\begin{split}
    Q_t &= Q_{t-1} + \lambda[r_t - Q_{t-1}] \\
    Q_t &= Q_{t-1} + \lambda[1 - Q_{t-1}] \\
    Q_t &= \lambda + (1-\lambda)Q_{t-1}
\end{split}
\end{equation}
$$

Or, if $r_t = 0$ (incorrect choice):

$$
\begin{equation}
\begin{split}
    Q_t &= Q_{t-1} + \lambda[r_t - Q_{t-1}] \\
    Q_t &= Q_{t-1} + \lambda[0 - Q_{t-1}] \\
    Q_t &= (1-\lambda)Q_{t-1} 
\end{split}
\end{equation}
$$

It's trivial to see that $\lambda=1$ (the extreme case) would be choice the desired response:


If $r_t = 1$ & $\lambda=1$:

$$
\begin{equation}
\begin{split}
    Q_t &= \lambda + (1-\lambda)Q_{t-1} \\
    Q_t &= 1 
\end{split}
\end{equation}
$$

Or, if $r_t = 0$ & $\lambda=1$:

$$
\begin{equation}
\begin{split}
    Q_t &= (1-\lambda)Q_{t-1} \\
    Q_t &= 0
\end{split}
\end{equation}
$$

Which is correct as only one choice can be the correct choice. This is the extreme weighting, where we only take the value of the most recent return.


-----

# Resampling

The agent should sample actions from a list & remove incorrect options, to ensure the agent cycles through all possible actions.

-----

# RL Instance
On the implementation design:
- Most memory + compute constraints arise from the Bayesian model, as such we wish to store lists of the Q-values etc so that we can monitor the algorithm overtime.
- All $Q$ values are initialized at $0$.

In [150]:
wcst = WCST_Environment(all_items)


class RL_Instance:
    """Defining the RL model
        # REQUIREMENTS
        # - Model (given trained parameters)
        # - Training Loop
        # - Trainable Parameters
        # - Graphics (training processes etc)
    """
    def __init__(self, wcst_instance):
        self.wcst = wcst_instance
        self.actions  = ['colour', 'shape', 'number']
        self.Q = {'colour': [0],
                  'shape' : [0],
                  'number': [0]}
        self.data = None
        # self.sample_actions = ['colour', 'shape', 'number']
        # --- ground truth ---x
        self.ground_truth = [i['rule'] for i in self.wcst.all_items]
        self.rule_changes = [self.ground_truth[i-1]!=self.ground_truth[i] for i in range(1,len(self.ground_truth))]
        self.rule_changes.insert(0, False)
        self.change_idx = np.where(self.rule_changes)[0]
        # --- track performance ---x
        self.a_t = []
        self.acc = []
        self.lam = 1

    def forward(self, batch):
        # --- pass through the model != training ---x
        Q_t = [self.Q['colour'][-1], self.Q['shape'][-1], self.Q['number'][-1]]
        # at  = self.actions[np.argmax(Q_t)]
        maxes = [q==max(Q_t) for q in Q_t]
        inds  = np.arange(len(maxes))[maxes]
        at  = self.actions[np.random.choice(inds)]


        ch  = [at == b['rule'] for b in batch]
        [self.acc.append(c) for c in ch]
        [self.a_t.append(at) for c in ch]
        return({'matches': ch, 'action': at})
    
    def train(self, batch, lam=None):
        # --- train on the batch ----x
        if lam != None: self.lam=lam             
        for b in tqdm(batch): 
            forward = self.forward([b])
            at = forward['action']                                                      # action taken
            Qt = self.Q[at][-1] + self.lam*(forward['matches'][0] - self.Q[at][-1])     # update Qt of choice
            self.Q[at].append(Qt)                                                       # append to Q value


    def visualize_training(self):
        self.data = pd.DataFrame({
            'sample':range(len(self.acc)), 
            'accuracy': [int(a) for a in self.acc], 
            'action_taken': self.a_t})
        fig = go.Figure() 

        fig.add_trace(go.Scatter(x=self.data['sample'], y=self.data['accuracy'], name='Low 2007',
                                line = dict(color='steelblue', width=1, dash='solid')))
        for c in self.change_idx:
            fig.add_vline(x=c, line_width=1, line_dash='dash', line_color='#800000')
        fig.update_layout(
            template='plotly_white',
            title='RL Agents Performance',
            width=900, height=500,
            legend=dict(
                yanchor='bottom', xanchor='right',
                y=0.99,
                x=0.01
            ))
        fig.show()

In [151]:
wcst = WCST_Environment(all_items)
batch = wcst.batch(100)

rl = RL_Instance(wcst_instance=wcst)
# rl.forward(batch)
# rl.a_t

rl.train(batch)
rl.visualize_training()

100%|██████████| 100/100 [00:00<00:00, 34484.12it/s]


In [152]:
rl.visualize_training()

# Next Steps

How do we got from this baseline model to:
- Incorporating experimental data 
- Capturing information from the other experiments
- Capturing information from the other participants (Heirarchical Bayes)

## Modeling Data

Essentially fitting the this model to the data is equivalent to training the parameter $\lambda$ - so that **the rate at which the action value approximations $Q_t(\alpha)$ are updated - ie the HUMAN LEARNING RATE**

$$Q_t = Q_{t-1} + \lambda[r_t - Q_{t-1}]$$

#### RPE - Reward Prediction Error: $r_t - Q_{t-1}$

All subsequent steps are to the same effect, to add information about this learnign rate:
- Additional experiments can be caputured to add information about different learning rates between individuals
- The Heirarchical Bayes can be used to capture variation (information) across individuals


## Problem

The model is too simple! 

_It is completely reasonable to assume that all participants will act with a learning rate of $\lambda=1$ as if they understand the rules they will simply default to the last correct choice, i.e. $\lambda=1$_

As such:

_*No additional statistical attributees will capture additonal information variation - are the additional experiments superflous?*_

## MSc Project Solution + Additional paper

Abstract the model design phase to building statistical tools/machinery to handle different types of psychological data.

1. How to handle RL learning problems
2. How to capture variation across sample (Bayes)
3. Additional Statistical Machinery



# Final Remarks
 - EEG data & follow on work? Can I assist early?
    - Format of the data?
 - Phd Application in December: Happy to help on any applications/modeling/programming design tasks for coauthorship.
 - @Jonathan & @Allan to inform modeling direction.

# Thoughts?