# RL Recommendation Pipeline MVP 1.0
This pipeline will run after the Evaluation pipeline and consider the following inputs and outputs:

Inputs:

*Context features*
1.   Demographics ✅
2.   Self-efficacy score change (composite score including emotion scores and emotion lexicon, to use the output from voice analysis + lexicon indicators) ✅
3.   Consciousness score (conscientiousness model: content -> consicousness prob) ✅
4.   Temporal info (time b/w suggested action and most recent activity - sugg.device.since) ✅

*Feedback features*
1.   User rating (as a proxy to whether this user has the intend to take this action) ✅
2.   Activity mode (as a proxy to whether this user takes this action) ✅

*Arms*
1.   Messages/Actions (may consider to extract the actions from messages + other actions) ✅


Outputs:
1.   Probabilities for each action

Post processing:
1.   Pad the output probabilities with 0.2 for each unavailable action
2.   Thompson sampling
3.   Generate message for the selected action



For MVP, we consider a LLM for conscientiousness model and a finitie-dimensional model (bayesian contextual bandit but with reward function being a nonlinear model)

For the future, we should consider 1. causal inference applied in evaluation pipeline 2. deep reinforcement learning (DRL) framework


In [None]:
!pip install monai torchinfo pytorch-metric-learning

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# load requirements & connect to drive
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch
from torch.utils.data import DataLoader
from imblearn.over_sampling import RandomOverSampler
import os

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix , classification_report
from transformers import Trainer , TrainingArguments , BertTokenizer , BertForSequenceClassification

import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stopwords = set(stopwords.words("english"))

from transformers import AutoModelForSequenceClassification , AutoTokenizer

SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
# nltk.download('punkt')
device = "cuda" if torch.cuda.is_available() else "cpu"

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
def preprocessing(text) :

    text = text.lower()

    text = re.sub(r"[^\w\s]" , "" , text)

    text = re.sub(r"\d+" , "" , text)

    words = text.split()

    words = [w for w in words if w not in stopwords]

    preprocessing = " ".join(words)

    return preprocessing

def count_stopword(text):

    text = text.lower()

    text = re.sub(r"[^\w\s]" , "" , text)

    text = re.sub(r"\d+" , "" , text)

    words = text.split()

    nstopword = len([w for w in words if w in stopwords])

    return nstopword

def count_word(text):

    text = text.lower()

    text = re.sub(r"[^\w\s]" , "" , text)

    text = re.sub(r"\d+" , "" , text)

    words = text.split()

    nword = len(words)

    return nword

# Bayesian Contextual Bandit with Nonlinear Reward Function - By 5/18

We need to define the reward function and do some derivation for prior/posterior

What if we can have the distribution as a random function

action space: A = {a_1, ..., a_k}
R(a_i) ~ F_i

Step 0: R needs to be defined - could be a composite of multiple metrics

Step 1: Training
given data points: d_1,...,d_n
randomly sample p times with replacement
R_j(a_i) = f_ij(X_j)
...

If there is no available data to train an arm, initialize the function
R_j(a_s) = N(0, 1) - distribution can be chosen by users or just simply random(25%percentile of observed data, 50%percentile of observed data)

Step 2: Sample from F_1,...,F_k and takes a = max(a_i)(R(a_1),...,R(a_k))
given a data point,
sample one function from f_ij's, estimate R_j(a_i), do p times
R will be a k*p reward matrix for each individual, with rows indicating p samples from each arm, and columns being probs of arms from each sample. One can use R to compute P(a_i > a_j) = sum_s(1 to p)(R_s(a_i) > R_s(a_j))/p

R_max = R.max(axis = 0)

R_ind = R == rep(R_max, k).reshape(k, p)

R_freq = R_ind.sum(axis = 0)

Prob = R_freq/sum(R_freq)


Guess: as long as the function used to estimate reward function in each random sample is Donsker, then f_i is Donsker, and we should be able to derive asymptotic properties, limit/convergence.