This document is a review of the literature on methods for addressing contextual, non-stationary bandit problems. Sources currently summarized:

1.  Q. Wu, N. Iyer, and H. Wang, “Learning Contextual Bandits in a Non-stationary Environment,” The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval  - SIGIR ’18, pp. 495–504, 2018.
2.  N. Hariri, B. Mobasher, and R. D. Burke, “Adapting to User Preference Changes in Interactive Recommendation,” in IJCAI, 2015.
3.  O.-C. Granmo and S. Berg, “Solving Non-Stationary Bandit Problems by Random Sampling from Sibling Kalman Filters,” in Trends in Applied Intelligent Systems, 2010, pp. 199–208.
4.  J. Vermorel and M. Mohri, “Multi-armed Bandit Algorithms and Empirical Evaluation,” in Machine Learning: ECML 2005, 2005, pp. 437–448.

Not yet summarized but will be eventually:
-   H. Luo, C.-Y. Wei, A. Agarwal, and J. Langford, “Efficient Contextual Bandits in Non-stationary Worlds,” arXiv:1708.01799 [cs, stat], Aug. 2017.
-   D. Russo and B. Van Roy, “Learning to Optimize via Information-Directed Sampling,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 1583–1591.
-   K. Chaloner and I. Verdinelli, “Bayesian Experimental Design: A Review,” Statist. Sci., vol. 10, no. 3, pp. 273–304, Aug. 1995.
-   V. Raj and S. Kalyani, “Taming Non-stationary Bandits: A Bayesian Approach,” arXiv:1707.09727 [cs, stat], Jul. 2017.
-   B. Dumitrascu, K. Feng, and B. E. Engelhardt, “PG-TS: Improved Thompson Sampling for Logistic Contextual Bandits,” arXiv:1805.07458 [cs, stat], May 2018.

# Learning Contextual Bandits in a Non-stationary Environment
By: Q. Wu, N. Iyer, and H. Wang

**Environment: Piecewise stationary**

**Method: UCB-based Changepoint detection (dLinUCB)**
-   General idea:
    -   Maintain multiple candidate models and estimate distribution of error for each.
    -   Prediction error for each candidate model is estimated based on last set of observations in a sliding window.
    -   The model chosen for the current policy is the one with highest LCB on its error distribution.
-   Other details:
    -   Model abandonment: If model error exceeds its UCB on error, it is discarded.
    -   Model creation: If no suitable models remain, a new one is created.
    -   Model updating: Each candidate model is updated with data from the time of creation to the time their error exceeds their UCB.
-   In this paper, the LinUCB model is used for each of the candidate models, but the method can be generalized to other models.

**Details:**

For each slave bandit model $m$, define a binary random variable $e_i(m)$. This indicates whether the slave model's prediction error at time $i$ exceeds its confidence bound:

$$e_i(m) = \mathbb{1}\{ |\hat{r}_i(m) - r_i| > B_i(m, a_i) + \epsilon \}$$

where:
-   $B_i(m, a_i)$ = upper confidence bound of model $m$ reward estimation (method-specific, see paper for LinUCB equation)
    -   note, the dependence on the arm $a_i$ assumes the arm has some set of features and users have preferences for those
-   $|\hat{r}_i(m) - r_i|$ is the error in the reward estimation, where $r_i$ is the observed reward at time $i$.
-   $\epsilon = \sqrt{2} \sigma erf^{-1}$. This represents the high probability bound of Gaussian noise in received feedback.
    -   $erf^{-1}$ is the inverse of the Gauss error function

According to theorem 3.1 in the paper, for a linear model $m$, if the underlying environment is stationary, $\mathbb{P}(e_i(m) = 1) \le \sigma_1$, where $\sigma_1 \in (0, 1)$ is a hyper-parameter in $B_i(m, a_i)$. In practice, the error is estimated as the mean prediction error on the last $\tau$ observations ($\tau$-sized sliding window). So $\hat{e}_i(m) = 1/\hat{\tau}(m) \sum_{i=t-\hat{\tau}(m)}^t e_i(m)$, where $\hat{\tau}(m) = \min{t - t_m, \tau}$, and $t_m$ is when the model was created.

**Regret Analysis: $O(\Gamma_T \sqrt{S_{max}} log S_{max})$**
-   $\Gamma_T$ = total number of ground-truth environments up to time $T$.
-   $S_{max}$ = longest stationary period till time $t$.
-   Authors state: "This arguably is the _lowest_ upper regret bound any bandit algorithm can achieve in such a non-stationary environment without further assumptions."
-   Not too impressive: many assumptions are unnecessary, and complete independence across environments seems to be one of those.

**Experiments:**
-   Compared against:
    1.  LinUCB
    2.  adaptive Thompson Sampling (adTS): changepoint detection module
    3.  Windowed mean-shift detection algorithm (WMDUCB1): UCB1-type with changepoint detection
    4.  Meta-Bandit algorithm: switches between two UCB1 models
    5.  Collaborative filtering with bandit learning (CLUB) -- on the real-world datasets
-   Datasets: 1 synthetic and 2 real-world
    -   Synthetic:
        -   Size-$K$ ($K$ = 1000) arm pool, with each arm associated to $d$-dimensional feature vector $x_a$.
        -   Also $\theta^*$, which is the global user preferences for each of the $d$ features.
        -   All parameters drawn from uniform (0, 1).
        -   All rewards generated by multiplying $x_a \theta^*$ are corrupted by Gaussian noise before returned to learner.
        -   After $S$ rounds, $\theta^*$ is randomized until a specified distance $\Delta$ between rewards returned is achieved for some specified proportion $\rho$ of the arms.
        -   All algorithms executed for 5000 iterations.
        -   Accumulated regret used to evaluate each.
        -   **Ranking:** (1) dLinUCB, (2) adTS, (3) LinUCB, (4) Meta-Bandit, (5) WMDUCB1
    -   Real-world 1: Yahoo! Today Module
        -   Clickstream dataset with 45,811,883 user visits in ten-day period in May 2009.
        -   For each visit, both user and 10 candidate articles are associated with feature vector of six dimensions.
        -   Optimizing for CTR.
        -   Replay is used to evaluate all methods, based on CTR normalized by logged random strategy's CTR.
        -   Two modes of testing:
            1.  Estimate latent user preferences for items with 2 variants: (1) non-personalized: assume all users share same preferences, (2) personalized: assume each user has own latent preferences.
                -   **Ranking:** (1) dLinUCB, (2) UCB, (3) adTS & CLUB basically tied
            2.  Estimate article popularity over time
                -   **Ranking:** (1) dLinUCB & LinUCB tied, (2) adTS
    -   Real-world 2: LastFM & Delicious
        -   HetRec 2011 workshop data
        -   LastFM contains 1892 users and 17632 items (artists) -- listened artists = positive feedback
        -   Delicious contains 1861 users and 69226 items (URLs) -- bookmarked URLs = positive feedback
        -   For both, tags preprocessed into TFIDF vectors, then PCA to reduce to 25 principle components
        -   $K$ fixed to 25, with 1 of 25 picked to be positive for a particular user
        -   Followed Hartland et al. (Multi-armed Bandit, Dynamic Environments, and Meta-Bandits) to simulate non-stationarity.
        -   **Ranking:** (1) dLinUCB, (2) adTS, (3) LinUCB

**Thoughts:**
-   Using a Bayesian approach, we could estimate these error distributions using the posterior predictive distributions.
-   Another option would be to use the likelihood on the past $\tau$ samples. Perhaps when this drops low enough, discard the model? I'm not sure what the bound would be on the likelihood that is equivalent to the bound given in theorem 3.1.
-   It's not entirely clear how much of the benefit comes from the improved changepoint detection and how much comes from maintaining multiple candidate models.
-   dLinUCB showed marked improvement over adTS on the first two datasets but nearly tied with it on the last two. The Yahoo! dataset actually seems closest to the domain I'm interested in, so this may not be concerning. It doesn't seem to be cherry-picking, but it would be interesting to try to tease out why this is the case.

# Adapting to User Preference Changes in Interactive Recommendation
By: Hariri, Mobasher, and Burke

**Environment: Piecewise stationary**

**Method: TS-based Changepoint detection (adTS)**
-   General idea:
    -   Changepoint detection through combo of CUSUM charts and bootstrap (from Wayne, 2000 - change-point analysis...)
    -   Use two sliding intervals of fixed length $N$.
    -   Fit two models, one on each, and compare their distributions using Mahalanobis distance.
    -   Changepoint detection method produces confidence of changepoint. Pre-defined threshold used to trigger detection above certain confidence.
    -   Accept detected changepoint if there are at least L (look-ahead parameter) points after the change.
-   Other details:
    -   Start checking for changepoints after S (splitting threshold parameter) time points of observations.
    -   What happens when change is detected? Options are to combine data before and after in some way or just discard before. The method in this paper discards before.
    -   Use conjugate Bayesian multivariate Gaussian regression for estimating rewards and Thompson Sampling for policy.
    -   Represent both users and items as set of features. Factorize items using PCA.
    -   Learn latent user preferences and detect changes in these.

**Experiments:**
-   Dataset description:
    -   Yahoo! Music ratings dataset version 1.0. Over 10M ratings of musical artists over 1mo period prior to March 2004.
    -   Ratings of 0 to 100 given by 1,948,882 users to 98,213 artists.
-   Two goals: evaluate accuracy of changepoint detection and compared to conventional recommenders.
-   Evaluation method: 5-fold CV: one fold for testing, one for tuning, three for training.
    -   Artists not rated were given rating of overall mean rating across dataset.
-   Changes simulated by merging two users together, providing ratings from one, then switching to the other after $T = 30$ rounds.
-   Filtered test users so each had at least $T$ ratings in user's profile.
-   Compared against:
    1.  User-based kNN (k = 10)
    2.  Standard Beta-binomial TS
    3.  Optimal recommender (always serves item with highest rating, knows the changepoint exactly)
-   **Ranking:** (1) Optimal (2) adTS, (3) Standard TS, (4) User-based kNN
    -   As expected, adTS was no better than standard TS until after the changepoint.
    -   It probably would have looked much better after a few changepoints.

**Thoughts:**
-   The experiments in this paper were very limited. It would have been much more interesting to try out varying hybrid user sizes. For instance, merge 2-5 users and see how the method fares when there are more changepoints.
-   It also seems like the method of merging users into hybrid users should have ensured actual differences between users were present. Otherwise the simulated change might actually not have been a real change in preferences.
-   Overall, it seems like this method is a serious hack-job. It has _so_ many parameters.
    1.  Interval size $N$
    2.  Limits on both how many points ahead (Look-ahead $L$) and behind (splitting threshold $S$) before detecting
    3.  Number of latent features $d$
    4.  Change detection confidence (set to 0.95 for all experiments in paper)
    5.  Several model hyperparameters (though these can be set to be uninformative or weakly-informative in the usual Bayesian manner)
-   Also, the authors conduct no sensitivity analyses, so it's unclear how much their choices were cherry-picked. Relatively competitive performance in experiments by Wu, Iyer, and Wang show good performance and alleviate these concerns somewhat.
-   It is interesting to think of segments/audiences as "users." Then we can try to factorize the set of all experiences, or "items." 
Grouping users into segments provides a means to use collaborative filtering. So this approach may be a feasible alternative to content tagging.

# Solving Non-Stationary Bandit Problems by Random Sampling from Sibling Kalman Filters
By: Granmo and Berg

**Environment: Constantly Changing (Drifting) with Normally Distributed Rewards**

**Method: Kalman Filtered Conjugate Gaussian Estimation (KF-MANB)**
-   Presents TS algorithm for normal rewards that incorporated Kalman Filter on all value model parameters.
    -   Although authors don't appear to be aware that they're doing TS after estimating the distributions.
    -   They state in a footnote "To the best of our knowledge, the concept of having automata choose actions based on the _order of statistics_ of instances of estimate distributions, has been unreported in the literature."
-   Must provide observation and transition noise parameters, but sensitivity analysis shows method is robust to settings.
-   **Not contextual**; estimates single mean and variance for each arm at each time point.

**Experiments:**
-   Summary:
    -   On both stationary & non-stationary simulated data, this method outperforms all others.
    -   Next contenders are POKER and $\epsilon^n$-greedy ($\epsilon$ reduced on some schedule).
-   Compared to:
    1.  UCB1-Normal
    2.  Interval Estimation (UCB)
    3.  Pursuit (top-performing method from Learning Automata field)
    4.  $\epsilon^n$-greedy
    5.  POKER
-   Evaluation in terms of regret
-   Report results for 10-armed stationary and non-stationary simulated datasets
    -   Ensemble of 1000 independent replications with different RNGs used
    -   In each replication, 2000 arm pulls conducted
    -   Set difference in reward distribution means to be 50.0 and space them evenly
    -   Simulated several variations with increasing observation noise and fixed transition noise (stationary) for wide range of signal-to-noise ratio (12.5 to 200 grid)
        -   All other methods do much worse with higher observation noise settings
    -   Also simulated several variations with increasing transition noise and fixed observation noise (0 to 200 grid)
        -   Even the next-best method (POKER) is much worse with even a small amount of transition noise.
    -   **Ranking:** (1) KF-MANB (2) POKER, (3) $\epsilon^n$-greedy with $c=0.3$, other settings not much worse

**Thoughts:**
-   Very promising method for continuously changing environments.
-   Interesting to note how poorly the other methods which are designed for changepoint detection do in drifting environments.
-   The term "sibling" in "Sibling Kalman Filters" in the title is a bit nebulous. I wonder if the siblings are the filters for the different arms?
-   Clearly there's a gap in the literature with applying Kalman Filters to the more interesting linear models capable of working in contextual settings.

# Multi-armed Bandit Algorithms and Empirical Evaluation
By: Vermorel and Mohri

**Environment: Stationary and Non-Contextual**

**Summary:**
-   Evaluate many methods on two stationary datasets: (1) simulated mixed-duration with normal rewards and (2) CDN latency minimization network dataset.
-   Authors posit that main contributions are empirical valuation of many methods and introduction of theirs, which, obviously, outperforms all the others.
-   Also provides a nice summary of existing methods.

**Method: Price of Knowledge and Estimated Reward (POKER)**
-   Based on 3 ideas:
    1.  Natural way of balancing exploration and exploitatoin is to assign a price to knowledge gained while pulling a particular lever. This notion of "value of information" has been studied in other domains and is often referred to as "exploration bonuses" in bandit literature. "The objective is to quantify the uncertainty in the same units as the rewards."
    2.  Properties of unobserved arms could potentially be estimated, to some extent, from those already observed. This idea motivates the use of other arm estimates in determining the expected magnitude of improvement from pulls that yield improvements.
    3.  Specification of the expected time horizon remaining provides an intuitive way of tuning exploration vs. exploitation. The value of exploration goes down when there is less time to exploit what is learned through it.
-   POKER chooses the arm with the highest estimated price.
    -   The price is the sum of the estimated mean reward and the exploration bonus.
    -   The exploration bonus is the product of the probability of an improvement, the expected reward mean improvement, and the horizon remaining.


**Details:**
-   Sketch of method:
    1.   Estimate means (normally distributed) for each arm: $\hat{\mu}_i$
    2.   Sort them largest to smallest: $\hat{\mu}_{i_1} \ge \ldots \ge \hat{\mu}_{i_q}$
    3.   Compute index of benchmark as $\sqrt{q}$, where $q$ is the number of arms with nonzero rewards so far.
    4.   Compute estimated reward improvement as $\delta_\mu = \frac{\hat{\mu}_{i_1} - \hat{\mu}_{i_\sqrt{q}}}{q}$.
    5.   Compute probability of improvement as $P[\mu_i \ge \hat{\mu}_i^* + \delta_\mu]$, where $\hat{\mu}_i^*$ is the highest estimated mean reward.
    6.   Compute price of each arm $i$ as: $p_i = \hat{\mu}_i + P[\mu_i \ge \hat{\mu}_i^* + \delta_\mu]\delta_\mu H$.
    7.   Serve arm with highest price and observe feedback.
    8.   Update model mean estimates and repeat.

**Experiments**
-   Simulated data
    -   all arms normally distributed; means and stdevs drawn from uniform (0, 1)
    -   data generated for 1000 arms and 10K rounds.
    -   3 configurations: 100, 1000, and 10K rounds, corresponding to less rounds than levers, equal to, and more than.
    -   idea of these configs is to evaluate in non-asymptotic cases as well as closer to asymptotic
-   Real-world: URLs Retrieval Latency
    -   Agent selects one source and waits until the data are retrieved (latency feedback); objective is minimization of total latency
    -   Home pages (arms) of 700+ universities, retrieved roughly every 10 min for about 10 days.
    -   Evaluated by running for 130 and 1300 rounds
-   Ranking: POKER is best in both settings. IntEstim (UCB) second best. $\epsilon$-first and $\epsilon$-greedy also do quite well. GaussMatch (TS) does well with enough rounds, but poorly when there are few rounds.

**Thoughts:**
-   Interesting tidbit in the lit review section: "A different version of the bandit problem has been studied by [10, 23, 9, 8] where the reward distributions are assumed to be known to the player. This problem is not about balancing exploration and exploitation, it admits an optimal solution based on the so-called Gittins indices."
-   Another interesting tidbit: "This paper considers the _opaque_ bandit problem where a unique reward is observed at each round, in contrast to the _transparent_ one where all rewards are observed." I hadn't heard of this distinction clarified by this opaque/transparent terminology before; it's a nice way to do it.
-   The notion of estimating properties of unobserved arms from those observed seems similar to a common-prior assumption. Another way they could be estimated is by factorizing the user and item spaces (i.e. contextual and multivariate extensions). Is there a unifying way to look at information gained that spans these problem settings?
-   **It's a good idea** to run experiments of multiple durations, as performance with less rounds can be _way_ different than performance after many rounds.