Bandit Bench

This project is a small, unscientific benchmark of algorithms for the Bernoulli Multi-Armed Bandit. It benchmarks my specific use-case of short-horizon problems (500 arm pulls) with Bernoulli rewards (i.e., either there is a reward or there is not, with a given probability).

Algorithms are only included in the benchmark if

They are easy to implement
They do not depend on the time horizon explicitly
They do not need (much) parameter tuning

Algorithms

Baselines

Random Baseline (chooses arms randomly)
Greedy Baseline (chooses the arm with the maximum average reward)
ϵ-Greedy
ϵ-Decreasing
Explore Then Commit
Least Failures (then most successes, then random)

Bootstrap-based

Dueling-based

Approximations of the Gittins Index

Whittle's approximation (works best for large simulation horizon)
Brezzi and Lai's approximation

Thompson Sampling-based

ϵ-Exploring Thompson Sampling (PDF)
Information Relaxation Sampling (Finite Horizon) (simplified version that uses an assumed, fixed horizon)
Non-Parametric Thompson Sampling
Optimistic Thompson Sampling
Satisficing Thompson Sampling
Thompson Sampling with Virtual Helping Agents (C3)
Thompson Sampling
VarTS

Upper Confidence Bound-based

Other

Results

The following table shows the average rank and runtime of each algorithm when considering the five experiments further down in this file.

Algorithm	Average Rank	Average Time (seconds)
Batch Ensemble for MAB (m=0)	15.2	0
IRS.FH (H=2)	16.2	1
UCB-DT (γ=0.90)	18.6	3
IRS.FH (H=3)	18.8	1
UCB-DT (γ=0.95)	19.2	3
IRS.FH (H=1)	19.8	1
UCB-DT (γ=0.75)	21.0	3
TS-UCB (100 samples)	23.6	65
UCB-DT (γ=1.00)	23.8	3
ϵ-Exploring TS-UCB (1 samples)	25.6	0
Batch Ensemble for MAB (m=1)	26.6	0
ϵ-Exploring TS-UCB (10 samples)	26.8	1
IRS.FH (H=4)	27.0	1
ϵ-Exploring TS-UCB (100 samples)	27.2	7
Gittins Index -- Brezzi and Lai's Approximation (β=0.99)	27.4	0
Gittins Index -- Whittle's Approximation (β=0.99)	30.0	0
RAVEN-UCB (a0=0.5, b0=0.5, eps=0.1)	30.2	0
MOSS-Anytime (α=-0.85)	30.4	0
RAVEN-UCB (a0=0.5, b0=0.5, eps=0.5)	30.6	0
TS-UCB (10 samples)	32.4	7
RAVEN-UCB (a0=0.5, b0=0.5, eps=0.001)	32.4	0
IRS.FH (H=5)	35.2	1
Gittins Index -- Whittle's Approximation (β=0.90)	36.0	0
RAVEN-UCB (a0=0.5, b0=10, eps=0.5)	36.2	0
BayesUCB (δ=0.300)	37.0	0
ϵ-Decreasing (ϵ=0.990)	37.4	0
POKER (H=5)	37.8	0
POKER (H=10)	38.0	0
ReUCB (a=2.00)	38.6	1
Greedy	38.8	0
ϵ-Decreasing (ϵ=0.900)	38.8	0
POKER (H=1)	39.0	0
BayesUCB (δ=0.200)	39.2	0
ReUCB (a=1.50)	39.2	1
BayesUCB (δ=0.400)	39.6	0
ReUCB (a=1.00)	40.0	1
ϵ-Decreasing (ϵ=0.700)	40.6	0
CODE (δ=0.990)	41.0	0
RAVEN-UCB (a0=0.5, b0=10, eps=0.1)	41.8	0
MOSS-Anytime (α=-0.50)	43.0	0
Thompson Sampling with Virtual Helping Agents (Combiner C3)	43.2	18
ϵ-Greedy (ϵ=0.010)	43.4	0
ϵ-Greedy (ϵ=0.020)	44.4	0
BayesUCB (δ=0.500)	47.4	0
Gittins Index -- Whittle's Approximation (β=0.70)	47.8	0
TS-UCB (1 samples)	48.2	1
ϵ-Greedy (ϵ=0.050)	48.6	0
POKER (H=25)	49.0	0
Gittins Index -- Whittle's Approximation (β=0.50)	49.2	0
SoftElim (theta=8.00)	50.0	0
IRS.FH (H=10)	51.6	1
ϵ-Decreasing (ϵ=0.500)	51.8	0
MOSS-Anytime (α=-0.33)	52.0	0
ϵ-Exploring Thompson Sampling	52.2	0
POKER (H=50)	53.8	0
POKER (H=100)	55.6	0
RAVEN-UCB (a0=0.5, b0=10, eps=0.001)	56.4	0
RAVEN-UCB (a0=1, b0=5, eps=0.5)	58.4	0
Gittins Index -- Brezzi and Lai's Approximation (β=0.9)	60.2	0
RAVEN-UCB (a0=1, b0=5, eps=0.1)	60.8	0
ϵ-Greedy (ϵ=0.100)	61.0	0
WR-SDA (forced_exploration=true)	61.4	2
Gittins Index -- Brezzi and Lai's Approximation (β=0.8)	63.0	0
UCBT	63.8	0
BayesUCB (δ=0.100)	65.0	0
Gittins Index -- Brezzi and Lai's Approximation (β=0.95)	65.4	0
IRS.FH (H=25)	65.6	2
RAVEN-UCB (a0=1, b0=5, eps=0.001)	67.8	0
Forced Exploration	67.8	0
Optimistic Thompson Sampling	68.0	1
WR-SDA (forced_exploration=false)	68.4	1
BayesUCB (δ=0.900)	69.0	0
SoftElim (theta=4.00)	70.8	0
ReBoot (r=0.25)	71.2	0
MARS (δ=0.100)	72.0	0
POKER (H=250)	74.2	0
Batch Ensemble for MAB (m=2)	76.0	0
VarTS	78.2	0
Thompson Sampling	78.4	1
Weighted Bootstrap	79.2	3
Satisficing Thompson Sampling (ϵ=0.005)	79.8	1
ReBoot (r=0.50)	80.0	0
Satisficing Thompson Sampling (ϵ=0.010)	81.0	1
Bootstrapped Thompson Sampling (J=10)	81.6	0
KL-UCB	82.8	8
Vanilla Residual Bootstrap (init=0)	83.2	0
Hellinger-UCB	83.4	2
Bootstrapped Thompson Sampling (J=500)	84.0	4
Bootstrapped Thompson Sampling (J=1000)	85.2	9
Bootstrapped Thompson Sampling (J=100)	85.8	1
Multiplier Bootstrap-based Exploration	86.2	6
Non-Parametric Thompson Sampling	86.4	5
CODE (δ=0.900)	89.6	0
UCB1-Tuned	90.2	0
Garbage In, Reward Out (a=0.10)	91.2	1
Satisficing Thompson Sampling (ϵ=0.050)	95.0	1
Bounded Dirichlet Sampling	95.2	2
Vanilla Residual Bootstrap (init=1)	95.6	0
MARS (δ=0.010)	96.8	1
SoftElim (theta=2.00)	97.2	0
Kullback-Leibler Maillard Sampling	98.2	1
ϵ-Decreasing (ϵ=0.200)	98.4	0
Perturbed-History Exploration (a=1.1)	101.0	1
RS (a=0.65)	102.8	0
RS (a=0.50)	103.0	0
EB-TCI	103.2	0
Softsatisficing (a=0.65)	107.4	0
Batch Ensemble for MAB (m=4)	108.4	0
Tsallis-INF	109.8	1
Delightful GB (lr=2.000)	110.0	0
ReBoot (r=0.90)	110.6	0
MARS (δ=0.001)	111.4	10
MARS (δ=0.002)	111.6	5
Batch Ensemble for MAB (m=8)	111.8	0
Garbage In, Reward Out (a=0.33)	112.0	1
MARS (δ=1.000)	114.0	0
RS (a=0.90)	114.4	0
Satisficing Thompson Sampling (ϵ=0.100)	114.4	1
Delightful GB (lr=1.000)	114.4	0
ETC (m=10)	114.4	0
lil' UCB (δ=0.100)	115.0	0
RS (a=0.75)	116.2	0
FTPL-GR (lr=1.000)	116.8	4
Vanilla Residual Bootstrap (init=5)	117.8	0
ReBoot (r=1.00)	118.4	0
SoftElim (theta=1.00)	120.0	0
Perturbed-History Exploration (a=2.1)	122.4	1
Softsatisficing (a=0.75)	123.0	0
Softsatisficing (a=0.90)	125.4	0
ETC (m=20)	126.4	0
ϵ-Decreasing (ϵ=0.100)	126.8	0
FTPL-GR (lr=0.100)	127.0	4
Delightful GB (lr=0.500)	128.4	0
ETC (m=5)	130.0	0
lil' UCB (δ=0.010)	130.4	0
Gradient Bandit	130.6	0
Garbage In, Reward Out (a=1.00)	131.6	1
Softsatisficing (a=0.50)	132.0	0
Gradient Bandit (with baseline)	132.0	0
Boltzmann-Gumbel Exploration	132.2	0
ETC (m=25)	133.0	0
ReBoot (r=1.50)	134.0	0
RS (a=0.25)	134.2	0
RS (a=0.99)	134.6	0
Softsatisficing (a=0.25)	136.8	0
RS (a=0.10)	136.8	0
lil' UCB (δ=0.001)	137.0	0
Softsatisficing (a=0.99)	137.4	0
ReBoot (r=1.70)	138.0	0
Least Failures	138.0	0
SoftElim (theta=0.50)	138.2	0
Perturbed-History Exploration (a=5.1)	141.0	1
ReBoot (r=2.10)	144.8	0
UCB1	145.0	0
EXP-IX	145.4	0
ETC (m=3)	145.6	0
Softsatisficing (a=0.10)	146.2	0
RAVEN-UCB (a0=5, b0=1, eps=0.5)	149.0	0
RAVEN-UCB (a0=5, b0=1, eps=0.1)	149.6	0
RAVEN-UCB (a0=5, b0=1, eps=0.001)	151.0	0
ETC (m=2)	151.0	0
RS (a=0.00)	154.4	0
FTPL-GR (lr=0.010)	157.6	4
SoftElim (theta=0.10)	159.2	0
FTPL-GR (lr=0.001)	164.6	4
CODE (δ=0.050)	165.8	0
Random	165.8	0

Data

Uniform

This experiment uses 10 arms, with the means sampled uniformly from the interval [0, 1]. This is a relatively easy instance, because there is likely to be a single best arm that is easy to find. This is reflected in the %-Optimal column, where the best algorithms reach over 2/3 pull rate of the optimal arm.

Results

Algorithm	%-Optimal	Regret (Mean)	Regret (Median Absolute Deviation)	Time
Gittins Index -- Whittle's Approximation (β=0.99)	65.59	16.1524	6.1313	0.21s
Gittins Index -- Brezzi and Lai's Approximation (β=0.99)	68.73	16.2172	3.6835	0.42s
BayesUCB (δ=0.300)	68.64	16.4619	5.4296	0.19s
Batch Ensemble for MAB (m=0)	71.10	16.5066	2.6067	0.05s
Batch Ensemble for MAB (m=1)	70.97	16.7903	2.9368	0.08s
TS-UCB (100 samples)	71.96	16.8902	3.3855	58.33s
IRS.FH (H=2)	71.14	16.9375	2.7279	1.26s
IRS.FH (H=3)	72.29	16.9395	3.0432	1.29s
BayesUCB (δ=0.400)	64.26	17.1313	6.9164	0.19s
TS-UCB (10 samples)	72.56	17.2502	3.7256	6.84s
Gittins Index -- Whittle's Approximation (β=0.90)	61.89	17.4329	6.6927	0.21s
BayesUCB (δ=0.200)	71.21	17.5846	4.0029	0.19s
IRS.FH (H=4)	72.57	17.6097	3.5134	1.32s
ϵ-Exploring TS-UCB (100 samples)	69.94	17.8335	2.8808	6.68s
ϵ-Exploring TS-UCB (10 samples)	70.16	17.9184	2.9572	0.73s
ϵ-Exploring TS-UCB (1 samples)	70.36	18.0420	3.0933	0.15s
IRS.FH (H=1)	68.89	18.0936	2.5509	1.14s
UCB-DT (γ=1.00)	69.93	18.1466	2.5287	2.55s
UCB-DT (γ=0.95)	72.44	18.1946	2.4725	2.58s
UCB-DT (γ=0.75)	72.50	18.1962	2.5172	2.47s
UCB-DT (γ=0.90)	72.42	18.2016	2.4807	2.54s
IRS.FH (H=5)	72.59	18.3624	4.0320	1.38s
MOSS-Anytime (α=-0.85)	69.71	18.8113	2.5659	0.18s
CODE (δ=0.990)	68.91	18.9329	2.9569	0.17s
RAVEN-UCB (a0=0.5, b0=0.5, eps=0.5)	70.40	19.0812	2.7529	0.24s
POKER (H=10)	65.48	19.3526	3.2000	0.31s
POKER (H=5)	66.34	19.3558	2.7035	0.32s
ReUCB (a=2.00)	68.28	19.4080	2.6887	0.95s
RAVEN-UCB (a0=0.5, b0=0.5, eps=0.1)	70.19	19.4646	2.7603	0.27s
TS-UCB (1 samples)	71.83	19.5545	5.3564	0.86s
ReUCB (a=1.50)	67.94	19.5633	2.7023	0.94s
POKER (H=1)	66.55	19.5748	2.5553	0.29s
ReUCB (a=1.00)	67.60	19.7102	2.7106	0.91s
Greedy	66.26	19.7129	2.5470	0.17s
BayesUCB (δ=0.500)	58.32	20.1518	9.3444	0.19s
RAVEN-UCB (a0=0.5, b0=0.5, eps=0.001)	69.57	20.5446	2.8610	0.21s
ϵ-Decreasing (ϵ=0.990)	66.35	20.7765	2.7735	0.27s
BayesUCB (δ=0.100)	68.95	20.9900	3.7246	0.16s
ϵ-Greedy (ϵ=0.010)	66.18	21.1769	2.8588	0.19s
IRS.FH (H=10)	71.83	21.2519	5.4688	1.44s
ϵ-Decreasing (ϵ=0.900)	66.48	21.2824	2.8492	0.28s
POKER (H=25)	61.38	21.3386	6.6186	0.31s
Thompson Sampling with Virtual Helping Agents (Combiner C3)	62.44	21.5482	6.2481	30.29s
SoftElim (theta=8.00)	67.25	21.7745	3.4756	0.39s
RAVEN-UCB (a0=0.5, b0=10, eps=0.5)	68.76	22.2566	2.9656	0.23s
MOSS-Anytime (α=-0.50)	70.74	22.4582	2.7088	0.21s
ϵ-Greedy (ϵ=0.020)	65.99	22.7752	3.1672	0.24s
RS (a=0.75)	57.47	23.4688	11.0929	0.19s
RAVEN-UCB (a0=0.5, b0=10, eps=0.1)	68.14	23.4949	3.2027	0.23s
ϵ-Decreasing (ϵ=0.700)	66.55	23.6847	3.3687	0.26s
WR-SDA (forced_exploration=false)	66.87	23.8280	5.0922	1.12s
WR-SDA (forced_exploration=true)	69.42	24.2844	5.2129	1.27s
IRS.FH (H=25)	70.09	24.4452	6.0015	1.74s
MOSS-Anytime (α=-0.33)	69.75	24.4536	2.6909	0.21s
Gittins Index -- Whittle's Approximation (β=0.70)	51.65	24.9930	11.3705	0.21s
RAVEN-UCB (a0=1, b0=5, eps=0.5)	65.05	25.4596	3.6551	0.23s
Optimistic Thompson Sampling	68.80	25.6235	7.1784	0.92s
POKER (H=50)	56.21	25.6788	9.6913	0.31s
SoftElim (theta=4.00)	64.43	26.0903	4.2092	0.39s
RAVEN-UCB (a0=1, b0=5, eps=0.1)	64.72	26.1332	3.7397	0.23s
ϵ-Greedy (ϵ=0.050)	65.45	27.3929	4.0210	0.22s
ϵ-Exploring Thompson Sampling	63.13	27.7465	9.1583	0.16s
UCBT	65.40	28.7984	4.0759	0.08s
Weighted Bootstrap	66.08	28.9034	7.1124	2.76s
Thompson Sampling	66.09	28.9339	7.1373	0.81s
Satisficing Thompson Sampling (ϵ=0.005)	65.94	29.0318	7.1008	0.86s
Satisficing Thompson Sampling (ϵ=0.010)	65.61	29.3229	7.0179	0.94s
KL-UCB	66.78	29.6304	7.3837	7.38s
RAVEN-UCB (a0=0.5, b0=10, eps=0.001)	65.19	29.6926	4.6441	0.23s
ReBoot (r=0.25)	61.18	30.3599	5.2731	0.18s
Softsatisficing (a=0.75)	49.75	30.3853	17.8608	0.12s
Gittins Index -- Whittle's Approximation (β=0.50)	46.19	30.4565	14.2565	0.18s
RAVEN-UCB (a0=1, b0=5, eps=0.001)	62.48	30.6351	4.5496	0.24s
CODE (δ=0.900)	54.94	30.6423	6.5536	0.17s
Batch Ensemble for MAB (m=2)	62.49	30.7481	7.8970	0.14s
POKER (H=100)	51.57	30.8991	12.6895	0.31s
Hellinger-UCB	63.89	31.0005	7.0702	2.23s
ϵ-Decreasing (ϵ=0.500)	65.55	31.3306	4.6232	0.25s
UCB1-Tuned	62.03	31.6747	3.6906	0.16s
Vanilla Residual Bootstrap (init=0)	59.99	33.1442	5.4073	0.19s
Non-Parametric Thompson Sampling	63.70	33.7962	7.1820	4.50s
RS (a=0.65)	44.01	33.8259	18.1643	0.18s
ReBoot (r=0.50)	58.58	34.0829	5.9224	0.23s
Bounded Dirichlet Sampling	63.86	34.1647	7.1345	2.10s
MARS (δ=0.100)	64.47	34.5294	4.8042	0.32s
Satisficing Thompson Sampling (ϵ=0.050)	57.19	35.0506	6.7983	0.95s
Gittins Index -- Brezzi and Lai's Approximation (β=0.95)	42.16	35.0542	18.6679	0.40s
ϵ-Greedy (ϵ=0.100)	63.98	35.8380	5.3322	0.22s
Multiplier Bootstrap-based Exploration	60.70	36.1612	4.2418	5.51s
SoftElim (theta=2.00)	57.22	36.3273	4.6857	0.39s
Kullback-Leibler Maillard Sampling	59.67	37.5162	8.3979	0.49s
Perturbed-History Exploration (a=1.1)	56.96	37.8929	5.6711	0.77s
Gittins Index -- Brezzi and Lai's Approximation (β=0.9)	40.26	38.2054	20.8866	0.39s
POKER (H=250)	46.27	38.6838	15.5508	0.31s
Garbage In, Reward Out (a=0.10)	57.65	38.7302	5.2772	0.80s
Gittins Index -- Brezzi and Lai's Approximation (β=0.8)	39.50	39.7859	22.1131	0.39s
Softsatisficing (a=0.65)	40.21	39.8959	24.8725	0.09s
BayesUCB (δ=0.900)	39.28	40.0985	22.3598	0.19s
FTPL-GR (lr=1.000)	58.80	40.3823	10.4546	4.60s
Vanilla Residual Bootstrap (init=1)	59.43	40.6304	4.7837	0.24s
Bootstrapped Thompson Sampling (J=500)	40.59	41.9370	21.7066	4.25s
Bootstrapped Thompson Sampling (J=1000)	40.88	41.9668	21.1936	8.31s
Bootstrapped Thompson Sampling (J=100)	40.77	42.3584	21.7453	1.08s
Bootstrapped Thompson Sampling (J=10)	39.55	42.8224	21.8677	0.35s
VarTS	64.30	42.9062	5.1435	0.44s
RS (a=0.90)	61.99	43.3480	18.0732	0.16s
Satisficing Thompson Sampling (ϵ=0.100)	44.13	44.2992	10.4673	0.95s
lil' UCB (δ=0.100)	52.19	44.8365	5.5606	0.30s
Delightful GB (lr=2.000)	42.93	46.2175	19.8031	0.48s
Tsallis-INF	54.25	46.4787	5.9697	0.90s
Softsatisficing (a=0.90)	56.49	46.6220	27.5781	0.27s
Forced Exploration	62.89	46.6666	6.2607	0.09s
ReBoot (r=0.90)	52.24	47.2795	6.7367	0.25s
Delightful GB (lr=1.000)	46.65	47.8063	14.1295	0.48s
Garbage In, Reward Out (a=0.33)	51.74	49.2706	5.5459	1.06s
Vanilla Residual Bootstrap (init=5)	55.69	50.7442	6.1208	0.26s
SoftElim (theta=1.00)	47.68	51.6367	6.0049	0.39s
ReBoot (r=1.00)	49.90	51.8800	6.7533	0.25s
MARS (δ=0.010)	54.83	53.5390	6.1628	1.36s
Batch Ensemble for MAB (m=4)	48.14	53.9552	22.0397	0.21s
EB-TCI	42.82	55.0174	15.7714	0.31s
MARS (δ=0.001)	50.10	55.4818	10.4818	10.22s
RS (a=0.50)	32.37	55.6085	36.1138	0.18s
Perturbed-History Exploration (a=2.1)	47.44	56.5448	6.0521	0.90s
ETC (m=10)	47.32	56.6956	11.0554	0.25s
FTPL-GR (lr=0.100)	50.69	57.1522	7.0720	3.97s
MARS (δ=0.002)	50.18	59.5941	8.8231	5.46s
Delightful GB (lr=0.500)	42.57	61.8842	12.8072	0.42s
lil' UCB (δ=0.010)	44.08	62.1486	6.5312	0.33s
Softsatisficing (a=0.50)	28.97	63.5094	43.9773	0.07s
MARS (δ=1.000)	37.18	65.5059	21.5650	0.08s
Garbage In, Reward Out (a=1.00)	43.03	66.4802	6.9482	1.19s
Boltzmann-Gumbel Exploration	43.87	68.9250	6.5817	0.31s
Gradient Bandit	46.48	69.6675	9.5534	0.41s
Gradient Bandit (with baseline)	48.72	70.6839	6.1066	0.50s
SoftElim (theta=0.50)	37.94	71.6155	8.1877	0.42s
ReBoot (r=1.50)	40.44	72.1794	8.1305	0.24s
lil' UCB (δ=0.001)	39.18	73.8291	8.0325	0.29s
ETC (m=5)	27.93	78.7963	24.1796	0.22s
ReBoot (r=1.70)	37.41	79.4522	8.9230	0.23s
Batch Ensemble for MAB (m=8)	34.92	80.6517	46.0772	0.27s
ϵ-Decreasing (ϵ=0.200)	50.82	81.7548	11.1762	0.23s
Perturbed-History Exploration (a=5.1)	36.06	83.3539	9.5119	1.04s
RS (a=0.99)	42.93	84.4523	29.9659	0.17s
ETC (m=20)	49.52	85.1694	11.9964	0.22s
UCB1	34.52	86.8474	10.2054	0.17s
Least Failures	40.55	88.7625	28.1293	0.16s
Softsatisficing (a=0.99)	39.79	89.3024	27.8456	0.40s
ReBoot (r=2.10)	32.31	92.8131	10.7156	0.24s
EXP-IX	31.87	95.7830	13.0250	0.42s
ETC (m=3)	22.30	98.5252	27.0722	0.25s
RAVEN-UCB (a0=5, b0=1, eps=0.5)	28.73	104.0006	13.6507	0.26s
RAVEN-UCB (a0=5, b0=1, eps=0.1)	28.72	104.0303	13.6481	0.25s
RAVEN-UCB (a0=5, b0=1, eps=0.001)	28.63	104.3277	13.7403	0.26s
RS (a=0.25)	19.47	104.5549	74.4142	0.19s
ETC (m=25)	41.95	105.2629	14.8396	0.24s
ETC (m=2)	20.21	110.5641	26.8868	0.24s
Softsatisficing (a=0.25)	18.13	114.1714	81.1272	0.06s
ϵ-Decreasing (ϵ=0.100)	35.59	127.2145	17.7947	0.15s
RS (a=0.10)	16.10	132.1474	91.9926	0.16s
SoftElim (theta=0.10)	19.54	132.7254	18.1293	0.37s
Softsatisficing (a=0.10)	15.69	135.0154	93.5393	0.04s
FTPL-GR (lr=0.010)	20.49	142.8645	20.4026	4.04s
RS (a=0.00)	15.09	144.1846	97.5086	0.13s
CODE (δ=0.050)	10.94	187.9726	24.8420	0.18s
FTPL-GR (lr=0.001)	10.91	196.2329	29.3352	3.98s
Random	10.01	204.0160	30.3495	0.04s

Half-Range

This experiment uses 10 arms, with the means sampled uniformly from the interval [0.25, 0.75]. This is a harder instance, because the arms are closer together and thus harder to distinguish.

This experiment was taken from the GIRO paper.

Results

Algorithm	%-Optimal	Regret (Mean)	Regret (Median Absolute Deviation)	Time
Batch Ensemble for MAB (m=0)	44.01	24.3162	7.5674	0.06s
IRS.FH (H=2)	43.85	24.7422	7.8222	1.27s
IRS.FH (H=3)	45.28	25.0047	6.6326	1.32s
Gittins Index -- Whittle's Approximation (β=0.99)	44.09	25.0082	7.8533	0.21s
Gittins Index -- Brezzi and Lai's Approximation (β=0.9)	40.05	25.2717	12.2698	0.40s
UCB-DT (γ=0.90)	43.02	25.6120	7.2004	2.59s
UCB-DT (γ=0.95)	43.00	25.6319	7.1816	2.61s
UCB-DT (γ=0.75)	43.05	25.6700	7.2075	2.52s
Gittins Index -- Brezzi and Lai's Approximation (β=0.99)	41.68	25.7326	9.2085	0.42s
Gittins Index -- Whittle's Approximation (β=0.90)	41.53	25.7390	9.3584	0.22s
Gittins Index -- Brezzi and Lai's Approximation (β=0.95)	40.51	25.7582	11.6365	0.40s
Gittins Index -- Whittle's Approximation (β=0.50)	40.01	25.7986	11.1271	0.19s
BayesUCB (δ=0.900)	39.16	25.8336	13.1050	0.20s
Gittins Index -- Brezzi and Lai's Approximation (β=0.8)	38.83	26.1183	12.9800	0.36s
BayesUCB (δ=0.500)	42.67	26.1689	8.4137	0.19s
Gittins Index -- Whittle's Approximation (β=0.70)	40.02	26.4219	10.1272	0.22s
BayesUCB (δ=0.400)	43.04	26.4233	8.1311	0.20s
BayesUCB (δ=0.300)	44.61	26.4883	7.0859	0.20s
IRS.FH (H=4)	44.59	26.4971	6.7171	1.35s
RAVEN-UCB (a0=0.5, b0=0.5, eps=0.001)	42.52	26.5203	7.2100	0.21s
Batch Ensemble for MAB (m=1)	43.55	26.6555	8.8838	0.09s
Thompson Sampling with Virtual Helping Agents (Combiner C3)	43.95	26.7444	8.6626	13.72s
IRS.FH (H=1)	39.73	26.9134	9.8115	1.16s
TS-UCB (100 samples)	45.02	26.9156	6.2201	66.44s
RAVEN-UCB (a0=0.5, b0=0.5, eps=0.1)	41.15	27.0983	8.1856	0.26s
RAVEN-UCB (a0=0.5, b0=10, eps=0.5)	42.23	27.1170	6.8959	0.24s
RS (a=0.65)	44.47	27.1789	13.8551	0.17s
BayesUCB (δ=0.200)	45.32	27.2126	6.1882	0.19s
ϵ-Exploring TS-UCB (1 samples)	42.08	27.2128	8.1379	0.16s
MOSS-Anytime (α=-0.85)	40.04	27.3181	8.7262	0.19s
RAVEN-UCB (a0=0.5, b0=0.5, eps=0.5)	40.90	27.4143	8.4479	0.24s
MOSS-Anytime (α=-0.50)	44.05	27.4891	5.4358	0.23s
ϵ-Exploring TS-UCB (10 samples)	41.55	27.5450	8.1473	0.74s
ϵ-Exploring TS-UCB (100 samples)	41.08	27.6722	8.2739	6.79s
CODE (δ=0.990)	39.41	27.7728	10.1499	0.17s
TS-UCB (10 samples)	44.54	27.9224	5.9276	6.02s
UCB-DT (γ=1.00)	38.52	28.0522	9.8213	2.55s
IRS.FH (H=5)	43.60	28.0814	6.8634	1.41s
ϵ-Decreasing (ϵ=0.990)	38.24	28.1487	9.6145	0.16s
Greedy	37.83	28.2076	9.9996	0.09s
Softsatisficing (a=0.65)	41.45	28.2436	15.4217	0.19s
ϵ-Decreasing (ϵ=0.900)	38.32	28.3069	9.4761	0.18s
ϵ-Decreasing (ϵ=0.700)	39.35	28.3077	8.7988	0.18s
POKER (H=1)	37.76	28.3667	10.1082	0.29s
POKER (H=5)	37.76	28.3800	10.0953	0.33s
RAVEN-UCB (a0=0.5, b0=10, eps=0.1)	41.48	28.3881	6.9847	0.24s
POKER (H=10)	37.74	28.4050	10.0473	0.32s
SoftElim (theta=8.00)	41.57	28.4570	8.0884	0.41s
ϵ-Greedy (ϵ=0.010)	38.03	28.4793	9.7905	0.10s
ReUCB (a=2.00)	38.15	28.5063	9.6122	0.96s
ReUCB (a=1.50)	38.07	28.5077	9.7464	0.95s
ReUCB (a=1.00)	37.96	28.5231	9.8749	0.92s
ϵ-Greedy (ϵ=0.020)	38.36	28.6900	9.4808	0.14s
POKER (H=25)	37.49	28.8412	9.4550	0.32s
ϵ-Greedy (ϵ=0.050)	39.46	29.3486	8.7084	0.15s
Bootstrapped Thompson Sampling (J=10)	38.57	29.4073	13.9756	0.34s
ϵ-Decreasing (ϵ=0.500)	40.91	29.4333	7.5048	0.17s
MOSS-Anytime (α=-0.33)	42.29	29.8866	5.9957	0.22s
POKER (H=100)	38.92	29.9131	6.5647	0.34s
RAVEN-UCB (a0=1, b0=5, eps=0.5)	41.51	30.0406	6.8188	0.24s
BayesUCB (δ=0.100)	42.74	30.3237	6.0061	0.18s
POKER (H=50)	36.98	30.4262	8.3416	0.35s
RS (a=0.50)	34.65	30.4958	17.8406	0.19s
RAVEN-UCB (a0=1, b0=5, eps=0.1)	41.13	30.6400	6.5227	0.24s
ϵ-Exploring Thompson Sampling	40.39	30.7426	8.8886	0.16s
Bootstrapped Thompson Sampling (J=500)	38.36	30.8943	13.6813	4.26s
Bootstrapped Thompson Sampling (J=100)	38.23	30.9704	13.6387	1.02s
Bootstrapped Thompson Sampling (J=1000)	37.93	31.2238	13.7505	8.31s
SoftElim (theta=4.00)	41.35	31.4556	6.6870	0.44s
ϵ-Greedy (ϵ=0.100)	40.16	31.5381	7.6639	0.15s
TS-UCB (1 samples)	41.21	31.8313	6.2230	0.79s
IRS.FH (H=10)	41.13	32.0134	6.6906	1.45s
UCBT	41.92	32.0754	5.3843	0.08s
MARS (δ=0.100)	40.72	32.4299	6.8526	0.32s
RAVEN-UCB (a0=0.5, b0=10, eps=0.001)	39.29	32.7851	6.8408	0.24s
Forced Exploration	41.72	33.1699	5.7046	0.09s
RAVEN-UCB (a0=1, b0=5, eps=0.001)	39.57	33.2162	6.5485	0.25s
Softsatisficing (a=0.50)	31.09	33.7174	20.6161	0.08s
WR-SDA (forced_exploration=true)	40.23	33.8608	6.5333	1.88s
POKER (H=250)	37.22	33.9079	8.0820	0.34s
WR-SDA (forced_exploration=false)	37.74	34.3702	7.8470	1.75s
VarTS	40.86	34.4084	5.8058	0.45s
IRS.FH (H=25)	38.87	35.4924	6.7519	1.88s
CODE (δ=0.900)	35.87	35.7202	11.4984	0.18s
UCB1-Tuned	38.36	36.0304	5.8517	0.16s
ReBoot (r=0.25)	35.81	36.8892	8.1828	0.18s
Vanilla Residual Bootstrap (init=0)	35.10	38.0391	7.9288	0.19s
Optimistic Thompson Sampling	36.78	38.4207	7.1289	0.95s
Multiplier Bootstrap-based Exploration	36.05	38.7066	7.0003	5.47s
SoftElim (theta=2.00)	35.14	39.4061	7.2049	0.40s
ReBoot (r=0.50)	34.21	39.5480	8.2009	0.24s
Delightful GB (lr=2.000)	33.57	39.6689	13.8321	0.49s
ETC (m=10)	33.45	40.0881	11.7950	0.15s
Hellinger-UCB	36.12	40.4295	6.1041	2.42s
Weighted Bootstrap	35.00	40.5410	7.4857	2.82s
Thompson Sampling	34.98	40.5432	7.5133	0.82s
Satisficing Thompson Sampling (ϵ=0.005)	34.96	40.5786	7.5540	0.81s
Satisficing Thompson Sampling (ϵ=0.010)	34.87	40.6461	7.5447	0.85s
Batch Ensemble for MAB (m=2)	35.57	40.7384	9.1522	0.17s
Garbage In, Reward Out (a=0.10)	33.73	42.0945	7.6013	0.99s
Perturbed-History Exploration (a=1.1)	33.49	42.3004	7.7267	0.87s
KL-UCB	34.54	42.7149	6.2245	7.92s
EB-TCI	30.56	42.8317	9.3319	0.32s
Satisficing Thompson Sampling (ϵ=0.050)	32.52	43.1108	8.0902	0.95s
Non-Parametric Thompson Sampling	33.09	43.6865	7.5605	4.62s
Vanilla Residual Bootstrap (init=1)	32.88	43.7710	7.4509	0.26s
MARS (δ=0.010)	33.86	44.2200	6.6462	1.38s
Delightful GB (lr=1.000)	31.39	44.6731	11.2050	0.49s
Bounded Dirichlet Sampling	32.79	44.7466	7.9659	2.70s
Tsallis-INF	32.35	45.6862	8.4068	0.92s
lil' UCB (δ=0.100)	31.70	46.4287	6.7023	0.31s
MARS (δ=1.000)	27.30	46.6224	23.1260	0.09s
Kullback-Leibler Maillard Sampling	29.69	47.8324	8.4744	0.50s
Satisficing Thompson Sampling (ϵ=0.100)	27.45	48.1450	10.2207	0.97s
Garbage In, Reward Out (a=0.33)	30.11	48.1458	8.0648	1.30s
MARS (δ=0.001)	30.62	48.3961	7.5977	11.25s
ReBoot (r=0.90)	29.34	48.4181	8.4845	0.26s
ϵ-Decreasing (ϵ=0.200)	33.79	49.1413	7.5396	0.14s
MARS (δ=0.002)	30.50	49.4331	7.3296	5.92s
SoftElim (theta=1.00)	28.58	49.5066	8.4513	0.40s
ETC (m=5)	21.32	50.0278	17.6885	0.15s
ReBoot (r=1.00)	27.89	50.9352	8.6898	0.27s
ETC (m=20)	31.24	51.1732	8.6350	0.15s
Perturbed-History Exploration (a=2.1)	27.91	52.2188	8.4423	0.98s
Softsatisficing (a=0.75)	30.32	52.6410	16.5585	0.40s
FTPL-GR (lr=0.100)	28.27	52.7166	10.1011	4.03s
Vanilla Residual Bootstrap (init=5)	28.26	53.2834	8.4062	0.26s
Delightful GB (lr=0.500)	26.85	53.6128	10.7001	0.43s
RS (a=0.75)	29.84	54.6692	13.4306	0.20s
Gradient Bandit	29.48	54.9276	8.8193	0.42s
Gradient Bandit (with baseline)	29.07	55.8207	8.2645	0.49s
Batch Ensemble for MAB (m=4)	26.48	55.8781	10.9358	0.27s
ETC (m=25)	32.18	56.3820	8.2546	0.15s
FTPL-GR (lr=1.000)	26.06	56.8336	12.0625	3.95s
lil' UCB (δ=0.010)	25.83	56.9410	8.2814	0.33s
Garbage In, Reward Out (a=1.00)	25.12	57.7304	9.1152	1.28s
Boltzmann-Gumbel Exploration	25.61	58.0539	8.8928	0.32s
ReBoot (r=1.50)	22.85	61.0890	9.6647	0.26s
SoftElim (theta=0.50)	22.46	61.3682	9.9029	0.39s
lil' UCB (δ=0.001)	22.85	62.7995	9.1698	0.30s
ReBoot (r=1.70)	21.38	64.4112	10.0761	0.23s
Perturbed-History Exploration (a=5.1)	21.44	65.8492	10.0502	1.03s
UCB1	20.42	68.0927	10.1489	0.17s
ϵ-Decreasing (ϵ=0.100)	24.60	68.8686	9.8576	0.07s
ReBoot (r=2.10)	19.16	69.7726	10.8419	0.23s
ETC (m=3)	15.41	69.9994	18.3348	0.14s
EXP-IX	19.28	71.2582	11.2795	0.44s
RS (a=0.25)	15.97	71.7134	48.5357	0.19s
Batch Ensemble for MAB (m=8)	17.28	74.4122	19.8696	0.32s
RAVEN-UCB (a0=5, b0=1, eps=0.5)	17.71	74.8132	11.3884	0.30s
RAVEN-UCB (a0=5, b0=1, eps=0.1)	17.70	74.8207	11.3883	0.26s
RAVEN-UCB (a0=5, b0=1, eps=0.001)	17.67	74.8915	11.3848	0.26s
Softsatisficing (a=0.90)	17.73	76.2770	12.8748	0.43s
RS (a=0.90)	17.16	77.8704	12.5705	0.19s
Softsatisficing (a=0.25)	14.53	78.0274	53.4058	0.06s
ETC (m=2)	15.27	80.4676	18.0151	0.11s
Softsatisficing (a=0.99)	15.71	81.5688	13.1278	0.42s
RS (a=0.99)	15.39	82.4437	12.9414	0.19s
Least Failures	15.39	82.4443	12.9451	0.08s
RS (a=0.10)	13.03	84.6250	56.6531	0.17s
FTPL-GR (lr=0.010)	14.69	85.0457	12.8256	4.06s
SoftElim (theta=0.10)	13.70	85.4625	12.8718	0.36s
RS (a=0.00)	12.85	85.6737	57.3841	0.12s
Softsatisficing (a=0.10)	12.81	86.5606	58.7008	0.04s
FTPL-GR (lr=0.001)	10.46	100.0294	14.8858	4.00s
Random	10.01	102.0080	15.1748	0.02s
CODE (δ=0.050)	10.00	102.0185	14.8649	0.17s

Hard

This experiment uses 10 arms. All arms have a success probability of 0.5, except for the best arm, which has a success probability of 0.51.

This experiment was taken from the paper describing Boltzmann-Gumbel Exploration.

Results

Algorithm	%-Optimal	Regret (Mean)	Regret (Median Absolute Deviation)	Time
POKER (H=100)	18.10	4.0949	0.1700	0.32s
Greedy	17.00	4.1498	0.1100	0.10s
POKER (H=10)	17.00	4.1498	0.1100	0.32s
POKER (H=1)	17.00	4.1498	0.1100	0.28s
POKER (H=25)	17.00	4.1498	0.1100	0.32s
POKER (H=5)	17.00	4.1498	0.1100	0.32s
POKER (H=50)	17.00	4.1499	0.1100	0.33s
ϵ-Decreasing (ϵ=0.990)	16.90	4.1552	0.1000	0.17s
ϵ-Decreasing (ϵ=0.900)	16.80	4.1598	0.1000	0.16s
ϵ-Greedy (ϵ=0.010)	16.64	4.1682	0.1000	0.11s
ReUCB (a=1.00)	16.59	4.1704	0.1200	0.92s
ϵ-Decreasing (ϵ=0.700)	16.29	4.1854	0.1000	0.16s
ϵ-Greedy (ϵ=0.020)	16.25	4.1873	0.1000	0.14s
ReUCB (a=1.50)	15.54	4.2231	0.1200	0.96s
ReUCB (a=2.00)	15.41	4.2293	0.1200	0.95s
ϵ-Exploring TS-UCB (1 samples)	15.26	4.2371	0.1700	0.16s
Gittins Index -- Whittle's Approximation (β=0.50)	15.19	4.2406	0.1200	0.19s
ϵ-Greedy (ϵ=0.050)	15.11	4.2447	0.0900	0.15s
ϵ-Decreasing (ϵ=0.500)	14.77	4.2614	0.0800	0.15s
VarTS	14.62	4.2691	0.1200	0.51s
ϵ-Exploring TS-UCB (10 samples)	14.23	4.2887	0.1700	0.75s
Batch Ensemble for MAB (m=8)	14.22	4.2892	0.3800	0.33s
ϵ-Exploring TS-UCB (100 samples)	14.10	4.2951	0.1700	7.27s
ϵ-Decreasing (ϵ=0.200)	14.05	4.2973	0.1600	0.14s
ϵ-Greedy (ϵ=0.100)	13.97	4.3014	0.0800	0.13s
ϵ-Exploring Thompson Sampling	13.71	4.3147	0.1100	0.17s
Batch Ensemble for MAB (m=2)	13.68	4.3160	0.2900	0.18s
Forced Exploration	13.53	4.3235	0.1000	0.09s
RS (a=0.50)	13.42	4.3290	0.0100	0.20s
IRS.FH (H=1)	13.36	4.3318	0.1200	1.19s
UCB-DT (γ=0.90)	13.27	4.3365	0.1000	2.60s
UCB-DT (γ=0.95)	13.27	4.3365	0.1000	2.69s
Gittins Index -- Whittle's Approximation (β=0.70)	13.25	4.3374	0.1200	0.22s
UCB-DT (γ=1.00)	13.19	4.3406	0.1200	2.58s
UCB-DT (γ=0.75)	13.05	4.3474	0.1000	2.68s
MOSS-Anytime (α=-0.33)	13.00	4.3502	0.2000	0.21s
MOSS-Anytime (α=-0.85)	12.95	4.3526	0.1800	0.18s
MOSS-Anytime (α=-0.50)	12.94	4.3532	0.1700	0.21s
IRS.FH (H=2)	12.84	4.3582	0.1700	1.27s
Batch Ensemble for MAB (m=1)	12.71	4.3646	0.4800	0.09s
Gittins Index -- Whittle's Approximation (β=0.99)	12.66	4.3672	0.2000	0.22s
IRS.FH (H=3)	12.63	4.3687	0.1900	1.39s
RAVEN-UCB (a0=0.5, b0=10, eps=0.5)	12.62	4.3690	0.1800	0.26s
RAVEN-UCB (a0=0.5, b0=10, eps=0.1)	12.62	4.3692	0.2100	0.23s
RAVEN-UCB (a0=0.5, b0=0.5, eps=0.001)	12.58	4.3710	0.1800	0.20s
RAVEN-UCB (a0=0.5, b0=0.5, eps=0.1)	12.58	4.3710	0.1800	0.25s
RAVEN-UCB (a0=0.5, b0=0.5, eps=0.5)	12.58	4.3710	0.1800	0.24s
Batch Ensemble for MAB (m=0)	12.57	4.3714	0.1700	0.05s
Gittins Index -- Brezzi and Lai's Approximation (β=0.8)	12.47	4.3767	0.1200	0.34s
BayesUCB (δ=0.500)	12.44	4.3782	0.2000	0.21s
RAVEN-UCB (a0=0.5, b0=10, eps=0.001)	12.41	4.3797	0.2600	0.23s
POKER (H=250)	12.40	4.3802	0.2500	0.34s
BayesUCB (δ=0.400)	12.37	4.3813	0.2000	0.20s
Gittins Index -- Brezzi and Lai's Approximation (β=0.9)	12.37	4.3816	0.1200	0.44s
Gittins Index -- Whittle's Approximation (β=0.90)	12.37	4.3816	0.1700	0.22s
IRS.FH (H=4)	12.29	4.3857	0.1900	1.45s
TS-UCB (100 samples)	12.17	4.3915	0.2500	65.62s
UCBT	12.17	4.3916	0.4200	0.07s
BayesUCB (δ=0.300)	12.17	4.3917	0.2500	0.20s
IRS.FH (H=5)	12.02	4.3990	0.2200	1.37s
RAVEN-UCB (a0=1, b0=5, eps=0.001)	12.01	4.3996	0.3500	0.24s
RAVEN-UCB (a0=1, b0=5, eps=0.5)	12.00	4.3999	0.2500	0.27s
RAVEN-UCB (a0=1, b0=5, eps=0.1)	11.98	4.4008	0.2500	0.26s
ϵ-Decreasing (ϵ=0.100)	11.91	4.4043	0.1500	0.08s
SoftElim (theta=8.00)	11.90	4.4050	0.2100	0.42s
Bootstrapped Thompson Sampling (J=10)	11.83	4.4083	0.1600	0.37s
Bootstrapped Thompson Sampling (J=500)	11.80	4.4101	0.3400	4.41s
Bootstrapped Thompson Sampling (J=1000)	11.78	4.4109	0.3400	9.08s
Bootstrapped Thompson Sampling (J=100)	11.76	4.4118	0.3100	1.11s
EB-TCI	11.56	4.4218	0.4400	0.33s
Batch Ensemble for MAB (m=4)	11.56	4.4221	0.3100	0.26s
WR-SDA (forced_exploration=false)	11.52	4.4238	0.3200	1.34s
BayesUCB (δ=0.200)	11.51	4.4245	0.2300	0.20s
Thompson Sampling with Virtual Helping Agents (Combiner C3)	11.46	4.4269	0.2600	6.04s
IRS.FH (H=10)	11.46	4.4270	0.2400	1.67s
TS-UCB (10 samples)	11.45	4.4275	0.2500	7.13s
Vanilla Residual Bootstrap (init=0)	11.42	4.4292	0.3500	0.19s
MARS (δ=0.100)	11.41	4.4296	0.2400	0.34s
WR-SDA (forced_exploration=true)	11.41	4.4296	0.3000	1.53s
CODE (δ=0.900)	11.39	4.4305	0.4900	0.18s
ReBoot (r=0.25)	11.38	4.4311	0.3500	0.19s
BayesUCB (δ=0.900)	11.37	4.4314	0.1200	0.22s
Gittins Index -- Brezzi and Lai's Approximation (β=0.95)	11.35	4.4324	0.1200	0.43s
ReBoot (r=0.50)	11.34	4.4329	0.3800	0.22s
Gittins Index -- Brezzi and Lai's Approximation (β=0.99)	11.29	4.4355	0.0900	0.45s
TS-UCB (1 samples)	11.21	4.4395	0.2400	0.87s
CODE (δ=0.990)	11.21	4.4397	0.1200	0.16s
Optimistic Thompson Sampling	11.20	4.4399	0.3000	1.04s
Garbage In, Reward Out (a=0.10)	11.16	4.4418	0.3400	1.13s
BayesUCB (δ=0.100)	11.16	4.4421	0.2400	0.16s
Non-Parametric Thompson Sampling	11.16	4.4422	0.3400	4.96s
Satisficing Thompson Sampling (ϵ=0.010)	11.15	4.4425	0.3400	0.85s
Satisficing Thompson Sampling (ϵ=0.005)	11.15	4.4426	0.3300	0.97s
Thompson Sampling	11.14	4.4430	0.3400	0.76s
IRS.FH (H=25)	11.14	4.4431	0.2600	2.20s
SoftElim (theta=4.00)	11.14	4.4431	0.2500	0.47s
Perturbed-History Exploration (a=1.1)	11.13	4.4433	0.3600	0.87s
Weighted Bootstrap	11.13	4.4436	0.3400	2.96s
Multiplier Bootstrap-based Exploration	11.12	4.4439	0.3100	5.72s
Softsatisficing (a=0.65)	11.12	4.4441	0.3000	0.38s
Vanilla Residual Bootstrap (init=1)	11.11	4.4443	0.3500	0.25s
Satisficing Thompson Sampling (ϵ=0.050)	11.09	4.4454	0.4000	0.95s
Garbage In, Reward Out (a=0.33)	11.04	4.4480	0.3800	1.14s
Tsallis-INF	11.01	4.4497	0.2700	1.07s
KL-UCB	10.99	4.4505	0.2800	8.29s
MARS (δ=1.000)	10.95	4.4523	0.0000	0.09s
RS (a=0.65)	10.95	4.4523	0.2500	0.16s
ReBoot (r=0.90)	10.94	4.4528	0.3800	0.22s
SoftElim (theta=2.00)	10.93	4.4536	0.3100	0.43s
MARS (δ=0.010)	10.93	4.4537	0.3700	1.38s
Kullback-Leibler Maillard Sampling	10.91	4.4544	0.3500	0.51s
Delightful GB (lr=1.000)	10.91	4.4547	0.3100	0.49s
Perturbed-History Exploration (a=2.1)	10.89	4.4557	0.3400	1.03s
SoftElim (theta=1.00)	10.88	4.4560	0.3300	0.44s
Delightful GB (lr=2.000)	10.87	4.4565	0.2300	0.49s
Vanilla Residual Bootstrap (init=5)	10.85	4.4574	0.2700	0.25s
Hellinger-UCB	10.85	4.4575	0.2800	2.66s
lil' UCB (δ=0.100)	10.85	4.4575	0.2600	0.31s
ReBoot (r=1.00)	10.84	4.4578	0.3500	0.22s
Bounded Dirichlet Sampling	10.83	4.4586	0.3100	2.58s
MARS (δ=0.002)	10.79	4.4603	0.2800	5.95s
UCB1-Tuned	10.74	4.4632	0.2400	0.17s
Satisficing Thompson Sampling (ϵ=0.100)	10.72	4.4641	0.3100	1.03s
MARS (δ=0.001)	10.71	4.4646	0.3200	11.34s
lil' UCB (δ=0.010)	10.70	4.4651	0.2200	0.31s
Boltzmann-Gumbel Exploration	10.67	4.4663	0.2700	0.32s
Delightful GB (lr=0.500)	10.66	4.4669	0.2800	0.42s
Garbage In, Reward Out (a=1.00)	10.66	4.4669	0.2600	1.23s
Gradient Bandit	10.66	4.4672	0.2000	0.40s
Gradient Bandit (with baseline)	10.60	4.4699	0.2000	0.48s
FTPL-GR (lr=0.100)	10.58	4.4711	0.2500	3.90s
lil' UCB (δ=0.001)	10.54	4.4730	0.2000	0.31s
ReBoot (r=1.50)	10.49	4.4756	0.2100	0.22s
FTPL-GR (lr=1.000)	10.46	4.4769	0.2900	4.00s
SoftElim (theta=0.50)	10.41	4.4797	0.1700	0.45s
Perturbed-History Exploration (a=5.1)	10.40	4.4798	0.1900	1.12s
ReBoot (r=1.70)	10.40	4.4798	0.1800	0.22s
Softsatisficing (a=0.75)	10.38	4.4810	0.1900	0.40s
RS (a=0.75)	10.36	4.4819	0.1600	0.19s
EXP-IX	10.36	4.4822	0.1600	0.41s
ReBoot (r=2.10)	10.29	4.4854	0.1400	0.22s
RS (a=0.00)	10.28	4.4861	0.0000	0.13s
ETC (m=25)	10.27	4.4863	0.0000	0.14s
RAVEN-UCB (a0=5, b0=1, eps=0.1)	10.25	4.4873	0.1100	0.24s
RAVEN-UCB (a0=5, b0=1, eps=0.5)	10.25	4.4873	0.1100	0.24s
RS (a=0.10)	10.24	4.4881	0.0000	0.16s
UCB1	10.23	4.4883	0.1600	0.16s
Softsatisficing (a=0.50)	10.23	4.4884	0.0000	0.08s
RS (a=0.90)	10.22	4.4891	0.1100	0.17s
RAVEN-UCB (a0=5, b0=1, eps=0.001)	10.21	4.4893	0.1200	0.24s
Softsatisficing (a=0.90)	10.21	4.4897	0.1200	0.41s
Softsatisficing (a=0.99)	10.17	4.4914	0.1100	0.41s
FTPL-GR (lr=0.010)	10.14	4.4930	0.0900	4.00s
ETC (m=5)	10.11	4.4943	0.0000	0.16s
ETC (m=20)	10.11	4.4946	0.0000	0.15s
SoftElim (theta=0.10)	10.08	4.4958	0.0700	0.37s
Least Failures	10.08	4.4961	0.0900	0.06s
RS (a=0.99)	10.08	4.4961	0.0900	0.15s
Softsatisficing (a=0.25)	10.05	4.4976	0.0000	0.06s
ETC (m=2)	10.04	4.4982	0.4300	0.10s
ETC (m=3)	10.04	4.4982	0.4300	0.14s
FTPL-GR (lr=0.001)	10.02	4.4988	0.0400	4.03s
Random	10.02	4.4992	0.0500	0.02s
CODE (δ=0.050)	10.00	4.5000	0.0000	0.18s
RS (a=0.25)	10.00	4.5000	0.0000	0.17s
Softsatisficing (a=0.10)	10.00	4.5000	0.0000	0.04s
ETC (m=10)	9.94	4.5030	0.0000	0.14s

Beta

This experiment uses 10 arms. The arm means are sampled from a Beta(1, 8) distribution.

This experiment was taken from the paper Multiplier Bootstrap-based Exploration.

Results

Algorithm	%-Optimal	Regret (Mean)	Regret (Median Absolute Deviation)	Time
BayesUCB (δ=0.900)	57.80	20.8173	5.4766	0.20s
Gittins Index -- Brezzi and Lai's Approximation (β=0.95)	57.01	21.0752	5.5688	0.41s
Gittins Index -- Brezzi and Lai's Approximation (β=0.9)	56.78	21.0892	5.6266	0.41s
RAVEN-UCB (a0=0.5, b0=0.5, eps=0.1)	55.98	21.1003	6.6151	0.27s
RAVEN-UCB (a0=0.5, b0=0.5, eps=0.5)	55.97	21.1006	6.6257	0.26s
Gittins Index -- Brezzi and Lai's Approximation (β=0.8)	56.64	21.1021	5.7940	0.36s
RAVEN-UCB (a0=0.5, b0=0.5, eps=0.001)	55.97	21.1268	6.5255	0.23s
RAVEN-UCB (a0=0.5, b0=10, eps=0.1)	55.99	21.1428	6.5261	0.25s
RAVEN-UCB (a0=0.5, b0=10, eps=0.5)	55.95	21.1445	6.5317	0.25s
Gittins Index -- Whittle's Approximation (β=0.70)	57.25	21.1454	5.2421	0.22s
Gittins Index -- Whittle's Approximation (β=0.50)	56.58	21.2012	5.5556	0.19s
Gittins Index -- Brezzi and Lai's Approximation (β=0.99)	57.41	21.2745	4.9635	0.44s
RAVEN-UCB (a0=0.5, b0=10, eps=0.001)	55.91	21.2843	6.6230	0.25s
Gittins Index -- Whittle's Approximation (β=0.90)	57.73	21.7090	4.9167	0.23s
IRS.FH (H=1)	55.74	21.8777	5.8971	1.19s
MOSS-Anytime (α=-0.85)	54.95	22.2898	5.6517	0.19s
Batch Ensemble for MAB (m=0)	57.67	22.3441	5.1453	0.06s
UCB-DT (γ=0.75)	54.64	22.4071	6.1492	2.44s
UCB-DT (γ=0.90)	54.45	22.4627	6.1571	2.42s
UCB-DT (γ=0.95)	54.39	22.4968	6.1852	2.49s
UCB-DT (γ=1.00)	53.32	22.6778	7.3649	2.59s
Thompson Sampling with Virtual Helping Agents (Combiner C3)	56.85	22.9838	7.1485	23.21s
IRS.FH (H=2)	56.33	23.0910	5.0660	1.28s
SoftElim (theta=8.00)	56.09	23.5078	5.9049	0.44s
CODE (δ=0.990)	51.11	23.5974	9.3932	0.17s
Gittins Index -- Whittle's Approximation (β=0.99)	56.84	23.7789	5.0319	0.23s
MOSS-Anytime (α=-0.50)	56.24	24.1465	4.0881	0.22s
BayesUCB (δ=0.500)	56.42	24.2684	5.5883	0.20s
IRS.FH (H=3)	55.61	24.5294	4.7675	1.37s
TS-UCB (100 samples)	56.12	24.7396	4.2715	72.32s
ReBoot (r=0.25)	52.26	24.7586	8.6759	0.19s
IRS.FH (H=4)	55.19	25.4645	4.9019	1.38s
BayesUCB (δ=0.400)	55.31	25.7435	5.7135	0.19s
RAVEN-UCB (a0=1, b0=5, eps=0.5)	54.87	26.1006	5.6929	0.26s
RAVEN-UCB (a0=1, b0=5, eps=0.1)	54.85	26.1238	5.6912	0.26s
MOSS-Anytime (α=-0.33)	54.80	26.1464	4.2098	0.22s
IRS.FH (H=5)	54.72	26.2315	4.9551	1.41s
TS-UCB (10 samples)	55.00	26.2465	4.2202	6.33s
RAVEN-UCB (a0=1, b0=5, eps=0.001)	54.65	26.3698	5.5242	0.26s
MARS (δ=0.100)	50.54	26.5911	7.3123	0.31s
Batch Ensemble for MAB (m=1)	53.77	26.8886	7.9552	0.10s
BayesUCB (δ=0.300)	53.93	27.5129	5.7981	0.20s
RS (a=0.25)	51.31	27.6540	15.4650	0.18s
IRS.FH (H=10)	53.19	28.6872	5.2676	1.43s
SoftElim (theta=4.00)	51.91	28.7964	6.3713	0.44s
UCBT	47.49	28.8558	8.0049	0.08s
ReBoot (r=0.50)	51.44	28.9633	6.3791	0.24s
MARS (δ=0.010)	51.25	29.2541	7.4360	1.36s
TS-UCB (1 samples)	52.69	29.2908	4.9082	0.83s
ϵ-Exploring TS-UCB (1 samples)	47.60	29.4661	9.1640	0.16s
Softsatisficing (a=0.25)	49.07	29.5243	16.9209	0.23s
ϵ-Exploring TS-UCB (10 samples)	47.35	29.8509	9.2614	0.78s
BayesUCB (δ=0.200)	51.93	29.8533	5.8074	0.20s
ϵ-Exploring TS-UCB (100 samples)	47.30	29.8928	9.1374	6.90s
RS (a=0.10)	44.30	29.9723	13.9283	0.16s
Hellinger-UCB	50.41	30.1850	5.4750	2.45s
ϵ-Decreasing (ϵ=0.500)	45.66	30.9426	10.3885	0.20s
Bootstrapped Thompson Sampling (J=10)	49.88	31.1623	6.5576	0.34s
Forced Exploration	48.86	31.4112	9.0715	0.09s
WR-SDA (forced_exploration=true)	45.51	31.4285	8.7057	2.16s
IRS.FH (H=25)	50.57	32.0301	5.6474	1.76s
Multiplier Bootstrap-based Exploration	49.17	32.4139	6.0942	5.86s
ϵ-Decreasing (ϵ=0.700)	42.88	32.7340	12.0469	0.19s
ϵ-Exploring Thompson Sampling	44.35	33.2297	12.5748	0.18s
ϵ-Greedy (ϵ=0.100)	44.10	33.2831	11.8153	0.13s
BayesUCB (δ=0.100)	48.69	33.6207	5.8840	0.17s
ϵ-Greedy (ϵ=0.050)	42.23	33.7998	13.3609	0.14s
UCB1-Tuned	48.22	34.0173	5.5690	0.18s
Bootstrapped Thompson Sampling (J=100)	47.17	34.7044	6.3147	1.03s
Garbage In, Reward Out (a=0.10)	46.53	35.1309	6.4203	0.76s
Bootstrapped Thompson Sampling (J=500)	46.87	35.1931	6.3344	4.91s
Bootstrapped Thompson Sampling (J=1000)	46.77	35.2492	6.3528	8.70s
Vanilla Residual Bootstrap (init=1)	46.87	35.3194	6.1483	0.29s
ϵ-Decreasing (ϵ=0.900)	40.31	35.3717	14.8777	0.19s
Optimistic Thompson Sampling	47.14	35.7522	6.0706	0.86s
ϵ-Greedy (ϵ=0.020)	39.94	35.9324	16.4079	0.14s
Softsatisficing (a=0.10)	39.16	36.0339	20.3466	0.06s
ϵ-Decreasing (ϵ=0.990)	39.46	36.2891	16.3213	0.20s
ReUCB (a=2.00)	39.22	36.5714	16.0590	0.97s
ReUCB (a=1.50)	39.18	36.6102	16.1550	0.99s
Vanilla Residual Bootstrap (init=0)	39.16	36.6560	16.2068	0.24s
ReUCB (a=1.00)	39.11	36.6897	16.2417	0.94s
ETC (m=5)	39.97	37.5465	17.0296	0.14s
MARS (δ=0.002)	41.98	37.6275	11.3986	5.06s
Thompson Sampling	45.13	37.7121	6.4934	0.80s
Weighted Bootstrap	45.12	37.7328	6.5247	3.26s
Satisficing Thompson Sampling (ϵ=0.005)	45.09	37.7390	6.4866	0.83s
ϵ-Greedy (ϵ=0.010)	38.41	37.7394	18.4671	0.11s
Satisficing Thompson Sampling (ϵ=0.010)	44.99	37.8316	6.5136	0.87s
KL-UCB	44.75	37.9754	5.7666	7.94s
SoftElim (theta=2.00)	43.27	38.7769	7.2693	0.49s
MARS (δ=0.001)	39.54	38.9137	13.8711	10.13s
ETC (m=10)	40.33	38.9869	13.6763	0.14s
Non-Parametric Thompson Sampling	43.90	39.3507	6.7160	4.76s
Greedy	36.66	39.9099	21.7087	0.10s
Bounded Dirichlet Sampling	43.54	39.9645	6.6454	2.69s
CODE (δ=0.900)	40.61	40.2050	13.2482	0.19s
ReBoot (r=0.90)	42.54	40.7881	7.1968	0.23s
Satisficing Thompson Sampling (ϵ=0.050)	41.59	40.8984	7.1496	0.97s
WR-SDA (forced_exploration=false)	36.73	41.0443	19.6162	1.85s
POKER (H=100)	36.39	41.3770	22.1022	0.26s
POKER (H=250)	36.40	41.3832	21.9572	0.26s
POKER (H=50)	36.31	41.4064	22.3819	0.26s
POKER (H=25)	36.24	41.4405	22.6058	0.26s
Kullback-Leibler Maillard Sampling	40.83	41.4463	7.5405	0.48s
POKER (H=10)	36.05	41.5399	23.0560	0.26s
POKER (H=5)	35.92	41.6486	23.3861	0.25s
POKER (H=1)	35.72	41.8819	23.9184	0.23s
VarTS	44.32	41.9995	7.9769	0.45s
Perturbed-History Exploration (a=1.1)	40.79	42.7866	7.3646	0.94s
ReBoot (r=1.00)	40.65	43.3432	7.6618	0.23s
ϵ-Decreasing (ϵ=0.200)	41.58	44.0842	11.9547	0.18s
Garbage In, Reward Out (a=0.33)	38.75	45.1922	7.8091	1.03s
ETC (m=3)	33.51	45.7840	28.1017	0.14s
Delightful GB (lr=2.000)	36.05	46.5685	9.5762	0.48s
ETC (m=20)	37.94	47.1134	13.4466	0.14s
Satisficing Thompson Sampling (ϵ=0.100)	33.63	48.2505	9.6052	0.99s
lil' UCB (δ=0.100)	36.34	48.6046	7.3285	0.31s
Batch Ensemble for MAB (m=2)	35.22	48.8043	9.6369	0.21s
Delightful GB (lr=1.000)	33.35	50.3568	10.1421	0.46s
ETC (m=25)	37.82	51.7141	13.9357	0.14s
SoftElim (theta=1.00)	33.15	51.8623	8.8820	0.49s
ETC (m=2)	29.53	53.1694	28.6333	0.10s
ReBoot (r=1.50)	33.20	53.6329	10.8113	0.23s
Perturbed-History Exploration (a=2.1)	32.69	53.7141	9.5801	1.10s
Tsallis-INF	32.39	54.7917	11.3371	1.00s
Gradient Bandit	33.08	54.8612	11.0784	0.44s
Gradient Bandit (with baseline)	32.39	55.7285	11.4145	0.49s
Vanilla Residual Bootstrap (init=5)	31.29	56.3702	10.3022	0.31s
ReBoot (r=1.70)	30.99	56.8129	12.0371	0.22s
Delightful GB (lr=0.500)	29.44	57.0335	11.8970	0.44s
Garbage In, Reward Out (a=1.00)	29.48	58.3347	11.5697	1.29s
Boltzmann-Gumbel Exploration	29.89	58.4917	11.5794	0.32s
lil' UCB (δ=0.010)	29.25	58.7242	11.2953	0.31s
EB-TCI	24.42	59.0388	22.7179	0.30s
ϵ-Decreasing (ϵ=0.100)	30.85	61.5675	16.7498	0.09s
ReBoot (r=2.10)	27.57	61.9376	14.1671	0.23s
Batch Ensemble for MAB (m=4)	26.13	62.3496	14.4520	0.34s
RS (a=0.00)	21.30	64.4828	45.7779	0.12s
lil' UCB (δ=0.001)	25.41	64.5631	14.7753	0.28s
SoftElim (theta=0.50)	24.23	65.1032	13.5526	0.44s
FTPL-GR (lr=0.100)	24.27	65.5709	16.0043	4.01s
Batch Ensemble for MAB (m=8)	21.62	67.3453	17.1377	0.52s
MARS (δ=1.000)	21.56	67.5569	47.1317	0.09s
Perturbed-History Exploration (a=5.1)	23.50	67.6540	15.6673	1.01s
RS (a=0.50)	22.27	69.0806	16.5670	0.21s
Softsatisficing (a=0.50)	22.04	69.2504	16.5095	0.41s
UCB1	22.31	69.6096	17.0817	0.18s
FTPL-GR (lr=1.000)	18.50	75.4350	20.0029	4.07s
RAVEN-UCB (a0=5, b0=1, eps=0.001)	18.85	75.5057	19.3704	0.27s
RAVEN-UCB (a0=5, b0=1, eps=0.5)	18.84	75.5100	19.3704	0.29s
RAVEN-UCB (a0=5, b0=1, eps=0.1)	18.84	75.5124	19.3704	0.28s
EXP-IX	17.71	77.5879	20.1208	0.46s
Softsatisficing (a=0.65)	16.04	80.3949	20.9280	0.42s
RS (a=0.65)	16.07	80.4757	21.2548	0.18s
FTPL-GR (lr=0.010)	14.62	83.6299	22.2985	4.07s
RS (a=0.75)	14.49	83.6745	22.3955	0.19s
Softsatisficing (a=0.75)	14.41	83.6750	22.5529	0.42s
SoftElim (theta=0.10)	13.20	85.3669	22.8842	0.42s
RS (a=0.90)	13.27	86.1578	23.4025	0.18s
Softsatisficing (a=0.90)	13.19	86.1783	23.4203	0.43s
Least Failures	12.90	86.8718	23.7396	0.09s
RS (a=0.99)	12.90	86.8720	23.7404	0.18s
Softsatisficing (a=0.99)	12.75	87.0894	23.7866	0.44s
FTPL-GR (lr=0.001)	10.39	92.2580	25.9058	3.99s
Random	9.99	93.1436	26.0904	0.02s
CODE (δ=0.050)	10.00	93.1468	25.9588	0.13s

Reverse Beta

This experiment uses 10 arms. The arm means are sampled from a Beta(8, 1) distribution.

I added this to see which algorithms are affected by rewards close to 1 instead of close to 0.

Results

Algorithm	%-Optimal	Regret (Mean)	Regret (Median Absolute Deviation)	Time
TS-UCB (100 samples)	57.88	6.9424	2.3169	63.34s
TS-UCB (10 samples)	57.81	7.2576	2.1382	6.63s
RS (a=0.90)	44.63	7.3959	3.7683	0.16s
POKER (H=5)	55.21	7.8042	1.8429	0.32s
TS-UCB (1 samples)	57.78	7.8597	1.9439	0.86s
Batch Ensemble for MAB (m=0)	56.24	7.8614	1.7577	0.05s
POKER (H=1)	56.13	8.0894	1.7428	0.29s
POKER (H=10)	48.12	8.3258	2.2912	0.30s
UCB-DT (γ=1.00)	55.66	8.4709	1.4837	2.58s
IRS.FH (H=2)	54.85	8.4788	1.5053	1.25s
ϵ-Exploring TS-UCB (100 samples)	54.56	8.4801	1.5113	7.25s
IRS.FH (H=3)	55.12	8.4888	1.5141	1.28s
ϵ-Exploring TS-UCB (10 samples)	54.54	8.5219	1.5144	0.83s
IRS.FH (H=1)	54.33	8.5340	1.5095	1.19s
IRS.FH (H=4)	55.12	8.5479	1.5428	1.28s
UCB-DT (γ=0.90)	55.84	8.5612	1.5143	2.55s
ϵ-Exploring TS-UCB (1 samples)	54.46	8.5696	1.5205	0.15s
UCB-DT (γ=0.95)	55.76	8.5784	1.5121	2.53s
IRS.FH (H=5)	55.41	8.5928	1.5664	1.33s
ReUCB (a=1.50)	54.41	8.5966	1.5037	0.99s
ReUCB (a=2.00)	54.44	8.5966	1.5062	1.00s
ReUCB (a=1.00)	54.34	8.6079	1.5056	0.94s
Greedy	53.82	8.6471	1.5273	0.09s
UCB-DT (γ=0.75)	55.91	8.6739	1.5558	2.56s
ϵ-Decreasing (ϵ=0.990)	53.57	8.8897	1.5610	0.18s
Batch Ensemble for MAB (m=1)	51.73	8.9409	4.0595	0.09s
IRS.FH (H=10)	55.40	8.9571	1.7240	1.42s
ϵ-Decreasing (ϵ=0.900)	53.45	9.0268	1.5815	0.18s
Gittins Index -- Brezzi and Lai's Approximation (β=0.99)	39.47	9.0345	4.6410	0.42s
BayesUCB (δ=0.200)	43.36	9.0488	4.8228	0.19s
ϵ-Greedy (ϵ=0.010)	53.34	9.0561	1.5947	0.10s
Optimistic Thompson Sampling	54.95	9.3835	3.1222	0.87s
ϵ-Greedy (ϵ=0.020)	52.84	9.4814	1.6900	0.14s
CODE (δ=0.990)	48.81	9.4822	1.7486	0.18s
ϵ-Decreasing (ϵ=0.700)	52.78	9.6599	1.7204	0.17s
IRS.FH (H=25)	54.09	10.0328	2.1203	1.39s
WR-SDA (forced_exploration=true)	52.76	10.1746	2.5789	0.84s
WR-SDA (forced_exploration=false)	52.23	10.2123	2.7845	0.68s
POKER (H=50)	41.34	10.6033	2.7010	0.29s
POKER (H=25)	41.02	10.6190	2.7176	0.29s
POKER (H=100)	41.56	10.6923	2.7973	0.29s
ϵ-Greedy (ϵ=0.050)	51.57	10.7107	1.9386	0.14s
POKER (H=250)	42.10	10.8128	2.8381	0.30s
ϵ-Exploring Thompson Sampling	45.36	10.8543	4.0462	0.18s
RAVEN-UCB (a0=0.5, b0=0.5, eps=0.5)	49.54	10.9591	1.8509	0.26s
MOSS-Anytime (α=-0.85)	50.92	11.2098	1.9147	0.18s
RAVEN-UCB (a0=0.5, b0=0.5, eps=0.1)	48.79	11.4276	1.8858	0.26s
ϵ-Decreasing (ϵ=0.500)	50.84	11.5088	2.0720	0.15s
KL-UCB	51.49	11.6751	3.5785	6.36s
RS (a=0.99)	52.43	12.1155	6.1302	0.16s
Softsatisficing (a=0.90)	28.86	12.3245	7.4167	0.08s
Thompson Sampling	48.52	12.4232	2.7600	0.87s
Weighted Bootstrap	48.57	12.4424	2.7624	2.94s
RAVEN-UCB (a0=0.5, b0=0.5, eps=0.001)	47.05	12.5486	1.8639	0.23s
Thompson Sampling with Virtual Helping Agents (Combiner C3)	36.81	12.6008	4.1699	17.73s
Satisficing Thompson Sampling (ϵ=0.005)	47.58	12.6638	2.7907	0.86s
ϵ-Greedy (ϵ=0.100)	49.34	12.7974	2.3308	0.13s
RAVEN-UCB (a0=0.5, b0=10, eps=0.5)	46.61	12.9612	1.8873	0.27s
Satisficing Thompson Sampling (ϵ=0.010)	45.49	13.1710	2.8249	0.89s
Batch Ensemble for MAB (m=2)	41.46	13.3774	9.4515	0.11s
Non-Parametric Thompson Sampling	47.46	13.6038	4.3455	4.65s
RAVEN-UCB (a0=0.5, b0=10, eps=0.1)	45.57	13.8469	1.9559	0.26s
BayesUCB (δ=0.300)	26.75	13.8625	8.5920	0.19s
Forced Exploration	48.30	13.9900	2.5181	0.08s
Least Failures	47.50	14.3230	6.3554	0.07s
Softsatisficing (a=0.99)	47.05	14.4903	8.0153	0.25s
Bounded Dirichlet Sampling	45.58	14.5418	4.6561	2.36s
FTPL-GR (lr=1.000)	44.10	14.7489	2.7172	3.83s
SoftElim (theta=8.00)	38.46	15.0256	2.2659	0.41s
BayesUCB (δ=0.100)	37.27	15.0673	2.3660	0.16s
Kullback-Leibler Maillard Sampling	43.53	15.1294	5.1731	0.53s
MOSS-Anytime (α=-0.50)	44.06	15.3933	2.1697	0.22s
Hellinger-UCB	43.81	15.5306	5.4955	1.54s
RAVEN-UCB (a0=0.5, b0=10, eps=0.001)	44.04	15.6325	2.3208	0.26s
VarTS	44.55	15.6709	2.3704	0.42s
MOSS-Anytime (α=-0.33)	40.99	17.0540	2.2724	0.22s
MARS (δ=1.000)	33.17	17.3184	5.7422	0.09s
Gittins Index -- Whittle's Approximation (β=0.99)	21.93	17.3442	10.7921	0.22s
UCBT	32.33	18.1863	6.0728	0.07s
RAVEN-UCB (a0=1, b0=5, eps=0.5)	32.80	18.2035	2.8245	0.28s
Batch Ensemble for MAB (m=4)	25.45	18.2890	11.3150	0.13s
RAVEN-UCB (a0=1, b0=5, eps=0.1)	32.55	18.5572	2.8832	0.28s
MARS (δ=0.100)	35.28	18.7945	6.0606	0.33s
Satisficing Thompson Sampling (ϵ=0.050)	23.97	18.9613	5.2597	1.04s
SoftElim (theta=4.00)	29.57	19.3858	3.0999	0.46s
BayesUCB (δ=0.400)	19.76	19.4237	12.4626	0.19s
EB-TCI	35.93	19.7395	5.2415	0.30s
ReBoot (r=0.25)	34.89	19.9697	3.1894	0.18s
RAVEN-UCB (a0=1, b0=5, eps=0.001)	31.26	20.1089	3.2838	0.28s
Gittins Index -- Whittle's Approximation (β=0.90)	19.03	20.6196	13.1752	0.22s
Vanilla Residual Bootstrap (init=0)	33.47	21.5207	3.2212	0.25s
Multiplier Bootstrap-based Exploration	28.45	22.2710	3.5416	5.48s
ETC (m=20)	33.55	22.3233	4.2529	0.16s
ETC (m=10)	27.09	22.3539	6.4168	0.17s
ReBoot (r=0.50)	30.87	22.5161	3.8147	0.22s
ϵ-Decreasing (ϵ=0.200)	38.74	22.6530	4.3599	0.15s
UCB1-Tuned	25.07	22.9077	3.4824	0.17s
Vanilla Residual Bootstrap (init=1)	31.19	23.1578	3.3412	0.25s
Tsallis-INF	26.30	23.2635	4.3108	0.95s
MARS (δ=0.010)	26.79	23.5506	7.7650	1.35s
Vanilla Residual Bootstrap (init=5)	30.48	23.8825	3.5154	0.26s
Garbage In, Reward Out (a=0.10)	26.82	23.9510	3.8778	0.83s
MARS (δ=0.001)	23.58	24.5993	9.9033	8.90s
Batch Ensemble for MAB (m=8)	19.96	24.6930	17.4961	0.17s
Perturbed-History Exploration (a=1.1)	24.17	24.8624	4.3134	0.94s
SoftElim (theta=2.00)	21.91	24.9126	4.1359	0.40s
MARS (δ=0.002)	23.66	25.1527	9.2101	4.85s
RS (a=0.75)	16.69	25.2082	15.9762	0.16s
BayesUCB (δ=0.500)	15.86	25.3034	16.2815	0.20s
FTPL-GR (lr=0.100)	26.18	25.4951	4.2832	4.01s
Satisficing Thompson Sampling (ϵ=0.100)	12.50	26.5896	8.8139	0.99s
ETC (m=25)	28.64	27.0247	5.2417	0.15s
CODE (δ=0.900)	16.26	27.7259	4.4425	0.20s
Garbage In, Reward Out (a=0.33)	21.22	28.0093	4.7583	1.36s
Bootstrapped Thompson Sampling (J=500)	17.05	28.0954	16.5475	4.31s
Bootstrapped Thompson Sampling (J=1000)	16.11	28.1867	16.6249	8.81s
ReBoot (r=0.90)	24.08	28.2376	5.0547	0.22s
Delightful GB (lr=1.000)	18.81	28.2884	8.7302	0.50s
Delightful GB (lr=2.000)	17.79	28.4145	12.2193	0.48s
lil' UCB (δ=0.100)	19.19	28.5694	4.7509	0.30s
Bootstrapped Thompson Sampling (J=100)	15.82	29.0489	16.7117	1.11s
ReBoot (r=1.00)	22.53	29.7884	5.3791	0.23s
Softsatisficing (a=0.75)	14.24	30.0109	19.0522	0.08s
Bootstrapped Thompson Sampling (J=10)	15.12	30.0861	17.7177	0.36s
Gittins Index -- Whittle's Approximation (β=0.70)	14.69	30.0926	19.2525	0.22s
Perturbed-History Exploration (a=2.1)	18.72	30.3983	5.2058	1.04s
SoftElim (theta=1.00)	16.81	30.6976	5.1339	0.41s
Delightful GB (lr=0.500)	17.18	31.8160	7.6577	0.45s
lil' UCB (δ=0.010)	16.72	32.2288	5.5208	0.30s
Garbage In, Reward Out (a=1.00)	17.26	32.4632	5.6672	1.24s
ϵ-Decreasing (ϵ=0.100)	27.51	32.6383	6.3517	0.10s
Boltzmann-Gumbel Exploration	17.44	32.7460	5.6438	0.33s
Gradient Bandit	18.28	33.0070	6.4322	0.43s
Gittins Index -- Whittle's Approximation (β=0.50)	13.79	34.0919	21.5379	0.19s
lil' UCB (δ=0.001)	15.51	34.2797	5.8524	0.27s
Gradient Bandit (with baseline)	17.93	34.4364	5.7630	0.45s
EXP-IX	15.62	34.8327	6.2311	0.43s
ReBoot (r=1.50)	18.20	35.2644	6.5573	0.28s
Perturbed-History Exploration (a=5.1)	15.29	35.4831	6.2519	1.00s
SoftElim (theta=0.50)	13.95	35.6632	6.1276	0.40s
UCB1	14.55	36.1248	6.3580	0.18s
ReBoot (r=1.70)	17.25	36.7828	6.9301	0.26s
RS (a=0.65)	12.54	36.9123	23.2223	0.17s
Gittins Index -- Brezzi and Lai's Approximation (β=0.95)	12.48	37.2813	23.4074	0.40s
Softsatisficing (a=0.65)	12.05	38.0281	23.7792	0.07s
Gittins Index -- Brezzi and Lai's Approximation (β=0.9)	12.42	38.3118	23.9070	0.40s
RAVEN-UCB (a0=5, b0=1, eps=0.5)	13.50	38.3221	6.8900	0.30s
RAVEN-UCB (a0=5, b0=1, eps=0.1)	13.50	38.3307	6.8949	0.32s
RAVEN-UCB (a0=5, b0=1, eps=0.001)	13.47	38.4149	6.9164	0.28s
Gittins Index -- Brezzi and Lai's Approximation (β=0.8)	12.38	38.6260	24.0049	0.37s
BayesUCB (δ=0.900)	12.29	38.7864	24.1887	0.19s
ReBoot (r=2.10)	15.90	39.2124	7.5247	0.28s
ETC (m=5)	12.36	41.7571	9.1900	0.19s
RS (a=0.50)	11.80	42.0199	25.5228	0.17s
FTPL-GR (lr=0.010)	12.28	42.4732	8.0136	3.98s
ETC (m=3)	12.03	43.5920	9.6906	0.18s
Softsatisficing (a=0.50)	11.18	43.6892	26.5617	0.07s
SoftElim (theta=0.10)	11.05	43.9424	8.1715	0.35s
ETC (m=2)	11.03	45.2564	9.3287	0.13s
RS (a=0.25)	11.31	46.4787	27.3045	0.16s
Softsatisficing (a=0.25)	11.03	46.8837	27.6280	0.06s
Softsatisficing (a=0.10)	10.99	47.1460	27.6826	0.04s
RS (a=0.10)	11.22	47.2198	27.6594	0.16s
RS (a=0.00)	11.13	47.2636	27.6526	0.12s
FTPL-GR (lr=0.001)	10.23	48.4041	9.6683	4.04s
CODE (δ=0.050)	10.00	49.2639	9.8811	0.20s
Random	9.99	49.2870	10.0029	0.02s

Notes / Conclusions

I have probably made mistakes while translating the formulas from the papers into code – KL-UCB, PHE, GIRO, etc. are doing much worse than I expected.
Keep in mind that the experiments were on the Bernoulli Bandit with a short horizon only. Many of the algorithms also work on other kinds of bandits and may rank differently there.
Greedy algorithms seem to do well on a short horizon in general, but fall behind after a few hundred steps.
UCB-DT does very well; the tuning parameter can be adjusted for the length of the horizon. More greedy settings (as used above) seem to do well over short time horizons.
ϵ-Exploring Thompson Sampling seems to match or exceed Thompson Sampling while being computationally much lighter.
Augmenting TS-UCB with the ϵ-Exploring policy speeds up TS-UCB by a lot, but trades off a bit of performance. For reasons I'm not entirely sure of, fewer samples seem to perform just as well or better than the variants with more samples.

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
experiments		experiments
src		src
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
TASKS.md		TASKS.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bandit Bench

Algorithms

Baselines

Bootstrap-based

Dueling-based

Approximations of the Gittins Index

Thompson Sampling-based

Upper Confidence Bound-based

Other

Results

Data

Uniform

Half-Range

Hard

Beta

Reverse Beta

Notes / Conclusions

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bandit Bench

Algorithms

Baselines

Bootstrap-based

Dueling-based

Approximations of the Gittins Index

Thompson Sampling-based

Upper Confidence Bound-based

Other

Results

Data

Uniform

Half-Range

Hard

Beta

Reverse Beta

Notes / Conclusions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages