## Stateless Problem: Thompson Sampling

汤普森采样

lemma: $\beta$-distribution
$$
\begin{aligned}
\mathcal{B}(x; \alpha, \beta) &= C x^{\alpha - 1} (1 - x)^{\beta - 1} \\ 
&= \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} x^{\alpha - 1} (1 - x)^{\beta - 1} \\ 
&= \frac1{\Beta(\alpha, \beta)}x^{\alpha - 1} (1 - x)^{\beta - 1}
\end{aligned}
$$

where $\Gamma$ is Euler $\Gamma$-funct:
$$
\Gamma(x) = \int_0^{+\infin}t^{x - 1} e^{-t}dt(x > 0)
$$
if $x \in N^+$, $\Gamma(x) = x!$

for $\beta$ dist, $\mathbb{E}(\mathcal{B}) = \frac{\alpha}{\alpha + \beta}$, $\mathbb{Var}(\mathcal{B}) = \frac{\alpha\beta}{(\alpha + \beta)^2(\alpha + \beta + 1)}$

In [1]:
import numpy as np

# 每个老虎机中奖概率均匀分布, 共10个
probs = np.random.uniform(size=10)

# 记录每个老虎机的返回值
rewards = [[1] for _ in range(10)]

probs, rewards

(array([0.74421478, 0.26813075, 0.92777303, 0.5701738 , 0.37258173,
        0.22315075, 0.38718998, 0.79308439, 0.72092343, 0.6306549 ]),
 [[1], [1], [1], [1], [1], [1], [1], [1], [1], [1]])

In [2]:
## beta dist test
print('数字小时，beta分布有很大随机性')
for _ in range(5):
  print(np.random.beta(1, 1))
  
print('数字大时，beta分布逐渐稳定')
for _ in range(5):
  print(np.random.beta(1e5, 1e5))

数字小时，beta分布有很大随机性
0.9133636940271159
0.728337316337896
0.642989119991121
0.963628999183373
0.3425361414062037
数字大时，beta分布逐渐稳定
0.5023018584560439
0.49988173809382147
0.49816430260921796
0.49808291204244937
0.499978411763636


In [3]:
import random

# 概率递减的贪婪算法
def choose_one():
  # modified code here
  # 求出得奖次数
  count_1 = [sum(i) + 1 for i in rewards]
  
  # 求出无奖次数
  count_0 = [sum(1 - np.array(i)) + 1 for i in rewards]
  
  # 根据以上两个参数计算奖励分布，近似中奖概率
  beta = np.random.beta(count_1, count_0)
  
  return beta.argmax()

  # modified code end

choose_one()

2

In [4]:
def try_and_play():
  i = choose_one()
  
  # act
  reward = 0
  if random.random() < probs[i]:
    reward = 1
    
  # 记录结果
  rewards[i].append(reward)
  
try_and_play()

rewards

[[1], [1, 0], [1], [1], [1], [1], [1], [1], [1], [1]]

In [5]:
def get_result():
  # 玩 N 次
  for _ in range(5000):
    try_and_play()
    
  # 期望的最好结果
  target = probs.max() * 5000
  
  # 实际结果
  result = sum([sum(i) for i in rewards])
  
  return target, result

get_result()

(4638.865153582355, 4612)