# Upper Confidence Bound (UCB) 

The Upper Confidence Bound (UCB) algorithm is often phrased as “optimism in the face of uncertainty.”  That is, if a bandit has the potential to be the best we should try it. As one tries bandits more and more the uncertainty around the estimate of its payout shrinks.  The algorithm allocates exploratory effort to actions that might be optimal and are in this sense "optimistic."

Practically one wants to limit this optimism to what is **plausibly possible**. This optimism is limited by a optimism hyperparameter which should be tuned and adjsuted. 

So what do we mean by plausible?  

Recall that if \$X_1, X_2,\\ldots, X_n\$ are independent and sub-Gaussian then it follows that 

$E[X_i] = 0$ and \$\\hat \\mu
= \\sum\_{t=1}\^n X_t / n\$

Note that a sub-Gaussian is a probability distribution with strong tail decay. Informally, the tails of a sub-Gaussian distribution are dominated by (i.e. decay at least as fast as) the tails of a Gaussian. 

This knowledge allows one to use a Hoeffding bound for the uncertainty. In probability theory, Hoeffding's inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount.


\\begin{align\*}\
Pr({\\hat \\mu \\geq \\epsilon}) \\leq \\exp\\left(-n\\epsilon\^2 /
2\\right)\\,\
\\end{align\*}

For more detail see sub-Gaussian variables and Hoeffding bounds  https://en.wikipedia.org/wiki/Hoeffding%27s_inequality



Rearrainging the equation with $\delta$ on the right-hand side and solving for $\epsilon$ leads to


\\begin{align\*}\ 
Pr({\\hat \\mu \\geq \\sqrt{\\frac{2}{n}
\\ln\\left(\\frac{1}{\\delta}\\right)}}) \\leq \\delta\\,.  
\\end{align\*}


This means that in round \$t\$ it has
observed \$T_i(t-1)\$ samples from bandit \$i\$ and has observed mean rewards of \$\\hat \\mu_i(t-1)\$ for that bandit. The largest plausible estimate of the mean for badnit \$i\$ is  

\\begin{align\*}\
\\hat \\mu_i(t-1) + \\sqrt{\\frac{2}{T_i(t-1)}
\\ln\\left(\\frac{1}{\\delta}\\right)}\\,.\
\\end{align\*}

Then the algorithm chooses the action \$i\$ that maximizes the above quantity. If the hyperparamter \$\\delta\$ is chosen very small, then the algorithm will be more optimistic and if \$\\delta\$ is large, then the optimism is
less certain. The value of \$1-\\delta\$ is called the *confidence level* usually called $c$. 


For an alternate form in which the *confidence level* usually called $c$ is not explcitly set is commonly used as well


\\begin{align\*}\
\\hat \\mu_i + c\\sqrt{\\frac{ln(n)}{n_i}
}\\,\
\\end{align\*} 


where $\mu_i$ represents the current reward return average of bandit i at the current round, n represents the total number of trials passed over all bandits so far, and $n_i$ represents the number of pulls given to bandit i.


Note the following:

* The upper boundary is proportional to the squared root of ln(n), which means that when the experiment progresses, all arms have their upper boundaries increases by a factor of squared root of ln(n). 
* This upper boundary is inversely proportional to the squared root of $n_i$. The more times the specific arm has been engaged before in the past, the faster the confidence boundary shrinks to the point estimate.
* Note that $\\ln \frac{1}{\delta}$ change over the learning of the agent.  

The UCB algorithm always picks the arm with the highest plausible reward UCB as represented by either equation above.

