Coursera: Fundamentals of Reinforcement Learning

Exploitation is the right thing to do to maximize the expected reward on the one step, but exploration may produce the greater total reward in the long run. ... Reward is lower in the short run, during exploration, but higher in the long run because after you have discovered the better actions, you can exploit them many times. Because it is not possible both to explore and to exploit with any single action selection, one often refers to the “conflict” between exploration and exploitation. (p. 26)
As noted earlier, we often encounter reinforcement learning problems that are effectively nonstationary. In such cases it makes sense to give more weight to recent rewards than to long-past rewards. (p. 32)
All the methods we have discussed so far are dependent to some extent on the initial action-value estimates, Q1(a) [action a "quality"]. In the language of statistics, these methods are biased by their initial estimates. For the sample-average methods, the bias disappears once all actions have been selected at least once, but for methods with constant α [update step size], the bias is permanent, though decreasing over time as given by (2.6). In practice, this kind of bias is usually not a problem and can sometimes be very helpful. The downside is that the initial estimates become, in effect, a set of parameters that must be picked by the user, if only to set them all to zero. (p. 34)

Rather than ε-greedy, in those ε fraction of the times when the highest expected value action is not chosen, don't choose from the others evenly. Instead choose based on an "upper bound" estimate of how good the other actions might be.
argmax[Q_t(a) + c * sqrt(ln(t) / <# times action a taken>)]

"The ε-greedy methods choose randomly a small fraction of the time, whereas UCB methods choose deterministically but achieve exploration by subtly favoring at each step the actions that have so far received fewer samples." (p 42)
"Despite their simplicity, in our opinion the methods presented in this chapter can fairly be considered the state of the art. ... special kind of action value called a Gittins index [an instance of a Bayesian method, which assume a known initial distribution] ... neither the theory nor the computational tractability of this approach appear to generalize ... In general, the update computations can be very complex, but for certain special distributions (called conjugate priors) they are easy. One possibility is to then select actions at each step according to their posterior probability of being the best action"

Provide feedback