# Reinforcement learning
* **overview**
    * agents interact with environment through available actions
    * actions impact environment, which impacts agents through rewards
    * rewards are unknown and must be estimated by the agent
    * process is dynamically repeated so agents learn to estimate rewards over time
    * agent > action > environment (state) > reward (positive actions are reinforced)
* unlimited data and computational requirements
* examples > games, recommendation engines, marketing, automated bidding
* solutions represents policy, by which agents choose actions in response to the state
* agents maximize expected rewards over time

* **differs from traditional machine learning**
    * unlike labels, rewards are not know and uncertain
    * action impacts environment, the state changes, which changes the problem
    * agents face a tradeoff between rewards across different periods

* **approaches**

## Q-learning in Keras
* Q-learning is value-based reinforcement learning algorithm
* trains agents to make sequences of decisions
* learn value of specific action in a state
* optimal action-section policy for an agent
* Q-function
    * provides a measure of expected utility of action `a` in state `s`
    * follows the optimal policy
    * updated iteratively using the Bellman equation
        * immediate reward
        * estimated future rewards
        * $Q(s,a) \leftarrow Q(s,a) + \alpha [r+\gamma max_{a'}\ Q(s',a') - Q(s,a)]$
            * $s$ - current state,
            * $a$ - current action,
            * $r$ - reward received after action $a$,
            * $s'$ - state resulting from taking action $a$,
            * $a'$ - next action,
            * $\alpha$ - learning rate, controlling the extent to which new information overrides the old
            * $\gamma$ - discount factor, modeling the importance of future rewards
* algorithm
    * initialize the environment and params
        * pick a platform
        * initialize Q-tab
        * set hyperparameters (learning rate, discount factor, exploration rate)
    * build Q-network
        * construct Q-network that approximate Q-function
    * train the Q-network
        * implement a training loop
    * evaluate the agent
        * test the agent to maximize rewards

* detailed algorithm
    * initialize the environment
        * cart pole problem from OpenAI gym (balance a pole on a cart)
        * initialize the params (learning rate, discount factor, exploration rate)
    * build Q-network
        * Q-table for storing Q-values, impractical for large/continuous state-spaces
        * Q-network approximates the table
            * architecture
                * 2-3 dense layers with ReLU activation
                * input layer -> states, output layer -> actions
    * train Q-network
        * get initial state
        * select action
            * with probability of epsilon select a random action (exploration)
            * or select the highest predicted Q-value (exploitation)
        * take action
            * execute the chosen action in the environment
        * update Q-values
            * use Bellman equation
            * compute the target Q-value
            * train the Q-network to minimize the difference between the predicted and target Q-value
        * repeat
            * reduce the exploration rate to shift from exploration to exploitation
    * evaluate the agent
        * interaction with the environment using the learned policy
        * exploitation of learned Q-value patterns
        * accumulation of total rewards over several episodes

