# Reinforcement Learning

## Introduction

At the very core of the nature of learning is the idea of interacting with the environment. This interaction produces a wealth of information about cause and effect relationships and consequences of actions. Such as an infant doing absolutely anything without the help of a teacher. but in this process it learns the world around it, learns to walk and talk and etc. with just the sensorimotor connection to its environment. For example, driving a car, we seek to improve by being aware of how the environment responds to the actions we take. This is the foundation for learning and intelligence. The compitational approach to learning from interaction and various methods involved is called **Reinforcement Learning (RL)**. This is a goal directed learning process as compared to other machine learning approaches such as the widely used supervised learning. We will discuss this later in the book.

The idea is to learn control strategies by a learner called an **agent** that interacts with an **environment**. The agent's job is to learn **what to do** i.e. how to map situations to actions through **trial and error**. The agent receives **rewards** from the environment as **feedback** for its actions. 

The agent eventually must learn which actions yield the most reward by trying them. We can think of this as learning a set of actions through positive reinforcement. In most cases, actions actions may affect not only the immediate reward but in subsequent time steps. The agent must learn to take actions that maximize the total reward over time.

```{note}
Trial and error search and delayed reward form most important features of reinforcement learning.
```

This can be thought of as an optimization framework of a dynamical system.

```{note}
A dynamical system is a concept in mathematics that describes how the state of a system changes over time

State Space: It has a "state" at any given moment, represented by a point in an abstract space. Exmple: position and velocity of a moving object.
Dynamics: The rules that govern how the state changes over time.
Time Evolution: The system evolves over time according to these rules, which can be deterministic or stochastic.
Time Steps: The system's state is updated at discrete time steps or continuously.
```

The agent gets to measure the current state in the environment i.e. state S. It doesn't measure the full state, it just knows what state it is right now and what state it was in the past. And the agent gets to take an action, about what to do next.

### Difference between Reinforcement Learning and other types of learning

#### Supervised Learning

This is very different from supervised learning, where you have a training set of labeled examples provided by an external supervisor. Each sample or example is a situation and each label is the correct action to take in that situation. So we learn a function that maps the situation to actions and use it to extrapolate or generalize to new situations. This is an interesting way of learning but is inadequate given that there is no learning from interactions. 

In interactive learning problems, it is often impractical to obtain examples of desired behaviour of numerous multiple situations. In such cases, an agent must be able to learn from its own experience.

#### Unsupervised Learning

In unsupervised learning, the idea is to find patterns and structure hidden in unlabeled data. So there is not external supervisor and labeled data to learn a function mapping. Some might get confused given the fact that reinforcement learning also does not rely on labeled data, but the underlying goal is to maximize a reward signal instead of trying to find a hidden structure. Uncovering the hideen structure happens implicitly through the agent's experience.

```{note}
Reinforcement learning can be put as a third paradigm of machine learning, alongside supervised and unsupervised learning.
```

#### Exploration vs Exploitation

To obtain a lot of reward, the agent must take the action that it has tried in the past and found to be effective in producing reward. But to discover such actions, it has to try new actions that it has not tried or selected before. So, the agent has to exploit what it has experienced in order to obtain a reward, but it also has to explore new actions to discover their effects and make better action selections in the future. This is called the **exploration-exploitation trade-off**.

On a stochastic or random task, each action must be tried multiple times to get a good estimate of the expected reward. the agent must try a variety of actions and progressively focus on the most promising ones to finish the task. This is a fundamental problem in reinforcement learning and is called the **exploration-exploitation dilemma**. Researchers have been studying this problem for decades and yet remains unresolved.

We do not have this problem in supervised or unsupervised learning.

### Engineering and Scientific disciplines

This is part of Artificial Intelligence and Machine Learning since decades and integrated with statistics, optimization, control theory. Since its closest to the kind of learning that humans and animals do, its strongly related to psychology and neuroscience and is originally inspired by biological learning systems.

### Examples

* A gazelle calf struggles to its feet minutes after being born. Half an hour later it is
running at 20 miles per hour.
* A master chess player makes a move. The choice is informed both by planning—anticipating possible replies and counterreplies—and by immediate, intuitive judgments of the desirability of particular positions and moves.
* A child learns to walk by falling down and getting up again.
* **A man preparing breakfast**. Its a mundane task, but this is a complex web of conditional behavior acquired over a period of time, and is tuned based on interactive consequences. Rapid judgements are made based on series of eye movements to obtain information and to guide reaching the objects. Each step is guided by goals and optimized, such as on what to carry first and what to put on the dining table in order to minimize the number of trips to the kitchen but also to avoid dropping things. Whether the agent in this case the human is aware of it or not, it is constantly determining the action to be taken in a given state.

All of these involve, **interaction**, an **environment**, a decision-making agent and finally a goal.

The agent’s actions are permitted to **affect** the future state of the environment (e.g., the
next chess position) thereby affecting the actions and opportunities available to the agent at later times.

At the same time the **effects** of actions cannot be fully predicted, so the agent must **monitor** its environment frequently that is observe each state and react appropriately. The chess player makes and move and observes the opponent's response and knows whether or not he wins.

The agent can use its experience to improve its performance overtime. The chess player refines the intuition he uses to evaluate positions, thereby improving his play.

The agent can either bring knowledge from previous experience pr learn from scratch by evolution, but interaction with the environment is essential for adjusting behavior to exploit specific features of the task.

### Elements of Reinforcement Learning