# **REINFORCEMENT LEARNING**

---



---





**OVERVIEW :**

---



----> Reinforcement Learning is a feedback-based Machine learning technique in which an agent learns to behave in an environment by performing the actions and seeing the results of actions. For each good action, the agent gets positive feedback, and for each bad action, the agent gets negative feedback or penalty.



----> In Reinforcement Learning, the agent learns automatically using feedbacks without any labeled data, unlike supervised learning.Since there is no labeled data, so the agent is bound to learn by its experience only.


---->The agent learns with the process of hit and trial, and based on the experience, it learns to perform the task in a better way. Hence, we can say that **"Reinforcement learning is a type of machine learning method where an intelligent agent (computer program) interacts with the environment and learns to act within that."** How a Robotic dog learns the movement of his arms is an example of Reinforcement learning.



*italicized text*
----> It is a core part of Artificial intelligence, and all AI agent works on the concept of reinforcement learning. Here we do not need to pre-program the agent, as it learns from its own experience without any human intervention.








**KEY FEATURES :**

---




1]  In RL, the agent is not instructed about the environment and what actions need to be taken.

2]  It is based on the hit and trial process.

3]  The agent takes the next action and changes states according to the feedback of the previous action.

4]  The agent may get a delayed reward.

5]  The environment is stochastic, and the agent needs to explore it to reach to get the maximum positive rewards.

**IMPORTANCE :**

---



Reinforcement learning is applicable to a wide range of complex problems that cannot be tackled with other machine learning algorithms. RL is closer to artificial general intelligence (AGI), as it possesses the ability to seek a long-term goal while exploring various possibilities autonomously. Some of the benefits of RL include:

**Focuses on the problem as a whole**

Conventional machine learning algorithms are designed to excel at specific subtasks, without a notion of the big picture. RL, on the other hand, doesn’t divide the problem into subproblems; it directly works to maximize the long-term reward. It has an obvious purpose, understands the goal, and is capable of trading off short-term rewards for long-term benefits.


**Does not need a separate data collection step** 

In RL, training data is obtained via the direct interaction of the agent with the environment. Training data is the learning agent’s experience, not a separate collection of data that has to be fed to the algorithm. This significantly reduces the burden on the supervisor in charge of the training process.


**Works in dynamic, uncertain environments** 

RL algorithms are inherently adaptive and built to respond to changes in the environment. In RL, time matters and the experience that the agent collects is not independently and identically distributed (i.i.d.), unlike conventional machine learning algorithms. Since the dimension of time is deeply buried in the mechanics of RL, the learning is inherently adaptive.

**CHALLENGES :**

---



Reinforcement learning (RL) has gained enormous popularity in the recent years, especially in robotics. It is maybe the most advanced tool to achieve truly independent machines (although self learning may get there first). However, it does not really work yet… Here are the main challenges (and current solutions) encountered by reinforcement learning nowadays.

**Low sample efficiency**

---



---







RL algorithms have a notoriously high sample complexity, i.e. they require a very large number of trial-and-error actions before being able to solve a task. Applying such methods to real robot arms would require a very large number of iterations, which is not always possible in practice.

For example in this work, over 580k grasping attempts were necessary before learning a successful strategy. We can learn how to drive a car in 30 hours but a RL agent would require millions of human training time. Most RL robotics problems so far are mainly limited to single object interactions.


**Current solutions**

The solutions to tackle this high sample complexity limitation include,

to build a virtual model of the robot using a physics simulator, use it to learn a policy and then deploy it on the real robot. See section on ”Limitation of virtual environments”.

to use model-based approaches. See section on ”Limitation of model-based approaches”.

to combine model-free and model-based approaches.

to use Soft Actor Critic (SAC), a sample-efficient model-free off-policy algorithm.

to parallelise learning across multiple robots

to use imitation learning (i.e. let a human manipulates the robot during a few trials to teach action sequences that solve the task successfully). For example this and this.


**Limitation of virtual environments**

---

---




In problems where the dynamics can be accurately captured by a simulation, pre-training the agent in a virtual simulated environment is an effective approach. However, such problems are often limited to narrow tasks and usually require extensive manual adjustments to work properly. Moreover, it is often challenging to build a virtual model that takes into account all the intricacies and imperfections of a physical robot. This is known as the Sim-to-Real gap. In particular, it is challenging to simulate the images that a robot will encounter based on visual sensing.


**Current solutions**

to randomise the visual inputs provided by the environment and train the policy to be robust to changes so that the real world would look like just another variation of the simulation.

to train a conditional GAN to transform the randomized images back to the canonical form of the original simulation that the policy was familiar with.


**Limitation of model-based approaches**

---



---


Model-free RL techniques learn a policy based on rewards obtained by interaction with the environment. The drawback is that it requires a high sample complexity. In contrast in model-based methods, the agent learns a model of the environment and use it to plan and improve its policy, thus dramatically reducing the number of interactions it needs with the environment. This leads to better sample complexity and more robust policies, see for example this paper. Even though model-based approaches reduce sample complexity, they suffer from a number of limitations such as,

they have lower asymptotic performance compared to model-free algorithms
they may be subject to catastrophic failures due to errors in the learned model
they are challenging to integrate with deep neural networks
they don’t generalize well across multiple environment dimensions
they require the ability to accurately learn the dynamics of a model, which can be very difficult for complex systems.



**Reward specification**


---


---


RL problems require a reward to be defined and specified so that the agent can learn a policy for a given task. In robotics, it is not always straightforward to specify a reward as it requires a lot of domain knowledge that may not always be available. For example, for the task of inserting a book between two other books, designing a reward function based only on visual inputs can be challenging.



**Current solution**

Some solutions that tackle the reward specification limitation include,

to provide the agent with several images of the goal state and allow it to query a human to know whether current state is a goal state. For example, see this paper.
to let the agent define its own reward from visual inputs based on pixel recognition without human interaction.


**Sparse reward**

---



---


Many RL problems feature very sparse reward, i.e. the agent does not receive any intermediate rewards at each time step. It does not have any feedback on how to improve its performance during the episode and it must figure out by itself what action sequences lead to the final reward. The Mountain-car is a classic RL benchmark problem with sparse rewards.


**Current solutions**

Reward engineering = Reward shaping = Reward hacking = “Rew-art”. It consists in using domain knowledge to augment the sparse reward and transform it to a dense reward (i.e. the reward is always higher in states that are closer to the end goal). 

In the Mountain-car example, reward engineering would consist in adding the velocity of the car to the reward to encourage it to gather speed. 

Hindsight Experience Replay (HER) and Scheduled Auxiliary Control (SAC-X). It consists in leveraging failed attempts from the agent at reaching the final reward by replaying the failed episodes with with a different goal than the one the agent was trying to achieve. This allows the agent to learn even if the episode was unsuccessful.

Use Divergent Policy Search methods to explore the space of observable policies, see for example this paper. (novelty, surprise and diversity approaches).

Use curiosity-driven exploration, intrinsic motivation or count-based exploration to encourage the agent to explore new states.


**Limitations of reward engineering**

Reward engineering seems an attractive solution to the problem of sparse reward, however it faces the following limitations.

it assumes some a priori knowledge on how to solve the problem, which is not always available, especially for complex tasks.
an agent with modified engineered reward no longer aims at solving the initial task but instead optimises a proxy that will hopefully help with the learning process. This may compromise the performance relative to the true objective, and sometimes even lead to unexpected and unwanted behaviour.


**Overfitting / failing to generalise**

---



---


It is hard to prevent a RL agent to overfit a problem. It may learn to solve a task to super-human performances, however it will perform very poorly on other similar tasks.

A solution could be to train an agent on a large distribution of environments but that’s very computationally expensive.


**Limitation in robotics: continuous action and state space**

---


In order to solve specific tasks in a RL context, robots must receive and execute instructions continuously (or at least at very short time intervals). Some RL algorithm – such as Q learning – can only deal with discrete states and action space. In order to control robots, it is this necessary to discretise the continuous state and action spaces.
However, as the number of degrees-of-freedom increases, the number of discrete bins increases exponentially which can be prohibitive for computational resources. This is informally known as the curse of dimensionality.
A number of RL algorithms have been invented to tackle this problem, such as DDPG, TRPO, PPO, NAF or Branching Dueling Q-Network (BDQ).

**FUTURE SCOPE :**

---

----> While reinforcement learning may ultimately have promise, it is important not to overstate its current achievements nor its current applicability.  For instance, while there has been much focus on the role reinforcement learning had in the development of AlphaGo, what is less well known is AlphaGo’s training started with Monte Carlo methods and deep neural networks, during which time it learnt from 30 million moves from expert human players.  Reinforcement learning was only applied after this extensive initial training.  Further, after the reinforcement learning phase, moves from those games were then fed into a second neural network.  In other words, reinforcement learning only played a part (albeit important part) in the success of AlphaGo – it was not the entire solution.

----> Data scientists have a tendency to apply new methods to every problem they encounter, simply because they are fascinated by it and often without stopping to think whether the new method should be applied to their particular problem.  It is actually a form of bias known as “Maslow’s hammer”.

----> In the case of reinforcement learning, there are several blogs that explain how it could be applied to recommendation engines.  However, such blogs tend not to explain why reinforcement learning should be used for this task (as opposed to tried and tested machine learning methods) nor do such blogs discuss the challenges of productionising such a solution in a real world system.

----> Only time will tell whether reinforcement learning becomes as mainstream as some predict, or whether it is best suited only to niche problems such as game solving and robotics.  However, does that then mean it doesn’t warrant investment, research or learning about in the meantime?  Absolutely not!  When asked why he wanted to scale Mount Everest, George Mallory famously replied “Because it’s there” – it provided a focus, a challenge, and a reason and sometimes that is all that is needed.  With RL, we have a challenge and reward, and so I for one will continue to learn about this fascinating approach, but I won’t be applying it to problems I encounter, at least not until its challenges have been overcome.


**CONCLUSION:**

-----> From the above discussion, we can say that Reinforcement Learning is one of the most interesting and useful parts of Machine learning. In RL, the agent explores the environment by exploring it without any human intervention. It is the main learning algorithm that is used in Artificial Intelligence. But there are some cases where it should not be used, such as if you have enough data to solve the problem, then other ML algorithms can be used more efficiently. The main issue with the RL algorithm is that some of the parameters may affect the speed of the learning, such as delayed feedback.



**REFERENCE LINKS :**

---



https://www.javatpoint.com/reinforcement-learning

https://www.datamachinist.com/reinforcement-learning/challenges-in-reinforcement-learning/

https://www.synopsys.com/ai/what-is-reinforcement-learning.html

https://www.capgemini.com/gb-en/2020/05/is-reinforcement-learning-worth-the-hype/

