Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent Only Learns to Rotate rather than move (always)!! (Best Formatted Issue So far!!😂) #1457

Closed
ust007 opened this issue Dec 1, 2018 · 12 comments
Assignees
Labels
help-wanted Issue contains request for help or information.

Comments

@ust007
Copy link

ust007 commented Dec 1, 2018

Hey, I know this is too much to read, but I tried to Explain the problem in the best way.

So I wanted to train an agent to reach a target at some random distance on the x-z plane with obstacles in between. My agent and Target are both on the same plane.
Both are spawning at random locations in a range of 30f.
MAIN SCENE
INPUTS(Continous)

  1. x and z relative position of the target
    AddVectorObs(relativePosition.x / 30f);
    AddVectorObs(relativePosition.z / 30f);

  2. 5 rayPer.Percieve at different angles. Same as the Hallway Example
    Code:
    float rayDistance = 10f;
    float[] rayAngles = { 20f, 60f, 90f, 120f, 160f };

     string[] detectableObjects = { "target", "wall" };
    AddVectorObs(rayPer.Perceive(rayDistance, rayAngles, detectableObjects, 0.3f, 0f));.
    
  3. AddVectorObs(GetStepCount() / (float)agentParameters.maxStep);
    also copied from Hallway Example (Up to my understanding this gives a sense of time to agent.).

  4. If the agent is on bad road or not. which is for now always 0 as i haven't introduced.
    AddVectorObs(OnRoad ? 1 : 0);

OUTPUTS (Continous) :
only 2 one to move forward and backward
another one to rotate left and right
And its exactly same as the Hallway Example

Rewards
+1 if the anent collides with the Target
Time Penalty : AddReward(-1f / agentParameters.maxStep);
And a small close in reward and small penalty to move away.

Now the parameter of .yaml file is exactly same as Hallway Example.
Tried a lots of different things but nothing worked so just copied it.

Training-- All the training done with obstacles disabled.
100f time scale
PPO-LSTM
Same as Hallway

  1. When the target is in the range of rayPer.Percieve
    At First the agent learns to rotate and keeps on rotating to about 150k steps and after that it moves and starts to close in the target 2-3 times but doesn't reaches it. but after that it knows that closing in gives positive reward, so it does that everytime..
    A=> so basically after every reset it starts rotating (and that too its always same direction) and when rayPer.Percieve detects Target it closes in..
    So, its trained? see further...

  2. In the above training nearly at 300k I decided to increase the distance of the target.
    Now, if the target is in the range of rayPer.Percieve then it does the same as A (statement A in 1st point of training.)
    and if the target is far off then it just keeps on rotating (same direction as usual) and sometimes random motion to move forward and rotate at the same time.
    So no Success :( after 600k steps.

  3. So at 600k I changed all the Target positions are out of range for rayPer.Percieve.
    Result - Keeps on rotating..

  4. After 1 , 2 and 3 i thought to train it from scratch and from starting the Target will be out of range for rayPer.Percieve.
    Again no success ... just learns to rotate ( same direction) and v v less movements.

No Success even when the Obstacles were Disabled.

So, if someone can help and guide me or any suggestions will be helpful.
My thinking what can be wrong

  1. The task is difficult and needs more training.
  2. Maybe i am giving unnecessary extra data as input.
  3. Needs some different kinda approach to train gradually.
  4. LSTM doesn't works well with continuous input.
    but I also did the training step 1 and 2 with use_recurrent: false. So maybe LSTM is not the problem.

Next things I am gonna try-

  1. Discreet Space Type.
  2. Train Multiple Agents at the same time.

PS: I really wanna know why it keeps on learning to rotate always, and that too in the same direction.
Also, this took me an hour to type while my stupid Agent was again learning to rotate in Background. 😂😂😂

All the screenshots of the Results -
TRAINING-1
Training-1 Results
TRAINING-2
Training-2 Results
TRAINING-3
Training-3 Results
TRAINING-4
Training-4 Results

@ust007 ust007 changed the title Agent Only Learns to Rotate rather than move (always)!! Agent Only Learns to Rotate rather than move (always)!! (Best Formatted Issue So far!!😂) Dec 1, 2018
@superjayman
Copy link

I have noticed similar behavior from my Agents, it looks like sometimes they get stuck in a local minima and they just going around in circles eating up food. Not sure how to fix?

@LeSphax
Copy link
Contributor

LeSphax commented Dec 3, 2018

One thing I can see from your graphs is that entropy is increasing. When that happens to me it usually means that the task is too hard for the agent and he is not getting enough rewards.

Entropy increasing means that the agent is becoming less confident about how good his actions are, essentially he is picking more random actions.
Part of the ppo algorithm is rewarding the agent when he takes random actions to make sure it keeps exploring. So your agent thinks that taking random actions is the best way to get rewards.

How does this small closing in reward work? It feels like it is a one time reward and after getting it the agent decides that there is nothing more to get and he should just take random actions.

So you could either make this reward continuous, i.e reward him every time the agent gets closer than the last frame.
You could also put the goal closer to the agent so that there is a better chance that he will move to it randomly. Either using curriculum learning and putting the goal further and further each lesson.
Or just making the distance to the goal random.

@eshvk eshvk added the help-wanted Issue contains request for help or information. label Dec 4, 2018
@ust007
Copy link
Author

ust007 commented Dec 7, 2018

@LeSphax , I think you are right about Entropy.
But as I mentioned in my Post, I am giving continuously if the agent moving close to the Target in each step. And a Penalty for going away.
And I changed to Discreet Space Type, and this changed a lot.
But still I am unable to train it to go near the target if the target is not in the range of the rayPer.Percieve().
I tried both techniques , First- starting from close and then slowly taking them far away.. and another time just starting from far away points... No good results from both.

I am starting to think if the task is too difficult?

@LeSphax
Copy link
Contributor

LeSphax commented Dec 12, 2018

Hey @ust007,

Just to make sure I understand the reward and penalty that the agent gets for moving close/ going away, could you detail it further or maybe show me the code?

Now that I think about it, maybe the penalty for going away is the problem here. I think it is generally better to avoid giving penalties to an agent that are caused by his actions. I.e:
DO Give the agent a small penalty every timestep no matter what (to encourage finishing faster)
DONT Give the agent a penalty when he does an action that you don't want it to do.

Because until the agent understands the difference between moving away and moving closer he will just understand that whenever he moves he has a 50/50 chance of getting a penalty or getting a reward.
So removing the penalty should be helpful in your case.

You say you tried to put the goal close to the agent, but how close did you put it? My idea was at first to put the goal so close to the agent that he will walk on it by chance quite often. This way you wouldn't need to rely on your getting closer reward, the reward of +1 if the agent collides with the Target should be enough.

About the rayPer.Perceive thing, it's hard to follow what you did exactly. It looks like you changed the environment manually in the middle of the training session.
I would recommend against changing the environment manually in general because it makes it hard to reproduce your results afterwards. If you want to change the task in the middle of training you should use curriculum learning as it allows you to retrain the agent from scratch on the same curriculum.

So to summarize:

  • Could you show the code of the small reward/penalty you are giving?
  • Try to remove the penalty for moving away.
  • If that doesn't work put the goal reaaaally close to the agent. Once it learns to move to the goal, use curriculum learning to put the goal further and further away.

@ust007
Copy link
Author

ust007 commented Dec 12, 2018

thanks @LeSphax ,
I will explain you about the reward/penalty thing (below)

First about the curriculum learning i didn't get the part where you said

it allows you to retrain the agent from scratch on the same curriculum.

what does this mean..

  1. that the agent gets to know that there are some changes in the env and he has to try new things including the previous learned things.
  2. he doesn't knows that the env has changed
  3. Or maybe something else.. so pls explain in detail
    Because as you said I was changing the environment manually by pausing or using the --load feature.
    Explain me the difference if you can.

lemme explain what result I want.
In the end I want the agent to bypass the obstacles and reach the target..
What I did..

  1. I am manually introducing the obstacles. (Don't know how to use curriculum learning here.)
  2. I trained and it learned but sometimes it just gets stuck to the obstacles.
  3. And there is an unexpected behavior in which the agent moves towards the target from behind , rather than from moving towards the target from the ray.persieve side.. it moves from the opposite end,
    so this results in the agent being stuck on obstacles as it cannot detect that there's wall or obstacle behind. idk why this happens..

obstacles

Now about the reward/penalty thing.. this is the code..

AddReward((previousDistance - distanceToTarget) / CurrentEpDistance);

previous dist: the distance b/w the target and agent in last step.
distToTarget: the distance b/w the target and agent in current step.
CurrentEpDistance : the initial starting distance at the start of te episode.

Overall, I wanted to reward the agent only with +1(after addition from all the steps) in an episode if he moves the full distance b/w the agent and target.
And similarly if he moves away this code gives it a penalty.

@LeSphax
Copy link
Contributor

LeSphax commented Dec 12, 2018

About the curriculum you should try to read the documentation but basically it allows to create several lessons.
This way you can have your agent do tasks that are harder and harder. For example a curriculum could be:

1- The goal is very close to the agent
2- The goal is far from the agent
3- The goal is far and there are obstacles

So using you can just retrain the agent from scratch everytime (i.e without using --load).
And the changes to the environment will be done automatically. This allows you to reproduce your results more easily because you don't have to remember what you did manually. Everything is written in code.

About your code, it seem to be correct but it might improve without the penalty.
From what I understand now you don't have the same problem, the agent is going towards the objective but get stuck to obstacles.
I think this just means it needs to train more. You could also start with smaller obstacles and make them bigger after it passes them successfully.

@atapley
Copy link

atapley commented Dec 14, 2018

I have trained agents to locate goals when the agent's position and the goal's location are randomized as well, and in my experience, the agent will rotate in one direction when it starts off every time because it is looking for either the goal or something it recognizes, such as a corner or a specific obstacle. Once it sees something it is looking for, it will then move in that direction.

The reason why it moves backwards towards the goal is because of your distance reward. The agent learns that it will waste time by spinning around, so it tries to solve it by moving backwards instead since it can get a positive reward by moving backwards. If you get rid of the backwards action, it will guarantee that it needs to move using the ray face.

@ust007
Copy link
Author

ust007 commented Dec 14, 2018

@atapley I get your point about the rotation. But about the backward movement.. u said

since it can get a positive reward by moving backwards
the same reward he can get by moving forwards towards the goal. So there's no point to get biased with backwards movement.
But in later training it has improved and also approaches from the ray face.

@atapley
Copy link

atapley commented Dec 14, 2018

The agent would need to turn around to face the goal which involves more actions than just moving backwards, meaning more negative time step rewards. The reason behind it learning as it trains more can just be that after getting stuck on an obstacle for a while due to it moving backwards, it learns that it's better to move forwards. But at least in the early stages of training, that is a possibility for the issue.

@ust007
Copy link
Author

ust007 commented Dec 15, 2018

@atapley Yeah you are right. Thanks for the help.

@xiaomaogy
Copy link
Contributor

Thanks for reaching out to us. Hopefully you were able to resolve your issue. We are closing this due to inactivity, but if you need additional assistance, feel free to reopen the issue.

@lock
Copy link

lock bot commented Jan 22, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 22, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help-wanted Issue contains request for help or information.
Projects
None yet
Development

No branches or pull requests

7 participants