Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion in few terms used in Hyper parameters in yaml file. #2252

Closed
Junggy opened this issue Jul 12, 2019 · 14 comments
Closed

Confusion in few terms used in Hyper parameters in yaml file. #2252

Junggy opened this issue Jul 12, 2019 · 14 comments
Assignees
Labels
discussion Issue contains general discussion.

Comments

@Junggy
Copy link

Junggy commented Jul 12, 2019

Hello,

I am bit confused with some hyper parameters used in *.yaml file.
(Experiences, Time Horizon, Batch Size, Buffer Size, Num Epoch)
according to documentation (https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Training-PPO.md) these are what I understood so far :

1. Experience : collected agent's [observations, actions, rewards] per step. (But what does it meant by processing the experiences?)
2. Time Horizon : How many experiences to collect to be used in value estimate
3. Batch Size : How many experiences to use single Gradient
4. Buffer Size : How many Gradient should processed (i.e. averaging multiple gradient) before actually updating model - here, buffer_size = n_gradient * batch size
5. Num Epoch : This one, I don't understand what does it meant by number of passes through the experience buffer during gradient descent. can you give me some detailed explanation? or some reference if it is available?

If there is any misunderstanding / anything wrong, please correct me and give me some detailed explanation.

Thanks in advance.

@Junggy Junggy added the discussion Issue contains general discussion. label Jul 12, 2019
@shihzy
Copy link
Contributor

shihzy commented Jul 12, 2019

Hi @Junggy

1 - Processing means to go through one iteration of observation, action, and updating the reward.
2 - 4 Not sure on what you mean by your questions, can you elaborate?
5. This article does a good job explaining Epochs and batches for gradient descent https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/

@shihzy shihzy self-assigned this Jul 12, 2019
@Junggy
Copy link
Author

Junggy commented Jul 12, 2019

Thank you for quick answer. @unityjeffrey

Hmmmm okay you got my 5th question wrong.
I know what is batch and epoch completely. I just don't understand what does that sentence mean and especially how the epoch is designed in this case.

So did you define buffer_size amount of experience as one full dataset (one epoch) ????

So how it works internally is : every time buffer_size is filled, it calculates gradient with batch_size amount of experience "buffer_size / batch_size" times then post process (i.e. averaging gradient) and update to model. continue this with n_epoch times before collecting new buffer_size amount of experiences.

is this correct ?

regarding Time horizon,
it is written that

"time_horizon corresponds to how many steps of experience to collect per-agent before adding it to the 1) experience buffer. 2) When this limit is reached before the end of an episode, a value estimate is used to predict the overall expected reward from the agent's current state"

1). experience buffer is mentioned in here. is this experience buffer means same buffer as buffer with Buffer Size (I really don't get this part. what is relation between "Time Horizon" and "Batch Size" & "Buffer Size" & "Num Epoch"? )
So experience buffer takes time-horizon number of experiences as one unit and it experience buffer takes unit * buffer size amount as one epoch ?????
or experience buffer mentioned here is different kind of buffer?

  1. here now it says buffer number in reached its limit it calculates value estimate. So when the Buffer Size is filled, that experiences are used to calculate value estimate ? also the layer to calculate value estimate is updated n_epoch time?

@mattinjersey
Copy link

thanks I also find this all confusing so it would be good to clarify.

@shihzy
Copy link
Contributor

shihzy commented Jul 15, 2019

cc: @xiaomaogy

@xiaomaogy
Copy link
Contributor

@awjuliani

@awjuliani
Copy link
Contributor

Hi all. Let me try to clarify:

Time horizon is how many steps of experience to collect in a single trajectory before calculating the discounted returns and advantages for that trajectory, and then adding it to the buffer.

The buffer size is how big this buffer of trajectories can get before we use it for training. Once the buffer reaches this size, we then go through it (num_epoch) number of times, taking a random (batch_size) size batch at a time. After the epochs of training, we then clear the buffer and start filling it again from scratch.

@Junggy
Copy link
Author

Junggy commented Jul 16, 2019

@awjuliani Thanks for the answer.

Pretty much all makes sense.
But just one last thing. This is what I was always confused about.

you said
"Time horizon is how many steps of experience to collect in a single trajectory** before calculating the discounted returns and advantages for that trajectory, and then adding it to the buffer."

So one unit of buffer is not single experience, but time-horizon amount of experiences (means single trajectory)? (i.e. buffer : buffer_size * single_trajectory (means time_horizon * experiences), not buffer_size * single_experience)

Something like,
Buffer is filled with buffer_size amount of trajectories, trajectories is filled with time_horizon amount of experiences. right?

@awjuliani
Copy link
Contributor

@Junggy

The buffer consists of single experiences, and the buffer size corresponds to number of experiences. When a trajectory is added, it is added as single experiences, not as a whole unit. That being said, when dealing with LSTMs the experiences from trajectories are kept in temporal order, so they can be re-used during training.

@mattinjersey
Copy link

mattinjersey commented Jul 16, 2019

could you define the word "experience". Does an experience include all the vector observations from a single timestep. So 1 experience might consist of 40 vector observations for 1 game, but it might consist of 100 vector observations for another game.

Also could you define the word "trajectory".

@Junggy
Copy link
Author

Junggy commented Jul 18, 2019

@awjuliani thanks, its getting more clear!

This is what I understood at last. can you clarify whether its wrong or not?

let's say, time_horizon = 4, buffer_size= 4, batch_size=2, n_epoch=2

  1. time_horizon amount of experience is gathered, and discounted advantage is calculated - trajectory
    (i.e. time_horizon_buffer = [exp_1, exp_2, exp_3, exp_4] -> adv_1)

  2. first experience and advantage goes into buffer.
    (i.e. buffer : [(exp_1,adv_1)]

  3. discard first experience in time_horizon buffer
    (i.e. time_horizon_buffer = [exp_2, exp_3, exp_4]

------------------- repeat until buffer is filled ------------------------
i.e.

  1. new experience received, next discounted advantage is calculated - new trajectory
    (i.e. time_horizon_buffer = [exp_2, exp_3, exp_4, exp_5] -> adv_2)

  2. first experience and advantage goes into buffer.
    (i.e. buffer : [(exp_1,adv_1),(exp_2,adv_2)]

  3. discard_first experience in the time_horizon_buffer
    (i.e time_horizon_buffer = [exp_3, exp_4, exp_5]

......
......
...... repeated 2 more times

-------------------------- buffer is filled ---------------------------
(i.e. buffer = : [(exp_1,adv_1),(exp_2,adv_2),(exp_3,adv_3),(exp_4,adv_4)])

calculate gradient with batch_size number of samples in buffer (i.e. sample size = 2)
repeat until it goes through all samples in buffer (i.e. 4/2 = 2 times)
repeat this n_epoch times (i.e. 2 times)

empty everything and start over again
(time_horizon_buffer = [], buffer = [])

Is this correct?
Thanks in advance!

@Junggy
Copy link
Author

Junggy commented Jul 18, 2019

@mattinjersey

I think experience is everything you received from unity after action, like dictionary.
like, exp = {observation : some_vector, visual_observation : some_images, reward : some_scalar}
something like this
and seems like it will be changed depending on the setup.

@Nanocentury
Copy link

This should totally be added to the docs, Thanks for asking and answering.

@vincentpierre
Copy link
Contributor

Thank you for the discussion. We are closing this issue due to inactivity.

@github-actions
Copy link

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 31, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
discussion Issue contains general discussion.
Projects
None yet
Development

No branches or pull requests

7 participants