Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache Autoscaler for DL training as mentioned in Fluid paper Section V.B.3 #3668

Open
mengwanguc opened this issue Dec 26, 2023 · 1 comment
Assignees

Comments

@mengwanguc
Copy link

Hi,

I'm wondering where I can find the Cache System Autoscaler for DL training as mentioned in Fluid paper Section V.B.3?

I've spent hours scanning the documentation, but failed to find the corresponding doc/script.

There are 2 related docs, but they don't answer my question:
(1). Accelerate Machine Learning Training with Fluid It does describes how to run ML training on Fluid, but the cache capacity is hardcoded. However, according to the Fluid paper, the cache amount should adapt dynamically to the training speed.
(2). Cache Runtime Auto Scaling Although it describes how to scale cache workers on the fly, the scaling policy is quite simple (checking if 90% cache capacity is used). However, the scaling policy for DL training is much more complicated.

Therefore, I'm wondering if there is any doc/demo/script/etc that can help me reproduce the cache autoscaling for DL training, as mentioned in the paper?
Any help will be appreciated!

Thanks,
Meng

@TrafalgarZZZ
Copy link
Member

/assign @RongGu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants