You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm wondering where I can find the Cache System Autoscaler for DL training as mentioned in Fluid paper Section V.B.3?
I've spent hours scanning the documentation, but failed to find the corresponding doc/script.
There are 2 related docs, but they don't answer my question:
(1). Accelerate Machine Learning Training with Fluid It does describes how to run ML training on Fluid, but the cache capacity is hardcoded. However, according to the Fluid paper, the cache amount should adapt dynamically to the training speed.
(2). Cache Runtime Auto Scaling Although it describes how to scale cache workers on the fly, the scaling policy is quite simple (checking if 90% cache capacity is used). However, the scaling policy for DL training is much more complicated.
Therefore, I'm wondering if there is any doc/demo/script/etc that can help me reproduce the cache autoscaling for DL training, as mentioned in the paper?
Any help will be appreciated!
Thanks,
Meng
The text was updated successfully, but these errors were encountered:
Hi,
I'm wondering where I can find the Cache System Autoscaler for DL training as mentioned in Fluid paper Section V.B.3?
I've spent hours scanning the documentation, but failed to find the corresponding doc/script.
There are 2 related docs, but they don't answer my question:
(1). Accelerate Machine Learning Training with Fluid It does describes how to run ML training on Fluid, but the cache capacity is hardcoded. However, according to the Fluid paper, the cache amount should adapt dynamically to the training speed.
(2). Cache Runtime Auto Scaling Although it describes how to scale cache workers on the fly, the scaling policy is quite simple (checking if 90% cache capacity is used). However, the scaling policy for DL training is much more complicated.
Therefore, I'm wondering if there is any doc/demo/script/etc that can help me reproduce the cache autoscaling for DL training, as mentioned in the paper?
Any help will be appreciated!
Thanks,
Meng
The text was updated successfully, but these errors were encountered: