Cache Autoscaler for DL training as mentioned in Fluid paper Section V.B.3 #3668

mengwanguc · 2023-12-26T02:39:07Z

Hi,

I'm wondering where I can find the Cache System Autoscaler for DL training as mentioned in Fluid paper Section V.B.3?

I've spent hours scanning the documentation, but failed to find the corresponding doc/script.

There are 2 related docs, but they don't answer my question:
(1). Accelerate Machine Learning Training with Fluid It does describes how to run ML training on Fluid, but the cache capacity is hardcoded. However, according to the Fluid paper, the cache amount should adapt dynamically to the training speed.
(2). Cache Runtime Auto Scaling Although it describes how to scale cache workers on the fly, the scaling policy is quite simple (checking if 90% cache capacity is used). However, the scaling policy for DL training is much more complicated.

Therefore, I'm wondering if there is any doc/demo/script/etc that can help me reproduce the cache autoscaling for DL training, as mentioned in the paper?
Any help will be appreciated!

Thanks,
Meng

TrafalgarZZZ · 2024-01-15T02:45:50Z

/assign @RongGu

fluid-e2e-bot bot assigned RongGu Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache Autoscaler for DL training as mentioned in Fluid paper Section V.B.3 #3668

Cache Autoscaler for DL training as mentioned in Fluid paper Section V.B.3 #3668

mengwanguc commented Dec 26, 2023

TrafalgarZZZ commented Jan 15, 2024

Cache Autoscaler for DL training as mentioned in Fluid paper Section V.B.3 #3668

Cache Autoscaler for DL training as mentioned in Fluid paper Section V.B.3 #3668

Comments

mengwanguc commented Dec 26, 2023

TrafalgarZZZ commented Jan 15, 2024