-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
Is the output actually compressed? Looks like it's still float? |
It is actually compressed. The float data is just a holder, in which 16 float data will be compressed into one float data. |
For now, this method can only support to compress the float data. |
We may want to consider adding a data type like int2 |
How is this going to be used by kvstore and ps-lite? How does it recognize a compressed array? |
For multi-GPU, we can directly invoke quantize_2bit() before copying data to other gpu, and invoke dequantize_2bit() when receiving the data block. But in distributed training I'm not sure if we need to hack the ps-lite code, to invoke a dequantize function before updating the model ? |
The compressed array is also a NDArray, the first two element is two threshold, the other elements is the compressed data. Every 16 data will be pack into one float . |
How are gradients from different gpus aggregated? |
Before aggregating the gradient, we first need to decompress them into a new array with same size of the original gradient array, but the value new array is changed, only consist of two threshold. The new array can be re-used. |
quantize also need to output the residual right? |
Yes, before compressing we need to add the residual, and we also need to calculate new residual after compressing. I think it will be better to let user do this task. This operator is just a very low-level operator. We can build higher level python wrapper for that. |
The residual is stored in each local device. |
You can calculate compression and residual in one operator to make it faster |
I will do that . |
@piiswrong I add the calculation of residual in this operator and add test case in test file. |
Where are the tests? I don't see any |
I removed the test file just to verify if it can be built successfully. I will add them later. |
This is the implementation of two-bit compression.
Usage: