-
Notifications
You must be signed in to change notification settings - Fork 520
[SYSTEMDS-2699] CLA IO Compressed Matrix #1697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Since the compression format have a tendency to change a bit the files written will not be fully supported at all times across different versions. A suggestion to detect changes or incompatible version numbers is to write a identifier to the files in the beginning,
Since GitHash is not available at all times we could use SystemDS version number as a fall back. I do not personally like either solution maybe someone else have some suggestions? Other design decisions:
Help / Comments appreciated |
|
Well, regarding the overall design I would recommend to follow the existing binary format. We write sequence files of key-value (index-block) pairs from both local and distributed writers such that the files can be read in any execution mode. Right now it seems you directly serialize the entire block, similar to what the buffer pool eviction did. A version ID at the binning of the file/blocks is fine but we should strive to keep the file layout static, except for new encoding schemes this should be possible. |
|
I was thinking of the index-block design used in binary, but since the compression framework compress an entire matrix in CP i would have to decompose the compression into multiple blocks if we want this and, in CP, reading would have to combine them again. I think this overcomplicate things unless we somehow make the compression able to combine different blocks with the same compression plan. Furthermore if we write a compressed distributed block-indexed matrix to disk we get multiple blocks with different formats that would not be able to combine nicely in CP anyway. Enforcing that such a read should lead to SP instructions. In the end the problems make reading and writing the same way as binary blocks a bit challenging especially if you want same behavior. |
|
Please do not add these special cases / workarounds to the compiler. |
|
Agree. Hence i was asking for suggestions. |
|
Thanks - as I said, strive for clear semantics first and don't worry about performance/suboptimal compression ratios. Writing out b x b blocks according to the CP compression scheme is fine (with splitting of column groups across block boundaries). When reading b x b compressed blocks, take the compression plan of the first blocks that touch individual columns, and then merge the remaining blocks in. Once the initial version is ready and fully operational, we can talk about performance to minimize reallocations, etc. |
|
Okay let's see what i can do. I have a few ideas |
|
Thanks for the help |
This commit adds the basic blocks for writing a compressed matrix to disk, and adds a basic test for the case of writing a matrix and read it back from disk. Further testing and full integration into DML is needed, and a mechanism to detect if the format of the compression groups have changed.
This commit adds slicing of rows with compressed output, this allows us to have compressed blocks for writing a full compressed matrix to disk.
4726753 to
c44e7e2
Compare
This commit adds slicing of rows with compressed output, this allows us to have compressed blocks for writing a full compressed matrix to disk. Also contained in this commit is the extensions to reading and writing of compressed format (.cla), to allow it to work for spark. Closes apache#1697
This commit adds the basic blocks for writing a compressed matrix to disk, and adds a basic test for the case of writing a matrix and read it back from disk.
Further testing and full integration into DML is needed, and a mechanism to detect if the format of the compression groups have changed.