[SYSTEMDS-2699] CLA IO Compressed Matrix #1697

Baunsgaard · 2022-09-17T21:23:04Z

This commit adds the basic blocks for writing a compressed matrix to disk, and adds a basic test for the case of writing a matrix and read it back from disk.

Further testing and full integration into DML is needed, and a mechanism to detect if the format of the compression groups have changed.

Baunsgaard · 2022-09-17T21:31:38Z

Since the compression format have a tendency to change a bit the files written will not be fully supported at all times across different versions. A suggestion to detect changes or incompatible version numbers is to write a identifier to the files in the beginning,

GitHash
SystemDS version Number

Since GitHash is not available at all times we could use SystemDS version number as a fall back. I do not personally like either solution maybe someone else have some suggestions?

Other design decisions:

For distributed i intend to simply write each compressed block in different files like we already do.
Parallel reading and writing could be made with many files, for instance i could split each each column group into a separate file instead of multiple blocks, perhaps someone have some experience or ideas?

Help / Comments appreciated

Baunsgaard · 2022-09-17T21:33:47Z

@mboehm7

mboehm7 · 2022-09-18T14:37:30Z

Well, regarding the overall design I would recommend to follow the existing binary format. We write sequence files of key-value (index-block) pairs from both local and distributed writers such that the files can be read in any execution mode. Right now it seems you directly serialize the entire block, similar to what the buffer pool eviction did.

A version ID at the binning of the file/blocks is fine but we should strive to keep the file layout static, except for new encoding schemes this should be possible.

Baunsgaard · 2022-09-18T15:45:37Z

I was thinking of the index-block design used in binary, but since the compression framework compress an entire matrix in CP i would have to decompose the compression into multiple blocks if we want this and, in CP, reading would have to combine them again.

I think this overcomplicate things unless we somehow make the compression able to combine different blocks with the same compression plan.

Furthermore if we write a compressed distributed block-indexed matrix to disk we get multiple blocks with different formats that would not be able to combine nicely in CP anyway. Enforcing that such a read should lead to SP instructions.

In the end the problems make reading and writing the same way as binary blocks a bit challenging especially if you want same behavior.
But i can suggest we always treat the compressed format as an index-block based file with a block size >= nCols && nRows ;)

mboehm7 · 2022-09-18T16:08:42Z

Please do not add these special cases / workarounds to the compiler.

Baunsgaard · 2022-09-18T16:16:53Z

Agree. Hence i was asking for suggestions.

mboehm7 · 2022-09-18T16:30:51Z

Thanks - as I said, strive for clear semantics first and don't worry about performance/suboptimal compression ratios. Writing out b x b blocks according to the CP compression scheme is fine (with splitting of column groups across block boundaries). When reading b x b compressed blocks, take the compression plan of the first blocks that touch individual columns, and then merge the remaining blocks in. Once the initial version is ready and fully operational, we can talk about performance to minimize reallocations, etc.

Baunsgaard · 2022-09-18T16:41:57Z

Okay let's see what i can do. I have a few ideas

Baunsgaard · 2022-09-18T16:48:12Z

Thanks for the help

This commit adds the basic blocks for writing a compressed matrix to disk, and adds a basic test for the case of writing a matrix and read it back from disk. Further testing and full integration into DML is needed, and a mechanism to detect if the format of the compression groups have changed.

This commit adds slicing of rows with compressed output, this allows us to have compressed blocks for writing a full compressed matrix to disk.

This commit adds slicing of rows with compressed output, this allows us to have compressed blocks for writing a full compressed matrix to disk. Also contained in this commit is the extensions to reading and writing of compressed format (.cla), to allow it to work for spark. Closes apache#1697

Baunsgaard added 3 commits October 12, 2022 10:09

[SYSTEMDS-3438] CLA RowSlice compressed return

8e73624

This commit adds slicing of rows with compressed output, this allows us to have compressed blocks for writing a full compressed matrix to disk.

Fix spark reading , and full integration into DML.

c44e7e2

Baunsgaard force-pushed the CompressedIO branch from 4726753 to c44e7e2 Compare October 12, 2022 16:43

Baunsgaard added 3 commits October 12, 2022 20:27

add license

730face

update

6d0061c

cleanup

ea7db3d

Baunsgaard closed this in d8936cf Oct 12, 2022

Baunsgaard deleted the CompressedIO branch October 21, 2022 12:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYSTEMDS-2699] CLA IO Compressed Matrix #1697

[SYSTEMDS-2699] CLA IO Compressed Matrix #1697

Uh oh!

Baunsgaard commented Sep 17, 2022

Uh oh!

Baunsgaard commented Sep 17, 2022

Uh oh!

Baunsgaard commented Sep 17, 2022

Uh oh!

mboehm7 commented Sep 18, 2022

Uh oh!

Baunsgaard commented Sep 18, 2022

Uh oh!

mboehm7 commented Sep 18, 2022

Uh oh!

Baunsgaard commented Sep 18, 2022

Uh oh!

mboehm7 commented Sep 18, 2022

Uh oh!

Baunsgaard commented Sep 18, 2022

Uh oh!

Baunsgaard commented Sep 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SYSTEMDS-2699] CLA IO Compressed Matrix #1697

[SYSTEMDS-2699] CLA IO Compressed Matrix #1697

Uh oh!

Conversation

Baunsgaard commented Sep 17, 2022

Uh oh!

Baunsgaard commented Sep 17, 2022

Uh oh!

Baunsgaard commented Sep 17, 2022

Uh oh!

mboehm7 commented Sep 18, 2022

Uh oh!

Baunsgaard commented Sep 18, 2022

Uh oh!

mboehm7 commented Sep 18, 2022

Uh oh!

Baunsgaard commented Sep 18, 2022

Uh oh!

mboehm7 commented Sep 18, 2022

Uh oh!

Baunsgaard commented Sep 18, 2022

Uh oh!

Baunsgaard commented Sep 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants