Skip to content

Conversation

@Baunsgaard
Copy link
Contributor

This commit adds the basic blocks for writing a compressed matrix to disk, and adds a basic test for the case of writing a matrix and read it back from disk.

Further testing and full integration into DML is needed, and a mechanism to detect if the format of the compression groups have changed.

@Baunsgaard
Copy link
Contributor Author

Since the compression format have a tendency to change a bit the files written will not be fully supported at all times across different versions. A suggestion to detect changes or incompatible version numbers is to write a identifier to the files in the beginning,

  • GitHash
  • SystemDS version Number

Since GitHash is not available at all times we could use SystemDS version number as a fall back. I do not personally like either solution maybe someone else have some suggestions?

Other design decisions:

  1. For distributed i intend to simply write each compressed block in different files like we already do.
  2. Parallel reading and writing could be made with many files, for instance i could split each each column group into a separate file instead of multiple blocks, perhaps someone have some experience or ideas?

Help / Comments appreciated

@Baunsgaard
Copy link
Contributor Author

@mboehm7

@mboehm7
Copy link
Contributor

mboehm7 commented Sep 18, 2022

Well, regarding the overall design I would recommend to follow the existing binary format. We write sequence files of key-value (index-block) pairs from both local and distributed writers such that the files can be read in any execution mode. Right now it seems you directly serialize the entire block, similar to what the buffer pool eviction did.

A version ID at the binning of the file/blocks is fine but we should strive to keep the file layout static, except for new encoding schemes this should be possible.

@Baunsgaard
Copy link
Contributor Author

I was thinking of the index-block design used in binary, but since the compression framework compress an entire matrix in CP i would have to decompose the compression into multiple blocks if we want this and, in CP, reading would have to combine them again.

I think this overcomplicate things unless we somehow make the compression able to combine different blocks with the same compression plan.

Furthermore if we write a compressed distributed block-indexed matrix to disk we get multiple blocks with different formats that would not be able to combine nicely in CP anyway. Enforcing that such a read should lead to SP instructions.

In the end the problems make reading and writing the same way as binary blocks a bit challenging especially if you want same behavior.
But i can suggest we always treat the compressed format as an index-block based file with a block size >= nCols && nRows ;)

@mboehm7
Copy link
Contributor

mboehm7 commented Sep 18, 2022

Please do not add these special cases / workarounds to the compiler.

@Baunsgaard
Copy link
Contributor Author

Agree. Hence i was asking for suggestions.

@mboehm7
Copy link
Contributor

mboehm7 commented Sep 18, 2022

Thanks - as I said, strive for clear semantics first and don't worry about performance/suboptimal compression ratios. Writing out b x b blocks according to the CP compression scheme is fine (with splitting of column groups across block boundaries). When reading b x b compressed blocks, take the compression plan of the first blocks that touch individual columns, and then merge the remaining blocks in. Once the initial version is ready and fully operational, we can talk about performance to minimize reallocations, etc.

@Baunsgaard
Copy link
Contributor Author

Okay let's see what i can do. I have a few ideas

@Baunsgaard
Copy link
Contributor Author

Thanks for the help

This commit adds the basic blocks for writing a compressed matrix to
disk, and adds a basic test for the case of writing a matrix and
read it back from disk.

Further testing and full integration into DML is needed, and a mechanism
to detect if the format of the compression groups have changed.
This commit adds slicing of rows with compressed output,
this allows us to have compressed blocks for writing
a full compressed matrix to disk.
@Baunsgaard Baunsgaard deleted the CompressedIO branch October 21, 2022 12:01
fathollahzadeh pushed a commit to fathollahzadeh/systemds that referenced this pull request Dec 7, 2022
This commit adds slicing of rows with compressed output,
this allows us to have compressed blocks for writing
a full compressed matrix to disk.

Also contained in this commit is the extensions to reading and
writing of compressed format (.cla), to allow it to work for
spark.

Closes apache#1697
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants