-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal to put more metadata in FROSTT file format #14
Comments
Hi Aart, First of all, thanks for the comment. In order to make the format as simple as possible, we omitted the meta-data, since it can be inferred from the list of non-zeros. The downside of not having meta-data is that, it requires more than one pass through the data to load it. In order to alleviate any issues, we have implemented a loader function in C and will be putting it up on the website shortly (i.e., in the next couple of weeks). We will discuss including an option for meta-data, and how it should be formatted, and make the appropriate changes to the repo. Please let us know if you have any more suggestions or comments. -- Jee |
Hi Jee, |
A nice compromise could be having an optional comment at the beginning that serves as a header with metadata? Something like this?
|
Thanks Shaden! If maintaining backward compatibility and/or maintaining the original simplicity is important, then adding the metadata as comment is a very acceptable compromise indeed. But allow me to make two cases for simply adding it to the specification proper. (1) It adds an extra integrity check, even for information that can be inferred. For example, even though the number of nonzeros can be inferred by scanning to EOF, if somebody accidentally removed a few lines of data, that would go undetected with the original format. In the extended format, this change would be detected. (2) It adds exact information. For example, even though the modes (I called that rank) and number of nonzeros can be inferred, the exact dimension sizes for each mode remain ambiguous. Scanning each order for the maximum value index in that value is not guaranteed to be precise. As an extreme example, take
Is this a dense 1x1x1 tensor, a sparse 17x13x3 tensor, or a very sparse 10,000 x 10,000 x 10,000 tensor? No good way to know. Expressing this as
would unambiguously define the sparse tensor. In any case, I highly appreciate your team taking the time to discuss this more, and hope my feedback is useful.In the meantime, I thank you and your team for providing this very nice set of sparse tensors, that is a really great contribution! Aart |
@aartbik this is similar to the binary format writer I made for FROSTT like tensors here https://gitlab.com/tensors/genten/-/blob/mpi-sgd/tools/sparse_tensor_to_binary.cpp I wanted a concise binary format so that I could read the tensor more easily in parallel, and I settled on the following: * The header for the text files needs to be in the form
-----------------------------------------------------------------
sptensor -> Type
5 -> Number of dimensions
1605 4198 1631 4209 868131 -> Sizes of each dimension
1698825 -> Number nonzero
1 1 1 1049 156 1.000000 -> This is the first nonzero element
... -> More nonzero elements
----------------------------------------------------------------- Making a binary form of this is straight forward as well, but it might be nice to all agree on what the binary format of FROSTT tensors should look like. I have the following form: The output file will have the following form without the newlines or -> comments
73 70 74 6e -> 4 char 'sptn'
ndims -> uint32_t
bytes_for_float_type -> uint32_t, might need more here, or just drop this field and assume double to be easier
size0 size1 size2 size3 size4 -> ndims uint64_t
bits0 bits1 bits2 bits3 bits4 -> number of bits used for each index
number_non_zero -> uint64_t
/* the elements depend on the size of each mode to make the file size smaller we
* will use the smallest of uint8_t uint16_t uint32_t uint64_t that holds all
* the elements from the size field above, for now all floats are stored as
* described above. unlike the textual format we will always use zero based
* indexing
*/
1 1 1 1049 156 1.000000 -> uint16_t uint16_t uint16_t uint16_t uint32_t float_type but would be open to changing it. Because for high dimensional tensors the vast majority of the file size is actually dedicated to coordinates, I thought it was a good idea to squeeze each coordinate into the smallest builtin type it would fit into. |
Thanks for the information. We seem to agree on the fields for the text format. For a binary format, it makes sense to include some form of compression. Defining the number of bits used for the (max) index in each dimension sounds reasonable. You could also consider a variable-length encoding scheme, with savings for the smaller indices, even if some are large. |
I agree, I was just trying to do something simple so I could read the files with MPI_IO and each input would have a known offset. Do we want to explicitly express the order of the tensor # This is exact!
3 1
17 13 3
1 1 1 1.0 or just derive it from the number of modes sizes provided? # This is exact!
1
17 13 3 -> Information contained here
1 1 1 1.0 Alternatively we could copy matrix market # This is exact!
17 13 3 1 -> nnz is the final input
1 1 1 1.0 and that would let you use the same reader to read MM COO files. Finally, (@ShadenSmith sorry if this is specified somewhere else that I am missing) should we specify that indices always use 1 based indexing (this appears to be what matrix market does), 0 based indexing, or include a flag in the file to swap between them? Hopefully, this doesn't come off as bike shedding I don't have a strong preference, but I would be excited if we all agreed on a format so I could ditch my one off conversion scripts. I'm fine with the FROSTT team just picking an option and changing my code. |
Thanks everyone for the comments. This part of the page has become lively again :)
|
I'm sorry for the extremely late update... Disclaimer: --- General idea --- An example would be (for the Chicago crime tensor) For binary files, it would contain the following information:
--- Templated library for loading/storing tensors ---
For those of you who are developing in C, I've also include an example code that show how these templated classes can be used through interface functions (see examples/C). The code can be found here: If you have any questions, please leave a comment or send me an email. Thank you! |
I found the current format that only lists nonzero elements of a tensor rather terse. Granted, the rank can be inferred from the number of integer columns and the number of nonzeros by scanning to EOF. But the dimension sizes would be unclear for very sparse tensors where not all ranges are "filled".
So why not include a bit more metadata in the file header?
May I propose, at the very least
For example, a 2x3 dense matrix would look as follows
The text was updated successfully, but these errors were encountered: