Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to put more metadata in FROSTT file format #14

Open
aartbik opened this issue Jan 17, 2021 · 9 comments
Open

Proposal to put more metadata in FROSTT file format #14

aartbik opened this issue Jan 17, 2021 · 9 comments

Comments

@aartbik
Copy link

aartbik commented Jan 17, 2021

I found the current format that only lists nonzero elements of a tensor rather terse. Granted, the rank can be inferred from the number of integer columns and the number of nonzeros by scanning to EOF. But the dimension sizes would be unclear for very sparse tensors where not all ranges are "filled".

So why not include a bit more metadata in the file header?

May I propose, at the very least

# extended FROSTT file format
<rank> <nnz>
<dimension sizes (one per rank)>
... data as before ...

For example, a 2x3 dense matrix would look as follows

# this is a comment
2 6
2 3
1 1 1.1
1 2 1.2
1 3 1.3
2 1 2.1
2 2 2.2
2 3 2.3
@jeewhanchoi
Copy link
Contributor

Hi Aart,

First of all, thanks for the comment.

In order to make the format as simple as possible, we omitted the meta-data, since it can be inferred from the list of non-zeros.

The downside of not having meta-data is that, it requires more than one pass through the data to load it. In order to alleviate any issues, we have implemented a loader function in C and will be putting it up on the website shortly (i.e., in the next couple of weeks).

We will discuss including an option for meta-data, and how it should be formatted, and make the appropriate changes to the repo.

Please let us know if you have any more suggestions or comments.

-- Jee

@aartbik
Copy link
Author

aartbik commented Jan 17, 2021

Hi Jee,
Thanks for your very timely response, and I appreciate you discussing this! To give some background, I am working on a "sparse tensor compiler" in MLIR using the solid foundation laid by the TACO compiler team (see the ongoing discussion in this thread). For benchmarking, I found the FROSTT repository very helpful, but was just wondering about adding some more metadata (see this thread). Please let me know if I can help.
Aart

@ShadenSmith
Copy link
Collaborator

A nice compromise could be having an optional comment at the beginning that serves as a header with metadata? Something like this?

#META modes=2 nnz=6 dims=2,3
1 1 1.1
1 2 1.2
1 3 1.3
2 1 2.1
2 2 2.2
2 3 2.3

@aartbik
Copy link
Author

aartbik commented Jan 17, 2021

Thanks Shaden!

If maintaining backward compatibility and/or maintaining the original simplicity is important, then adding the metadata as comment is a very acceptable compromise indeed. But allow me to make two cases for simply adding it to the specification proper.

(1) It adds an extra integrity check, even for information that can be inferred. For example, even though the number of nonzeros can be inferred by scanning to EOF, if somebody accidentally removed a few lines of data, that would go undetected with the original format. In the extended format, this change would be detected.

(2) It adds exact information. For example, even though the modes (I called that rank) and number of nonzeros can be inferred, the exact dimension sizes for each mode remain ambiguous. Scanning each order for the maximum value index in that value is not guaranteed to be precise. As an extreme example, take

# what is this exactly?
1 1 1 1.0

Is this a dense 1x1x1 tensor, a sparse 17x13x3 tensor, or a very sparse 10,000 x 10,000 x 10,000 tensor? No good way to know. Expressing this as

# This is exact!
3 1
17 13 3
1 1 1 1.0

would unambiguously define the sparse tensor.

In any case,  I highly appreciate your team taking the time to discuss this more, and hope my feedback is useful.In the meantime, I thank you and your team for providing this very nice set of sparse tensors, that is a really great contribution!

Aart

@calewis
Copy link

calewis commented Jan 25, 2021

@aartbik this is similar to the binary format writer I made for FROSTT like tensors here https://gitlab.com/tensors/genten/-/blob/mpi-sgd/tools/sparse_tensor_to_binary.cpp

I wanted a concise binary format so that I could read the tensor more easily in parallel, and I settled on the following:

* The header for the text files needs to be in the form
-----------------------------------------------------------------
sptensor                   -> Type
5                          -> Number of dimensions
1605 4198 1631 4209 868131 -> Sizes of each dimension
1698825                    -> Number nonzero
1 1 1 1049 156 1.000000    -> This is the first nonzero element
...                        -> More nonzero elements
-----------------------------------------------------------------

Making a binary form of this is straight forward as well, but it might be nice to all agree on what the binary format of FROSTT tensors should look like. I have the following form:

 The output file will have the following form without the newlines or -> comments
73 70 74 6e                   -> 4 char 'sptn'
ndims                         -> uint32_t
bytes_for_float_type          -> uint32_t, might need more here, or just drop this field and assume double to be easier
size0 size1 size2 size3 size4 -> ndims uint64_t
bits0 bits1 bits2 bits3 bits4 -> number of bits used for each index
number_non_zero               -> uint64_t
/* the elements depend on the size of each mode to make the file size smaller we
* will use the smallest of uint8_t uint16_t uint32_t uint64_t that holds all
* the elements from the size field above, for now all floats are stored as
* described above.  unlike the textual format we will always use zero based
* indexing
*/
1 1 1 1049 156 1.000000 -> uint16_t uint16_t uint16_t uint16_t uint32_t float_type

but would be open to changing it.

Because for high dimensional tensors the vast majority of the file size is actually dedicated to coordinates, I thought it was a good idea to squeeze each coordinate into the smallest builtin type it would fit into.

@aartbik
Copy link
Author

aartbik commented Jan 26, 2021

Thanks for the information. We seem to agree on the fields for the text format. For a binary format, it makes sense to include some form of compression. Defining the number of bits used for the (max) index in each dimension sounds reasonable. You could also consider a variable-length encoding scheme, with savings for the smaller indices, even if some are large.

@calewis
Copy link

calewis commented Jan 26, 2021

I agree, I was just trying to do something simple so I could read the files with MPI_IO and each input would have a known offset.

Do we want to explicitly express the order of the tensor

# This is exact!
3 1
17 13 3
1 1 1 1.0

or just derive it from the number of modes sizes provided?

# This is exact!
1
17 13 3 -> Information contained here
1 1 1 1.0

Alternatively we could copy matrix market

# This is exact!
17 13 3 1 -> nnz is the final input
1 1 1 1.0

and that would let you use the same reader to read MM COO files.

Finally, (@ShadenSmith sorry if this is specified somewhere else that I am missing) should we specify that indices always use 1 based indexing (this appears to be what matrix market does), 0 based indexing, or include a flag in the file to swap between them?

Hopefully, this doesn't come off as bike shedding I don't have a strong preference, but I would be excited if we all agreed on a format so I could ditch my one off conversion scripts. I'm fine with the FROSTT team just picking an option and changing my code.

@jeewhanchoi
Copy link
Contributor

jeewhanchoi commented Jan 26, 2021

Thanks everyone for the comments. This part of the page has become lively again :)

  1. We will define/implement a meta-data scheme with optional comments at the beginning where these parameters can be specified. Hopefully, this will maintain backwards compatibility.

  2. We already have a binary reader/writer that is compatible with the current format. We will share it soon.
    I'm finally done with two of the three NSF proposals, so I will try to push these out to the website in the next week or so.
    Once these have been pushed out, I would appreciate any feedback/bug report that you guys can provide.

  3. As for variable bits, we are in the process of putting together a "compression" scheme with variable bit encoding for the modes. We will update the datasets with the appropriate description once it has been published (soon).
    This one will take a bit more time as we are still focused on writing the paper and with experiments to get the paper out to ICS.

@jeewhanchoi
Copy link
Contributor

jeewhanchoi commented Mar 27, 2021

I'm sorry for the extremely late update...
I had been playing around with implementing a new templated C++ code for loading and saving tensors, and I thought that finishing this might give me some ideas as to how the meta-data should be stored (and I've been generally busy with teaching).

Disclaimer:
All of the text below are a first-cut proposal, so if you have a better suggestion, please leave a comment (or send me an email at jeec@uoregon.edu).
I am working on an official document that will go on the FROSTT website that describes all of this once we settle on an 'official standard.'

--- General idea ---
As Shaden suggested, I think going with a comment-based metadata will keep existing libraries backwards compatible with our tensors, and I think it's the cleanest option (just my opinion, of course).
I suggest one metadata per line, as it would allow users to add more later, and make it easier to parse.

An example would be (for the Chicago crime tensor)
# nmodes 4
# nnz 5330673
# dims 6186 24 77 32
Basically, each line starts with a # and the keywords 'nmodes', 'nnz', and 'dims' can be used to specify the metadata, followed by the value separated by a space.

For binary files, it would contain the following information:

  1. # of bytes used to store the value data, # of bytes to store the indices, and # of bytes to store the nnz.
  2. nmodes, dims, and nnz (or other metadata if we decide to add more - we need to find a good way of providing an arbitrary number of metadata for binary files).
  3. indices, and values
    indices are stored in a nmodes x nnz array, so it would be nnz mode-0 indices followed by the mode-1 indices, etc.
    This way, the binary file reading function can read the # of bytes used by each data type, and use this information to read the subsequent data.

--- Templated library for loading/storing tensors ---
I've created a library that defines a templated class 'SpTensor' that holds relevant information about the tensor, and has functions for reading and writing text/binary tensors.
I've made it a C++ templated class for two reasons:

  1. by making the data types templated, you can specify different data to hold the mode lengths, indices, values etc. (e.g., float vs. double, unsigned int vs. unsigned long int, etc.)
  2. by making SpTensor a class, you can inherit it to add more data structures to the spares tensor as needed. You can also modify the functions to give them different behavior.
  3. I've also defined the MetaData class. Currently it reads the above three metadata, but you can use inheritance to add more metadata and provide functions to read them as needed.

For those of you who are developing in C, I've also include an example code that show how these templated classes can be used through interface functions (see examples/C).
While you can use a C compiler to compile your code, you will need a C++ compiler to a) compile the library, and then to b) link the object files together to generate the executable.
We will eventually provide this code as a static/dynamic library.
Also, the code uses C++11 features (i.e., compile it with -std=c++11.

The code can be found here:
https://github.com/jeewhanchoi/tensor_tools
This is a WIP, so I'm keeping the repo private for the moment. However, if you want to take look or try it out, please let me know and I'll provide access.
Disclaimer 2:
I'm not a good C++ programmer, so there may be many inefficiencies and bugs.
Please feel free to suggest ways to improve the code (or better yet, make a pull request to make the improvements yourself).

If you have any questions, please leave a comment or send me an email.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants