New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Java] serialize to output stream is limited to 2GB #1528
Comments
Fury need to go back in the buffer to update some header in some situations. In such cases, flush ahead is not possible. In the long run, we may be able to streaming write if we provide options to disable such look back. But could you share which cases you need to serialize such big object? It's rare in a production environment, and protobuf don't support it too |
We have quite large files on disk and can not use protobuf because of the 2GB limitation. Thats why we where looking for alternatives: fast serialization with cross-language support. I feel in the future if we store machine learning embeddings in a column based style to a file, the 2GB limit will be a problem quite a lot. |
This is interesting, if embeddings are storaged, we may need larger limitation. Could we split a big object into some small objects for serialization. I mean, you can serialize like this: Fury fury = xxx;
OutputStream stream = xxx;
fury.serialize(stream, o1);
fury.serialize(stream, o2);
fury.serialize(stream, o3); Then for deserializaion, you can: Fury fury = xxx;
FuryInputStream stream = xxx;
Object o1 = fury.deserialize(stream);
Object o2 = fury.deserialize(stream);
Object o3 = fury.deserialize(stream); |
If we can't split an object graph into multiple serialization, then we do need to support larger size limit |
In our case we have a file with meta information and embeddings of several millions images. All the embeddings are stored in column-based styles for fast access and distance calculations. The embeddings are around 1000 dimensions, which means we can only store 2 million images in one file otherwise just the embeddings alone are to large for fury. |
Why not split this file into smaller files |
It is just a hassle, right now the meta information of all images are stored in row-based style, followed by the embeddings information of all the images in column-based style. I see three options with the current implementation of fury to handle large files:
For 2 and 3 we would need to keep track of how big the file already is in order to make reasonable splits (not splitting the meta data or embedding of an image into two files). Finding a good splitting point for option 1 is more straight forward since the number of embeddings fitting into one file can be calculated in advanced. In all cases ideal memory allocation and ordering of data would need more consideration. |
Search before asking
Version
Version: 0.4.1
OS: Windows
JDK: 21
Component(s)
Java
Minimal reproduce step
What did you expect to see?
I was hoping to get a file with 2147483646 bytes, all zero.
What did you see instead?
Anything Else?
I think when providing an OutputStream to the serialize method the intermediate MemoryBuffer should behave like the buffer inside the BufferedOutputStream. When the buffer is full it should flush its content to the underlying OutputStream in order to free up its bytes.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: