Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] serialize to output stream is limited to 2GB #1528

Open
1 of 2 tasks
Neiko2002 opened this issue Apr 16, 2024 · 7 comments
Open
1 of 2 tasks

[Java] serialize to output stream is limited to 2GB #1528

Neiko2002 opened this issue Apr 16, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@Neiko2002
Copy link

Neiko2002 commented Apr 16, 2024

Search before asking

  • I had searched in the issues and found no similar issues.

Version

Version: 0.4.1
OS: Windows
JDK: 21

Component(s)

Java

Minimal reproduce step

public static void main(String[] args) throws Exception {
	Fury fury = Fury.builder().requireClassRegistration(false).build();
	try (OutputStream output = new BufferedOutputStream(Files.newOutputStream(Files.createTempFile(null, null)))) {
		fury.serialize(output, new BigObj());
	}
}
	
static public class BigObj {
	public byte[] b1 = new byte[Integer.MAX_VALUE/2];
	public byte[] b2 = new byte[Integer.MAX_VALUE/2];
}

What did you expect to see?

I was hoping to get a file with 2147483646 bytes, all zero.

What did you see instead?

Exception in thread "main" java.lang.NegativeArraySizeException: -2147483510
	at io.fury.memory.MemoryBuffer.ensure(MemoryBuffer.java:1980)
	at io.fury.memory.MemoryBuffer.writePrimitiveArrayWithSizeEmbedded(MemoryBuffer.java:1946)
	at io.fury.serializer.ArraySerializers$ByteArraySerializer.write(ArraySerializers.java:290)

Anything Else?

I think when providing an OutputStream to the serialize method the intermediate MemoryBuffer should behave like the buffer inside the BufferedOutputStream. When the buffer is full it should flush its content to the underlying OutputStream in order to free up its bytes.

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@Neiko2002 Neiko2002 added the bug Something isn't working label Apr 16, 2024
@chaokunyang
Copy link
Collaborator

Fury need to go back in the buffer to update some header in some situations. In such cases, flush ahead is not possible. In the long run, we may be able to streaming write if we provide options to disable such look back.

But could you share which cases you need to serialize such big object? It's rare in a production environment, and protobuf don't support it too

@Neiko2002
Copy link
Author

We have quite large files on disk and can not use protobuf because of the 2GB limitation. Thats why we where looking for alternatives: fast serialization with cross-language support. I feel in the future if we store machine learning embeddings in a column based style to a file, the 2GB limit will be a problem quite a lot.

@chaokunyang
Copy link
Collaborator

We have quite large files on disk and can not use protobuf because of the 2GB limitation. Thats why we where looking for alternatives: fast serialization with cross-language support. I feel in the future if we store machine learning embeddings in a column based style to a file, the 2GB limit will be a problem quite a lot.

This is interesting, if embeddings are storaged, we may need larger limitation. Could we split a big object into some small objects for serialization. I mean, you can serialize like this:

Fury fury = xxx;
OutputStream stream = xxx;
fury.serialize(stream, o1);
fury.serialize(stream, o2);
fury.serialize(stream, o3);

Then for deserializaion, you can:

Fury fury = xxx;
FuryInputStream stream = xxx;
Object o1 = fury.deserialize(stream);
Object o2 = fury.deserialize(stream);
Object o3 = fury.deserialize(stream);

@chaokunyang
Copy link
Collaborator

If we can't split an object graph into multiple serialization, then we do need to support larger size limit

@Neiko2002
Copy link
Author

In our case we have a file with meta information and embeddings of several millions images. All the embeddings are stored in column-based styles for fast access and distance calculations. The embeddings are around 1000 dimensions, which means we can only store 2 million images in one file otherwise just the embeddings alone are to large for fury.

@chaokunyang
Copy link
Collaborator

chaokunyang commented Apr 17, 2024

Why not split this file into smaller files

@Neiko2002
Copy link
Author

Neiko2002 commented Apr 17, 2024

It is just a hassle, right now the meta information of all images are stored in row-based style, followed by the embeddings information of all the images in column-based style. I see three options with the current implementation of fury to handle large files:

  1. Store meta information and embeddings in seperate files, and split the embedding file into smaller files to cirumvent the 2GB limit
  2. Try to keep everything in one file, but create additional files if its breaks the 2GB barrier
  3. Store everything in row-based format (meta information and embedding per image) and split the files if needed

For 2 and 3 we would need to keep track of how big the file already is in order to make reasonable splits (not splitting the meta data or embedding of an image into two files). Finding a good splitting point for option 1 is more straight forward since the number of embeddings fitting into one file can be calculated in advanced. In all cases ideal memory allocation and ordering of data would need more consideration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants