[Java] serialize to output stream is limited to 2GB #1528

Neiko2002 · 2024-04-16T20:08:15Z

Search before asking

I had searched in the issues and found no similar issues.

Version

Version: 0.4.1
OS: Windows
JDK: 21

Component(s)

Java

Minimal reproduce step

public static void main(String[] args) throws Exception {
	Fury fury = Fury.builder().requireClassRegistration(false).build();
	try (OutputStream output = new BufferedOutputStream(Files.newOutputStream(Files.createTempFile(null, null)))) {
		fury.serialize(output, new BigObj());
	}
}
	
static public class BigObj {
	public byte[] b1 = new byte[Integer.MAX_VALUE/2];
	public byte[] b2 = new byte[Integer.MAX_VALUE/2];
}

What did you expect to see?

I was hoping to get a file with 2147483646 bytes, all zero.

What did you see instead?

Exception in thread "main" java.lang.NegativeArraySizeException: -2147483510
	at io.fury.memory.MemoryBuffer.ensure(MemoryBuffer.java:1980)
	at io.fury.memory.MemoryBuffer.writePrimitiveArrayWithSizeEmbedded(MemoryBuffer.java:1946)
	at io.fury.serializer.ArraySerializers$ByteArraySerializer.write(ArraySerializers.java:290)

Anything Else?

I think when providing an OutputStream to the serialize method the intermediate MemoryBuffer should behave like the buffer inside the BufferedOutputStream. When the buffer is full it should flush its content to the underlying OutputStream in order to free up its bytes.

Are you willing to submit a PR?

I'm willing to submit a PR!

chaokunyang · 2024-04-17T04:02:41Z

Fury need to go back in the buffer to update some header in some situations. In such cases, flush ahead is not possible. In the long run, we may be able to streaming write if we provide options to disable such look back.

But could you share which cases you need to serialize such big object? It's rare in a production environment, and protobuf don't support it too

Neiko2002 · 2024-04-17T08:35:18Z

We have quite large files on disk and can not use protobuf because of the 2GB limitation. Thats why we where looking for alternatives: fast serialization with cross-language support. I feel in the future if we store machine learning embeddings in a column based style to a file, the 2GB limit will be a problem quite a lot.

chaokunyang · 2024-04-17T09:38:06Z

We have quite large files on disk and can not use protobuf because of the 2GB limitation. Thats why we where looking for alternatives: fast serialization with cross-language support. I feel in the future if we store machine learning embeddings in a column based style to a file, the 2GB limit will be a problem quite a lot.

This is interesting, if embeddings are storaged, we may need larger limitation. Could we split a big object into some small objects for serialization. I mean, you can serialize like this:

Fury fury = xxx;
OutputStream stream = xxx;
fury.serialize(stream, o1);
fury.serialize(stream, o2);
fury.serialize(stream, o3);

Then for deserializaion, you can:

Fury fury = xxx;
FuryInputStream stream = xxx;
Object o1 = fury.deserialize(stream);
Object o2 = fury.deserialize(stream);
Object o3 = fury.deserialize(stream);

chaokunyang · 2024-04-17T09:42:29Z

If we can't split an object graph into multiple serialization, then we do need to support larger size limit

Neiko2002 · 2024-04-17T10:44:12Z

In our case we have a file with meta information and embeddings of several millions images. All the embeddings are stored in column-based styles for fast access and distance calculations. The embeddings are around 1000 dimensions, which means we can only store 2 million images in one file otherwise just the embeddings alone are to large for fury.

chaokunyang · 2024-04-17T10:58:34Z

Why not split this file into smaller files

Neiko2002 · 2024-04-17T11:14:08Z

It is just a hassle, right now the meta information of all images are stored in row-based style, followed by the embeddings information of all the images in column-based style. I see three options with the current implementation of fury to handle large files:

Store meta information and embeddings in seperate files, and split the embedding file into smaller files to cirumvent the 2GB limit
Try to keep everything in one file, but create additional files if its breaks the 2GB barrier
Store everything in row-based format (meta information and embedding per image) and split the files if needed

For 2 and 3 we would need to keep track of how big the file already is in order to make reasonable splits (not splitting the meta data or embedding of an image into two files). Finding a good splitting point for option 1 is more straight forward since the number of embeddings fitting into one file can be calculated in advanced. In all cases ideal memory allocation and ordering of data would need more consideration.

Neiko2002 added the bug Something isn't working label Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Java] serialize to output stream is limited to 2GB #1528

[Java] serialize to output stream is limited to 2GB #1528

Neiko2002 commented Apr 16, 2024 •

edited

chaokunyang commented Apr 17, 2024

Neiko2002 commented Apr 17, 2024

chaokunyang commented Apr 17, 2024

chaokunyang commented Apr 17, 2024

Neiko2002 commented Apr 17, 2024

chaokunyang commented Apr 17, 2024 •

edited

Neiko2002 commented Apr 17, 2024 •

edited

[Java] serialize to output stream is limited to 2GB #1528

[Java] serialize to output stream is limited to 2GB #1528

Comments

Neiko2002 commented Apr 16, 2024 • edited

Search before asking

Version

Component(s)

Minimal reproduce step

What did you expect to see?

What did you see instead?

Anything Else?

Are you willing to submit a PR?

chaokunyang commented Apr 17, 2024

Neiko2002 commented Apr 17, 2024

chaokunyang commented Apr 17, 2024

chaokunyang commented Apr 17, 2024

Neiko2002 commented Apr 17, 2024

chaokunyang commented Apr 17, 2024 • edited

Neiko2002 commented Apr 17, 2024 • edited

Neiko2002 commented Apr 16, 2024 •

edited

chaokunyang commented Apr 17, 2024 •

edited

Neiko2002 commented Apr 17, 2024 •

edited