how to reduce memory usage with streams? is chunkSize being ignored? #413

UnsungHero97 · 2023-05-06T04:10:57Z

Hello! Thank you for this library!

I'm experimenting with zip.js to determine whether or not I can use it for a production service I maintain involving unlimited file transfers. The idea is to zip files client side and stream the zip from sender to receiver. The other piece of the experiment is to use Cloudflare Workers as the proxy to route the bytes from sender to receiver.

Good news, I have it working! However, I'm running into 2 issues that I hope to get help with.

The first issue I'm seeing is that memory usage grows to be as large as the files being transferred. For example, sending gigabytes leads to gigabytes of memory usage.

The screenshot shows zip.js using 4.8gb of memory for a transfer of 4 files, each being a little over 1gb. The reason for streaming is to allow for unlimited transfers; no limit on how many files can be transfered or how large each file can be.

I have not yet figured out why this is happening but I suspect it has something to do with the final zip file. According to the docs, calling zip.close() returns a promise of the contents of the zip. For my use case, the sender does not need the final zip and so zip.close() doesn't need to return anything.

How can I avoid this?

The second issue is with configuring chunkSize. I have not been able to get it to take effect. As far as I can tell, I just need to call configure() once globally, which I've done:

In this example, I'm setting chunkSize to 10mb but none of the requests have the Content-Length header equal to the chunk size I've configured. In all these transfer requests, Content-Length varies but 2mb is the largest as far as I can tell.

What am I doing wrong with chunkSize?

The text was updated successfully, but these errors were encountered:

gildas-lormeau · 2023-05-07T16:28:09Z

I used this test to answer your questions. It generates 4GB of data to compress on the fly and does not store the compressed data. On paper, this test should not leak memory because no data is stored.

https://plnkr.co/edit/tUWLHBVp3xHyxSqh

import * as zip from "https://unpkg.com/@zip.js/zip.js/index.js";
let total = 0;

class LogWriter extends zip.Writer {
  writeUint8Array(array) {
    out.innerHTML = "write " + array.length + " bytes <br>" + out.innerHTML;
  }
}

class LogReader extends zip.Reader {
  init() {
    this.size = 4 * 1024 * 1024 * 1024;
  }

  readUint8Array(offset, length) {
    const ONE_KB_DATA = Array.from(
      crypto.getRandomValues(new Uint8Array(1024))
    );
    const size = Math.ceil(length / 1024);
    const data = new Uint8Array(Array(size).fill(ONE_KB_DATA).flat());
    total += data.length;
    out.innerHTML =
      "read " +
      data.length +
      " bytes (" +
      Math.floor(total / (1024 * 1024)) +
      "/" +
      this.size / (1024 * 1024) +
      ")<br>" +
      out.innerHTML;
    return data;
  }
}

async function test() {
  zip.configure({ chunkSize: 16 * 1024 * 1024 });
  out.innerHTML += "START<br>";
  const zipWriter = new zip.ZipWriter(new LogWriter());
  zipWriter.add("test", new LogReader(), {});
  zipWriter.close();
}

test();

It shows that chunkSize is used when reading data. It also seems to confirm the memory leak you described. The heap snaphot in the developer tools does not help me to find where the data is allocated. I'm a bit surprised because the library relies enterily on steams, and the code above shows it works as expected. I need to do more investigation on this.

UnsungHero97 · 2023-05-07T17:30:11Z

Here's what my setup looks like (minus a bunch of console.logs and error handling):

configure({
  chunkSize: 1024 * 1024 * 10
});

async function transfer(token :string, files :File[]) {

  const zipStream = new TransformStream<Uint8Array, Uint8Array>();
  const zip = new ZipWriter(zipStream.writable, { level: 0 });

  let resolveReader = () => null;
  let rejectReader = () => null;
  const readerPromise = new Promise((resolve, reject) => {
    resolveReader = resolve;
    rejectReader = reject;
  });

  let resolveWriter = () => null;
  let rejectWriter = () => null;
  const writerPromise = new Promise((resolve, reject) => {
    resolveWriter = resolve;
    rejectWriter = reject;
  });

  (async () => {
    const reader = zipStream.readable.getReader();
    while (true) {
      const { done, value } = await reader.read();
      if (done) {
        break;
      }
      console.log('reader.read()', done, value.byteLength);
      await fetch(`${url()}/transfer/${token}`, { body: value, method: 'PATCH' })
    }
    reader.releaseLock();
    resolveReader(null);
  })()
    .then(() => null)
    .catch(() => null);

  (async () => {
    for (let fileIndex = 0; fileIndex < files.length; fileIndex += 1) {
      const file = files[fileIndex];
      await zip.add(file.name, file.stream());
    }
    await zip.close();
    resolveWriter(null);
  })()
    .then(() => null)
    .catch(() => null);

  await Promise.all([readerPromise, writerPromise]);
}

Here, await reader.read() returns a chunk that is not the size I configured. Not only that, the chunk size is not consistent:

I understand that some of these chunks are for zip metadata but the largest chunk coming through is 2mb.

As for the memory leak, I have not gotten very far debugging it. I suspected web workers at first but the same thing happens with useWebWorkers: false.

UnsungHero97 · 2023-05-07T17:56:30Z

On second though, I think the chunkSize behavior I'm seeing is probably just the default browser implementation. I also just came across this:

https://developer.mozilla.org/en-US/docs/Web/API/ReadableStreamDefaultReader

Note however that zero-copy transfer from an underlying source is only supported for underlying byte sources that autoallocate buffers. In other words, the stream must have been constructed specifying both type="bytes" and autoAllocateChunkSize. For any other underlying source, the stream will always satisfy read requests with data from internal queues.

This seems like a possible explanation for the memory issue. I'll play with ReadableStreamBYOBReader to see if that resolves the memory issue:

https://developer.mozilla.org/en-US/docs/Web/API/ReadableStreamBYOBReader

The ReadableStreamBYOBReader interface of the Streams API defines a reader for a ReadableStream that supports zero-copy reading from an underlying byte source. It is used for efficient copying from underlying sources where the data is delivered as an "anonymous" sequence of bytes, such as files.

gildas-lormeau · 2023-05-08T11:18:26Z

For the record, chunkSize is ignored if you're using the default settings and web streams when zipping data. You have to set highWaterMark and size yourself in this case. I hope I'll find some time on my end to understand what's wrong. I doubt it comes from the Reader personally or it would be a bug. There's is no reason the keep in memory all the data in my example.

gildas-lormeau · 2023-05-08T18:59:21Z

I've found the cause of the issue, it was related to the usage of ReadableStream#teeand the CRC32 value computed when compressing data. I refactored the code (see 0d85d73) in order to remove the usage of this method to fix the issue. It's fixed in the version 2.7.7 I just published. Now the test I posted is able to compress 4GB of data with a constant memory usage.

UnsungHero97 · 2023-05-08T19:04:34Z

Amazing! I'll give it a try and let you know if it works on my end.

UnsungHero97 · 2023-05-08T19:27:15Z

I tested v2.7.7 a few times locally and memory usage was between 100mb and 200mb, with a few spikes to ~450mb, which is much better than before!

Quick question ... with compression set to 0 via new ZipWriter(zipStream.writable, { level: 0 }), does that bypass the compression logic entirely, or does it still do some computation on the stream?

Thanks for the super quick fix! I really appreciate your effort to get this resolved!

gildas-lormeau · 2023-05-08T20:08:40Z

I'm glad to hear you confirm it's fixed on your end as well. Even with level set to 0, zip.js will need to compute the CRC32 signature used to verify the integrity of the data. The compression logic is entirely bypassed though.

UnsungHero97 changed the title ~~how to reduce memory usage with streams? also, chunkSize seems to be ignored?~~ how to reduce memory usage with streams? is chunkSize being ignored? May 6, 2023

gildas-lormeau closed this as completed May 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to reduce memory usage with streams? is chunkSize being ignored? #413

how to reduce memory usage with streams? is chunkSize being ignored? #413

UnsungHero97 commented May 6, 2023

gildas-lormeau commented May 7, 2023 •

edited

Loading

UnsungHero97 commented May 7, 2023 •

edited

Loading

UnsungHero97 commented May 7, 2023 •

edited

Loading

gildas-lormeau commented May 8, 2023 •

edited

Loading

gildas-lormeau commented May 8, 2023 •

edited

Loading

UnsungHero97 commented May 8, 2023

UnsungHero97 commented May 8, 2023

gildas-lormeau commented May 8, 2023

how to reduce memory usage with streams? is chunkSize being ignored? #413

how to reduce memory usage with streams? is chunkSize being ignored? #413

Comments

UnsungHero97 commented May 6, 2023

gildas-lormeau commented May 7, 2023 • edited Loading

UnsungHero97 commented May 7, 2023 • edited Loading

UnsungHero97 commented May 7, 2023 • edited Loading

gildas-lormeau commented May 8, 2023 • edited Loading

gildas-lormeau commented May 8, 2023 • edited Loading

UnsungHero97 commented May 8, 2023

UnsungHero97 commented May 8, 2023

gildas-lormeau commented May 8, 2023

gildas-lormeau commented May 7, 2023 •

edited

Loading

UnsungHero97 commented May 7, 2023 •

edited

Loading

UnsungHero97 commented May 7, 2023 •

edited

Loading

gildas-lormeau commented May 8, 2023 •

edited

Loading

gildas-lormeau commented May 8, 2023 •

edited

Loading