Skip to content

feat: deduplicate shared URL downloads across test suites#338

Open
dabrain34 wants to merge 2 commits intofluendo:masterfrom
dabrain34:dab_duplication_download
Open

feat: deduplicate shared URL downloads across test suites#338
dabrain34 wants to merge 2 commits intofluendo:masterfrom
dabrain34:dab_duplication_download

Conversation

@dabrain34
Copy link
Copy Markdown
Contributor

@dabrain34 dabrain34 commented Feb 26, 2026

Introduce a centralized DownloadManager that ensures each URL is downloaded at most once, eliminating duplicate downloads both across test suites and within a single test suite.

  • Add DownloadManager class in utils.py with download-once caching and centralized archive cleanup
  • Refactor TestSuite.download() to use pre-downloaded archives from the manager across all three download paths
  • Use a thread pool to download concurrently and make DownloadManager thread-safe so duplicate URLs are still fetched only once.

This feature allows to fast up considerably the download of AV1-ARGON* which was downloading each time the 6GB archive for every test vector.

Fix #309

Introduce a centralized DownloadManager that ensures each URL is
downloaded at most once, eliminating duplicate downloads both across
test suites and within a single test suite.

- Add DownloadManager class in utils.py with download-once caching
  and centralized archive cleanup
- Refactor TestSuite.download() to use pre-downloaded archives from
  the manager across all three download paths
- Use a thread pool to download concurrently and make DownloadManager
  thread-safe so duplicate URLs are still fetched only once.
@dabrain34
Copy link
Copy Markdown
Contributor Author

@ylatuya ping

@dabrain34
Copy link
Copy Markdown
Contributor Author

dabrain34 commented Apr 21, 2026

@rsanchez87 can you have a look to this PR as well? The idea would be to fast up the build of docker images containing all the test suites

Comment thread fluster/test_suite.py
)
# When archive_path is provided, the archive was already downloaded
# by the DownloadManager — skip directly to extraction.
if ctx.archive_path and os.path.exists(ctx.archive_path):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't all the download logic be in the DownloadManager? I would expect _download_single_test_vector and _download_single_archive use the download manager instead of utils.download and let the download manager handle all the checks so that those are not duplicated.

Comment thread fluster/test_suite.py
f"Checksum mismatch for source file {os.path.basename(first_tv.source)}: {checksum} "
f"instead of '{first_tv.source_checksum}'"
# Verify existing file: clean up corrupt, skip if valid
skip_download = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this logic should be handled by the DownloadManager

Comment thread fluster/test_suite.py
extract_all: bool = False,
keep_file: bool = False,
retries: int = 2,
download_manager: Optional["utils.DownloadManager"] = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not make the DownloadManager optional; it's simpler to maintain.

Comment thread fluster/test_suite.py
return (url, local_path)

max_workers = max(1, min(jobs, len(unique_source_list)))
with ThreadPoolExecutor(max_workers=max_workers) as dl_pool:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be handled by the DownloadManager, the api should support a list of url's to download and let the DownloadManager handle all the parallel downloads.

Comment thread fluster/test_suite.py
return (url, local_path)

max_workers = max(1, min(jobs, len(unique_source_list)))
with ThreadPoolExecutor(max_workers=max_workers) as dl_pool:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use Pool from multiprocessing

Comment thread fluster/test_suite.py
@@ -328,7 +400,16 @@ def _callback_error(err: Any) -> None:

downloads = []
for tv in self.test_vectors.values():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go away if we are using the DownloadManager.

@rsanchez87
Copy link
Copy Markdown
Contributor

@rsanchez87 can you have a look to this PR as well? The idea would be to fast up the build of docker images containing all the test suites

@dabrain34, tested with python3 fluster.py download AV1-ARGON-PROFILE0-CORE-ANNEX-B AV1-ARGON-PROFILE1-CORE-ANNEX-B AV1-ARGON-PROFILE2-CORE-ANNEX-B
master: 49m 40s
PR: 16m 2s (~3x faster, ZIP downloaded once instead of 3 times) ✅

Also regression tests ✔️

I’ll test again once the requested changes by @ylatuya are implemented. Thanks!

@dabrain34
Copy link
Copy Markdown
Contributor Author

thanks for the test, indeed this is even better on low speed lines as we dont redownload all the time the AV1 zip file.

I'm currently addressing comments from ylatuya. When this is ready I will come back to you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Downloading the AV1 test suites results in downloading multiple times a 6GB archive

3 participants