feat: deduplicate shared URL downloads across test suites#338
feat: deduplicate shared URL downloads across test suites#338dabrain34 wants to merge 2 commits intofluendo:masterfrom
Conversation
Introduce a centralized DownloadManager that ensures each URL is downloaded at most once, eliminating duplicate downloads both across test suites and within a single test suite. - Add DownloadManager class in utils.py with download-once caching and centralized archive cleanup - Refactor TestSuite.download() to use pre-downloaded archives from the manager across all three download paths - Use a thread pool to download concurrently and make DownloadManager thread-safe so duplicate URLs are still fetched only once.
|
@ylatuya ping |
|
@rsanchez87 can you have a look to this PR as well? The idea would be to fast up the build of docker images containing all the test suites |
| ) | ||
| # When archive_path is provided, the archive was already downloaded | ||
| # by the DownloadManager — skip directly to extraction. | ||
| if ctx.archive_path and os.path.exists(ctx.archive_path): |
There was a problem hiding this comment.
Shouldn't all the download logic be in the DownloadManager? I would expect _download_single_test_vector and _download_single_archive use the download manager instead of utils.download and let the download manager handle all the checks so that those are not duplicated.
| f"Checksum mismatch for source file {os.path.basename(first_tv.source)}: {checksum} " | ||
| f"instead of '{first_tv.source_checksum}'" | ||
| # Verify existing file: clean up corrupt, skip if valid | ||
| skip_download = False |
There was a problem hiding this comment.
All of this logic should be handled by the DownloadManager
| extract_all: bool = False, | ||
| keep_file: bool = False, | ||
| retries: int = 2, | ||
| download_manager: Optional["utils.DownloadManager"] = None, |
There was a problem hiding this comment.
I would not make the DownloadManager optional; it's simpler to maintain.
| return (url, local_path) | ||
|
|
||
| max_workers = max(1, min(jobs, len(unique_source_list))) | ||
| with ThreadPoolExecutor(max_workers=max_workers) as dl_pool: |
There was a problem hiding this comment.
This should be handled by the DownloadManager, the api should support a list of url's to download and let the DownloadManager handle all the parallel downloads.
| return (url, local_path) | ||
|
|
||
| max_workers = max(1, min(jobs, len(unique_source_list))) | ||
| with ThreadPoolExecutor(max_workers=max_workers) as dl_pool: |
There was a problem hiding this comment.
We use Pool from multiprocessing
| @@ -328,7 +400,16 @@ def _callback_error(err: Any) -> None: | |||
|
|
|||
| downloads = [] | |||
| for tv in self.test_vectors.values(): | |||
There was a problem hiding this comment.
This should go away if we are using the DownloadManager.
@dabrain34, tested with Also regression tests ✔️ I’ll test again once the requested changes by @ylatuya are implemented. Thanks! |
|
thanks for the test, indeed this is even better on low speed lines as we dont redownload all the time the AV1 zip file. I'm currently addressing comments from ylatuya. When this is ready I will come back to you |
Introduce a centralized DownloadManager that ensures each URL is downloaded at most once, eliminating duplicate downloads both across test suites and within a single test suite.
This feature allows to fast up considerably the download of AV1-ARGON* which was downloading each time the 6GB archive for every test vector.
Fix #309