Question: How to recursively archive directories containing many PDF download links #1423

JKL213 · 2024-05-07T08:42:02Z

Title as is. I want to archive an Apache2 file server page (download.eversberg.eu) that has a ton of PDF files. Any convenient way to achieve this? I tried depth=1 and it does not really work. The PDF links still redirect to the original page.

pirate · 2024-05-07T09:52:25Z

ArchiveBox doesn't go more than 1 hop deep when using depth=1, so unless the links to the PDFs are directly on the page you're passing it, it's not going to recursively keep following links beyond the first hop.

You can do multiple passes if you need it to go deeper into a folder structure:

archivebox add --depth=1 'https://download.eversberg.eu'
archivebox list --csv=url --filter-type=domain download.eversberg.eu | archivebox add --depth=1
archivebox list --csv=url --filter-type=domain download.eversberg.eu | archivebox add --depth=1
archivebox list --csv=url --filter-type=domain download.eversberg.eu | archivebox add --depth=1
...
# repeat for however many levels deep you want to go

I recommend SiteSucker or wget's recursive feature, they may be better suited to this use case.

See this issue for more info: #191

JKL213 · 2024-05-07T09:55:29Z

I'm sorry, my request was terribly worded. I'm looking to archive the site "download.eversberg.eu" which is essentially just a file host running on Apache2. I managed to solve my issue by just running individual archive tasks for every folder and using depth setting 1. I'm essentially just using for a convenient way to archive below depth 1.

pirate · 2024-05-07T09:57:48Z

No worries, I figured out what you meant after I visited the domain, edited my comment above 👍

pirate changed the title ~~Question: How can I archive a file server page?~~ Question: How to recursively archive a page containing many PDF download links May 7, 2024

pirate added status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers type: support labels May 7, 2024

pirate changed the title ~~Question: How to recursively archive a page containing many PDF download links~~ Question: How to recursively archive directories containing many PDF download links May 7, 2024

pirate closed this as completed May 7, 2024

pirate added status: done Work is completed and released (or scheduled to be released in the next version) and removed status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers labels May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: How to recursively archive directories containing many PDF download links #1423

Question: How to recursively archive directories containing many PDF download links #1423

JKL213 commented May 7, 2024

pirate commented May 7, 2024 •

edited

Loading

JKL213 commented May 7, 2024

pirate commented May 7, 2024

Question: How to recursively archive directories containing many PDF download links #1423

Question: How to recursively archive directories containing many PDF download links #1423

Comments

JKL213 commented May 7, 2024

pirate commented May 7, 2024 • edited Loading

JKL213 commented May 7, 2024

pirate commented May 7, 2024

pirate commented May 7, 2024 •

edited

Loading