Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: How to recursively archive directories containing many PDF download links #1423

Closed
JKL213 opened this issue May 7, 2024 · 3 comments
Labels
status: done Work is completed and released (or scheduled to be released in the next version) type: support

Comments

@JKL213
Copy link

JKL213 commented May 7, 2024

Title as is. I want to archive an Apache2 file server page (download.eversberg.eu) that has a ton of PDF files. Any convenient way to achieve this? I tried depth=1 and it does not really work. The PDF links still redirect to the original page.

@pirate
Copy link
Member

pirate commented May 7, 2024

ArchiveBox doesn't go more than 1 hop deep when using depth=1, so unless the links to the PDFs are directly on the page you're passing it, it's not going to recursively keep following links beyond the first hop.

You can do multiple passes if you need it to go deeper into a folder structure:

archivebox add --depth=1 'https://download.eversberg.eu'
archivebox list --csv=url --filter-type=domain download.eversberg.eu | archivebox add --depth=1
archivebox list --csv=url --filter-type=domain download.eversberg.eu | archivebox add --depth=1
archivebox list --csv=url --filter-type=domain download.eversberg.eu | archivebox add --depth=1
...
# repeat for however many levels deep you want to go

I recommend SiteSucker or wget's recursive feature, they may be better suited to this use case.

See this issue for more info: #191

@pirate pirate changed the title Question: How can I archive a file server page? Question: How to recursively archive a page containing many PDF download links May 7, 2024
@pirate pirate added status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers type: support labels May 7, 2024
@JKL213
Copy link
Author

JKL213 commented May 7, 2024

I'm sorry, my request was terribly worded. I'm looking to archive the site "download.eversberg.eu" which is essentially just a file host running on Apache2. I managed to solve my issue by just running individual archive tasks for every folder and using depth setting 1. I'm essentially just using for a convenient way to archive below depth 1.

@pirate
Copy link
Member

pirate commented May 7, 2024

No worries, I figured out what you meant after I visited the domain, edited my comment above 👍

@pirate pirate changed the title Question: How to recursively archive a page containing many PDF download links Question: How to recursively archive directories containing many PDF download links May 7, 2024
@pirate pirate closed this as completed May 7, 2024
@pirate pirate added status: done Work is completed and released (or scheduled to be released in the next version) and removed status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers labels May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: done Work is completed and released (or scheduled to be released in the next version) type: support
Projects
None yet
Development

No branches or pull requests

2 participants