Find a scalable way to archive reference URLS

As originally posted by @ziadhany in this issue, we now have a way to collect reference URLs, but there are issue with doing it at scale, including rate limiting:
- https://github.com/aboutcode-org/vulnerablecode/issues/1724#issuecomment-4110565625


> I think we should use web.archive.org to store archive URLs via their "Save Page Now" endpoint `https://web.archive.org/save/<url>` 
> 
> https://github.com/akamhy/waybackpy/blob/master/waybackpy/save_api.py.
> Since our URLs are stored in AdvisoryReference, we could add a new field `archive_url` to that model.
> 
> One major constraint is the rate limit of 15 requests per minute. With 250,000 URLs to process, the initial run would take about 11.5 days:
> 
> 250,000 / (15 * 60 * 24) = 11.57 days 


@ziadhany is reaching out to archive.org to find better and improved ways to archive at scale.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Find a scalable way to archive reference URLS #2303

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Find a scalable way to archive reference URLS #2303

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions