As originally posted by @ziadhany in this issue, we now have a way to collect reference URLs, but there are issue with doing it at scale, including rate limiting:
I think we should use web.archive.org to store archive URLs via their "Save Page Now" endpoint https://web.archive.org/save/<url>
https://github.com/akamhy/waybackpy/blob/master/waybackpy/save_api.py.
Since our URLs are stored in AdvisoryReference, we could add a new field archive_url to that model.
One major constraint is the rate limit of 15 requests per minute. With 250,000 URLs to process, the initial run would take about 11.5 days:
250,000 / (15 * 60 * 24) = 11.57 days
@ziadhany is reaching out to archive.org to find better and improved ways to archive at scale.
As originally posted by @ziadhany in this issue, we now have a way to collect reference URLs, but there are issue with doing it at scale, including rate limiting:
@ziadhany is reaching out to archive.org to find better and improved ways to archive at scale.