Skip to content

Find a scalable way to archive reference URLS #2303

@pombredanne

Description

@pombredanne

As originally posted by @ziadhany in this issue, we now have a way to collect reference URLs, but there are issue with doing it at scale, including rate limiting:

I think we should use web.archive.org to store archive URLs via their "Save Page Now" endpoint https://web.archive.org/save/<url>

https://github.com/akamhy/waybackpy/blob/master/waybackpy/save_api.py.
Since our URLs are stored in AdvisoryReference, we could add a new field archive_url to that model.

One major constraint is the rate limit of 15 requests per minute. With 250,000 URLs to process, the initial run would take about 11.5 days:

250,000 / (15 * 60 * 24) = 11.57 days

@ziadhany is reaching out to archive.org to find better and improved ways to archive at scale.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions