Skip to content

drowninginflowers/pdap_url_cache

Repository files navigation

This is a workflow and set of python scripts that extract a series of URLs and their metadata from a JSON file then caches them at the Internet Archive based on their update frequency.

The JSON URL data comes from the Police Data Accessability Project.

TO-DO:

  • alter the builder script to pull data directly from a database instead of buliding from JSON
    • alter workflow to automatically add URLs to the active cache list when new entries appear in the database
  • handle exceptions:
    • reattempt a cache
    • store broken URL info and remove from active caching list
    • update broken URL info in database itself
  • use the Internet Archive API instead of savepagenow
    • compare current versions of a page to previously cached pages to determine caching needs instead of using approximate timedeltas
    • schedule caching using Archive API task scheduler instead of using sequential cache requests

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages