Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive new term #3

Closed
wants to merge 4 commits into from
Closed

Archive new term #3

wants to merge 4 commits into from

Conversation

ondenman
Copy link
Contributor

@ondenman ondenman commented Jan 3, 2017

What does this do?

Uses scraped-page-archive to archive all pages scraped.

Why is this needed?

The country recently had an election coming up (2016-09-04). I hoped that we might be able to archive the previous term before it disappears but it looks like the site now lists the current term. As I had already begun to add the scraper, archiving the current term was only a trivial step -- at least it's now archived for the future.

Archiving it now gives us the chance to go back and re-scrape later even if it disappears.

Checklists:

Scraper Change checklist

  • 1. scraper is on Morph.io under the "everypolitician-scrapers" group?
  • 2. scraper's GitHub "Website" link points at morph.io page?
  • 3. scraper is set to auto-run?
  • 4. scraper is archiving?
  • 5. legislature has a scraper webhook set?

Adding Archiving:

  • 1. we are using at least version 0.5 of scraped_page_archive gem?
  • 2. scraper uses scraped_page_archive gem directly or via a suitable strategy?
  • 3. MORPH_SCRAPER_CACHE_GITHUB_REPO_URL is configured?
  • 4. pages are being archived in new branch of correct scraper repo?

tmtmtmtm and others added 4 commits January 3, 2017 16:04
Members are now listed over four pages. This commit adds the
fourth page.
Updated scraper to scrape new layout.
Copy link
Contributor

@tmtmtmtm tmtmtmtm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really doing two different things:

  1. Adding archiving
  2. Scraping a new term

The description claims it's only doing the first, but the code also does the second. You should only add archiving to a scraper that's already working, or as the first step of an entirely new scraper. So I would suggest you either begin an entirely new scraper, starting with archiving the pages required, and then extending it to scrape the data you want, or adjust the existing scraper to work for the new term (ideally rewriting for Scraped), and then archive that when done.

But trying to do both in one step like this isn't really working.

@tmtmtmtm tmtmtmtm assigned ondenman and unassigned tmtmtmtm Jan 4, 2017
@ondenman ondenman closed this Mar 2, 2017
@tmtmtmtm tmtmtmtm deleted the archive-new-term branch September 6, 2017 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants