Archive new term #3

ondenman · 2017-01-03T17:14:20Z

What does this do?

Uses scraped-page-archive to archive all pages scraped.

Why is this needed?

The country recently had an election coming up (2016-09-04). I hoped that we might be able to archive the previous term before it disappears but it looks like the site now lists the current term. As I had already begun to add the scraper, archiving the current term was only a trivial step -- at least it's now archived for the future.

Archiving it now gives us the chance to go back and re-scrape later even if it disappears.

Checklists:

Scraper Change checklist

1. scraper is on Morph.io under the "everypolitician-scrapers" group?
2. scraper's GitHub "Website" link points at morph.io page?
3. scraper is set to auto-run?
4. scraper is archiving?
5. legislature has a scraper webhook set?

Adding Archiving:

1. we are using at least version 0.5 of scraped_page_archive gem?
2. scraper uses scraped_page_archive gem directly or via a suitable strategy?
3. MORPH_SCRAPER_CACHE_GITHUB_REPO_URL is configured?
4. pages are being archived in new branch of correct scraper repo?

Members are now listed over four pages. This commit adds the fourth page.

Updated scraper to scrape new layout.

tmtmtmtm

This is really doing two different things:

Adding archiving
Scraping a new term

The description claims it's only doing the first, but the code also does the second. You should only add archiving to a scraper that's already working, or as the first step of an entirely new scraper. So I would suggest you either begin an entirely new scraper, starting with archiving the pages required, and then extending it to scrape the data you want, or adjust the existing scraper to work for the new term (ideally rewriting for Scraped), and then archive that when done.

But trying to do both in one step like this isn't really working.

tmtmtmtm and others added 4 commits January 3, 2017 16:04

Use to URI.join to generate absolute URLs

b7bbc2a

Require scraped-page-archive

0329554

Add additional members list page

4409e53

Members are now listed over four pages. This commit adds the fourth page.

Update for new term

372fd20

Updated scraper to scrape new layout.

ondenman mentioned this pull request Jan 4, 2017

Add scraped page archive #2

Closed

10 tasks

ondenman requested a review from tmtmtmtm January 4, 2017 10:30

ondenman assigned tmtmtmtm Jan 4, 2017

tmtmtmtm force-pushed the master branch from b7bbc2a to a24fcdd Compare January 4, 2017 11:04

tmtmtmtm suggested changes Jan 4, 2017

View reviewed changes

tmtmtmtm assigned ondenman and unassigned tmtmtmtm Jan 4, 2017

ondenman closed this Mar 2, 2017

tmtmtmtm deleted the archive-new-term branch September 6, 2017 10:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Archive new term #3

Archive new term #3

ondenman commented Jan 3, 2017 •

edited

tmtmtmtm left a comment

Archive new term #3

Archive new term #3

Conversation

ondenman commented Jan 3, 2017 • edited

What does this do?

Why is this needed?

Checklists:

Scraper Change checklist

Adding Archiving:

tmtmtmtm left a comment

Choose a reason for hiding this comment

ondenman commented Jan 3, 2017 •

edited