Skip to content

Support module scrapers #12

@andykais

Description

@andykais

The dream here is to let other users maintain scrapers in a community repo, or on their own githubs, and let developers simply install them via npm.

npm i scrape-pages @community-scrapers/twitter-feed @community-scrapers/twitter-login

ConfigInit:

scrape:
  module: '@community-scrapers/twitter-feed'

yields Config:

input:
  - '@community-scrapers/twitter-feed:username'
define:
  @community-scrapers/twitter-feed:feedpage: ...
  @community-scrapers/twitter-feed:post: ...
  @community-scrapers/twitter-feed:post-media: ...
scrape:
  module: '@community-scrapers/twitter-feed'

Local define defs can override those inside module define.

How to wire this stuff up?

inputs

Create a object in each ScrapeStep that came from a module. Object should map full input keys to module's internal keys. The internal keys will be the ones actually used in the handlebar templates. E.g.

{
  '@community-scrapers/twitter-feed:username': 'username'
}

scrape

Two options:

  1. Create a separate flow.ts instance for a module and hook that up to whatever is above/below it.
  2. Crawl through a module scraper, find all empty scrapeEach arrays and reattach the rest of the structure there.

stateful values

There may be times when a local/module scraper gets a value that you want for the rest of the run. Most often this will be an auth/access token.

define:
  'user-likes-page':
    download:
       urlTemplate: 'https://twitter.com/likes'
       headerTemplates:
         'x-twitter-access-token': '{{ accessToken }}'
    parse:
      selector: '.post a'
      attribute: 'href'
scrape:
  module: '@community-scrapers/twitter-login'
  valueAsInput: 'accessToken'
  forEach:
    - scraper: 'user-likes-page'

This is essentially global state, whenever '@community-scrapers/twitter-login' gives us a value, we update the input value for 'accessToken', and replace the passed down value with ''

organizing dependencies

It is possible to have a separate directory where module scrapers live using worker_threads.

mkdir scrape-pages-runners
cd scrape-pages-runners
npm init
npm i scrape-pages @community-scrapers/twitter-feed @community-scrapers/twitter-login

Your main nodejs process can run something like

const { Worker } = require('worker_threads')
const worker = new Worker('./scrape-pages-runners/worker.js', { workerData: { config, options } })
worker.on('message', ([event, data]) => console.log(event, data)) // wire up scraper events here
worker.on('exit', () => console.log('complete.')

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions