-
Notifications
You must be signed in to change notification settings - Fork 2
Description
The dream here is to let other users maintain scrapers in a community repo, or on their own githubs, and let developers simply install them via npm.
npm i scrape-pages @community-scrapers/twitter-feed @community-scrapers/twitter-loginConfigInit:
scrape:
module: '@community-scrapers/twitter-feed'yields Config:
input:
- '@community-scrapers/twitter-feed:username'
define:
@community-scrapers/twitter-feed:feedpage: ...
@community-scrapers/twitter-feed:post: ...
@community-scrapers/twitter-feed:post-media: ...
scrape:
module: '@community-scrapers/twitter-feed'Local define defs can override those inside module define.
How to wire this stuff up?
inputs
Create a object in each ScrapeStep that came from a module. Object should map full input keys to module's internal keys. The internal keys will be the ones actually used in the handlebar templates. E.g.
{
'@community-scrapers/twitter-feed:username': 'username'
}scrape
Two options:
- Create a separate
flow.tsinstance for a module and hook that up to whatever is above/below it. - Crawl through a module scraper, find all empty
scrapeEacharrays and reattach the rest of the structure there.
stateful values
There may be times when a local/module scraper gets a value that you want for the rest of the run. Most often this will be an auth/access token.
define:
'user-likes-page':
download:
urlTemplate: 'https://twitter.com/likes'
headerTemplates:
'x-twitter-access-token': '{{ accessToken }}'
parse:
selector: '.post a'
attribute: 'href'
scrape:
module: '@community-scrapers/twitter-login'
valueAsInput: 'accessToken'
forEach:
- scraper: 'user-likes-page'This is essentially global state, whenever '@community-scrapers/twitter-login' gives us a value, we update the input value for 'accessToken', and replace the passed down value with ''
organizing dependencies
It is possible to have a separate directory where module scrapers live using worker_threads.
mkdir scrape-pages-runners
cd scrape-pages-runners
npm init
npm i scrape-pages @community-scrapers/twitter-feed @community-scrapers/twitter-loginYour main nodejs process can run something like
const { Worker } = require('worker_threads')
const worker = new Worker('./scrape-pages-runners/worker.js', { workerData: { config, options } })
worker.on('message', ([event, data]) => console.log(event, data)) // wire up scraper events here
worker.on('exit', () => console.log('complete.')