Collect data from pages using the active browser session. Useful for difficult-to-scrape pages where selenium or other traditional scraping methods do not work.
Scrapelet runs as a bookmarklet, which executes some Javascript on the page you are viewing.
This Javascript can read the contents of the page, find the element you're interested in and send this HTML data to a HTTP endpoint for further processing and storage.
This approach is useful for sites with advanced scraping detection & prevenetion mechanisms. Because the bookmarklet piggy-backs on the page the user has already loaded, it doesn't make any further requests. This makes it invisible to the target server, and puts no extra strain on it.
The current implementation only searches the current page, so it's only really useful where a small number of pages are required to be scraped.
Real estate websites are protective of their information, as they may offer paid services to access this info. Scraping is often against the terms of service, so an alternative method is needed.
Scrapelet grabs the heading with all the property stats, and shoots this across to the spreadsheet script. This picks out the relevant information and adds a row to the table.
- Create a Google Spreadsheet to dump data into
- Add the example gsheet.gs code to your spreadsheet script (Tools->Script Editor)
- Publish as a Web App
- Build bookmarklet
- Update
config.js
to point to your Web App URLhttps://script.google.com/.../exec
- Run
npm install
, thennpm run build
- Grab the the output from
dist/scrapelet.js
and create as a bookmark in your browser
- Update
- Browse to the page you want to scrape and click the bookmarklet
If all went well, you should have a new row in your table. Tweak the sheets script to extract the data you need!
$ npm install
$ npm run build
Will Robertson contact@willrobertson.id.au
This project is licensed under an ISC license - see the LICENSE.md file for details.