A starter project for scraping similar data from multiple sources using Node, Cheerio, and Request and saving the result in a MongoDB instance.
- Node & NPM
- A MongoDB server instance (specify its url in config/)
- An empty Github repo for your version of the scraper
> git clone https://github.com/elnaz/scraper
> cd scraper
> git remote set-url origin git@github.com:YOUR_USERNAME/YOUR_SCRAPER_PROJECT.git
> git push origin master
> npm i
> npm start
Note: For legal reasons, when you first clone this starter project, it won't work because the example source, /lib/sources/example.js
is fake. To add your own sources, see below.
Let's say you need to scrape people from multiple different sources. For each source:
- Create a file with the source's name in the
/lib/sources/
directory. - In
/lib/sources/source-name.js
,
- Define and export a
URL
constant of the source's web page. - Define and export a
parsePeople
function that takes in a Cheerio selector$
, uses it to select the data you want to scrape about each person on the page, and returns an array of parsed JSON people objects.
- Require the new source in the
SOURCES
array of/lib/index.js
.