GitHub - breck7/crawlers: Crawlers for extracting measurements from the web for Scroll datasets

breck7 / crawlers Public

Notifications You must be signed in to change notification settings
Fork 0
Star 4

Crawlers for extracting measurements from the web for Scroll datasets

measurementscrawlers.scroll.pub

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github/workflows		.github/workflows
awis		awis
bigquery		bigquery
cancer.gov		cancer.gov
cloc		cloc
compiler-explorer		compiler-explorer
dblp.org		dblp.org
github.com		github.com
isbndb.com		isbndb.com
leachim6		leachim6
monaco		monaco
news.ycombinator.com		news.ycombinator.com
pygments		pygments
pypl		pypl
reddit.com		reddit.com
riju.codes		riju.codes
semanticscholar.org		semanticscholar.org
spacy		spacy
stackoverflow.com		stackoverflow.com
website		website
whois		whois
wikipedia.org		wikipedia.org
.gitignore		.gitignore
CNAME		CNAME
PoliteCrawler.js		PoliteCrawler.js
footer.scroll		footer.scroll
header.scroll		header.scroll
package.json		package.json
readme.scroll		readme.scroll

Repository files navigation

import header.scroll
title MeasurementsCrawlers

Crawlers for extracting measurements from the web for Scroll datasets.

* Crawlers generally:

1. Match - match entity ids from the source to concept ids
2. Fetch - fetch content from the source site and save to disk cache
3. Parse - parse the content into JSON objects and save to disk cache
4. Update - map the content to the measureParser and save to the concept base

permalink index.html
import footer.scroll