This is the crawler component for Buzzbang, a project to enable applications to find and use Bioschemas markup, and Google-like search over it for humans. Please see https://github.com/buzzbangorg/buzzbang-doc/wiki for more information.
These instructions are for Linux. Windows is not supported.
1. Create the intermediate crawl database
2. Queue URLs for Bioschemas JSON-LD extraction by adding them directly and crawling sitemaps
./bsbang-crawl.py <path-to-crawl-db> <location>
The location can be:
- a sitemap (e.g.
- a webpage (e.g.
http://identifiers.org or file://test/examples/FAIRsharing.html)
- a path (e.g.
conf/default-targets.txtwhich will then crawl all the locations in that file)
./bsbang-crawl.py data/crawl.db conf/default-targets.txt
3. Extract Bioschemas JSON-LD from webpages and insert into the crawl database.
** To download the crawled data from the database -
./bsbang-dump.py <path-to-crawl-db> <path-to-save-jsonld>
4. Install Solr.
5. Create a Solr core named 'bsbang'
cd $SOLR/bin ./solr create -c bsbang
6. Run Solr setup
cd $BSBANG ./setup/bsbang-setup-solr.py <path-to-bsbang-config-file> --solr-core-url <URL-of-solr-endpoint>
./setup/bsbang-setup-solr.py conf/bsbang-solr-setup.xml --solr-core-url http://localhost:8983/solr/testcore/
7. Index the extracted Bioschemas JSON-LD in Solr
./bsbang-index.py <path-to-crawl-db> --solr-core-url <URL-of-solr-endpoint>
./bsbang-index.py data/crawl.db --solr-core-url http://localhost:8983/solr/testcore/
See https://github.com/justinccdev/bsbang-frontend for a frontend project to the index.
$ python3 -m unittest discover
Future possibilities include:
- Possibly switch to using a 3rd party crawler or components rather than this custom-built one. Please see https://github.com/justinccdev/bsbang-crawler/issues/5
- Make crawler periodically re-crawl.
- Understand much more structure (e.g. DataSet elements within DataCatalog).
- Parse other Bioschemas and schema.org types used by life sciences websites (e.g. Organization, Service, Product)
- Instead of using Sqlite as intermediate crawl store, use something more scalable (perhaps mongodb, cassandra, etc.). But also see the item where we may want to replace parts/most of crawling infrastructure with a third party project, which will already have solved some, if not all, of the scalability issues.
- Crawl and understand PhysicalEntity/BioChemEntity/ResearchEntity once this matures further.
Any other suggestions welcome as Github issues for discussion or as pull requests.
Contributions welcome! Please
- Make pull requests to the dev branch.
- Conform to the PEP 8 style guide.