Alpha project for crawling bioschemas JSON-LD
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.circleci
bioschemas
bsbang
conf
data
setup
test/examples
.gitignore
CONTRIBUTORS.md
LICENSE
README.md
bsbang-crawl.py
bsbang-dump.py
bsbang-extract.py
bsbang-index.py
requirements.txt

README.md

README

master:CircleCI

This is the crawler component for Buzzbang, a project to enable applications to find and use Bioschemas markup, and Google-like search over it for humans. Please see https://github.com/buzzbangorg/buzzbang-doc/wiki for more information.

Usage

These instructions are for Linux. Windows is not supported.

1. Create the intermediate crawl database

./setup/bsbang-setup-sqlite.py <path-to-crawl-db>

Example:

./setup/bsbang-setup-sqlite.py data/crawl.db

2. Queue URLs for Bioschemas JSON-LD extraction by adding them directly and crawling sitemaps

./bsbang-crawl.py <path-to-crawl-db> <location>

The location can be:

  • a sitemap (e.g. http://beta.synbiomine.org/synbiomine/sitemap.xml)
  • a webpage (e.g. http://identifiers.org or file://test/examples/FAIRsharing.html)
  • a path (e.g. conf/default-targets.txt which will then crawl all the locations in that file)

Example:

./bsbang-crawl.py data/crawl.db conf/default-targets.txt

3. Extract Bioschemas JSON-LD from webpages and insert into the crawl database.

./bsbang-extract.py <path-to-crawl-db>

** To download the crawled data from the database -

./bsbang-dump.py <path-to-crawl-db> <path-to-save-jsonld>

4. Install Solr.

5. Create a Solr core named 'bsbang'

cd $SOLR/bin
./solr create -c bsbang

6. Run Solr setup

cd $BSBANG
./setup/bsbang-setup-solr.py <path-to-bsbang-config-file> --solr-core-url <URL-of-solr-endpoint>

Example:

./setup/bsbang-setup-solr.py conf/bsbang-solr-setup.xml --solr-core-url http://localhost:8983/solr/testcore/

7. Index the extracted Bioschemas JSON-LD in Solr

./bsbang-index.py <path-to-crawl-db> --solr-core-url <URL-of-solr-endpoint>

Example:

./bsbang-index.py data/crawl.db --solr-core-url http://localhost:8983/solr/testcore/

Frontend

See https://github.com/justinccdev/bsbang-frontend for a frontend project to the index.

Tests

$ python3 -m unittest discover

TODO

Future possibilities include:

  • Possibly switch to using a 3rd party crawler or components rather than this custom-built one. Please see https://github.com/justinccdev/bsbang-crawler/issues/5
  • Make crawler periodically re-crawl.
  • Understand much more structure (e.g. DataSet elements within DataCatalog).
  • Parse other Bioschemas and schema.org types used by life sciences websites (e.g. Organization, Service, Product)
  • Instead of using Sqlite as intermediate crawl store, use something more scalable (perhaps mongodb, cassandra, etc.). But also see the item where we may want to replace parts/most of crawling infrastructure with a third party project, which will already have solved some, if not all, of the scalability issues.
  • Crawl and understand PhysicalEntity/BioChemEntity/ResearchEntity once this matures further.

Any other suggestions welcome as Github issues for discussion or as pull requests.

Hacking

Contributions welcome! Please

  • Make pull requests to the dev branch.
  • Conform to the PEP 8 style guide.

Thanks!