Justin Clark-Casey edited this page Aug 28, 2018 · 29 revisions

Buzzbang logo

About

Buzzbang is an umbrella organization for

  • Software that allows applications to find and use Bioschemas markup.
  • Frontends that search over the crawled markup.

As such, it consists of multiple projects, some of which are found in buzzbangorg repositories (such as the original Buzzbang crawler and a Google like frontend), and others of which are located elsewhere, such as bioschemas-gocrawlit. We are working towards common components and formats, such as how crawlers commonly store markup in MongoDB.

The rest of this document is mainly about the components currently in the buzzbangorg repository itself though these will change over time.

Buzzbangorg itself currently contains 3 projects:

  1. The original crawler pipeline with stages that collect Bioschemas data from websites and indexes it into a Solr instance.
  2. A new Python crawler that uses Scrapy and will store the crawl in a MongoDB instance as well as a Solr instance.
  3. A search frontend by which a human can search the index created by the crawler pipeline. Anybody can deploy this and there is a live demo at http://buzzbang.science

These projects are at an alpha stage but have already been successfully used in some prototype Bioschemas-consuming applications.

Media

State of play

schema.org is a community project to develop a set of schemas that can be embedded in webpages, in formats such as JSON-LD, RDFa and Microdata. Example schemas include Movie, Store and Product. Among other usecases, this embedded data can then be crawled by search engines such as Google and Yandex, and used to return useful structured results on queries (such as the information boxes you see on some Google search results).

Bioschemas is a community project by the life sciences community to specify how schemas from schema.org can be used to markup life sciences information. As such, it has 2 aspects:

  1. When using existing schema.org schemas, such as DataCatalog and Dataset, Bioschemas will specify which properties are mandatory, which optional and the cardinality of properties. This is because schema.org itself specifies none of these things.
  2. In some cases, Bioschemas will come up with new schemas, such as BioChemEntity to describe biological and chemical entities, where nothing suitable pre-exists in schema.org. Once these have gone through review by the Bioschemas community, they will also be suggested to the main schema.org community.

Bioschemas is an extremely young project. As such, the specifications are subject to considerable change and some are not final (in particular BioChemEntity). In addition, very few life sciences information sources have yet implemented this markup. Nonetheless, bsbang-crawler is an alpha project to start crawling this data so that it can then be searched in the companion frontend project, and later on possibly joined together using embedded ontology terms in the form of a knowledge graph.

As an alpha project, bsbang-crawler is itself subject to considerable change. Until now, the crawler has been custom written. However, this is a poor choice for future scalability and maintainability, so whilst there might be a bit more work done on the custom crawler code, we are actively looking at a Transition to an established crawler package.

There is a companion bsbang-frontend companion project which is a very simple Google-like search engine on top of the extracted data.

Architecture

Roadmap

  • Transition to making Buzzbang an umbrella project for multiple crawlers and other components that make use of gathered Bioschemas information. These will likely be organized around a core common data model (Ricardo is looking into this). This will mean renaming various components and changing the focus of this wiki page to talk about more components than the existing single Buzzbang crawler and frontend.
  • Transition to an established crawler package - Notes on transition from the current custom crawling code to using an existing crawling package such as Scrapy. Ankit is working on this.
  • Store gathered Bioschemas metadata in a common data model, possibly MongoDB, so that many different applications (search engines, data synchronization tools) can use it with a common API. Ricardo is looking into this.
  • Make the crawled metadata available for direct download, or if possible publish to a project such as Common Crawl, so that applications and other data resources can consume it without needing to operate their own crawling framework.
  • Improve the indexing of metadata (Justin/Ankit are looking into this).
  • Improve the presentation and linking of results in the search frontend (Justin/Ankit are looking into this).

Collaboration is welcome on any of these items.

Limitations

  • Buzzbang at this stage will only crawl JSON-LD markup. This needs to be extended to also consume RDFa and Microdata markup formats as used by schema.org.

  • It currently only crawls DataCatalog and PhysicalEntity (soon to be renamed to BioChemEntity) JSON-LD found at pre-registered URLs, which are either webpages or crawlable sitemap.xml.

  • And more! (just not listed yet :)

Life sciences websites with Bioschemas markup

Some sites have initial markup, chiefly those listed in https://github.com/justinccdev/bsbang-crawler/blob/dev/conf/default-targets.txt. However, this is very limited in scope. The most promising set of sites for initial markup are samples databases. As shown on the Bioschemas front page, there is an event on 15-16 March 2018 to introduce the Bioschemas Sample schema to Biobanks, for their feedback and to help them implement markup on their sites if they choose to do so. This may result in a large amount of useful real-world markup that Buzzbang can crawl.

Archive

Notes for GSOC students - this also contains some information about state of Bioschemas and the project that might be useful here.

Related work

References

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.