Skip to content

Bio2RDF Dataset Metrics (v2 obsolete)

Michel Dumontier edited this page May 13, 2016 · 2 revisions

Bio2RDF dataset metrics

This page is now obsolete. Please look to Bio2RDF-Dataset-Summary-Statistics for the latest.

Every Bio2RDF dataset includes a set of metrics that quantitatively summarize their contents. Here we show you how to use them to easily explore a Bio2RDF dataset. All of these metrics are available in a named graph in each dataset SPARQL endpoint. The named graph IRI follows the pattern: 'http://bio2rdf.org/bio2rdf-[dataset]-statistics', where "[dataset]" is the preferred short name for the Bio2RDF dataset as seen here.

These Bio2RDF dataset metrics are serialized using the 'dataset' namespace. Each metric is linked to a single IRI that is composed of 'http://bio2rdf.org/dataset_resource:' followed by a unique MD5 hash for each dataset and typed as 'http://bio2rdf.org/dataset_vocabulary:Endpoint'. This resource is linked to each metric by a metric specific predicate. For example, the predicate that links the Bio2RDF dataset IRI to its triple count is 'http://bio2rdf.org/dataset_vocabulary:has_triple_count'. In the case of metrics that have multiple results, such as the frequency of resources of each type in a dataset, each result is given a unique IRI generated with an MD5 hash, which is linked to the IRI of the dataset resource being counted (e.g. the IRI of a given type) and a literal for its frequency.

The following is a list of the metrics available for each dataset, as well as sample SPARQL queries that you can use to access them. These queries were all executed over the Comparative Toxicogenomics Database (CTD) dataset available here.

1. total number of triples

Retrieve the total number triples in this endpoint:

  PREFIX data_vocab: <http://bio2rdf.org/dataset_vocabulary:>
  SELECT *
  FROM <http://bio2rdf.org/bio2rdf-ctd-statistics>
  WHERE {
   ?endpoint a data_vocab:Endpoint.
   ?endpoint data_vocab:has_triple_count ?tc .
  }

View results by clicking here. There are 151485732 triples in the CTD dataset.

2. number of unique subjects

Retrieve the total number of unique subjects in this endpoint:

  PREFIX data_vocab: <http://bio2rdf.org/dataset_vocabulary:>
  SELECT *
  FROM <http://bio2rdf.org/bio2rdf-ctd-statistics>
  WHERE {
   ?endpoint a data_vocab:Endpoint.
   ?endpoint data_vocab:has_unique_subject_count ?sc .
  }

View results by clicking here. There are 13627566 unique subject IRIs in the CTD dataset.

3. number of unique predicates

Retrieve the total number of unique predicates in this endpoint:

  PREFIX data_vocab: <http://bio2rdf.org/dataset_vocabulary:>
  SELECT *
  FROM <http://bio2rdf.org/bio2rdf-ctd-statistics>
  WHERE {
   ?endpoint a data_vocab:Endpoint.
   ?endpoint data_vocab:has_unique_predicate_count ?pc .
  }

View results by clicking here. There are 27 unique predicate IRIs in the CTD dataset.

4. number of unique objects

Retrieve the total number of unique objects in this endpoint:

  PREFIX data_vocab: <http://bio2rdf.org/dataset_vocabulary:>
  SELECT *
  FROM <http://bio2rdf.org/bio2rdf-ctd-statistics>
  WHERE {
   ?endpoint a data_vocab:Endpoint.
   ?endpoint data_vocab:has_unique_object_count ?oc .
  }

View results by clicking here. There are 14136295 unique object IRIs in the CTD dataset.

5. number of unique types

Retrieve the total number of unique types in this endpoint:

  PREFIX data_vocab: <http://bio2rdf.org/dataset_vocabulary:>
  SELECT *
  FROM <http://bio2rdf.org/bio2rdf-ctd-statistics>
  WHERE {
   ?endpoint a data_vocab:Endpoint.
   ?endpoint <http://bio2rdf.org/dataset_vocabulary:has_type_count> ?atype .
   ?atype <http://bio2rdf.org/dataset_vocabulary:has_count> ?tc.
   ?atype <http://bio2rdf.org/dataset_vocabulary:has_type> ?at.
  }

View results by clicking here. For example, there 11632475 entities of the type 'http://bio2rdf.org/ctd_vocabulary:Gene-Disease-Association'.

6. unique predicate-object links and their frequencies

Retrieve the total number of unique predicate-object links in this endpoint:

  PREFIX data_vocab: <http://bio2rdf.org/dataset_vocabulary:>
  SELECT *
  FROM <http://bio2rdf.org/bio2rdf-ctd-statistics>
  WHERE {
   ?endpoint a data_vocab:Endpoint.
   ?endpoint <http://bio2rdf.org/dataset_vocabulary:has_predicate_object_count> ?anObject.
   ?anObject <http://bio2rdf.org/dataset_vocabulary:has_count> ?aC.
   ?anObject <http://bio2rdf.org/dataset_vocabulary:has_predicate> ?aP.
  }

View results by clicking here. For example, the predicate 'http://bio2rdf.org/ctd_vocabulary:gene' is linked to 12050897 unique object IRIs.

7. unique predicate-literal links and their frequencies

Retrieve the total number of unique predicate-literal links in this endpoint:

  PREFIX data_vocab: <http://bio2rdf.org/dataset_vocabulary:>
  SELECT *
  FROM <http://bio2rdf.org/bio2rdf-ctd-statistics>
  WHERE {
   ?endpoint a data_vocab:Endpoint.
   ?endpoint <http://bio2rdf.org/dataset_vocabulary:has_predicate_literal_count> ?anObject.
   ?anObject <http://bio2rdf.org/dataset_vocabulary:has_count> ?aC.
   ?anObject <http://bio2rdf.org/dataset_vocabulary:has_predicate> ?aP.
  }

View results by clicking here. For example, the predicate 'http://purl.org/dc/terms/description' is linked to 9814 unique literals.

8. unique subject-predicate-unique object links and their frequencies

Retrieve the total number of subject-predicate-unique object links in this endpoint:

  PREFIX data_vocab: <http://bio2rdf.org/dataset_vocabulary:>
  SELECT *
  FROM <http://bio2rdf.org/bio2rdf-ctd-statistics>
  WHERE {
   ?endpoint a data_vocab:Endpoint.
   ?endpoint <http://bio2rdf.org/dataset_vocabulary:has_predicate_unique_subject_unique_object_count> ?anObject .
   ?anObject <http://bio2rdf.org/dataset_vocabulary:has_predicate> ?aP.
   ?anObject <http://bio2rdf.org/dataset_vocabulary:has_object_count> ?oC .
   ?anObject <http://bio2rdf.org/dataset_vocabulary:has_subject_count> ?sC .
  }

View results by clicking here. For example, the 'http://bio2rdf.org/ctd_vocabulary:disease' predicate links 12764011 unique subjects to 8039 unique objects.

9. unique subject-predicate-unique literal links and their frequencies

Retrieve the total number of subject-predicate-unique literal links in this endpoint:

  PREFIX data_vocab: <http://bio2rdf.org/dataset_vocabulary:>
  SELECT *
  FROM <http://bio2rdf.org/bio2rdf-ctd-statistics>
  WHERE {
   ?endpoint a data_vocab:Endpoint.
   ?endpoint <http://bio2rdf.org/dataset_vocabulary:has_predicate_unique_subject_unique_literal_count> ?anObj.
   ?anObj <http://bio2rdf.org/dataset_vocabulary:has_predicate> ?p.
   ?anObj <http://bio2rdf.org/dataset_vocabulary:has_subject_count> ?sc.
   ?anObj <http://bio2rdf.org/dataset_vocabulary:has_literal_count> ?lc.
  }

View results by clicking here. For example, the predicate 'http://bio2rdf.org/ctd_vocabulary:action' links 418422 unique subjects to 2045 unique literals.

10. unique subject type-predicate-object type links and their frequencies

Retrieve the total number of subject type-predicate-object type lbinks in this endpoint:

  PREFIX data_vocab: <http://bio2rdf.org/dataset_vocabulary:>
  SELECT *
  FROM <http://bio2rdf.org/bio2rdf-ctd-statistics>
  WHERE {
   ?endpoint a data_vocab:Endpoint.
   ?endpoint <http://bio2rdf.org/dataset_vocabulary:has_type_relation_type_count> ?anObj.
   ?anObj <http://bio2rdf.org/dataset_vocabulary:has_subject_type> ?subjectType.
   ?anObj <http://bio2rdf.org/dataset_vocabulary:has_subject_count> ?subjectCount.
   ?anObj <http://bio2rdf.org/dataset_vocabulary:has_predicate> ?aPred.
   ?anObj <http://bio2rdf.org/dataset_vocabulary:has_object_count> ?objCount.
   ?anObj <http://bio2rdf.org/dataset_vocabulary:has_object_type> ?objType.
  }

View results by clicking here. For example, this metric describes the frequencies of a given subject and object type as linked by a specific predicate. For example, the predicate 'http://bio2rdf.org/ctd_vocabulary:chemical' links 418413 subjects of type 'http://bio2rdf.org/ctd_vocabulary:Chemical-Gene-Association' to 9447 objects of type 'http://bio2rdf.org/ctd_vocabulary:Chemical'.

Developing useful SPARQL queries using dataset metrics

The subject type-predicate-object type metric gives the necessary information to understand how entities are related to one another in a dataset. It can also inform the construction of an immediately useful SPARQL query, without losing time generating "exploratory" queries to become familiar with the dataset model. For example, the subject type-predicate-object type metrics for CTD show that there are records that link Chemical-Disease Associations, Chemicals and Diseases via the 'http://bio2rdf.org/ctd_vocabulary:chemical' and 'http://bio2rdf.org/ctd_vocabulary:disease' predicates, respectively. These metrics allow one to construct the following query to retrieve the chemicals and diseases that are associated in the CTD dataset:

 SELECT ?cdi ?chemicalLabel ?diseaseLabel 
 WHERE {
  ?cdi rdf:type <http://bio2rdf.org/ctd_vocabulary:Chemical-Disease-Association> .
  ?cdi <http://bio2rdf.org/ctd_vocabulary:chemical> ?chemical .
  ?chemical rdf:type <http://bio2rdf.org/ctd_vocabulary:Chemical> .
  ?cdi <http://bio2rdf.org/ctd_vocabulary:disease> ?disease .
  ?disease rdf:type <http://bio2rdf.org/ctd_vocabulary:Disease> .
  ?chemical rdfs:label ?chemicalLabel .
  ?disease rdfs:label ?diseaseLabel . 
}