Skip to content

Data Commons

rowlandm edited this page Feb 3, 2024 · 11 revisions

What is Data Commons?

Data Commons is a software platform along with a governance framework that together allows a community to manage, analyze and share its data

Motivations: Why do we need this?

  • Use aggregated data to speed up solutions for a particular problem ex get enough data to use ML
  • Standardised Data Format 2
  • The need to share data among communities; however, data governance needs to be concerned as well

With proper management, different level access to data is useful

  • Cloud Computing: the technology allows us to manage large data with many supporting features 3
  • Speed things up: each community doesn't need to curate data allowing them to focus and make results faster 3
  • Bring Communities together: Data commons reduce costs for each community to maintain their data, and it reduces the barriers to access the data 3

Data Commons Compositions

  • ARDC version - Australian Research Data Commons

It is composed of 4 elements: People & Policy, Platform & Software, Data & Services, and Storage & Compute

image
  • Alternate Version

In this version, it is composed of:

  • Data Governance and Data Management Framework to handle data
  • Framework to handle identifying datasets
  • Framework to handle analyzing and visualizing datasets
  • Framework to share and collaborate results
Screenshot 2566-07-11 at 13 57 47

Examples

From the Alternate version, we will be using this to test the existing systems to see if it complies with its composition.

Bioinformatics Data Commons Data Governance and Data Management Identifying Datasets Analyzing and Visualizing data Share and Collaborate Results
UK Biobank
NCI Genomics Data Common
Haemosphere
cBioPortal
Stemformatics

The table highlights that the majority of them have achieved the framework structure; however, it is evident that not all of them adequately address data governance and management aspects.

WEHI Goal for Data Commons

  1. Aim to make it easier to streamline and setup data commons for a particular community
  2. Aim to create an on-demand data commons infrastructure
  3. Aim to create and configure extensively to suit each community's needs

User Stories for a Data Commons Framework

You can see the User Stories here.

Current Architectural Design of WEHI Data Commons Framework

  • Treat a Data Commons as an exclusive yacht club for datasets. The dress code is higher than for a common institute-wide dataset registry.
  • There may be multiple Data Commons in an institute, or across institutes.
  • It should be easy to create the core parts of a new Data Commons using a standard framework

Dataset Registry

The data registry is a place where metadata about the refined dataset and possibly sample information is stored. It should point to the location of the raw, processed and summarised data, as well as to the appropriate data portals.

It should allow the user to search for features of a dataset and also know if they can access a dataset straight away, or if they need to ask permission.

It should have an ecosystem of tools to allow data scientists to easily locate and access datasets.

Data Portals

There is interest in having research data that is summarised and easy to access for non-computational researchers. There are a few data portals such as cBioPortal, Aquila, Omero, and others that provide this type of functionality.

Because the data is so heteregenous, more than one data portal may be needed and different Data Commons may need different Data Portals.

We also need to be able to support our own Data Portals (eg. interactive data viz written in Shiny/R).

Screen Shot 2024-02-04 at 9 14 45 am

Screen Shot 2024-02-04 at 9 14 51 am

Roadmap

  • Create a proof of concept PoC for a single thematic that uses small public datasets
  • Review other systems and work with real data
  • Test alternative portals and data registries in parallel Migration to trial for one or two thematics
  • Maturation of services and push into production

Conclusion

This page provides a concise overview of what Data Commons entails, including how WEHI aims to shape its data commons infrastructure.

Reference

Clone this wiki locally