-
Notifications
You must be signed in to change notification settings - Fork 1
Data Commons
Data Commons is a software platform along with a governance framework that together allows a community to manage, analyze and share its data
- Use aggregated data to speed up solutions for a particular problem ex get enough data to use ML
- Standardised Data Format 2
- The need to share data among communities; however, data governance needs to be concerned as well
With proper management, different level access to data is useful
- Cloud Computing: the technology allows us to manage large data with many supporting features 3
- Speed things up: each community doesn't need to curate data allowing them to focus and make results faster 3
- Bring Communities together: Data commons reduce costs for each community to maintain their data, and it reduces the barriers to access the data 3
- ARDC version - Australian Research Data Commons
It is composed of 4 elements: People & Policy, Platform & Software, Data & Services, and Storage & Compute
![image](https://private-user-images.githubusercontent.com/44901743/252577904-329689cc-5d46-404e-a87d-99b66a065f7b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjM0MDE0MjMsIm5iZiI6MTcyMzQwMTEyMywicGF0aCI6Ii80NDkwMTc0My8yNTI1Nzc5MDQtMzI5Njg5Y2MtNWQ0Ni00MDRlLWE4N2QtOTliNjZhMDY1ZjdiLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA4MTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwODExVDE4MzIwM1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTZhZWM2NjliN2IyNzA4YjM0NWVjOTY4MDMwZjNjNDY0N2JmZGFlZDdkZTcyZTBkODZjNGU4YmY2YTJmYzgxOTMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.J7ITAU-sXbY--S8ZkiuoPhH-_RJHy4HOo5DCOEV2e6w)
- Alternate Version
In this version, it is composed of:
- Data Governance and Data Management Framework to handle data
- Framework to handle identifying datasets
- Framework to handle analyzing and visualizing datasets
- Framework to share and collaborate results
![Screenshot 2566-07-11 at 13 57 47](https://private-user-images.githubusercontent.com/44901743/252578311-d69f2d44-872f-4764-ad23-48d035fd2b07.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjM0MDE0MjMsIm5iZiI6MTcyMzQwMTEyMywicGF0aCI6Ii80NDkwMTc0My8yNTI1NzgzMTEtZDY5ZjJkNDQtODcyZi00NzY0LWFkMjMtNDhkMDM1ZmQyYjA3LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA4MTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwODExVDE4MzIwM1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTE2YmRkNTVkZTlhNTc0MjJiMGQ1ZTU3NDlkMTVjY2Q4YTZjOTY1OTFmYWQwOWYwNzE0NDNmOTY1MDA1YjJjNTImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.UhAzRuog699_aiHNioo_pQGSziNK05DKY6Nate036CA)
From the Alternate version, we will be using this to test the existing systems to see if it complies with its composition.
Bioinformatics Data Commons | Data Governance and Data Management | Identifying Datasets | Analyzing and Visualizing data | Share and Collaborate Results |
---|---|---|---|---|
UK Biobank | ✅ | ✅ | ✅ | ✅ |
NCI Genomics Data Common | ✅ | ✅ | ✅ | ✅ |
Haemosphere | ✅ | ✅ | ✅ | |
cBioPortal | ✅ | ✅ | ✅ | |
Stemformatics | ✅ | ✅ | ✅ |
The table highlights that the majority of them have achieved the framework structure; however, it is evident that not all of them adequately address data governance and management aspects.
- Aim to make it easier to streamline and setup data commons for a particular community
- Aim to create an on-demand data commons infrastructure
- Aim to create and configure extensively to suit each community's needs
You can see the User Stories here.
- Treat a Data Commons as an exclusive yacht club for datasets. The dress code is higher than for a common institute-wide dataset registry.
- There may be multiple Data Commons in an institute, or across institutes.
- It should be easy to create the core parts of a new Data Commons using a standard framework
The data registry is a place where metadata about the refined dataset and possibly sample information is stored. It should point to the location of the raw, processed and summarised data, as well as to the appropriate data portals.
It should allow the user to search for features of a dataset and also know if they can access a dataset straight away, or if they need to ask permission.
It should have an ecosystem of tools to allow data scientists to easily locate and access datasets.
There is interest in having research data that is summarised and easy to access for non-computational researchers. There are a few data portals such as cBioPortal, Aquila, Omero, and others that provide this type of functionality.
Because the data is so heteregenous, more than one data portal may be needed and different Data Commons may need different Data Portals.
We also need to be able to support our own Data Portals (eg. interactive data viz written in Shiny/R).
- Create a proof of concept PoC for a single thematic that uses small public datasets
- Review other systems and work with real data
- Test alternative portals and data registries in parallel Migration to trial for one or two thematics
- Maturation of services and push into production
This page provides a concise overview of what Data Commons entails, including how WEHI aims to shape its data commons infrastructure.