Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algovera: Decentralised Storage for Data Science Communities #517

Closed
richardblythman opened this issue Apr 3, 2022 · 4 comments
Closed

Comments

@richardblythman
Copy link

richardblythman commented Apr 3, 2022

Open Grant Proposal: Algovera: Decentralised Storage for Data Science Communities

Name of Project: Algovera

Proposal Category: app-dev, devtools-libraries

Proposer: richardblythman

(Optional) Technical Sponsor:

Do you agree to open source all work you do on behalf of this RFP and dual-license under MIT and APACHE2 licenses?: Yes

Project Description

The field of deep learning requires heavy amounts of storage and compute. Machine learning datasets often reach into the 100s of GBs, and pre-trained model weights can be large too. As a result, data scientists are regular users of centralised cloud services such as AWS, GCP and Azure. Popular dataset and model hubs like HuggingFace and ActiveLoop Hub also rely exclusively on centralised storage platforms. However, these services are expensive and can be difficult to use. At the same time, datasets and model weights for deep learning R&D are distributed across many websites and cloud platforms. The workflow for using a deep learning model usually requires downloading a dataset and model weights locally and following a readme for setup and processing, which is a time-consuming and tedious user experience. Finally, data scientists do not own, and cannot monetise, what they build on top of HuggingFace and ActiveLoop Hub.

Decentralised storage solutions have the potential to vastly reduce the costs incurred by data scientists for storing raw and processed versions of datasets, as well as model weights. Furthermore, it provides an opportunity to provide a common interface and storage standard for public and private deep learning datasets, thus easing the workflow of data scientists. Web3 also enables ownership and monetization of data, models and apps developed by data scientists, and mitigates some of the risks associated with centralised AI apps that store user information. We see decentralised storage as a key component of the Web3 machine learning stack that we are developing.

Currently, there are few tools for interacting with decentralised storage in the language (Python) and frameworks (Jupyter notebooks) that they are used to. This is especially true for writing to storage. Existing solutions include a read-only version of fsspec (a unified pythonic interface to local, remote and embedded file systems and bytes storage) called ipfsspec. Hence, this project aims to make it easier for data scientists to use decentralised storage and access public datasets and model weights through a unified Python interface. We will also create a proof of concept of a decentralised Web3 hub for machine learning datasets and models, that combines the best elements of decentralised storage, Web2 hubs, and Web3 marketplaces.

Value

What are the benefits to getting this right?
Web3 often talks about the value of data, and data as the new oil. This is awesome. But just as oil needs to be refined, data needs to be processed and turned into useful insights and predictions. Our observation is that there are not enough data scientists and machine learning practitioners working in Web3, and that Web3 data is not being refined or consumed enough for this reason. Making it easier for data scientists to work with decentralised storage and generate and monetise insights, has the potential to increase the value of Web3 data, which further incentivises individuals to withdraw their data from centralised platforms. We can help to kickstart this flywheel.

What are the risks if you don't get it right?
In our experience, data scientists are generally quite sceptical of crypto and Web3. Failing to get the workflow and UX of the platforms right could further alienate this community. For example, how should we use data tokens with open source datasets on a Web3 Hub (if at all)?

What are the risks that will make executing on this project difficult?
The HuggingFace team is not hugely responsive on their Discord, and it has been tricky to find someone to talk to about providing support for integrating decentralised storage. As mentioned, data scientists also tend to be sceptical of Web3.

Deliverables

  • Research report with detailed specification on technical architecture of solution, and GitHub repo with tutorials for existing workflows with decentralised storage
  • Python library (compatible with fsspec) for writing and reading from decentralised storage solutions such as IPFS and Filecoin
  • PRs to two popular dataset and model hubs, HuggingFace and ActiveLoop, to integrate our newly-developed Python interface
  • Proof of concept of a Web3 hub for storing machine learning datasets and models
  • Documents, tutorial videos and demos

Development Roadmap

Milestone 1 - Research and Concept Phase
In this milestone, we will outline the technical architecture of the solution in detail. We will also map out existing workflows for IPFS, Filecoin and Estuary. The results of this milestone will be a detailed specification and research report for the remaining solution, as well as tutorials for existing workflows in this GitHub repo.

We will also build on existing outreach with several other projects that are working in a similar direction. To date, we have hosted and joined working groups, and attended regular meetings with a number of projects to discuss considerations and differentiations:

We will continue to run and record weekly sessions for our decentralised storage working group within the Algovera community, as well as attending other working groups.

Estimated Time - 2 full time person months ($10,000)

Dates:
4/15/2022 - 5/15/2022

Milestone 2 - Python Interface for Decentralised Storage
In this milestone, we will implement a Python library for reading and writing from decentralised storage such as IPFS and Filecoin. We will use the format of fsspec, a unified pythonic interface to local, remote and embedded file systems and bytes storage. This is a popular library used by HuggingFace and more. This stage will involve some coordination with the Filecoin shared-zarr working group (and creators of fsspec). An implementation of fsspec for IPFS (ipfsspec) exists, but it has read-only functionality. We plan to use Estuary for storing with Filecoin.

Estimated Time - 3 full time person months ($15,000)

Dates:
5/15/2022 - 6/30/2022

Milestone 3 - Integration of Python Interface with Data Science Frameworks
In this milestone, we will submit PRs to two popular dataset and model hubs, HuggingFace and ActiveLoop, to integrate our newly-developed Python interface. HuggingFace uses the fsspec standard for working with centralised cloud storage (see here) so integration is trivial. Integration with ActiveLoop Hub will involve creating a class that wraps fsspec. This phase will lean on the experience of our team member Dyllan at ActiveLoop.

Estimated Time - 1 full time person months ($5,000)

Dates:
6/30/2022 - 7/15/2022

Milestone 4 - Web 3 AI Hub
There are several existing Web2 hubs for datasets, models, apps and other assets, such as Google AI Hub, HuggingFace Hub and ActiveLoop Hub. However, existing hubs do not facilitate ownership or monetisation of algorithms developed by users of the platforms. At the same time, the Ocean Protocol marketplace facilitates ownership and monetisation, although it’s designed for private datasets only. With our learnings from the previous milestones (and our previous successful Ocean grants), we will combine the best parts of HuggingFace, ActiveLoop Hub and the Ocean marketplace to create a Web3 AI marketplace where data scientists can use public and private datasets, and generate revenue from the algorithms that they develop. We will upload 10 popular open source machine learning datasets to start, as well as the numerous assets that the Algovera community has created to date.

Estimated Time - 3 full time person months ($15,000)

Dates:
7/15/2022 - 8/31/2022

Milestone 5 - Tutorial Documents, Videos and Demos
Documentation is the best way to communicate our applications to data scientists. We plan to create workflow tutorials, blogs and video tutorials, and open source developer-friendly documentation. These materials will focus on using our tools in a machine learning workflow. Algovera has experience with creating educational material for our popular data science in Web3 course. We will also publicise our work in other ecosystems that we are a part of such as Ocean and Gitcoin/Kernel.

Estimated Time - 1 full time person months ($5,000)

Dates:
8/31/2022 - 9/15/2022

Total Budget Requested

$50,000

Maintenance and Upgrade Plans

We plan to integrate this with Ocean Protocol's Provider for data and compute services, such as compute-to-data (C2D). An API will be developed to allow data scientists to upload datasets and models to the Web3 Hub.

Team

Team Members

  • Dr. Richard Blythman (Algovera)
  • Jakub Smekal (Algovera and Opscientia)
  • Dyllan McCreary (Gitcoin FDD)
  • Vedant Padwal (SAME Project)
  • Vintage Gold (Algovera)

Team Member LinkedIn Profiles

Team Website

https://www.algovera.ai

Relevant Experience

Dr. Richard Blythman is a machine learning R&D engineer with 5 years of experience in university, industry and startups. He is the founder of Algovera, a Web3 project and community advancing the development of the decentralised AI stack. Algovera has completed 9 successful grants with Ocean Protocol.

Jakub Smekal is an undergraduate student in maths and physics, and core team member of Algovera. He is a receiver of an Opscientia fellowship. He has worked on Python libraries for simulating complex systems and integrating Jupyter with MetaMask. He has experience working as a machine learning engineer in computer vision.

Dyllan McCreary is a deep learning research engineer and software engineer with 4 years of experience in industry. He spent 1 year building an open source Python package for computer vision data pipelines (Activeloop Hub), optimised for streaming and user experience.

Vedant Padwal is an undergraduate student in computer engineering and contributor/maintainer in MLOps open source projects such as Kubeflow, KServe, SAME-Project and NVIDIA-MERLIN. He has experience in building Large scale e2e machine learning systems, Kubernetes apps, full stack apps, recommender systems, and has worked with several people from industry and academia on open source projects.

Vintage Gold has a Master’s in data science with experience in natural language processing, deep learning models, and time series. He also has prior experience with configuring and designing accounting and financial systems, and building financial models for city planners.

Team code repositories

Additional Information

Currently, the GitcoinDAO's FDD workstream is collaborating directly with us to provide feedback and pragmatic use cases for our web3 ML stack on the sybil-account problem to protect their quadratic funding grant matching mechanism.

@ErinOCon
Copy link
Collaborator

ErinOCon commented Apr 4, 2022

Hi @richardblythman, Thank you for your proposal. We will review this and get back to you, on this thread, with a status update or questions.

@richardblythman
Copy link
Author

Awesome. Thanks Erin!

@ErinOCon
Copy link
Collaborator

Hi @richardblythman, we would like to fund part of the outlined work! Please email devgrants@fil.org to discuss next steps.

@richardblythman
Copy link
Author

That's great. Thanks @ErinOCon. Just sent an email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants