Skip to content

Latest commit

 

History

History
380 lines (297 loc) · 28.6 KB

Open Data.md

File metadata and controls

380 lines (297 loc) · 28.6 KB

Open Data

Make Open Data compatible with the Modern Data Ecosystem.

Motivation

Open Data is a public good. As a result, individual [[incentives]] are not aligned with collective ones.

As an organization or research group, spending time curating and maintaining datasets for other people to use doesn't make economic sense, unless you can profit from that.

The current landscape has a few problems:

  • Non Interoperability. Data is isolated in multiples places and between different formats.
  • Data Loss. Data is commonly stored in perishable hardware and formats.
  • Hard to Search. Datasets indexing is difficult since there are many standards.
  • No Collaboration. No incentives exists for people to work on improving or curating datasets.

Open Data can help organizations, scientist, and governments make better decisions. Data is one of the best ways to learn about the world and [[Coordination|coordinate]] people.

Open protocols create open systems. Open code creates tools. Open data creates open knowledge. We need better tools, protocols, and mechanisms to improve the Open Data ecosystem. It should be easy to find, download, process, publish, and collaborate on open datasets.

Iterative improvements over public datasets yield large amounts of value (check how Dune did it with blockchain data)¹. Access to data gives people the opportunity to create new business and make better decisions.

Open Source code has made a huge impact in the world. Let's make Open Data do the same! Let's make it possible for anyone to fork and re-publish fixed, cleaned, reformatted datasets as easily as we do the same things with code.

Why Now?

We have better and cheaper infrastructure. That includes things like faster storage, better compute, and, larger amounts of data. We need to improve our data workflows now. How does a world where people collaborate on datasets looks like? The data is there. We just need to use it.

During the last few years, a large number of new data and open source tools have emerged. There are new query engines (e.g: DuckDB, DataFusion, ...), execution frameworks (WASM), data standards (Arrow, Parquet, ...), and a growing set of open data marketplaces (Datahub, HuggingFace Datasets, Kaggle Datasets).

These trends are already making it's way towards movements like DeSci or smaller projects like Py-Code Datasets. But, we still need more tooling around data to improve interoperability as much as possible. Lots of companies have figured out how to make the most of their datasets. We should use similar tooling and approaches companies are using to manage the open datasets that surrounds us. A sort of Data Operating system.

Data wrangling is a perpetual maintenance commitment, taking a lot of ongoing attention and resources. Better and modern data tooling can reduce these costs.

Organizations like Our World in Data or 538 provide useful analysis but have to deal with dataset management, spending most of their time building custom tools around their workflows. That works, but limits the potential of these datasets. Sadly, there is no data get OWID/daily-covid-cases or data query "select * from 538/polls" that could act as a quick and easy entry-point to explore datasets.

We could have a better data ecosystem if we collaborate on open standards! So, lets move towards more composable, maintainable, and reproducible open data.

¹ Blockchain data might be a great place to start building on these ideas as the data there is open, immutable, and useful.

Design Principles

  • Easy. Create, curate and share datasets without friction.
    • Frictionless: Data is useful only when used! Right now, we're not using most of humanity's datasets. That's not because they're not available but because they're hard to get. They're isolated in different places and multiple formats.
    • Pragmatism: published data is better than almost published one because something is missing. Publishing datasets to the web is too hard now and there are few purpose-built tools that help.
  • Versioned and Modular. Data and metadata (e.g: relation) should be updated, forked and discussed as code in version controlled repositories.
    • Prime composability (e.g: Arrow ecosystem) so tools/services can be swapped.
    • Metadata as a first-class citizen. Even if minimal and automated.
    • Git based approach collaboration. Adopt and integrate with git and GitHub to reduce surface area. Build tooling to adapt revisions, tags, branches, issues, PRs to datasets.
    • Provide a declarative way of defining the datasets schema and other meta-properties like relations or tests/checks.
    • Support for integrating non-dataset files. A dataset could be linked to code, visualizations, pipelines, models, reports, ...
  • Reproducible and Verifiable. People should be able to trust the final datasets without having to recompute everything from scratch. In "reality", events are immutable, data should be too. Make datasets the center of the tooling.
    • With immutability and content addressing, you can move backwards in time and run transformations or queries on how the dataset was at a certain point in time.
    • Datasets are books, not houses!
  • Permissionless. Anyone should be able to add/update/fix datasets or their metadata. GitHub style collaboration, curation, and composability. On data.
  • Aligned Incentives. Curators should have incentives to improve datasets. Data is messy after all, but a good set of incentives could make great datasets surface and reward contributors accordingly (e.g: number of contributors to Dune).
    • Bounties could be created to reward people that adds useful but missing datasets.
    • Surfacing and creating great datasets could be rewarded (retroactively or with bounties).
    • Curating the data provides compounding benefits for the entire community!
    • Rewarding the datasets creators according to the usefulness. E.g: CommonCrawl built an amazing repository that OpenAI has used for their GPTs LLMs. Not sure how well CommonCrawl was compensated.
  • Open Source and Decentralized. Datasets should be stored in multiple places.
    • Don't create yet another standard. Provide a way for people to integrate current indexers. Work on adapters for different datasets sources. Similar to:
      • Foreign Data Wrappers in PostgreSQL
      • Trustfall.
      • Open source data integration projects like Airbyte. They can used to build open data connectors making possible to replicate something from $RANDOM_SOURCE (e.g: spreadsheets, Ethereum Blocks, URL, ...) to any destination.
      • Adapters are created by the community so data becomes connected.
      • Having better data will help create better and more accessible AI models (people are working on this).
    • Integrate with the modern data stack to avoid reinventing the wheel and increase surface of the required skill sets.
    • Decentralized the computation (where data lives) and then cache inmutable and static copies of the results (or aggregations) in CDNs (IPFS, R2, Torrent). Most end user queries require only reading a small amount of data!
  • Other Principles from the Indie Web like have fun!

Modules

Packaging

Package managers have been hailed among the most important innovations Linux brought to the computing industry. The activities of both publishers and users of datasets resemble those of authors and users of software packages.

  • Distribution. Decentralized. No central authority. Can work in closed and private networks. Cache/CDN friendly.
    • A data package is an URI (like in Deno). You can import from an URL (data add example.com/dataset.yml or data add example.com/hub_curated_datasets.yml).
    • As Rufus Pollock puts it, Keep it as simple as possible. Store the table location and schema and get me the data on the hard disk (or my browser) fast.
    • Bootstrap a package registry. E.g: a GitHub repository with lots of known datapackages that acts as fallback and quick way to get started with the tool (data list returns a bunch of known open datasets and integrates with platforms like Huggingface).
  • Indexing. Should be easy to list datasets matching a certain pattern or reading from a certain source.
    • Datasets are linked to a [[Open Data#Datafile|Datafile]]/datapackage.yml with metadata.
    • One repository, one dataset or one catalog/hub.
    • To avoid yet another open dataset portal, build adapters to integrate with other indexes.
    • FAIR.
  • Formatting. Datasets are saved and exposed in multiple formats (CSV, Parquet, ...). Could be done in the backend, or in the client when pulling data (WASM). The package manager should be format and storage agnostic. Give me the dataset with id xyz as a CSV in this folder.
  • Social. Allow users, organizations, stars, citations, attaching default visualizations (d3, Vega, Vegafusion, and others), ...
    • Importing datasets. Making possible to data fork user/data, improve something and publish the resulting dataset back (via something like a PR).
    • Have issues and discussions close to the dataset.
  • Extensible. Users could extend the package resource (e.g: Time Series Tabular Package inherits from Tabular Package) and add better support for more specific kinds of data (geographical).
    • Build integrations to ingest and publish data in other hubs (e.g: CKAN, HuggingFace, ...).

Storage and Serialization

  • Permanence. Each version should be permanent and accessible (look at git, IPFS, dolt, ...).
  • Versioning. Should be able to manage diffs and incremental changes in a smart way. E.g: only storing the new added rows or updated columns.
  • Smart. Use appropriate protocols for storing the data. E.g: rows/columns shouldn't be duplicated if they don't change.
    • Think at the dataset level and not the file level.
    • Tabular data could be partitioned to make it easier for future retrieval.
  • Immutability. Never remove historical data. Data should be append only.
    • Similar to how git deals with it. You could force the deletion of something in case that's needed, but that's not the default behaivor.
  • Flexible. Allow arbitrary backends. Both centralized (S3, GCS, ...) and decentralized (IPFS, Hypercore, Torrent, ...) layers.
    • As agnostic as possible, supporting many types of data; tables, geospatial, images, ...
    • Can all datasets can be represented as tabular datasets? This will enable to run SQL (select, groupbys, joins) on top of them which might be the easier way to start collaborating.
    • A dataset could have different formats derived from a common one. Build converters between formats relying on the Apache Arrow in memory standard format. This is similar to how Pandoc and LLVM work! The protocol could do the transformation (e.g: CSV to Parquet, JSON to Arrow, ...) automagically and run some checks at the data level to verify they contain the same information.
    • Datasets could be tagged from a library of types (e.g: ip-adress) and conversion functions (ip-to-country). Given that the representation is common (Arrow), the transformations could be written in multiple languages.

Transformations

  • Deterministic. Packaged lambda style transformations (WASM/Docker).
  • Declarative. Transformations should be defined as code and be idempotent. Similar to how Pachyderm/Kamu/Holium work.
    • E.g: The transformation tool ends up orchestrating containers/functions that read/write from the storage layer, Pachyderm style.
  • Environment agnostic. Can be run locally and remotely. One machine or a cluster. Streaming or batch.
  • Templated. Having a repository/market of open transformations could empower a bunch of use cases ready to plug in to datasets:

Consumption

  • Accessible. Datasets are files. Datasets are static assets living somewhere. Don't get in the middle with libraries or gated databases.
  • Documentation. Surface derived work (e.g: reports, other datasets, ...).
  • Embedded Visualizations. Know what's in there before downloading it.
    • Sane Defaults. Suggest basic charts (bars, lines, time series, clustering). Multiple views.
    • Exploratory. Allow drill downs and customization. Offer a simple way for people to query/explore the data.
    • Dynamic. Use only the data you need. No need to pull 150GB.
  • Default APIs. For some datasets, allowing REST API / GraphQL endpoints might be useful. Same with providing an SQL interface.

Frequently Asked Questions

I'm not super clear on these answers! Please reach out if you want to chat about it.

1. What would be a great use case to start with?

I'd say chain related data. Is open and people are eager to get their hands on it. I'm working on that area, so I might be biased.

2. Why should people use this instead of doing their own thing?

If everybody could converge to it, e.g: "datapackage.json" as a metadata and schema description standard, then, an ecosystem of utilities and libraries for processing data would take advantage of it.

3. What is the incentive for people to adopt it?

I wonder if there are ways to use novel mechanisms (e.g: DAOs) to incentive people? Also, companies like Golden and index.as are doing interesting work on monetizing data curation.

4. How can LLMs help "building bridges"?

LLMs could infer schema, types, and generate some metadata for us. [[Large Language Models|LLMs can parse unstructured data (CSV) and also generate structure from any data source (scrapping websites)]] making it easy to create datasets from random sources.

They're definitely blurring the line between structured and unstructured data too. Imagine pointing a LLMs to a GitHub repository with some CSVs and get the auto-generated datapakage.json.

5. How can we stream/update new data reliably? E.g: some datasets like Ethereum blocks could be updated every few minutes

I don't have a great answer. Perhaps just push the new data into partitioned datasets?

7. Is it possible to mount large amount of data (FUSE) from a remote source and get it dynamically as needed?

It should be possible. I wonder if we could mount all datasets locally and explore them as if they were in your laptop.

8. Can new table formats play efficiently with IPFS?

Parquet could be a great fit if we figure out how to deterministically serialize it and integrate with IPLD. This will reduce their size as unchanged columns could be encoded in the same CID.

Later on I think it could be interesting to explore running delta-rs on top of IPFS.

9. How to work with private data?

Not sure. Homomorphic encryption?

9. How could something like Ver works?

If you can envision the table you would like to have in front of you, i.e., you can write down the attributes you would like the table to contain, then the system will find it for you. This probably needs a [[Knowledge Graphs]]!

10. How can a [[Knowledge Graphs]] help with the data catalog?

It could help users connect datasets. With good enough core datasets, it could be used as an LLM backend.

An easy tool for creating, maintaining and publishing databases with the ability to restrict parts or all of it behind a pay wall. Pair it with the ability to send email updates to your audience about changes and additions.

12. Curated and small data (e.g: at the community level) is not reachable by Google. How can we help there?

Indeed! With LLMs on the rise, community curated datasets become more important as they don't appear in the big data dumps.

Related Projects

Data Package Managers

Computation

Large Open Datasets

Open Data Organizations

Indexes

Open Source Web Data IDE

After playing with Rill Developer, DuckDB, Vega, WASM, Rath, and other modern Data IDEs, I think we have all the pieces for an awesome web based BI/Data exploration tool. Some of the features it could have:

  • Let me add local and remote datasets. Not just one as I'd like to join them later.
  • Let me plot it using Vega-Lite. Guide me through alternatives like Vega's Voyager2 does.
    • Might be as simple as surfacing Observable Plot with DuckDB WASM...
  • Use LLMs to improve the datasets and offer next steps:
    • Get suggested transformations for certain columns. If it detect a date, extract day of the week. If it detects a string, lower() it...
    • Get suggested plots. Given that it'll know both the column names and the types. Should be possible to create a prompt that returns some plot ideas and another that takes that and write the Vega-Lite code to make it work.
    • Make it easy to query the data via Natural Language.
  • Let me transform them with SQL (DuckDB) and Python (JupyterLite). Similar to Neptyne but in the browser (WASM).
  • Let me save the plots in a separate space and give me a shareable URL encoded link.
    • Local datasets could be shared using something like Magic Wormhole or a temporal storage service.
  • Let me grab the state of the app (YAML/JSON), version control it, and generate static (to publish in GitHub Pages) and dynamic (hosted somewhere) dashboards from it.
  • It could also have "smart" data checks. Similar to deepchecks alerting for anomalies, outliers, noisy variables, ...
  • Given a large amount of [[Open Data]]. It could offer a way for people to upload their datasets and get them augmented.
    • E.g: Upload a CSV with year and country and the tool could suggest GDP per Capita or population.

Could be an awesome front-end to explore [[Open Data]].

Relevant Projects

Datafile

Inspired by ODF.

name: "My Dataset"
owner: "My Org"
kind: "dataset"
version: 1
description: "Some description"
license: "MIT"
documentation:
 url: "somewhere.com"
source:
 - name: "prod"
   db: "psql:/...."
pipeline:
 - name: "Extract X"
   type: image
   image: docker/image:latest
   cmd: "do something"
materializations:
 - format: "Parquet"
   location: "s3://....."
   partition: "year"
schema:
 fields:
  - name: "name"
    type: "string"
    description: "The name of the user"
  - name: "year"
  - description: "...."
 primary_key: "country_name"
metadata: "..."

Simple Package Manager Design

  • A package spec file describing a package.
  • A hierarchical owner/name folder structure for installed packages.
  • Spec file locator with fallback to the package registry.
  • Versioning and latest versions.
  • Asset checksums.

Architecture

Architecture

Edit on Excalidraw