Skip to content

5 Levels of data reusability #103

@joepio

Description

@joepio

Tim Berners-Lee's 5 star Open Data is a really cool mental model of how to think about open data quality. Check out the website if you havent seen it: https://5stardata.info/en/

But I think it doesn't quite fit what most developers would consider usable data, so it might make sense to provide a different list that focuses on data reusability.

Mostly, it lacks typed data as a deliminator - whether the data has a machine readable schema. Personally, I think this is one of the most important characteristics. It's probably one of the main reasons why SQL is so incredibly popular, or why pretty much all programming languages have things like Structs or Classes with (type-safe) properties. But not all data has this, so I think it should be a distinction layer - a separate level, if you will.

Also, we can introduce verifiability of data, powered by Atomic Commits (or any other technology that does something similar).

I'm not sure whether we should call it '5 levels', it's definitely not as catchy as '5 stars'. I'm also not fully certain about 'usability', but I think it describes what I mean pretty well.

Anyways, here's a work in progress / draft. Feel free to share ideas / criticism / thoughts!

========

5 Levels of data reusability

Not all data are created equal.
There are notable differences in how much you can do with data and how much effort it takes.
The more reusable data is, the easier it will be to use it as a developer, researcher or other type of data user.
Re-useability is about being able to transform, sort, query, serialize, modify, render and audit data without requiring too much work.

This list is inspired by Tim Berners-Lee's 5-star open data.

Level 0: proprietary data

If you don't give others the rights to read, use or modify your data, it's reusability is zero.

That's why it's important to have a license that allow others to use your data.
A good choice for a permissive option is the Open Database License.
Creative Commons licenses are also good options to clearly communicate if, and if so then how, your data is permitted to be re-used.

It's also important to use open formats (such as CSV, JSON or PNG), instead of proprietary formats (tied to specific vendors, such as PSD or RAR).

Level 1: unstructured data

Examples: images, videos, plain text

Unstructured data is the least usable.
Humans can read it, and AI / Machine Learning systems can draw more conclusions from it then ever,
but it's hard to build an actual application or graphic from only unstructured data.

Hi! I'm Joep, I'm born in 1991.

Level 2: structured data

Examples: CSV, XML, JSON, TOML, EXCEL

Structured data can be read by machines, and this allows us to do all sorts of useful things.
We can query, sort and filter.
But still, this type of data often requires human input when it needs to be processed.
And we don't have guarantees about which fields will be filled, or what their datatypes are.
One time, a birthYear can be a string, and the next time it can be a number.
Data can be structured, but still unpredictable.

{
  "name": "Joep",
  "birthYear": 1991
}

If we want predictability, we need to make it type-safe.

Level 3: type-safe data

Examples: SQL + DB SCHEMA, JSON + JSON schema, XSD + XML, RDF + SHACL, In-memory data in type-safe programming languages

Type-safe data means that every value of the data has an explicit datatype.
It is strongly typed and has a clear schema that describes which properties you can expect in a Resource.
This means that someone re-using type-safe data can know for certain that it conforms to a specification, a set of rules.
The shape of the data is predictable.
This predictability means that developers can safely re-use it in their system without worrying about missing fields or datatype errors.

Lots of software has internal type safety, especially if you use type-safe programming languages like Typescript, Kotlin or Rust.
However, when the data leaves the system, a lot of type related data is lost.
Even if this schema related information is described, the schema itself is often not machine-readable.
The best way to have type-safe data, is to describe the schema in a machine-readable format.

In SQL, we can use a DB schema. In JSON, we can add a JSON Schema file. For XML, we have XSD.

In Atomic Data, the Properties themselves (the links in the keys in JSON-AD) describe the required datatypes, which helps developers when re-using data understand what they can expect from a value.

{
  "https://atomicdata.dev/properties/isA": ["https://atomicdata.dev/classes/Agent"],
  "https://atomicdata.dev/properties/name": "Joep",
  "https://atomicdata.dev/properties/birthYear": 1991,
  "https://atomicdata.dev/properties/worksOn": "Atomic Data",
}

Level 4: browsable data

Examples: Atomic Data, properly hosted RDF

If your data is connected to other pieces of machine-readable dat, is becomes browsable, similar to how websites link to each other.
This effectively creates a web of data, and allows for a whole new way to think about the internet.
This is what allows decentralized applications, true data ownership, and a new set of applications.

{
  "https://atomicdata.dev/properties/isA": ["https://atomicdata.dev/classes/Agent"],
  "https://atomicdata.dev/properties/name": "Joep",
  "https://atomicdata.dev/properties/birthYear": 1991,
  "https://atomicdata.dev/properties/worksOn": "https://atomicdata.dev",
}

Level 5: verifiable data

Examples: Atomic Data + Atomic Commits

When your data is verifiable, other people can verify who created it and modified it.
They can use cryptography to validate signatures, which proves that one person or machine created a piece of data.

{
  "https://atomicdata.dev/properties/isA": ["https://atomicdata.dev/classes/Agent"],
  "https://atomicdata.dev/properties/name": "Joep",
  "https://atomicdata.dev/properties/birthYear": 1991,
  "https://atomicdata.dev/properties/worksOn": "https://atomicdata.dev",
  "https://atomicdata.dev/properties/previousCommit": "https://atomicdata.dev/commits/EF18751AE781",
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions