Permalink
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
103 lines (87 sloc) 3.37 KB

Describing your data schema

Without knowledge of the data structure, goodtables is only able to check if the structure of the data is valid. For example, that all rows have the same number of columns, that there are no blank headers, etc. To validate the actual contents, you need to describe the data schema.

The data schema describes what each column should have (strings, numbers, dates), their formats (this string should be an e-mail), and constraints (numbers on age column should be bigger than 18). You can think of it as a kind of data dictionary. The best way to describe the data schema is by writing a data package.

Instructions

On the root folder of your data, create a datapackage.json file with the contents:

{
  "name": "my-dataset",
  "title": "My dataset",
  "resources": [
    {
      "name": "my-data",
      "path": "data/data.csv"
    }
  ]
}

This is the simplest tabular data package we can create. Let's see how our data looks like so we can write the table schema for it.

+------------+------+------+--------+
| date       | from | to   | amount |
+============+======+======+========+
| 2017-01-01 | Jane | John | 1000   |
+------------+------+------+--------+
| 2017-01-15 | Jane | Paul | 500    |
+------------+------+------+--------+
| 2017-02-03 | John | Jane | 2000   |
+------------+------+------+--------+

A table schema has three parts: the data type ("string", "number", "date"), the data format ("e-mail", "URI", "ISO date"), and the constraints ("number must be above 18"). Not all columns will have all three parts.

In our data, we have the following columns:

+--------+---------+------------+-----------------------+
| column | type    | format     | constraints           |
+========+=========+============+=======================+
| date   | date    | YYYY-MM-DD |                       |
+--------+---------+------------+-----------------------+
| from   | string  |            |                       |
+--------+---------+------------+-----------------------+
| to     | string  |            |                       |
+--------+---------+------------+-----------------------+
| amount | numeric |            | Greater or equal to 0 |
+--------+---------+------------+-----------------------+

Writing this as a table schema in our data package, we have:

{
  "name": "my-dataset",
  "title": "My dataset",
  "resources": [
    {
      "name": "my-data",
      "path": "data/data.csv",
      "schema": {
        "fields": [
          {
            "name": "date",
            "type": "date",
            "description": "The transaction date"
          },
          {
            "name": "from",
            "type": "string",
            "description": "Payer"
          },
          {
            "name": "to",
            "type": "string",
            "description": "Payee"
          },
          {
            "name": "amount",
            "type": "number",
            "description": "Transaction value in Euros",
            "constraints": {
              "minimum": 0
            }
          }
        ]
      }
    }
  ]
}

Note that we didn't have to define the date format explicitly, as the default format is YYYY-MM-DD.

You can find all supported data types, formats and constraints in the Table Schema specification.