# Open development data. A step-by-step guide to documenting and publishing a dataset

– Alberto Cottica, UNDP Accelerator Labs

## Table of contents

* [About](#about)
* [Rationale](#rationale)
* [About Frictionless](#about-frictionless)
* [Step 1: import the library](#step-1)
* [Step 2: import the data](#step-2)
* [Step 3: validate](#step-3)
* [Step 4: create the metadata document](#step-4)
* [Step 5: manually edit the YAML file and convert to JSON](#step-5)
* [Step 6: save onto a suitable repository](#step-6)

## About <a class="anchor" id="about"></a>

This [Jupyter notebook](https://docs.jupyter.org/en/latest/) is for documenting a dataset using [Frictionless](https://frictionlessdata.io/introduction/).  This entails work to make explicit your implicit knowledge about the data, how they were collected, when and by whom, what they are intended for, relevant contextual information (example: "this dataset was collected during the COVID-19 pandemics, which is likely to have influenced responses ..."). 

I have tried to make this document as general as possible; as an example, I used [this dataset](https://zenodo.org/doi/10.5281/zenodo.7515227), that collects the answers to a 2022 survey on the digitalization of micro and informal businesses in the Global South. As in all Jupyter notebooks, the blocks of Python code in this document are executable. I cleared up their output in the interest of brevity. 

My warmest thanks to [Andrea Borruso](https://aborruso.github.io/) for his help in bootstrapping me with Frictionless. 


## Rationale <a class="anchor" id="rationale"></a>

In the course of their activities as part of UNDP's R&D function, the [UNDP Accelerator Labs](https://acceleratorlabs.undp.org) regularly unearth information of general interest for the development community. By making it available as a public good, the Accelerator Labs can contribute to the vision of a **open ecosystem of development**, where practitioners, donors and researchers share information, debate on what insights and strategies that information supports, and collaborate towards the advancing the Sustainable Development Goals. Once produced, public goods are near-costless to acquire; high-quality public information, then, is a very efficient enabler for development programmes.

The catch with this is that making high-quality data available is curation, and curation is hard work. In this document, we discuss the case in which a dataset, collected as a byproduct of a development project or an Accelerator Lab's learning cycle, is published so that others can reuse it. In the choice of how to do it we are guided by the [FAIR principles](https://www.go-fair.org): data should be Findable, Accessible, Interoperable and Retrievable. 

## About Frictionless <a class="anchor" id="about.frictionless"></a>

We use Frictionless because it is a widely accepted open standard for tabular data, and it was developed by and for the open data community. This means it encodes the ethics of documenting in the best interest of people you have not met yet, but might be interested in re-using your data.

The two main parts of the Frictionless project are: 

* An open standard to document datasets.
* Software that makes the work of cleaning, documenting and publishing open data much easier, notably the [Frictionless framework](https://v4.framework.frictionlessdata.io/).

The standard is called Data Package. The idea is this:

* All the data files you want to document as a coherent whole (normally `.csv` files) are in the same directory. For example, you might have a file containing the text of interviews and another one containing demographic characteristics of the people you interviewed. They contain different information, but are part of the same project. These files are called *resources*.
* In that same directory, you put a metadata file with a standardized name (`datapackage.json`) that describes each resource, and each variable in each resource. The formats accepted for the metadata files are JSON and YAML. For qualitative data we prefer YAML, as it's more human readable.

There are two ways to use the Frictionless framework: the Command Line Interface (CLI) or the Python library. The documentation is more complete for the Python library, which is what we use here. What follows assumes you have Python 3, and have installed the Frictionless library. The framework's [documentation](https://framework.frictionlessdata.io/) contains an installation guide in case you don't have these components on your computer yet. 


## Step 1: import  the library <a class="anchor" id="step-1"></a>

The `import frictionless` command imports the library. The rest of the stuff I have to do to make sure that Python finds the library's file, and for human-readable printing to screen

In [None]:
import sys
paths = ['', '/Users/johndoe/Documents', '/Library/Frameworks/Python.framework/Versions/3.9/lib/python39.zip', '/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9', '/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/lib-dynload', '/Users/albertocottica/Library/Python/3.9/lib/python/site-packages', '/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages']
for p in paths:
    sys.path.append(p)
import frictionless as fl


## Step 2: import the data and create the package <a class="anchor" id="step-2"></a>

Import your data file into Python. If you do not have one ready, you are welcome to download the CSV file from [this repository](https://zenodo.org/doi/10.5281/zenodo.7515227).

In [None]:
dirPath = './' # avoids the "unsafe path error": https://specs.frictionlessdata.io/data-resource/#data-location
filename = dirPath + 'data/' + 'small_businesses_digitalization_survey.csv'
resource = fl.Resource(filename)
package = fl.describe(filename, type = 'package')
package.to_yaml(dirPath + 'datapackage.yaml')
package.to_json(dirPath + 'datapackage.json')

## Step 3: validate <a class="anchor" id="step-3"></a>

Check that the package and its resources have no errors or missing values. The `validate` command generates a `report` object, that contains lists of errors and warnings.

In [None]:
report = fl.validate(dirPath + 'datapackage.json')
print(report.to_summary)

## Step 4: check the metadata document <a class="anchor" id="step-4"></a>

In Step 2 we used Frictionless to automatically generate the metadata file (for pedagogic purposes we generated it in two versions, one in YAML and one in JSON). It automatically infers the names and type of all variables. We can now edit those variables manually. The variables' names and types are automatically created, but you will need to add descriptions manually. In our case, we refer to the questionnaire to do that. 

Pay close attention to the types of variables. If Frictionless detects a variable that contains `Yes` in some cells and `No` in some others, it will most likely treat it as a string variable, when it is really a Boolean. Similarly, values like `24/09/2015` are likely to be read as strings, when they are really dates. In our case, many answers have been encoded as `0` or `1` and are treated as integers, but are really Booleans, so we change the metadata file to take that into account. 

Assigning the correct type to each variable helps the people that will use your data in the future. They will be able to import your data assigning already the correct data type, therefore avoiding manual conversions.

In [None]:
packagename = dirPath + 'data/' + 'small_businesses_digitalization_survey.csv'
print(packagename)
package = fl.describe(packagename, trusted = True)
print(package.to_yaml())

## Step 5: manually edit the YAML file and convert to JSON <a class="anchor" id="step-5"></a>

Once you have created the YAML, go through it paying attentions to miscategorized variables as above. This is also where you can (and should!) add human readable descriptions of your variables. The variables are listed in the `schema`, and each one should have a `description` field to help people that will reuse your data to better understand them. My example dataset's `schema` looks like this:

```
    "schema": {
        "fields": [
            {
                "type": "string",
                "name": "_uuid",
                "description": "Unique identifier."
            },
            {
                "type": "date",
                "name": "start",
                "description": "The survey's start date in the country."
            },
            {
                "type": "date",
                "name": "end",
                "description": "The survey's end date in the country."
            },
            ... many more variables...
            ]
        }
```


This process can be automated to some degree. For example, I like to use Pandas or OpenRefine for things like repair the miscategorizations of variables of the same type, for example all the 0-1 variables that were categorized as numeric, but are in fact Boolean. 

In addition, and perhaps most important of all, you should pay attention to the metadata that pertain to the entire dataset, including title, author(s), its own human-readable description, and of course the license, without which you cannot speak of open data. You can do that via Frictionless, using the `Schema` class, but since there is not much scope for automation here (you need to input description, authors etc. more or less manually) I normally just do it with a code editor, and use Frictionless to validate the metadata file. The metadata must follow the [Data Package specifications](https://specs.frictionlessdata.io/tabular-data-package).


```
{
    "path": "./small_businesses_digitalization_survey.csv",
    "name": "small_businesses_digitalization_survey",
    "title": "Digitalization of informal businesses in the global South.",
    "description": "Results of a survey on the use of digital tools by informal businesses in the global South in 2022. We uploaded the questionnaire onto a digital surveys platform. We then reached out through UNDP\u2019s network of informal or small businesses in 19 countries, inviting them to complete the questionnaire and spread awareness about it.\u00a0This implies that respondents are in no way a random sample of the target populations; this choice was made in the interest of speed and cost-effectiveness. We obviously claim no representativity, though we believe that some of our results are strong enough to be considered, to a first approximation, valid as a big picture.\u00a0",
    "author": "UNDP Accelerator Labs",
    "license": "CC-BY 4.0",
    "profile": "tabular-data-resource",
    "scheme": "file",
    "format": "csv",
    "hashing": "md5",
    "encoding": "utf-8",
    "schema": {
        "fields": [
            ... many more variables...
            ]
        }
   }
```

## Step 6: save onto a suitable repository <a class="anchor" id="step-6"></a>

Now you have a fully documented dataset, it's time to upload it to a repository where other development practitioners and researchers might easily find it. Choosing a repository is in part a judgement call: there is no one One Best Data Repository. Choose one in accordance to the FAIR principles and its intended consequences, maximum freedom of implementation for everyone and preventing the dominance of a very small number of private and public parties. You might want to think about which community is most likely to reuse your data, and pick a repository well loved by that community. 

For my own work, I tend to use [Zenodo](https://zenodo.org). My arguments:

* Operated by [CERN](https://home.web.cern.ch/about), it hosts CERN's own high-quality data on particle physics. It is a non-commercial, presumably long-lived environment: CERN was founded in 1951, Zenodo itself launched in 2013. 
* Assigns a [Digital Object Identifier](https://www.doi.org/) to each entry. DOI is itself an ISO standard, and it enables measuring the bibliometric impact of your data. 
* Supports versioning: you can update the same entry multiple times. Each version has its own DOI that you can point to; additionally, the entry has a DOI that automatically resolves to the latest version. This way, you can point, in the present, to a dataset that will be updated in the future.
* Supports [ORCID](https://orcid.org) identifiers for authors, making datasets more interoperable with the academic community.
* Automatically generates, in various formats, a bibliography entry for your datasets, for the convenience of the researchers and practitioners that reuse your data and need to cite it in their own articles or reports.
* Hosts not only data, but any kind of documents (papers, presentations, slide decks, video files...) with the same features. 
* Indexed by [OpenAIRE](https://www.openaire.eu/about), which is useful if your data come from a project funded by the European Union's research framework programmes. Additionally, it guarantees interoperability with the [European Open Science Cloud](https://eosc-portal.eu/).
* Pageviews and downloads statistics to get a proxy of impact.

Congratulations! You have just made your data available as fully FAIR data.