# All About Data

This lecture attempts to give a broad overview of Earth and Environment Science data formats, exchange protocols, and best practices.

## What is (are) data?

Ultimately, as we know, all digital data are just 1s and 0s. Each 1 or 0 value is called a _bit_.
Bits are usually grouped together in groups of 8 bits, called a _byte_.
A byte can represent $2^8$, or 256, distinct values.
We can choose to _encode_ and _organize_ the bytes in many different ways to represent different types of information.

### Data Encoding

Encoding is the way that we map raw bytes to meaningful values.

#### Numerical Data

The most straightforward way to interpret a byte is as an integer.

In [8]:
value = 0b10101010  # 1 byte base-2 literal
value

170

However, many different numerial data types can be encoded as bytes.
The more bytes we use for each value, the more range or precision we get.

In [18]:
import numpy as np
print(np.dtype('i2'), np.dtype('i2').itemsize, "bytes")
print(np.dtype('f8'), np.dtype('f8').itemsize, "bytes")
print(np.dtype('complex256'), np.dtype('f8').itemsize, "bytes")

int16 2 bytes
float64 8 bytes
complex256 8 bytes


#### Text Data

We can also encode text as bytes.

The simplest encoding is [ASCII](https://en.wikipedia.org/wiki/ASCII) (American Standard Code for Information Interchange).
ASCII uses one byte per character and therefore the ASCII alphabet only contains 256 different characters.

<img src="https://upload.wikimedia.org/wikipedia/commons/c/cf/USASCII_code_chart.png" width="30%" />

As computers became more powerful, and the computing community grew beyond the US and Western Europe, a more inclusive character encoding standard called [Unicode](https://en.wikipedia.org/wiki/Unicode) took hold.
The most common encoding is [UTF-8](https://en.wikipedia.org/wiki/UTF-8).
UTF-8 uses four bytes per character.
This web page the the Jupyter Notebook it was generated from all use UTF-8 encoding.

Fun fact: emojis are UTF-8 characters.

In [25]:
"ðŸ˜€".encode()

b'\xf0\x9f\x98\x80'

Single values will only take us so far.
To represent scientific data, we need to think about data _organization_.

### Tabular Data

A very common type of data is "tabular data".
We discussed it already in our Pandas lecture.
Tabular data consists of _rows_ and _columns_.
The columns usually have a name and a specific data type.
Each row is a distinct sample.

Here is an example of tabular data

| Name | Mass | Diameter |
| -- | -- | -- |
| Mercury | 0.330 $\times 10^{24}$ kg | 4879 km |  
| Venus | 4.87 $\times 10^{24}$ kg | 12,104 km |  
| Earth | 5.97 $\times 10^{24}$ kg | 12,756 km | 

(Via https://nssdc.gsfc.nasa.gov/planetary/factsheet/)

The simplest and most common way to encode tabular data is in a text file as [CSV (comma-separated values)](https://en.wikipedia.org/wiki/Comma-separated_values).
CSV is readable by humans and computer programs.

For larger datasets, [Apache Parquet](https://parquet.apache.org/) is a good alternative.
Parquet files are not human readable, but they can be parsed by computers much more quickly and efficiently.
They also use compression to achieve a smaller file size compared to CSV.

Multiple related tabular datasets can be stored and queried in a [relational database](https://en.wikipedia.org/wiki/Relational_database).
Databases are very useful but beyond the scope of this class.

### Array Data

When we have numerical data that are organized into an N-dimensional rectangular grid of values, we are dealing with array data.

<img src="http://xarray.pydata.org/en/stable/_images/dataset-diagram.png" width="600px" />

(via http://xarray.pydata.org/en/stable/user-guide/data-structures.html)

In python, we work with Array data in Numpy and Xarray.

Array data can be stored in the following standard formats:

- [Hierarchical Data Format (HDF5)](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) - Container for many arrays
- [Network Common Data Form (NetCDF)](https://www.unidata.ucar.edu/software/netcdf/) - Container for many arrays which conform to the [NetCDF data model](https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_data_model.html)
- [Zarr](https://zarr.readthedocs.io/en/stable/) - New cloud-optimized format for array storage
- [TileDB Embedded](https://docs.tiledb.com/main/) - New cloud-optimized format for array storage


### "GIS" Data

#### Raster Data

#### Vector Data

### Graph Data

### Unstructured Data

JSON, etc.

### Metadata

Common metadata conventions:

- [Climate and Forecast (CF) Conventions](https://cfconventions.org/) - Commonly used with NetCDF data
- [Attribute Convention for Data Discovery (ACDD)](https://wiki.esipfed.org/Attribute_Convention_for_Data_Discovery_1-3)
- [Schema.org](https://schema.org/)

## How do programs access data?

### Local Files

Some programs require data to be on your local hard drive to read.

To download files to your local computer, we strongly recommend [Pooch](https://www.fatiando.org/pooch/latest/).

### Over a Network

A more modern and scalable way of loading data is to read it directly over the network into your program.

#### HTTP streams

#### OPEnDAP

#### Other APIs

### Cloud Object Storage

## Best practices for data sharing

FAIR

https://eos.org/editors-vox/enabling-findable-accessible-interoperable-and-reusable-data

https://www.force11.org/group/fairgroup/fairprinciples

### To be Findable:

F1. (meta)data are assigned a globally unique and eternally persistent identifier.

F2. data are described with rich metadata.

F3. (meta)data are registered or indexed in a searchable resource.

F4. metadata specify the data identifier.

### To be Accessible:

A1  (meta)data are retrievable by their identifier using a standardized communications protocol.

A1.1 the protocol is open, free, and universally implementable.

A1.2 the protocol allows for an authentication and authorization procedure, where necessary.

A2 metadata are accessible, even when the data are no longer available.

### To be Interoperable:
    
I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.

I2. (meta)data use vocabularies that follow FAIR principles.

I3. (meta)data include qualified references to other (meta)data.


### To be Re-useable:
    
R1. meta(data) have a plurality of accurate and relevant attributes.

R1.1. (meta)data are released with a clear and accessible data usage license.

R1.2. (meta)data are associated with their provenance.

R1.3. (meta)data meet domain-relevant community standards.

## Persistent Identifiers (PI / PID)

Persistent identifier are quasi-permanent unique identifiers that can be used to look up data, articles, books, etc.
They are a key aspect of making data _Findable_.
Digitial persistent identifiers are usually HTTP URLs that are guaranteed to continue working for a long time (in contrast to general URLs, which could change or disappear at any time.)

### Digital Object Identifiers

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/11/DOI_logo.svg/1200px-DOI_logo.svg.png" width="20%" />

In science, the [DOI system](https://www.doi.org/) has become a univerally accepted way to identify and find digital scholarly objects.
A DOI consists of a string of characters. Prepending the URL `https://doi.org/` to a DOI will generate a web link that can be used to look up the object.

Publishers (of journal articles, books, data repositories, etc.) will typical generate (or "mint") a DOI for new content when it is published.
The publisher commits to making the DOI work "forever". 
That's why not just anyone can create a DOI; it's a commitment.

Here are some examples of DOIs. Click the links to see where they go.

| Product Type | DOI | DOI Resolver |
| -- | -- | -- |
| Article | `10.1109/MCSE.2021.3059437` | https://doi.org/10.1109/MCSE.2021.3059437 |
| Dataset | `10.6084/m9.figshare.3507758.v1` | https://doi.org/10.6084/m9.figshare.3507758.v1 |
| Software | `10.5281/zenodo.4821276` | https://doi.org/10.5281/zenodo.4821276 |

## How can YOU can create FAIR data and code?

### What not to do

- Put the data / code on your personal website
- Put the data / code in Google Drive / Dropbox / etc.
- Put the data / code in GitHub

The problem with these solutions is that they are _not persistent_ and therefore not _findable_.


### Rough guide to FAIRly sharing data

This guide applies to _small_ data (< 10 GB). 
Sharing medium or large datasets is more difficult and complicated--clear solutions and best practices do not yet exist.


#### Step 1: Quality control the data and metadata

Before sharing a dataset, you should apply quality control to ensure there are not bad or incorrect values.
Choose a standard data format appropriate to the structure of your data, as reviewed above.
You also need to generate metadata for your data.
If your format supports embedded metadata (e.g. NetCDF, Zarr), you should provide it following one or more of the metadata conventions reviewed above (e.g. CF, ACDD, etc.)
If not (e.g. csv files), you should provide a separate metadata file (e.g. `README.txt`) to accompany your data.
Once your data and metadata quality control is complete, export the data file[s] to a local folder.


#### Step 2: Upload the data to Zenodo

![Zenodo](https://about.zenodo.org/static/img/logos/zenodo-gradient-200.png)

Our recommended data repository is [Zenodo](https://zenodo.org/).

> Built and developed by researchers, to ensure that everyone can join in Open Science.
>
> The OpenAIRE project, in the vanguard of the open access and open data movements in Europe was commissioned by the EC to support their nascent Open Data policy by providing a catch-all repository for EC funded research. CERN, an OpenAIRE partner and pioneer in open source, open access and open data, provided this capability and Zenodo was launched in May 2013.
>
> In support of its research programme CERN has developed tools for Big Data management and extended Digital Library capabilities for Open Data. Through Zenodo these Big Science tools could be effectively shared with the longÂ­-tail of research.

Via https://about.zenodo.org/

First create a Zenodo account: https://zenodo.org/signup/
Make sure to link your Zenodo account to your GitHub and [ORCID](https://orcid.org/) accounts.

Then go to https://zenodo.org/deposit to deposit a new record.
Follow the instructions provided, and make sure to include as much metadata as you can.
Make sure to choose an open-access license such as Creative Commons.
This will make your data more Interoperable and Re-useable.

When all your data are uploaded and you have double-checked the metadata, you can click "Publish".
This will create a permanent archive for your data and generate a new DOI! ðŸŽ‰

#### Step 3: Verify you can access the data

Use the [Pooch DOI downloader](https://www.fatiando.org/pooch/latest/api/generated/pooch.DOIDownloader.html) to download and open your data files from a Jupyter Notebook.

### Sharing Code

Putting code on GitHub is very convenient for collaboration.
But GitHub does not meet the FAIR requirements.
Therefore, to make your code truly FAIR, you need to deposit it in a permamenent repository.
Again we recommend Zenodo.
Fortunately, Zenodo and GitHub integrate very well with each other.

The steps for archiving a GitHub repo in Zenodo are [very well documented](https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content).

```{note}
Prior to archiving your repo, you need to [create a CITATION.cff file](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-citation-files).
```