# Processed data package demo

In [1]:
import bw_processing as bwp
import numpy as np

Let's create a temporary directory to play around in:

In [2]:
import tempfile, os

In [3]:
tempdir = tempfile.mkdtemp()
os.chdir(tempdir)

There are two main interfaces for `bw_processing`: `create_datapackage` and `load_datapackage`. They both return an instance of `bw_processing.datapackage.Datapackage`. Let's create a datapackage:

In [4]:
dp = bwp.create_datapackage()

And add the simplest kind of data, a vector of static data:

In [5]:
dp.add_persistent_vector(
    matrix="something",
    indices_array=np.arange(10).astype(bwp.INDICES_DTYPE),
    data_array=np.arange(10),
    name='first'
)

The default filesystem is in-memory, so we can't save or serialize this data, but we can look at it. The `Datapackage` class exposes three attributes: `data`, `metadata`, and `resources`, which is just a shortcut to `dp.metadata['resources']`. This layout and terminology follow the Datapackage standard by the Open Knowledge Foundation.

We added two arrays, so the `data` attribute has these two arrays:

In [6]:
dp.data

[array([(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7),
        (8, 8), (9, 9)], dtype=[('row', '<i4'), ('col', '<i4')]),
 array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])]

`metadata` describes everything except for the metadata. This includes information we have (e.g. the matrix name), and defaults for values we didn't give (e.g. the license). `create_datapackage` will also make up information we don't provide, like the name or id:

In [7]:
dp.metadata

{'profile': 'data-package',
 'name': '4b73b70261ce4ce58399bd9db456d1d9',
 'id': '275bf88d8fb1464c8ee19998578aa200',
 'licenses': [{'name': 'ODC-PDDL-1.0',
   'path': 'http://opendatacommons.org/licenses/pddl/',
   'title': 'Open Data Commons Public Domain Dedication and License v1.0'}],
 'resources': [{'profile': 'data-resource',
   'format': 'npy',
   'mediatype': 'application/octet-stream',
   'name': 'first.indices',
   'matrix': 'something',
   'kind': 'indices',
   'path': 'first.indices.npy',
   'group': 'first',
   'category': 'vector',
   'nrows': 10},
  {'profile': 'data-resource',
   'format': 'npy',
   'mediatype': 'application/octet-stream',
   'name': 'first.data',
   'matrix': 'something',
   'kind': 'data',
   'path': 'first.data.npy',
   'group': 'first',
   'category': 'vector',
   'nrows': 10}],
 'created': '2021-08-27T13:48:02.933377Z',
 'combinatorial': False,
 'sequential': False,
 'seed': None,
 'sum_intra_duplicates': True,
 'sum_inter_duplicates': False}

The documentation covers these fields, and most of them come from the OKF standard.

In [8]:
dp.resources

[{'profile': 'data-resource',
  'format': 'npy',
  'mediatype': 'application/octet-stream',
  'name': 'first.indices',
  'matrix': 'something',
  'kind': 'indices',
  'path': 'first.indices.npy',
  'group': 'first',
  'category': 'vector',
  'nrows': 10},
 {'profile': 'data-resource',
  'format': 'npy',
  'mediatype': 'application/octet-stream',
  'name': 'first.data',
  'matrix': 'something',
  'kind': 'data',
  'path': 'first.data.npy',
  'group': 'first',
  'category': 'vector',
  'nrows': 10}]

Let's add another datapackage.

In [9]:
dp.add_persistent_vector(
    matrix="something else",
    indices_array=np.arange(10).astype(bwp.INDICES_DTYPE),
    data_array=np.random.random(size=(10,)),
    name='second'
)

In [31]:
for x, y in zip(dp.get_resource("first.indices"), dp.get_resource("first.data")):
    print(x, y)

[(0, 0) (1, 1) (2, 2) (3, 3) (4, 4) (5, 5) (6, 6) (7, 7) (8, 8) (9, 9)] [0 1 2 3 4 5 6 7 8 9]
{'profile': 'data-resource', 'format': 'npy', 'mediatype': 'application/octet-stream', 'name': 'first.indices', 'matrix': 'something', 'kind': 'indices', 'path': 'first.indices.npy', 'group': 'first', 'category': 'vector', 'nrows': 10} {'profile': 'data-resource', 'format': 'npy', 'mediatype': 'application/octet-stream', 'name': 'first.data', 'matrix': 'something', 'kind': 'data', 'path': 'first.data.npy', 'group': 'first', 'category': 'vector', 'nrows': 10}
