Creating ASDF Files
===

This tutorial will provide an overview of creating ASDF files using the python library [asdf](https://pypi.org/project/asdf/).

In [None]:
import os
import numpy as np
import asdf

np.random.seed(42)

def print_file(fn):
    """
    A helper function to print out an ASDF file
    """
    with open(fn, "r", encoding="unicode_escape") as f:
        print(f.read())

# Introduction

Creating an ASDF file starts with creating an instance of the [AsdfFile](https://asdf.readthedocs.io/en/latest/api/asdf.AsdfFile.html#asdf.AsdfFile) class.

In [None]:
af = asdf.AsdfFile()

# 🌲 The "tree"
Data within an ASDF file is organized in a "tree" structure, an arbitrarily nested mapping of key/value pairs (think of this as a Python dict). This allows data to be hierarchically organized within the file.

For example if you have some "data" values and some "meta" describing the condition of the data this can be organized within the tree under "data" and "meta" keys.

In [None]:
af.tree["meta"] = {"my": {"nested": "metadata"}}
af.tree["data"] = [1, 2, 3, 4]
print(af.tree)

For ease-of-use the [AsdfFile](https://asdf.readthedocs.io/en/latest/api/asdf.AsdfFile.html#asdf.AsdfFile) instance can be used like a dictionary (removing the need to always access the `tree` attribute).

In [None]:
af["meta"]

# 🍁 Tree contents

Many of the Python builtin types are supported by ASDF and largely match the basic types in [YAML](https://yaml.org/spec/1.1/).
| Python type | YAML type |
| --- | --- |
| `dict` | `mapping` |
| `list` | `sequence` |
| `str` | `string` |
| `float` | `float` |
| `int` | `int` |
| `None` | `null` |

# Exercise 1: Make an ASDF tree
Create an [AsdfFile](https://asdf.readthedocs.io/en/latest/api/asdf.AsdfFile.html#asdf.AsdfFile) instance and build a tree containing all of the above supported types.

# 💾 Saving to disk
[AsdfFile.write_to](https://asdf.readthedocs.io/en/latest/api/asdf.AsdfFile.html#asdf.AsdfFile.write_to) is the main method used to save ASDF files to disk.

In [None]:
af.write_to("my_tree.asdf")
os.path.exists("my_tree.asdf")

As ASDF files contain a plain-text header, simple trees can result in files that are human-readable.

In [None]:
af = asdf.AsdfFile()
af["trunk"] = {"branches": [0, 1, 2], "roots": ["a", "b", "c"]}
af.write_to("trunk.asdf")
print_file("trunk.asdf")  # print out contents of file

# Exercise 2: Save your tree
Recreate (if necessary) your custom tree containing all of the supported types and write it to an ASDF file. Open the file in a text editor and view the contents.

# 📋 Standard metadata
In the above example we didn't add the `asdf_library` and `history` keys that appear in the file. These are standard metadata keys added to every ASDF file. They record:

- `asdf_library`: Software library used to produce the file
- `history`: ASDF extensions used to produce the file (and optional user-added history entries)

We won't cover these in more depth. Please see the documentation for more details:
- [AsdfFile.add_history_entry](https://asdf.readthedocs.io/en/latest/api/asdf.AsdfFile.html#asdf.AsdfFile.add_history_entry)
- [asdf-1.0.0 schema](https://asdf-standard.readthedocs.io/en/1.1.1/generated/stsci.edu/asdf/core/asdf-1.0.0.html)

# 🪄 "Tagged" Types
For more complicated types, ASDF supports "tagged" objects (as does YAML). 

(YAML Tags are simple identifiers used to represent information about the type of native data structures.)

By adding a "tag" to an object and saving this to the file the asdf software knows this "tagged" object can be deserialized to a more complicated type. To cover this topic we'll use `complex` numbers.

In [None]:
af = asdf.AsdfFile()
af["z"] = 1 + 1j
af.write_to("complex.asdf")
print_file("complex.asdf")

asdf stored our `complex` number `z` as a "tagged" string "(1+1j)" in the file using the `core/complex-1.0.0` tag. When we open this file asdf will see the tag recreate the `complex` number for us from the string.

In [None]:
af = asdf.open("complex.asdf")
type(af["z"])

The mapping between tags and objects is handled by the asdf extension API. Support for new objects can be added by pip installing packages (like [asdf-astropy](https://pypi.org/project/asdf-astropy/)) or users can create and register their own extensions.

We won't go into details about creating an extension here but please see the documentation if there are objects you would like to store in an ASDF file (that aren't already supported):
https://asdf.readthedocs.io/en/latest/asdf/extending/extensions.html

# Exercise 3: Tagged objects
Open one of the ASDF files created above. What is the type of value stored with the "asdf_library" library in the tree?

# 🔢 N-Dimensional arrays
In addition to plain-text representations, ASDF files can contain binary data often used to store arrays of numerical data. It is efficient to read and write and doesn't suffer from loss of precision which might occur for numerical types converted to and from text.

Binary data is stored in "blocks" written after the ASDF tree. Objects in the tree may contain references to binary "blocks", the most common being [NDArrayType](https://asdf.readthedocs.io/en/latest/api/asdf.tags.core.NDArrayType.html#asdf.tags.core.NDArrayType) the class asdf uses for `numpy.ndarray` instances.

In [None]:
af = asdf.AsdfFile()
af["data"] = np.arange(10)
af.write_to("binary_data.asdf")
print_file("binary_data.asdf")

When we read this file back in we'll find a [NDArrayType](https://asdf.readthedocs.io/en/latest/api/asdf.tags.core.NDArrayType.html#asdf.tags.core.NDArrayType) instance for `data`.

In [None]:
af = asdf.open("binary_data.asdf")
type(af["data"])

This can mostly be treated the same as a `numpy.ndarray` but provides a few asdf-specific features. By default the array is "lazy loaded". This means [NDArrayType](https://asdf.readthedocs.io/en/latest/api/asdf.tags.core.NDArrayType.html#asdf.tags.core.NDArrayType) will only load the binary data from disk when the array contents are accessed (to reduce disk IO and improve performance).

In [None]:
print(af["data"])

# Exercise 4: Saving arrays
Generate an ASDF file with:
- an array ("array")
- a second reference to the same array ("array_reference")
- an additional different array ("other_array")

Examine the file contents. Open the file and examine "array" and "array_reference". Do they still refer to the same object? (hint: YAML supports anchors and aliases to have multiple references to the same object).

# 👀 Array views
Array views will be stored in ASDF files as views of an ASDF block. For a file with multiple views of the same array this can save space on disk.

In [None]:
af = asdf.AsdfFile()
af["base_array"] = np.zeros((100, 100))
af["view"] = af["base_array"][0]
af.write_to("shared_array.asdf")
print_file("shared_array.asdf")

# Exercise 5: Saving views
Save an ASDF file with a large array and a small view of the array. Open this file and change the view contents. This will require disabling memory mapping by passing `memmap=False` to [asdf.open](https://asdf.readthedocs.io/en/latest/api/asdf.open.html#asdf.open). What happens to the large array?

# 🗄 Storage options
For small arrays it is sometimes helpful to "inline" the array data. An "inline" array is stored as human-readable text in the YAML header instead of an ASDF block. This is controlled by calling [AsdfFile.set_array_storage](https://asdf.readthedocs.io/en/latest/api/asdf.AsdfFile.html#asdf.AsdfFile.set_array_storage).

In [None]:
af = asdf.AsdfFile()
af["small_array"] = np.arange(5)
af.set_array_storage(af["small_array"], "inline")
af.write_to("inline_array.asdf")
print_file("inline_array.asdf")

For large arrays it may be preferable to compress the ASDF block. Every installation of asdf supports  [bzp2](http://www.bzip.org/) and [zlib](http://www.zlib.net/) compression algorithsm (more can be added via extensions). To tell asdf to compress an array provide a supported 4 character code to [AsdfFile.set_array_compression](https://asdf.readthedocs.io/en/latest/api/asdf.AsdfFile.html#asdf.AsdfFile.set_array_compression).

In [None]:
af = asdf.AsdfFile()
af["compressed_array"] = np.zeros((1000, 1000))
af.set_array_compression(af["compressed_array"], "bzp2")
af.write_to("compressed.asdf")
print_file("compressed.asdf")

In [None]:
af['compressed_array']

# Exercise 6: Array storage options
Generate an ASDF file with:
- one array compressed with "zlib"
- a second array that is uncompressed

What happens if you read and then rewrite the file to a new filename? Is the "zlib" compressed array still compressed in the new file?