Skip to content

Commit

Permalink
[docs] - Update IO Manager docs to contain Asset IO management (#8337)
Browse files Browse the repository at this point in the history
  • Loading branch information
clairelin135 committed Jun 13, 2022
1 parent 227a336 commit 8b3770c
Show file tree
Hide file tree
Showing 11 changed files with 143 additions and 140 deletions.
14 changes: 7 additions & 7 deletions docs/content/concepts/assets/asset-materializations.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Dagster lets you track the interactions between ops, outputs, and assets over ti
There are two general patterns for dealing with assets when using Dagster:

- Put the logic to write/store assets inside the body of an op.
- Focus the op purely on business logic, and delegate the logic to write/store assets to an [IOManager](/concepts/io-management/io-managers).
- Focus the op purely on business logic, and delegate the logic to write/store assets to an [IO manager](/concepts/io-management/io-managers).

Regardless of which pattern you are using, <PyObject module="dagster" object="AssetMaterialization" /> events are used to communicate to Dagster that a materialization has occurred. You can create these events either by explicitly logging them at runtime, or (using an experimental interface), have Dagster automatically generate them by defining that a given op output corresponds to a given <PyObject module="dagster" object="AssetKey" />.

Expand Down Expand Up @@ -81,7 +81,7 @@ width={3808}
height={2414}
/>

### Logging an AssetMaterialization from an IOManager
### Logging an AssetMaterialization from an IO Manager

To record that an <PyObject object ="IOManager"/> has mutated or created an asset, we can log an <PyObject module="dagster" object="AssetMaterialization" /> event from its `handle_output` method. We do this via the method <PyObject object="OutputContext" method="log_event" />.

Expand Down Expand Up @@ -138,7 +138,7 @@ def my_metadata_materialization_op(context):
return remote_storage_path
```

#### Example: IOManager
#### Example: IO Manager

```python file=concepts/assets/materialization_io_managers.py startafter=start_marker_1 endbefore=end_marker_1
from dagster import AssetMaterialization, IOManager
Expand Down Expand Up @@ -205,7 +205,7 @@ def my_asset_op(context):

In this case, the <PyObject object="AssetMaterialization" /> and the <PyObject object="Output" /> events both correspond to the same data, the dataframe that we have created. With this in mind, we can simplify the above code, and provide useful information to the Dagster framework, by making this link between the `my_dataset` asset and the output of this op explicit.

Just as there are two places in which you can log runtime <PyObject object="AssetMaterialization" /> events (within an op body and within an IOManager), we provide two different interfaces for linking an op output to to an asset. Regardless of which you choose, every time the op runs and logs that output, an <PyObject object="AssetMaterialization" /> event will automatically be created to record this information.
Just as there are two places in which you can log runtime <PyObject object="AssetMaterialization" /> events (within an op body and within an IO manager), we provide two different interfaces for linking an op output to to an asset. Regardless of which you choose, every time the op runs and logs that output, an <PyObject object="AssetMaterialization" /> event will automatically be created to record this information.

If you use an <PyObject object="Output" /> event to yield your output, and specified any metadata entries on it, (see: [Op Event Docs](/concepts/ops-jobs-graphs/op-events#attaching-metadata-to-outputs)), these entries will automatically be attached to the materialization event for this asset.

Expand All @@ -228,7 +228,7 @@ def my_constant_asset_op(context):

If you've defined a custom <PyObject object="IOManager"/> to handle storing your op's outputs, the <PyObject object="IOManager"/> will likely be the most natural place to define which asset a particular output will be written to. To do this, you can implement the `get_output_asset_key` function on your <PyObject object="IOManager"/>.

Similar to the above interface, this function takes an <PyObject object="OutputContext"/> and returns an <PyObject object="AssetKey"/>. The following example functions nearly identically to `PandasCsvIOManagerWithMetadata` from the [runtime example](/concepts/assets/asset-materializations#example-iomanager) above.
Similar to the above interface, this function takes an <PyObject object="OutputContext"/> and returns an <PyObject object="AssetKey"/>. The following example functions nearly identically to `PandasCsvIOManagerWithMetadata` from the [runtime example](/concepts/assets/asset-materializations#example-io-manager) above.

```python file=/concepts/assets/materialization_io_managers.py startafter=start_asset_def endbefore=end_asset_def
from dagster import AssetKey, IOManager, MetadataEntry
Expand All @@ -254,11 +254,11 @@ class PandasCsvIOManagerWithOutputAsset(IOManager):

When an output is linked to an asset in this way, the generated <PyObject object="AssetMaterialization" /> event will contain any <PyObject object="MetadataEntry" /> information yielded from the `handle_output` function (in addiition to all of the `metadata` specified on the corresponding <PyObject object="Output" /> event).

See the [IOManager docs](/concepts/io-management/io-managers#yielding-metadata-from-an-iomanager) for more information on yielding these entries from an IOManager.
See the [IO manager docs](/concepts/io-management/io-managers#yielding-metadata-from-an-io-manager) for more information on yielding these entries from an IO manager.

#### Specifying partitions for an output-linked asset

If you are already specifying a `get_output_asset_key` function on your <PyObject object="IOManager" />, you can optionally specify a set of partitions that this manager will be updating or creating by also defining a `get_output_asset_partitions` function. If you do this, an <PyObject object="AssetMaterialization" /> will be created for each of the specified partitions. One useful pattern to pass this partition information (which will likely vary each run) to the manager, is to specify the set of partitions on the configuration of the output. You can do this by providing [per-output configuration](/concepts/io-management/io-managers#providing-per-output-config-to-an-io-manager) on the IOManager.
If you are already specifying a `get_output_asset_key` function on your <PyObject object="IOManager" />, you can optionally specify a set of partitions that this manager will be updating or creating by also defining a `get_output_asset_partitions` function. If you do this, an <PyObject object="AssetMaterialization" /> will be created for each of the specified partitions. One useful pattern to pass this partition information (which will likely vary each run) to the manager, is to specify the set of partitions on the configuration of the output. You can do this by providing [per-output configuration](/concepts/io-management/io-managers#providing-per-output-config-to-an-io-manager) on the IO manager.

Then, you can calculate the asset partitions that a particular output will correspond to by reading this output configuration in `get_output_asset_partitions`:

Expand Down
18 changes: 0 additions & 18 deletions docs/content/concepts/assets/multi-assets.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -47,24 +47,6 @@ def my_function():

By default, the names of the outputs will be used to form the asset keys of the multi-asset. The decorated function will be used to create the op for these assets and must emit an output for each of them. In this case, we can emit multiple outputs by returning a tuple of values, one for each asset.

### Customizing how assets are materialized with IO managers

As with regular assets, you can customize how each asset is materialized with [IO managers](/concepts/io-management/io-managers). To do this, specify an `io_manager_key` on each output of the multi-asset.

```python file=/concepts/assets/multi_assets.py startafter=start_io_manager_multi_asset endbefore=end_io_manager_multi_asset
from dagster import Out, multi_asset


@multi_asset(
outs={
"s3_asset": Out(io_manager_key="s3_io_manager"),
"adls_asset": Out(io_manager_key="adls2_io_manager"),
},
)
def my_assets():
return "store_me_on_s3", "store_me_on_adls2"
```

### Subsetting multi-assets

By default, it is assumed that the computation inside of a multi-asset will always produce the contents all of the associated assets. This means that attempting to execute a set of assets that produces some, but not all, of the assets defined by a given multi-asset will result in an error.
Expand Down
95 changes: 1 addition & 94 deletions docs/content/concepts/assets/software-defined-assets.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ A software-defined asset includes the following:

**Note**: A crucial distinction between software-defined assets and [ops](/concepts/ops-jobs-graphs/ops) is that software-defined assets know about their dependencies, while ops do not. Ops aren't connected to dependencies until they're placed inside a [graph](/concepts/ops-jobs-graphs/jobs-graphs).

**Materializing** an asset is the act of running its op and saving the results to persistent storage. You can initiate materializations from [Dagit](/concepts/dagit/dagit) or by invoking Python APIs. By default, assets are materialized to pickle files on your local filesystem, but materialization behavior is [fully customizable](#customizing-how-assets-are-materialized-with-io-managers). It's possible to materialize an asset in multiple storage environments, such as production and staging.
**Materializing** an asset is the act of running its op and saving the results to persistent storage. You can initiate materializations from [Dagit](/concepts/dagit/dagit) or by invoking Python APIs. By default, assets are materialized to pickle files on your local filesystem, but materialization behavior is fully customizable, using [IO managers](/concepts/io-management/io-managers#applying-io-managers-to-assets). It's possible to materialize an asset in multiple storage environments, such as production and staging.

---

Expand Down Expand Up @@ -348,99 +348,6 @@ There are a couple ways in Dagit to launch a run that materializes assets:
- Navigate to the Asset Details Page for the asset and click the "Materialize" button in the upper right corner.
- In the graph view of the Asset Catalog page, click the "Materialize" button in the upper right corner. You can click on assets to collect a subset to materialize.

## Customizing how assets are materialized with IO managers

By default, materializing an asset will pickle it to a local file named "my_asset", in a temporary directory. You can specify this directory by providing a value for the `local_artifact_storage` property in your dagster.yaml file.

[IO managers](/concepts/io-management/io-managers) enable fully overriding this behavior and storing asset contents in any way you wish - e.g. writing them as tables in a database or as objects in a cloud object store. Dagster also provides built-in IO managers that pickle assets to AWS S3 (<PyObject module="dagster_aws.s3" object="s3_pickle_io_manager" />), Azure Blob Storage (<PyObject module="dagster_azure.adls2" object="adls2_pickle_io_manager" />), and GCS (<PyObject module="dagster_gcp.gcs" object="gcs_pickle_io_manager" />), or you can write your own.

To apply an IO manager to a set of assets, you can use <PyObject object="with_resources" />:

```python file=/concepts/assets/asset_io_manager.py startafter=start_marker endbefore=end_marker
from dagster_aws.s3 import s3_pickle_io_manager, s3_resource

from dagster import asset, with_resources


@asset
def upstream_asset():
return [1, 2, 3]


@asset
def downstream_asset(upstream_asset):
return upstream_asset + [4]


assets_with_io_manager = with_resources(
[upstream_asset, downstream_asset],
resource_defs={"io_manager": s3_pickle_io_manager, "s3": s3_resource},
)
```

This example also includes `"s3": s3_resource`, because the `s3_pickle_io_manager` depends on an s3 resource.

When `upstream_asset` is materialized, the value `[1, 2, 3]` will be will be pickled and stored in an object on S3. When `downstream_asset` is materialized, the value of `upstream_asset` will be read from S3 and depickled, and `[1, 2, 3, 4]` will be pickled and stored in a different object on S3.

Different assets can have different IO managers:

```python file=/concepts/assets/asset_different_io_managers.py startafter=start_marker endbefore=end_marker
from dagster_aws.s3 import s3_pickle_io_manager, s3_resource

from dagster import asset, fs_io_manager, with_resources


@asset(io_manager_key="s3_io_manager")
def upstream_asset():
return [1, 2, 3]


@asset(io_manager_key="fs_io_manager")
def downstream_asset(upstream_asset):
return upstream_asset + [4]


assets_with_io_managers = with_resources(
[upstream_asset, downstream_asset],
resource_defs={
"s3_io_manager": s3_pickle_io_manager,
"s3": s3_resource,
"fs_io_manager": fs_io_manager,
},
)
```

When `upstream_asset` is materialized, the value `[1, 2, 3]` will be will be pickled and stored in an object on S3. When `downstream_asset` is materialized, the value of `upstream_asset` will be read from S3 and depickled, and `[1, 2, 3, 4]` will be pickled and stored in a file on the local filesystem.

The same assets can be bound to different resources and IO managers in different environments. For example, for local development, you might want to store assets on your local filesystem while, in production, you might want to store the assets in S3.

```python file=/concepts/assets/asset_io_manager_prod_local.py startafter=start_marker endbefore=end_marker
from dagster_aws.s3 import s3_pickle_io_manager, s3_resource

from dagster import asset, fs_io_manager, with_resources


@asset
def upstream_asset():
return [1, 2, 3]


@asset
def downstream_asset(upstream_asset):
return upstream_asset + [4]


prod_assets = with_resources(
[upstream_asset, downstream_asset],
resource_defs={"io_manager": s3_pickle_io_manager, "s3": s3_resource},
)

local_assets = with_resources(
[upstream_asset, downstream_asset],
resource_defs={"io_manager": fs_io_manager},
)
```

## Building jobs that materialize assets

You can define a job that materializes a fixed selection of assets each time it runs. Multiple jobs within the same repository can target overlapping sets of assets.
Expand Down

1 comment on commit 8b3770c

@vercel
Copy link

@vercel vercel bot commented on 8b3770c Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sign in to comment.