Skip to content

Commit

Permalink
[docs] Asset storage description in filesystem IO Manager docs (#8240)
Browse files Browse the repository at this point in the history
  • Loading branch information
clairelin135 committed Jun 8, 2022
1 parent 90fd674 commit db7fb76
Show file tree
Hide file tree
Showing 7 changed files with 40 additions and 8 deletions.
2 changes: 1 addition & 1 deletion docs/content/api/modules.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/content/api/searchindex.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/content/api/sections.json

Large diffs are not rendered by default.

12 changes: 7 additions & 5 deletions docs/content/concepts/io-management/io-managers.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ Not all inputs depend on upstream outputs. The [Unconnected Inputs](/concepts/io
that are responsible for storing the output of an op and loading it as input to downstream
ops. For example, an IO Manager might store and load objects from files on a filesystem.

Each op output can have its own IOManager, or multiple op outputs can share an IOManager. The IOManager that's used for handling a particular op output is automatically used for loading it in downstream ops.
Each op output can have its own IO manager, or multiple op outputs can share an IO manager. The IO manager that's used for handling a particular op output is automatically used for loading it in downstream ops.

<Image
alt="two-io-managers"
Expand All @@ -50,9 +50,11 @@ height={1040}

This diagram shows a job with two IO managers, each of which is shared across a few inputs and outputs.

The default IOManager, <PyObject module="dagster" object="fs_io_manager" />, stores and retrieves values in the filesystem while pickling. If a job is invoked via <PyObject object="JobDefinition" method="execute_in_process" />, the default IOManager is switched to <PyObject module="dagster" object="mem_io_manager"/>, which stores outputs in memory. Dagster provides out-of-the-box IOManagers that pickle objects and save them. These are <PyObject module="dagster_aws.s3" object="s3_pickle_io_manager"/> , <PyObject module="dagster_azure.adls2" object="adls2_pickle_io_manager"/> , or <PyObject module="dagster_gcp.gcs" object="gcs_pickle_io_manager"/>.
The default IO manager, <PyObject module="dagster" object="fs_io_manager" />, stores and retrieves values in the filesystem while pickling. If a job is invoked via <PyObject object="JobDefinition" method="execute_in_process" />, the default IO manager is switched to <PyObject module="dagster" object="mem_io_manager"/>, which stores outputs in memory.

IOManagers are [resources](/concepts/resources), which means users can supply different IOManagers for the same op outputs in different situations. For example, you might use an in-memory IOManager for unit-testing a job and an S3IOManager in production.
Dagster provides out-of-the-box IO managers that pickle objects and save them. These are <PyObject module="dagster_aws.s3" object="s3_pickle_io_manager"/> , <PyObject module="dagster_azure.adls2" object="adls2_pickle_io_manager"/> , or <PyObject module="dagster_gcp.gcs" object="gcs_pickle_io_manager"/>. These filesystem IO managers, along with <PyObject module="dagster" object="fs_io_manager" />, store op outputs at a unique path identified by the run ID, step key, and output name. These IO managers will output assets at a unique path identified by the asset key.

IO managers are [resources](/concepts/resources), which means users can supply different IOManagers for the same op outputs in different situations. For example, you might use an in-memory IOManager for unit-testing a job and an S3IOManager in production.

---

Expand Down Expand Up @@ -87,7 +89,7 @@ def my_job():

Not all the outputs in a job should necessarily be stored the same way. Maybe some of the outputs should live on the filesystem so they can be inspected, and others can be transiently stored in memory.

To select the IOManager for a particular output, you can set an `io_manager_key` on <PyObject module="dagster" object="Out" />, and then refer to that `io_manager_key` when setting IO managers in your job. In this example, the output of `op_1` will go to `fs_io_manager` and the output of `op_2` will go to `s3_pickle_io_manager`.
To select the IO manager for a particular output, you can set an `io_manager_key` on <PyObject module="dagster" object="Out" />, and then refer to that `io_manager_key` when setting IO managers in your job. In this example, the output of `op_1` will go to `fs_io_manager` and the output of `op_2` will go to `s3_pickle_io_manager`.

```python file=/concepts/io_management/io_manager_per_output.py startafter=start_marker endbefore=end_marker
from dagster_aws.s3 import s3_pickle_io_manager, s3_resource
Expand Down Expand Up @@ -118,7 +120,7 @@ def my_job():

## Defining an IO manager

If you have specific requirements for where and how your outputs should be stored and retrieved, you can define your own IOManager. This boils down to implementing two functions: one that stores outputs and one that loads inputs.
If you have specific requirements for where and how your outputs should be stored and retrieved, you can define your own IO manager. This boils down to implementing two functions: one that stores outputs and one that loads inputs.

To define an IO manager, use the <PyObject module="dagster" object="io_manager" displayText="@io_manager" /> decorator.

Expand Down
10 changes: 10 additions & 0 deletions python_modules/libraries/dagster-aws/dagster_aws/s3/io_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,16 @@ def s3_pickle_io_manager(init_context):
Serializes objects via pickling. Suitable for objects storage for distributed executors, so long
as each execution node has network connectivity and credentials for S3 and the backing bucket.
Assigns each op output to a unique filepath containing run ID, step key, and output name.
Assigns each asset to a single filesystem path, at "<base_dir>/<asset_key>". If the asset key
has multiple components, the final component is used as the name of the file, and the preceding
components as parent directories under the base_dir.
Subsequent materializations of an asset will overwrite previous materializations of that asset.
With a base directory of "/my/base/path", an asset with key
`AssetKey(["one", "two", "three"])` would be stored in a file called "three" in a directory
with path "/my/base/path/one/two/".
Attach this resource definition to your job to make it available to your ops.
.. code-block:: python
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,16 @@ def adls2_pickle_io_manager(init_context):
as each execution node has network connectivity and credentials for ADLS and the backing
container.
Assigns each op output to a unique filepath containing run ID, step key, and output name.
Assigns each asset to a single filesystem path, at "<base_dir>/<asset_key>". If the asset key
has multiple components, the final component is used as the name of the file, and the preceding
components as parent directories under the base_dir.
Subsequent materializations of an asset will overwrite previous materializations of that asset.
With a base directory of "/my/base/path", an asset with key
`AssetKey(["one", "two", "three"])` would be stored in a file called "three" in a directory
with path "/my/base/path/one/two/".
Attach this resource definition to your job in order to make it available all your ops:
.. code-block:: python
Expand Down
10 changes: 10 additions & 0 deletions python_modules/libraries/dagster-gcp/dagster_gcp/gcs/io_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,16 @@ def gcs_pickle_io_manager(init_context):
Serializes objects via pickling. Suitable for objects storage for distributed executors, so long
as each execution node has network connectivity and credentials for GCS and the backing bucket.
Assigns each op output to a unique filepath containing run ID, step key, and output name.
Assigns each asset to a single filesystem path, at "<base_dir>/<asset_key>". If the asset key
has multiple components, the final component is used as the name of the file, and the preceding
components as parent directories under the base_dir.
Subsequent materializations of an asset will overwrite previous materializations of that asset.
With a base directory of "/my/base/path", an asset with key
`AssetKey(["one", "two", "three"])` would be stored in a file called "three" in a directory
with path "/my/base/path/one/two/".
Attach this resource definition to your job to make it available to your ops.
.. code-block:: python
Expand Down

1 comment on commit db7fb76

@vercel
Copy link

@vercel vercel bot commented on db7fb76 Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sign in to comment.