[docs] Asset storage description in filesystem IO Manager docs (#8240)

dagster-io · Jun 8, 2022 · db7fb76 · db7fb76 · vercel · Jun 8, 2022
1 parent 90fd674
commit db7fb76
Show file tree

Hide file tree

Showing 7 changed files with 40 additions and 8 deletions.
diff --git a/docs/content/api/modules.json b/docs/content/api/modules.json
diff --git a/docs/content/api/searchindex.json b/docs/content/api/searchindex.json
diff --git a/docs/content/api/sections.json b/docs/content/api/sections.json
diff --git a/docs/content/concepts/io-management/io-managers.mdx b/docs/content/concepts/io-management/io-managers.mdx
@@ -37,7 +37,7 @@ Not all inputs depend on upstream outputs. The [Unconnected Inputs](/concepts/io
 that are responsible for storing the output of an op and loading it as input to downstream
 ops. For example, an IO Manager might store and load objects from files on a filesystem.
 
-Each op output can have its own IOManager, or multiple op outputs can share an IOManager. The IOManager that's used for handling a particular op output is automatically used for loading it in downstream ops.
+Each op output can have its own IO manager, or multiple op outputs can share an IO manager. The IO manager that's used for handling a particular op output is automatically used for loading it in downstream ops.
 
 <Image
 alt="two-io-managers"
@@ -50,9 +50,11 @@ height={1040}
 
 This diagram shows a job with two IO managers, each of which is shared across a few inputs and outputs.
 
-The default IOManager, <PyObject module="dagster" object="fs_io_manager" />, stores and retrieves values in the filesystem while pickling. If a job is invoked via <PyObject object="JobDefinition" method="execute_in_process" />, the default IOManager is switched to <PyObject module="dagster" object="mem_io_manager"/>, which stores outputs in memory. Dagster provides out-of-the-box IOManagers that pickle objects and save them. These are <PyObject module="dagster_aws.s3" object="s3_pickle_io_manager"/> , <PyObject module="dagster_azure.adls2" object="adls2_pickle_io_manager"/> , or <PyObject module="dagster_gcp.gcs" object="gcs_pickle_io_manager"/>.
+The default IO manager, <PyObject module="dagster" object="fs_io_manager" />, stores and retrieves values in the filesystem while pickling. If a job is invoked via <PyObject object="JobDefinition" method="execute_in_process" />, the default IO manager is switched to <PyObject module="dagster" object="mem_io_manager"/>, which stores outputs in memory.
 
-IOManagers are [resources](/concepts/resources), which means users can supply different IOManagers for the same op outputs in different situations. For example, you might use an in-memory IOManager for unit-testing a job and an S3IOManager in production.
+Dagster provides out-of-the-box IO managers that pickle objects and save them. These are <PyObject module="dagster_aws.s3" object="s3_pickle_io_manager"/> , <PyObject module="dagster_azure.adls2" object="adls2_pickle_io_manager"/> , or <PyObject module="dagster_gcp.gcs" object="gcs_pickle_io_manager"/>. These filesystem IO managers, along with <PyObject module="dagster" object="fs_io_manager" />, store op outputs at a unique path identified by the run ID, step key, and output name. These IO managers will output assets at a unique path identified by the asset key.
+
+IO managers are [resources](/concepts/resources), which means users can supply different IOManagers for the same op outputs in different situations. For example, you might use an in-memory IOManager for unit-testing a job and an S3IOManager in production.
 
 ---
 
@@ -87,7 +89,7 @@ def my_job():
 
 Not all the outputs in a job should necessarily be stored the same way. Maybe some of the outputs should live on the filesystem so they can be inspected, and others can be transiently stored in memory.
 
-To select the IOManager for a particular output, you can set an `io_manager_key` on <PyObject module="dagster" object="Out" />, and then refer to that `io_manager_key` when setting IO managers in your job. In this example, the output of `op_1` will go to `fs_io_manager` and the output of `op_2` will go to `s3_pickle_io_manager`.
+To select the IO manager for a particular output, you can set an `io_manager_key` on <PyObject module="dagster" object="Out" />, and then refer to that `io_manager_key` when setting IO managers in your job. In this example, the output of `op_1` will go to `fs_io_manager` and the output of `op_2` will go to `s3_pickle_io_manager`.
 
 ```python file=/concepts/io_management/io_manager_per_output.py startafter=start_marker endbefore=end_marker
 from dagster_aws.s3 import s3_pickle_io_manager, s3_resource
@@ -118,7 +120,7 @@ def my_job():
 
 ## Defining an IO manager
 
-If you have specific requirements for where and how your outputs should be stored and retrieved, you can define your own IOManager. This boils down to implementing two functions: one that stores outputs and one that loads inputs.
+If you have specific requirements for where and how your outputs should be stored and retrieved, you can define your own IO manager. This boils down to implementing two functions: one that stores outputs and one that loads inputs.
 
 To define an IO manager, use the <PyObject module="dagster" object="io_manager" displayText="@io_manager" /> decorator.
 

diff --git a/python_modules/libraries/dagster-aws/dagster_aws/s3/io_manager.py b/python_modules/libraries/dagster-aws/dagster_aws/s3/io_manager.py
@@ -90,6 +90,16 @@ def s3_pickle_io_manager(init_context):
     Serializes objects via pickling. Suitable for objects storage for distributed executors, so long
     as each execution node has network connectivity and credentials for S3 and the backing bucket.
 
+    Assigns each op output to a unique filepath containing run ID, step key, and output name.
+    Assigns each asset to a single filesystem path, at "<base_dir>/<asset_key>". If the asset key
+    has multiple components, the final component is used as the name of the file, and the preceding
+    components as parent directories under the base_dir.
+
+    Subsequent materializations of an asset will overwrite previous materializations of that asset.
+    With a base directory of "/my/base/path", an asset with key
+    `AssetKey(["one", "two", "three"])` would be stored in a file called "three" in a directory
+    with path "/my/base/path/one/two/".
+
     Attach this resource definition to your job to make it available to your ops.
 
     .. code-block:: python

diff --git a/python_modules/libraries/dagster-azure/dagster_azure/adls2/io_manager.py b/python_modules/libraries/dagster-azure/dagster_azure/adls2/io_manager.py
@@ -116,6 +116,16 @@ def adls2_pickle_io_manager(init_context):
     as each execution node has network connectivity and credentials for ADLS and the backing
     container.
 
+    Assigns each op output to a unique filepath containing run ID, step key, and output name.
+    Assigns each asset to a single filesystem path, at "<base_dir>/<asset_key>". If the asset key
+    has multiple components, the final component is used as the name of the file, and the preceding
+    components as parent directories under the base_dir.
+
+    Subsequent materializations of an asset will overwrite previous materializations of that asset.
+    With a base directory of "/my/base/path", an asset with key
+    `AssetKey(["one", "two", "three"])` would be stored in a file called "three" in a directory
+    with path "/my/base/path/one/two/".
+
     Attach this resource definition to your job in order to make it available all your ops:
 
     .. code-block:: python

diff --git a/python_modules/libraries/dagster-gcp/dagster_gcp/gcs/io_manager.py b/python_modules/libraries/dagster-gcp/dagster_gcp/gcs/io_manager.py
@@ -89,6 +89,16 @@ def gcs_pickle_io_manager(init_context):
     Serializes objects via pickling. Suitable for objects storage for distributed executors, so long
     as each execution node has network connectivity and credentials for GCS and the backing bucket.
 
+    Assigns each op output to a unique filepath containing run ID, step key, and output name.
+    Assigns each asset to a single filesystem path, at "<base_dir>/<asset_key>". If the asset key
+    has multiple components, the final component is used as the name of the file, and the preceding
+    components as parent directories under the base_dir.
+
+    Subsequent materializations of an asset will overwrite previous materializations of that asset.
+    With a base directory of "/my/base/path", an asset with key
+    `AssetKey(["one", "two", "three"])` would be stored in a file called "three" in a directory
+    with path "/my/base/path/one/two/".
+
     Attach this resource definition to your job to make it available to your ops.
 
     .. code-block:: python