Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: dagstermill io use io manager for output notebook #4490

Merged
merged 5 commits into from
Sep 16, 2021

Conversation

yuhan
Copy link
Contributor

@yuhan yuhan commented Aug 11, 2021

Summary

depends on #4500

prior context: https://dagster.phacility.com/D6145

User-facing changes:

(1) Deprecating output_notebook in favor of output_notebook_name.

  • Using the old property output_notebook would require "file_manager" and result in a FileHandle output - no breaking change to existing users.
  • Using the new property output_notebook_name would result in a bytes output and require "output_notebook_io_manager", see details in (2)

(2) With the new param, it requires a dedicated IO manager for output notebook. When output_notebook or output_notebook_name is specified, "output_notebook_io_manager" is required as a resource.

  • we provide a built-in local_output_notebook_io_manager for handling local output notebook materialization.
  • New capabilities: Users now can customize their own "output_notebook_io_manager" by extending OutputNotebookIOManager. This enables use cases:
    • users who want better asset key control can override get_output_asset_key.
    • users who want precise IO control over the output notebook can customize their own handle_output and load_input, e.g. they can control the file name of the output notebook.
    • users who want to attach more meaningful metadata can yield EventMetadataEntry in their own handle_output method

this is also fixing #3380

Test Plan

bk

fan_in_notebook_pipeline_legacy uses output_notebook:
image

fan_in_notebook_pipeline uses output_notebook_name:
image

Migration guide

old (deprecated, won't break):

my_notebook_solid = define_dagstermill_solid(
    name="my_notebook_solid",
    notebook_path="hello.ipynb",
    output_notebook="notebook", # here
)

@pipeline(
    mode_defs=[
        ModeDefinition(
            resource_defs={
                "file_manager": local_file_manager, # here
                "io_manager": fs_io_manager,
            }
        )
    ]
)
def my_pipeline():
    my_notebook_solid()

new:

my_notebook_solid = define_dagstermill_solid(
    name="my_notebook_solid",
    notebook_path="hello.ipynb",
    output_notebook_name="notebook", # here
)

@pipeline(
    mode_defs=[
        ModeDefinition(
            resource_defs={
                "output_notebook_io_manager": local_output_notebook_io_manager, # here
                "io_manager": fs_io_manager,
            }
        )
    ]
)
def my_pipeline():
    my_notebook_solid()

@vercel
Copy link

vercel bot commented Aug 11, 2021

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/elementl/dagster/7wzLG346E5vpZVEziX5GmpDZovf5
✅ Preview: https://dagster-git-yuhan-dagstermill-2-elementl.vercel.app

@yuhan yuhan changed the base branch from yuhan/dagstermill-file-manager-test to yuhan/dagstermill-io-1 August 11, 2021 01:32
@sryza
Copy link
Contributor

sryza commented Aug 16, 2021

I know we talked about this in the past, but refresh my memory - what's a situation where it's helpful for the notebook to be an output? Do we know of users that use the notebook in downstream steps? What do they do with it?

@yuhan
Copy link
Contributor Author

yuhan commented Aug 16, 2021

@sryza here's some prior context about user request #2319

@sryza
Copy link
Contributor

sryza commented Aug 16, 2021

@sryza here's some prior context about user request #2319

Thanks - makes sense

@sryza
Copy link
Contributor

sryza commented Aug 16, 2021

In addition to the existing code, you will need to specify an "output_notebook_io_manager" resource, we provide backcompact_output_notebook_io_manager which depends on file manager for easier migration:

I'm not 100% following why we need to make this change. What's the downside of preserving existing behavior here?

Copy link
Contributor Author

@yuhan yuhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the existing code, you will need to specify an "output_notebook_io_manager" resource, we provide backcompact_output_notebook_io_manager which depends on file manager for easier migration:

I'm not 100% following why we need to make this change. What's the downside of preserving existing behavior here?

It's an easy path for us that we don't have to maintain the code path like [1] and [2]. But you're right this is not necessary - will explore preserving existing behavior!

@yuhan yuhan marked this pull request as draft August 16, 2021 23:30
Base automatically changed from yuhan/dagstermill-io-1 to master August 18, 2021 19:43
@yuhan yuhan changed the title RFC: dagstermill io 2/ use io manager for output notebook RFC: dagstermill io use io manager for output notebook Aug 18, 2021
@yuhan yuhan changed the base branch from master to yuhan/dagstermill-4 August 20, 2021 16:32
@yuhan yuhan marked this pull request as ready for review August 20, 2021 22:25
Base automatically changed from yuhan/dagstermill-4 to master August 24, 2021 02:00
@@ -355,14 +355,14 @@ def filepicklelist_resource(init_context):
resource_defs={
"list": ResourceDefinition(lambda _: []),
"io_manager": fs_io_manager,
"file_manager": local_file_manager,
"output_notebook_io_manager": local_output_notebook_io_manager,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could fs_io_manager be used here for the output_notebook_io_manager as well? If so, would it make sense for the default IO manager key for the notebook output to just be "io_manager"?

I don't have an opinion here either way, just exploring the options.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no because 1) fs_io_manager pickles the content so users can't open the persisted file right away 2) fs_io_manager doesn't come with the get_output_asset_key so it won't yield asset materialization or provide handy asset key customization

OutputNotebookIOManager base class kinda shield the asset key handling from notebook users - when they customize their own io manager by extending the class, they don't have to worry about "how do i get the notebook show up in asset catalog."

@amarrella
Copy link
Contributor

nice! was just looking for something like this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants