Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Code Injection #1

Open
pattonw opened this issue Sep 16, 2020 · 0 comments
Open

Custom Code Injection #1

pattonw opened this issue Sep 16, 2020 · 0 comments
Assignees

Comments

@pattonw
Copy link
Collaborator

pattonw commented Sep 16, 2020

It is unlikely that the default config files will be able to cover all desired use cases for dacapo. It will be necessary to allow super users to customize as much as they would like of the dacapo pipeline. However we do want the following:

  1. minimize repetition. Avoid having to handle the master process separately from worker processes.
  2. maximize customization: It should be easy to replace any part of the dacapo process. Reading data, training pipeline, evaluation, post_processing, writing to db, etc. These should all be generic enough that users can replace any part.
  3. provide a single obvious way of overriding specific behavior.

Current state:
custom local dacapo.PostProcessors are supported via config arguments. First you give the path to your post_processor_module. This module is added to the path. Then you should be able to import your custom module.
Pros:

  1. Allows you to include any custom dacapo.PostProcessor without having to write a custom run script.
    Cons:
  2. Modifying the sys.path variable and depending on local files seems like it could get unwieldy with naming conflicts when overwriting many parts of the framework.
  3. Local files may not be available for workers. Might need extra work such as configuring mount directories etc.
  4. Sharing project requires copying the python environment, local code, and configurations.

Proposal #1:
Functional interface, something such as:

def run_all(
    data_configs,
    model_configs,
    post_processor_configs,
    ...,
    task_configs,
    train_iteration = None, # Optional generator that yields training batches
    post_processor = None, # Optional dacapo.PostProcessor
    loss = None, # Optional torch.nn.Module
    ...,
    data = None, # Optional dacapo.Data
):
    ...

Pros:

  1. Easy to replace any part that is exposed in run_all.
  2. Could run the exact same script as master and as worker, it would then be job of dacapo to figure out when run_all was called by the master or by the workers (probably via environment variable).
  3. User has full control. Only depends on local files if user wants it to. Customization could be very minimal, i.e. a single run.py script containing custom code.
    Cons:
  4. Custom code requires a custom run script. Everyone who wants to do something similar (or every similar dacapo project) would need at least a copy of the same boilerplate run.py script.
  5. Confusing function names. Every worker would call run_all, but then only run a single specific configuration.

Proposal #2:
Plugin system: documentation for python package plugin systems here.
Pros:

  1. Sharing your environment and configurations is enough.
  2. Easier to enforce some structure on custom code, making backwards compatibility easier going forward.
  3. No custom script required. Could make a command-line tool that can handle more than just the default setup.
    Cons.
  4. Higher barrier to user customization. (Could probably be alleviated through providing templates)
@pattonw pattonw mentioned this issue Sep 16, 2020
14 tasks
@pattonw pattonw self-assigned this Sep 22, 2020
pattonw added a commit that referenced this issue Mar 24, 2021
pattonw pushed a commit to e11bio/dacapo that referenced this issue Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant