Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate and build high quality command line tooling (master task) #10

Closed
2 of 6 tasks
schrockn opened this issue Jun 4, 2018 · 0 comments
Closed
2 of 6 tasks
Assignees

Comments

@schrockn
Copy link
Member

schrockn commented Jun 4, 2018

The command line tooling in dagster is a mess. This should be consolidated into a single, sensible command that provides a lot of value.

Initially I built this that you treated a file that contained the definition of a pipeline as a command line tool. This was done by conditionally invoking the pipeline if it was main. This was a mistake. We want to be able to invoke a pipeline from a traditional feeling command line tool.

embedded_cli.py is the code that drives this so-called "embedded" cli.

To see an example of this:

run

python3 pipelines.py in dagster/dagster_examples

That is the multi-pipeline case.

Or, for example

python3 pipeline.py in dagster/dagster_examples/qhp for example

That is the single-pipeline case

This should be sub-divided into a number of tasks:

  • File layout for making dagster pipeline tool aware of pipelines. This is essentially already implemented. dagster.cli.init.py:12-25
  • Listing pipelines and their description (also these lines, but should be improved)
  • A good way of printing out pipeline metadata. Currently print_pipeline in embedded_cli.py. It is not good. Invoked using the "meta" command in the single pipeline case.
  • Consume a yaml file that constructs the objects defined in dagster/config.py. We want to be able to drive a pipeline execution or materialization from a yaml file.
  • Be able to codegen a yaml file based on the declared definition of a pipeline.
  • Be able to verify the validity of an incoming yaml file based on a definition of a pipeline (not hi-pri)
@schrockn schrockn assigned schrockn and freiksenet and unassigned schrockn Jun 4, 2018
rexledesma added a commit that referenced this issue Nov 14, 2023
## Summary & Motivation
Following up on #17980, one
of the issues with memory usage in `dagster-dbt` is that if the
`manifest` is specified as a path, we potentially hold multiple copies
of the dictionary manifest in memory when creating our assets. This is
because we read from the path to create the dictionary manifest, and
then we hold a reference to the manifest dictionary in our asset
metadata. When the same path is read multiple times, multiple copies
ensure.

We can optimize this by implementing a cache when reading from the path,
so that the same dictionary manifest is returned even when a manifest
path is specified.

To further optimize this to hold no reference to the manifest
dictionary, we could instead hold a reference to the manifest path in
our asset metadata. However,

1. This only works if a manifest path is passed. If a manifest is
already specified as a dictionary, then we still have to hold it in
memory. This is the worst case.
2. (1) can be solved by preventing a manifest from being specified as a
dictionary, but that would be a breaking change.

## How I Tested These Changes
Use `tracemalloc`.

1. Create a `jaffle_dagster` project using `dagster-dbt project
scaffold` on `jaffle_shop`.
2. Update the code to read the manifest from a path in three separate
`@dbt_assets` definitions.
3. Run `PYTHONTRACEMALLOC=1 DAGSTER_DBT_PARSE_PROJECT_ON_LOAD=1 dagster
dev`

**Before Change**
```diff
Top 10 lines
#1: <frozen importlib._bootstrap_external>:672: 37114.4 KiB
#2: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/text_unidecode/__init__.py:6: 3675.4 KiB
    _replaces = pkgutil.get_data(__name__, 'data.bin').decode('utf8').split('\x00')
- #3: /Users/rexledesma/dagster-labs/dagster/python_modules/libraries/dagster-- dbt/dagster_dbt/dbt_manifest.py:17: 3262.9 KiB
-    manifest = cast(Mapping[str, Any], orjson.loads(manifest.read_bytes()))
#4: /Users/rexledesma/.pyenv/versions/3.10.13/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/abc.py:106: 1313.7 KiB
    cls = super().__new__(mcls, name, bases, namespace, **kwargs)
#5: /Users/rexledesma/.pyenv/versions/3.10.13/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/collections/__init__.py:481: 987.5 KiB
    result = type(typename, (tuple,), class_namespace)
#6: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/mashumaro/core/meta/code/builder.py:114: 978.7 KiB
    self.globals = globals().copy()
#7: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/mashumaro/core/meta/code/builder.py:263: 891.8 KiB
    exec(code, self.globals, self.__dict__)
#8: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/google/protobuf/internal/builder.py:85: 697.9 KiB
    message_class = _reflection.GeneratedProtocolMessageType(
#9: <frozen importlib._bootstrap_external>:128: 655.2 KiB
#10: <frozen importlib._bootstrap_external>:1616: 569.6 KiB
53203 other: 38030.8 KiB
Total allocated size: 88177.9 KiB
```

**After Change**
```diff
Top 10 lines
#1: <frozen importlib._bootstrap_external>:672: 37114.9 KiB
#2: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/text_unidecode/__init__.py:6: 3675.4 KiB
    _replaces = pkgutil.get_data(__name__, 'data.bin').decode('utf8').split('\x00')
#3: /Users/rexledesma/.pyenv/versions/3.10.13/Library/Frameworks/Python.framework/Versions/3.10 /lib/python3.10/abc.py:106: 1313.7 KiB
    cls = super().__new__(mcls, name, bases, namespace, **kwargs)
+ #4: /Users/rexledesma/dagster-labs/dagster/python_modules/libraries/dagster-dbt/dagster_dbt/dbt_manifest.py:13: 1057.4 KiB
+    return cast(Mapping[str, Any], orjson.loads(manifest_path.read_bytes()))
#5: /Users/rexledesma/.pyenv/versions/3.10.13/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/collections/__init__.py:481: 987.5 KiB
    result = type(typename, (tuple,), class_namespace)
#6: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/mashumaro/core/meta/code/builder.py:114: 978.7 KiB
    self.globals = globals().copy()
#7: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/mashumaro/core/meta/code/builder.py:263: 893.6 KiB
    exec(code, self.globals, self.__dict__)
#8: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/google/protobuf/internal/builder.py:85: 697.9 KiB
    message_class = _reflection.GeneratedProtocolMessageType(
#9: <frozen importlib._bootstrap_external>:128: 655.2 KiB
#10: <frozen importlib._bootstrap_external>:1616: 569.6 KiB
53195 other: 38013.7 KiB
Total allocated size: 85957.5 KiB
```

**Baseline (one `@dbt_assets` definition)**

```diff
Top 10 lines
#1: <frozen importlib._bootstrap_external>:672: 37114.2 KiB
#2: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/text_unidecode/__init__.py:6: 3675.4 KiB
    _replaces = pkgutil.get_data(__name__, 'data.bin').decode('utf8').split('\x00')
#3: /Users/rexledesma/.pyenv/versions/3.10.13/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/abc.py:106: 1313.7 KiB
    cls = super().__new__(mcls, name, bases, namespace, **kwargs)
+ #4: /Users/rexledesma/dagster-labs/dagster/python_modules/libraries/dagster-dbt/dagster_dbt/dbt_manifest.py:13: 1056.9 KiB
+    return cast(Mapping[str, Any], orjson.loads(manifest_path.read_bytes()))
#5: /Users/rexledesma/.pyenv/versions/3.10.13/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/collections/__init__.py:481: 987.5 KiB
    result = type(typename, (tuple,), class_namespace)
#6: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/mashumaro/core/meta/code/builder.py:114: 978.7 KiB
    self.globals = globals().copy()
#7: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/mashumaro/core/meta/code/builder.py:263: 893.7 KiB
    exec(code, self.globals, self.__dict__)
#8: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/google/protobuf/internal/builder.py:85: 697.9 KiB
    message_class = _reflection.GeneratedProtocolMessageType(
#9: <frozen importlib._bootstrap_external>:128: 655.2 KiB
#10: <frozen importlib._bootstrap_external>:1616: 569.6 KiB
53175 other: 37976.1 KiB
Total allocated size: 85918.9 KiB
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants