Skip to content

Commit

Permalink
docs(sda): fix typos and edit wording (#7136)
Browse files Browse the repository at this point in the history
  • Loading branch information
rexledesma committed Mar 21, 2022
1 parent 3509c98 commit 98e99c8
Show file tree
Hide file tree
Showing 7 changed files with 29 additions and 35 deletions.
2 changes: 1 addition & 1 deletion docs/content/api/modules.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/content/api/searchindex.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/content/api/sections.json

Large diffs are not rendered by default.

22 changes: 12 additions & 10 deletions docs/content/guides/dagster/software-defined-assets.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,16 @@ description: The "software-defined asset" APIs sit atop of the graph/job/op APIs

<CodeReferenceLink filePath="examples/software_defined_assets" />

The "Software-defined asset" APIs sit atop of the graph/job/op APIs and enable a novel novel approach to orchestration that puts assets at the forefront. As a reminder, to Dagster, an "asset" is a data product: an object produced by a data pipeline, e.g. a table, ML model, or report.
The software-defined asset APIs sit atop of the graph/job/op APIs and enable a novel approach to orchestration that puts assets at the forefront.

Conceptually, software-defined assets invert the typical relationship between assets and computation. Instead of defining a graph of ops and recording which assets those ops end up materializing, you define a set of assets, each of which knows how to compute its contents from upstream assets.
In Dagster, an "asset" is a data product, an object produced by a data pipeline. Some examples are tables, machine learning models, or reports.

Conceptually, software-defined assets invert the typical relationship between assets and computation. Instead of defining a graph of ops and recording which assets those ops end up materializing, you define a set of assets. Each asset knows how to compute its contents from upstream assets.

Taking a software-defined asset approach has a few main benefits:

- **Write less code** - because each asset knows about the assets it depends on, you don't need to use `@graph` / `@job` to wire up dependencies between your ops.
- **Track cross-job dependencies via asset lineage** - Dagit allows you to find the parents and children of any asset, even if they live in different jobs. This is useful for finding the sources of problems and for understanding the consequences of changing or removing an asset.
- **Write less code** - Each asset knows about the assets it depends on; you don't need to use `@graph` / `@job` to wire up dependencies.
- **Track cross-job dependencies via asset lineage** - Dagster allows you to find the parents and children of any asset, even if they live in different jobs. This is useful for finding the sources of problems and for understanding the consequences of changing or removing an asset.
- **Know when you need to take action on an asset** - In a unified view, Dagster compares the assets you've defined in code to the assets you've materialized in storage. You can catch that you've deployed code for generating a new table, but that you haven't yet materialized it. Or that you've deployed code that adds a column to a table, but that your stored table is still missing that column. Or that you've removed an asset definition, but the table still exists in storage.

In this example, we'll define some tables with dependencies on each other. We have a table of temperature samples collected in five-minute increments, and we want to compute a table of the highest temperatures for each day.
Expand All @@ -23,7 +25,7 @@ In this example, we'll define some tables with dependencies on each other. We ha

### Defining the assets

Here are our asset (aka table) definitions.
Here are our asset definitions that define tables we want to materialize.

```python file=../../software_defined_assets/software_defined_assets/assets.py startafter=start_marker endbefore=end_marker
import pandas as pd
Expand Down Expand Up @@ -53,17 +55,17 @@ def hottest_dates(daily_temperature_highs: DataFrame) -> DataFrame:

`sfo_q2_weather_sample` represents our base temperature table. It's a <PyObject module="dagster" object="SourceAsset" />, meaning that we rely on it, but don't generate it.

`daily_temperature_highs` represents a computed asset. It's derived by taking the `sfo_q2_weather_sample` table and applying the decorated function to it. Notice that it's defined using a pure function - i.e. a function with no side effects, just logical data transformation. The code for storing and retrieving the data in persistent storage will be supplied later on in an <PyObject object="IOManager" /> - that allows swapping in different implementations in different environments. E.g. we might want to store data in a local CSV file for easy testing, but store data a data warehouse in production.
`daily_temperature_highs` represents a computed asset. It's derived by taking the `sfo_q2_weather_sample` table and applying the decorated function to it. Notice that it's defined using a pure function, a function with no side effects, just logical data transformation. The code for storing and retrieving the data in persistent storage will be supplied later on in an <PyObject object="IOManager" />. This allows us to swap in different implementations in different environments. For example, in local development, we might want to store data in a local CSV file for easy testing. However in production, we would want to store data in a data warehouse.

`hottest_dates` is a computed asset that depends on another computed asset - the `daily_temperture_highs` asset.
`hottest_dates` is a computed asset that depends on another computed asset, `daily_temperature_highs`.

The framework infers asset dependencies by looking at the names of the arguments to the decorated functions. E.g. the function that defines the `daily_temperature_highs` asset has an argument named `sfo_q2_weather_sample` - corresponding to the asset of the same name.
The framework infers asset dependencies by looking at the names of the arguments to the decorated functions. The function that defines the `daily_temperature_highs` asset has an argument named `sfo_q2_weather_sample`, which corresponds to the asset definition of the same name.

### Combining the assets into a group

Having defined some assets, we can combine them into an <PyObject object="AssetGroup" />, which allows working with them in Dagit. It also allows combining them with resources and IO managers that determine how they're stored and connect them to external services.

It's common to use a utility like <PyObject object="AssetGroup" method="from_module" /> or \<PyObject object="AssetGroup" method='from_package_name" /> to pick up all the assets within a module or package, so you don't need to list them individually.
It's common to use a utility like <PyObject object="AssetGroup" method="from_module" /> or <PyObject object="AssetGroup" method="from_package_name" /> to pick up all the assets within a module or package, so you don't need to list them individually.

```python file=../../software_defined_assets/software_defined_assets/weather_assets_group.py startafter=asset_group_start endbefore=asset_group_end
# imports the module called "assets" from the package containing the current module
Expand All @@ -78,7 +80,7 @@ weather_assets = AssetGroup.from_modules(
)
```

The order that we supply the assets when constructing an <PyObject object="AssetGroup" /> doesn't matter - the dependencies are determined by what's declared inside each asset.
The order that we supply the assets when constructing an <PyObject object="AssetGroup" /> doesn't matter, since the dependencies are determined by each asset definition.

The functions we used to define our assets describe how to compute their contents, but not how to read and write them to persistent storage. For reading and writing, we define an <PyObject object="IOManager" />. In this case, our `LocalFileSystemIOManager` stores DataFrames as CSVs on the local filesystem:

Expand Down
Binary file modified docs/next/public/objects.inv
Binary file not shown.
1 change: 1 addition & 0 deletions docs/sphinx/sections/api/apidocs/assets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ A software-defined asset combines:
.. autodecorator:: asset

.. autoclass:: AssetGroup
:members:

.. autodecorator:: multi_asset

Expand Down
35 changes: 13 additions & 22 deletions python_modules/dagster/dagster/core/asset_defs/asset_group.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,31 +164,22 @@ def build_job(
Args:
name (str): The name to give the job.
selection (Union[str, List[str]]): A single selection query or list of selection queries to execute. For example:
* ``['some_asset_key']``: selects ``some_asset_key`` itself.
* ``['*some_asset_key']``: select ``some_asset_key`` and all
its ancestors (upstream dependencies).
* ``['*some_asset_key+++']``: select ``some_asset_key``, all
its ancestors, and its descendants
(downstream dependencies) within 3 levels down.
* ``['*some_asset_key', 'other_asset_key_a', 'other_asset_key_b
+']``: select ``some_asset_key`` and all its
ancestors, ``other_asset_key_a`` itself, and
``other_asset_key_b`` and its direct child asset keys. When
subselecting into a multi-asset, all of the asset keys in
that multi-asset must be selected.
selection (Union[str, List[str]]): A single selection query or list of selection queries
to execute. For example:
- ``['some_asset_key']`` select ``some_asset_key`` itself.
- ``['*some_asset_key']`` select ``some_asset_key`` and all its ancestors (upstream dependencies).
- ``['*some_asset_key+++']`` select ``some_asset_key``, all its ancestors, and its descendants (downstream dependencies) within 3 levels down.
- ``['*some_asset_key', 'other_asset_key_a', 'other_asset_key_b+']`` select ``some_asset_key`` and all its ancestors, ``other_asset_key_a`` itself, and ``other_asset_key_b`` and its direct child asset keys. When subselecting into a multi-asset, all of the asset keys in that multi-asset must be selected.
executor_def (Optional[ExecutorDefinition]): The executor
definition to use when executing the job. Defaults to the
executor on the AssetGroup. If no executor was provided on the
AssetGroup, then it defaults to
:py:class:`multi_or_in_process_executor`.
tags (Optional[Dict[str, Any]]): Arbitrary metadata for any
execution of the Job.
Values that are not strings will be json encoded and must meet t
he criteria that
`json.loads(json.dumps(value)) == value`. These tag values may
be overwritten by tag
values provided at invocation time.
AssetGroup, then it defaults to :py:class:`multi_or_in_process_executor`.
tags (Optional[Dict[str, Any]]): Arbitrary metadata for any execution of the job.
Values that are not strings will be json encoded and must meet the criteria that
`json.loads(json.dumps(value)) == value`. These tag values may be overwritten
tag values provided at invocation time.
description (Optional[str]): A description of the job.
Examples:
Expand Down

1 comment on commit 98e99c8

@vercel
Copy link

@vercel vercel bot commented on 98e99c8 Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sign in to comment.