Skip to content

Commit

Permalink
[docs] - Update Ops & Asset pages for release [CON-21] (#8158)
Browse files Browse the repository at this point in the history
* Update ops page

- Re-organizes the overview for Ops
- Adds info about assets

* Update assets page

Re-write overview for assets page

* Re-org intro for resources

* Resources - Move Overview up

* Resources - Make casing in headings consistent

* Prepping for asset update

* Revert "Prepping for asset update"

This reverts commit ef6350a.

* Revert "Resources - Make casing in headings consistent"

This reverts commit a003fb9.

* Revert "Resources - Move Overview up"

This reverts commit c7f63c7.

* Revert "Re-org intro for resources"

This reverts commit c9a98e6.

* Cleaning up materialization copy in intro

* Run snapshot
  • Loading branch information
erinkcochran87 committed Jun 8, 2022
1 parent 2e236b2 commit 1804af5
Show file tree
Hide file tree
Showing 2 changed files with 47 additions and 36 deletions.
32 changes: 16 additions & 16 deletions docs/content/concepts/assets/software-defined-assets.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,31 +5,31 @@ description: A software-defined asset is a description of how to compute the con

# Software-Defined Assets

A software-defined asset is a description of how to compute the contents of a particular data asset.
An **asset** is an object in persistent storage, such as a table, file, or persisted machine learning model. A **software-defined asset** is a Dagster object that couples an asset to the function and upstream assets that are used to produce its contents.

## Relevant APIs

| Name | Description |
| ------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <PyObject object="asset" decorator /> | A decorator used to define assets. |
| <PyObject object="AssetGroup" /> | A group of software-defined assets. |
| <PyObject object="SourceAsset" /> | A class that describes an asset, but doesn't define how to compute it. Within an <PyObject object="AssetGroup" />, <PyObject object="SourceAsset" />s are used to represent assets that other assets depend on, but can't be materialized themselves. |

## Overview
Software-defined assets enable a declarative approach to data management, in which code is the source of truth on what data assets should exist and how those assets are computed.

An "asset" is an object in persistent storage, e.g. a table, a file, or a persisted machine learning model. A software-defined asset is a Dagster object that couples an asset to the function and upstream assets that are used to produce its contents. Software-defined assets enable a declarative approach to data management, in which your code is the source of truth on what data assets should exist and how those assets are computed.

A software-defined asset includes three main components:
A software-defined asset includes the following:

- An <PyObject object="AssetKey" />, which is a handle for referring to the asset.
- A set of upstream asset keys, which refer to assets that the contents of the software-defined asset are derived from.
- An [op](/concepts/ops-jobs-graphs/ops), which is a function responsible for computing the contents of the asset from its upstream dependencies.

A crucial distinction between software-defined assets and [ops](/concepts/ops-jobs-graphs/ops) is that software-defined assets know about their dependencies, while ops do not. Ops aren't hooked up to dependencies until they're placed inside a [graph](/concepts/ops-jobs-graphs/jobs-graphs).
**Note**: A crucial distinction between software-defined assets and [ops](/concepts/ops-jobs-graphs/ops) is that software-defined assets know about their dependencies, while ops do not. Ops aren't connected to dependencies until they're placed inside a [graph](/concepts/ops-jobs-graphs/jobs-graphs).

**Materializing** an asset is the act of running its op and saving the results to persistent storage. You can initiate materializations from [Dagit](/concepts/dagit/dagit) or by invoking Python APIs. By default, assets are materialized to pickle files on your local filesystem, but materialization behavior is [fully customizable](#customizing-how-assets-are-materialized-with-io-managers). It's possible to materialize an asset in multiple storage environments, such as production and staging.

"Materializing" an asset is the act of running its op and saving the results to persistent storage. You can initiate materializations from [Dagit](/concepts/dagit/dagit), Dagster's web UI, or by invoking Python APIs. By default, assets are materialized to pickle files on your local filesystem, but materialization behavior is [fully customizable](#customizing-how-assets-are-materialized-with-io-managers).
---

## Relevant APIs

| Name | Description |
| ------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <PyObject object="asset" decorator /> | A decorator used to define assets. |
| <PyObject object="AssetGroup" /> | A group of software-defined assets. |
| <PyObject object="SourceAsset" /> | A class that describes an asset, but doesn't define how to compute it. Within an <PyObject object="AssetGroup" />, <PyObject object="SourceAsset" />s are used to represent assets that other assets depend on, but can't be materialized themselves. |

A single software-defined asset might be represented in multiple storage environments - e.g. it might have a "production" version and a "staging" version.
---

## Defining assets

Expand Down
51 changes: 31 additions & 20 deletions docs/content/concepts/ops-jobs-graphs/ops.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,40 @@ description: Ops are the core unit of computation in Dagster and contain the log

# Ops

Ops are the core unit of computation in Dagster. Multiple ops can be connected to create a [Graph](/concepts/ops-jobs-graphs/jobs-graphs).
Ops are one of two types of core units of computation in Dagster, with [assets](/concepts/assets/software-defined-assets) being the other type.

An individual op should perform relatively simple tasks, such as:

- Deriving a dataset from other datasets
- Executing a database query
- Initiating a Spark job in a remote cluster
- Querying an API and storing the result in a data warehouse
- Sending an email or Slack message

Collections of ops can be assembled to create a [graph](/concepts/ops-jobs-graphs/jobs-graphs).

<Image alt="ops" src="/images/ops.png" width={3200} height={1040} />

Ops support a variety of useful features for data orchestration, such as:

- **Flexible execution strategies**: Painlessly transition from development to production with ops, as they are sealed units of logic independent of execution strategy. Collections of ops - called [graphs](/concepts/ops-jobs-graphs/jobs-graphs) - can be bound via [jobs](/concepts/ops-jobs-graphs/jobs-graphs) to an appropriate [executor](/deployment/executors) for single-process execution or distribution across a cluster.

- **Pluggable external systems**: If your data pipeline interfaces with external systems, you may want to use local substitutes during development over a cloud-based production system. Dagster provides [resources](/concepts/resources) as an abstraction layer for this purpose.

Ops can be written against abstract resources (e.g. `database`), with resource definitions later bound at the [job](/concepts/ops-jobs-graphs/jobs-graphs) level. Op logic can thus remain uncoupled to any particular implementation of an external system.

- **Input and output management**: Ops have defined [inputs and outputs](#inputs-and-outputs), analogous to the arguments and return value(s) of a Python function. An input or output can be annotated with a [Dagster type](/concepts/types) for arbitrarily complex runtime validation. Outputs can additionally be tagged with an [IO Manager](/concepts/io-management/io-managers) to manage storage of the associated data in between ops. This enables easy swapping of I/O strategy depending on the execution environment, as well as efficient caching of data intermediates.

- **Configuration**: Operations in a data pipeline are often parameterized by both upstream data (e.g. a stream of database records) and configuration parameters independent of upstream data (e.g. a "chunk size" of incoming records to operate on). Define configuration parameters by providing an associated [config schema](/concepts/configuration/config-schema) to the op.

- **Event streams**: Ops emit a stream of [events](/concepts/ops-jobs-graphs/op-events) during execution. Certain events are emitted by default - such as indicating the start of an op's execution - but op authors are additionally given access to an event API.

This can be used to report data asset creation or modification (<PyObject object="AssetMaterialization"/>), the result of a data quality check (<PyObject object="ExpectationResult"/>), or other arbitrary information. Event streams can be visualized by Dagster's browser UI [Dagit](/concepts/dagit/dagit). This rich log of execution facilitates debugging, inspection, and real-time monitoring of running jobs.

- **Testability**: The properties that enable flexible execution of ops also facilitate versatile testing. Ops can be [tested](/concepts/testing) in isolation or as part of a pipeline. Further, the [resource](/concepts/resources) API allows external systems (e.g. databases) to be stubbed or substituted as needed.

---

## Relevant APIs

| Name | Description |
Expand All @@ -19,25 +49,6 @@ Ops are the core unit of computation in Dagster. Multiple ops can be connected t
| <PyObject object="OpExecutionContext"/> | An object exposing Dagster system APIs for resource access, logging, and more. Can be injected into an op by specifying `context` as the first argument of the compute function. |
| <PyObject object="OpDefinition" /> | Class for ops. You will rarely want to instantiate this class directly. Instead, you should use the <PyObject object="op" decorator />. |

## Overview

Ops are Dagster's core unit of computation. Individual ops should perform relatively simple tasks. Collections of ops can then be assembled into [Graphs](/concepts/ops-jobs-graphs/jobs-graphs) to perform more complex tasks. Some examples of tasks appropriate for a single op:

- Derive a dataset from other datasets.
- Execute a database query.
- Initiate a Spark job in a remote cluster.
- Query an API and store the result in a data warehouse.
- Send an email or Slack message.

The op as computational unit enables many useful features for data orchestration:

- **Flexible execution strategies**: Data pipelines are frequently developed locally and later deployed to production. Ops represent sealed units of logic independent of pipeline execution strategy, making the transition from development to production painless. Collections of ops (i.e. [Graphs](/concepts/ops-jobs-graphs/jobs-graphs)) can be bound via [Jobs](/concepts/ops-jobs-graphs/jobs-graphs) to an appropriate [Executor](/deployment/executors) for single-process execution or distribution across a cluster.
- **Pluggable external systems**: Production data pipelines almost always interface with external systems. The availability of these systems varies across environments. During development, it may be desirable to use a local substitute (e.g. SQLite) in place of a production system (e.g. cloud-hosted Postgres). Dagster provides [Resources](/concepts/resources) as an abstraction layer for this purpose. Ops can be written against abstract resources (e.g. `database`), with resource definitions later bound at the [Job](/concepts/ops-jobs-graphs/jobs-graphs) level. Op logic can thus remain uncoupled to any particular implementation of an external system.
- **Input and output management**: Ops have defined [inputs and outputs](#inputs-and-outputs), analogous to the arguments and return value(s) of a Python function. An input or output can be annotated with a [Dagster Type](/concepts/types) for arbitrarily complex runtime validation. Outputs can additionally be tagged with an [IO Manager](/concepts/io-management/io-managers) to manage storage of the associated data in between ops. This enables easy swapping of I/O strategy depending on the execution environment, as well as efficient caching of data intermediates.
- **Configuration**: Operations in a data pipeline are often parameterized by both upstream data (e.g. a stream of database records) and "configuration" parameters independent of upstream data (e.g. a "chunk size" of incoming records to operate on). Ops can be given an associated [Config Schema](/concepts/configuration/config-schema) to define such configuration parameters.
- **Event streams**: Ops emit a stream of [Events](/concepts/ops-jobs-graphs/op-events) during execution. Certain events are emitted by default (e.g. indicating the start of an op's execution), but op authors are additionally given access to an event API. This can be used to report data asset creation or modification (<PyObject object="AssetMaterialization"/>), the result of a data quality check (<PyObject object="ExpectationResult"/>), or other arbitrary information. Event streams can be visualized by Dagster's browser UI [Dagit](/concepts/dagit/dagit). This rich log of execution facilitates debugging, inspection, and real-time monitoring of running jobs.
- **Testability**: The properties that enable flexible execution of ops also facilitate versatile testing. Ops can be [tested](/concepts/testing) in isolation or as part of a pipeline. Further, the [Resource](/concepts/resources) API allows external systems (e.g. databases) to be stubbed or substituted as needed.

---

## Defining an op
Expand Down

1 comment on commit 1804af5

@vercel
Copy link

@vercel vercel bot commented on 1804af5 Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sign in to comment.