Skip to content

Commit

Permalink
asset tutorial (#7269)
Browse files Browse the repository at this point in the history
  • Loading branch information
sryza committed Apr 13, 2022
1 parent f23a00b commit 1600323
Show file tree
Hide file tree
Showing 19 changed files with 525 additions and 10 deletions.
20 changes: 19 additions & 1 deletion docs/content/_navigation.json
Original file line number Diff line number Diff line change
Expand Up @@ -405,7 +405,25 @@
"path": "/guides/dagster/memoization"
},
{
"title": "Software-Defined Assets (Experimental)",
"title": "Software-Defined Assets Tutorial (Experimental)",
"path": "/guides/dagster/asset-tutorial",
"children": [
{
"title": "Defining an Asset",
"path": "/guides/dagster/asset-tutorial/defining-an-asset"
},
{
"title": "Building Graphs of Assets",
"path": "/guides/dagster/asset-tutorial/asset-graph"
},
{
"title": "Testing Assets",
"path": "/guides/dagster/asset-tutorial/testing-assets"
}
]
},
{
"title": "Software-Defined Assets with Pandas and PySpark (Experimental)",
"path": "/guides/dagster/software-defined-assets"
},
{
Expand Down
19 changes: 10 additions & 9 deletions docs/content/guides.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,13 @@

This section explains how to accomplish common tasks in Dagster and showcases Dagster's experimental features.

| Name | Description |
| ------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| [Migrating to Graphs, Jobs, and Ops](/guides/dagster/graph_job_op) | This guide describes how to migrate to the Graph, Job, and Op APIs from the legacy Dagster APIs (Solids and Pipelines). |
| [Versioning and Memoization](/guides/dagster/memoization) | This guide describes how to use Dagster's versioning and memoization features. <Experimental /> |
| [Software-Defined Assets](/guides/dagster/software-defined-assets) | This guide describes how and why to use software-defined assets. <Experimental /> |
| [Run Attribution](/guides/dagster/run-attribution) | This guide describes how to perform Run Attribution by using a Custom Run Coordinator <Experimental /> |
| [Re-execution](/guides/dagster/re-execution) | This guide describes how to re-execute a job within Dagit and using Dagster's APIs. |
| [Fully-Featured Example Project](/guides/dagster/example_project) | This guide describes the Hacker News example project, which takes advantage of many of Dagster's features |
| [Validating Data with Dagster Type Factories](/guides/dagster/dagster_type_factories) | This guide illustrates the use of a Dagster Type factory to validate Pandas dataframes using the third-party library Pandera. |
| Name | Description |
| ------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------- |
| [Migrating to Graphs, Jobs, and Ops](/guides/dagster/graph_job_op) | This guide describes how to migrate to the Graph, Job, and Op APIs from the legacy Dagster APIs (Solids and Pipelines). |
| [Versioning and Memoization](/guides/dagster/memoization) | This guide describes how to use Dagster's versioning and memoization features. <Experimental /> |
| [Software-Defined Assets Tutorial](/guides/dagster/asset-tutorial) | This guide teaches how to use software-defined assets, one concept at a time. <Experimental /> |
| [Software-Defined Assets with Pandas and PySpark](/guides/dagster/software-defined-assets) | This guide offers a fast introduction to software-defined assets, with Pandas and PySpark. <Experimental /> |
| [Run Attribution](/guides/dagster/run-attribution) | This guide describes how to perform Run Attribution by using a Custom Run Coordinator <Experimental /> |
| [Re-execution](/guides/dagster/re-execution) | This guide describes how to re-execute a job within Dagit and using Dagster's APIs. |
| [Fully-Featured Example Project](/guides/dagster/example_project) | This guide describes the Hacker News example project, which takes advantage of many of Dagster's features |
| [Validating Data with Dagster Type Factories](/guides/dagster/dagster_type_factories) | This guide illustrates the use of a Dagster Type factory to validate Pandas dataframes using the third-party library Pandera. |
36 changes: 36 additions & 0 deletions docs/content/guides/dagster/asset-tutorial.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
title: Software Defined Assets Tutorial | Dagster
description: Getting familiar with Dagster software-defined assets and tooling through a hands-on tutorial
---

# Tutorial

If you're new to software-defined assets, we recommend working through this tutorial to become familiar with them, using small examples that are intended to be illustrative of real data problems.

## Index

The tutorial is divided into several sections:

- [**Defining an Asset**](/guides/dagster/asset-tutorial/defining-an-asset)
- [**Building Graphs of Assets**](/guides/dagster/asset-tutorial/asset-graph)
- [**Testing Assets**](/guides/dagster/asset-tutorial/testing-assets)

## Setup

### Python and pip

We’ll assume that you have some familiarity with Python, but you should be able to follow along even if you’re coming from a different programming language. To check that Python and the pip package manager are already installed in your environment or install them, you can follow the instructions [here](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).

### Dagster and Dagit

```bash
pip install dagster dagit requests
```

This installs a few packages:

- **Dagster**: the core programming model and abstraction stack; stateless, single-node, single-process and multi-process execution engines; and a CLI tool for driving those engines.
- **Dagit**: the UI for developing and operating Dagster assets.
- **Requests**: not part of Dagster. Our examples will use it to download data from the internet.

You can also check out our [Getting Started](/getting-started) page to make sure you have installed the packages and set up the environment properly.
119 changes: 119 additions & 0 deletions docs/content/guides/dagster/asset-tutorial/asset-graph.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Building Graphs of Assets

Software-defined assets can depend on other software-defined assets. An asset dependency means that the contents of an "upstream" asset are used to compute the contents of the "downstream" asset.

Why split up code into multiple assets? There are a few reasons:

- Dagster can materialize assets without re-materialize all the upstream assets. This means that, if we hit a failure, we can re-materialize just the assets that didn't materialize successfully, which often allows us to avoid re-executing expensive computation.
- When two assets don't depend on each other, Dagster can materialize them simultaneously.

## Let's Get Serial

Having defined a data set of cereals, we'll define a downstream asset that contains only the cereals that are manufactured by Nabisco.

```python file=/guides/dagster/asset_tutorial/serial_asset_graph.py
import csv

import requests

from dagster import asset


@asset
def cereals():
response = requests.get("https://docs.dagster.io/assets/cereal.csv")
lines = response.text.split("\n")
return [row for row in csv.DictReader(lines)]


@asset
def nabisco_cereals(cereals):
"""Cereals manufactured by Nabisco"""
return [row for row in cereals if row["mfr"] == "N"]
```

We've defined our new asset, `nabisco_cereals`, with an argument, `cereals`.

Dagster offers a few ways of specifying asset dependencies, but the easiest is to include an upstream asset name as an argument to the decorated function. When it's time to materialize the contents of the `sugariest_cereal` asset, the contents of `cereals` asset are provided as the value for the `cereals` argument to its compute function.

So:

- `cereals` doesn't depend on any other asset.
- `nabisco_cereals` depends `cereals`.

Let's visualize these assets in Dagit:

```bash
dagit -f serial_asset_graph.py
```

Navigate to <http://127.0.0.1:3000>:

<img src="/images/guides/asset-tutorial/serial_asset_graph.png" />

<br />

## A More Complex Asset Graph

Assets don't need to be wired together serially. An asset can be depend on and be depended on by any number of other assets.

Here, we're interested in which of Nabisco's cereals has the most protein. We define four assets:

- The `cereals` and `nabisco_cereals` assets, same as above.
- A `cereal_protein_fractions` asset, which records each cereal's protein content as a fraction of its total mass.
- A `highest_protein_nabisco_cereal`, which has is the name of the nabisco cereal that has the highest protein content.

```python file=/guides/dagster/asset_tutorial/complex_asset_graph.py
import csv

import requests

from dagster import asset


@asset
def cereals():
response = requests.get("https://docs.dagster.io/assets/cereal.csv")
lines = response.text.split("\n")
return [row for row in csv.DictReader(lines)]


@asset
def nabisco_cereals(cereals):
"""Cereals manufactured by Nabisco"""
return [row for row in cereals if row["mfr"] == "N"]


@asset
def cereal_protein_fractions(cereals):
"""
For each cereal, records its protein content as a fraction of its total mass.
"""
result = {}
for cereal in cereals:
total_grams = float(cereal["weight"]) * 28.35
result[cereal["name"]] = float(cereal["protein"]) / total_grams

return result


@asset
def highest_protein_nabisco_cereal(nabisco_cereals, cereal_protein_fractions):
"""
The name of the nabisco cereal that has the highest protein content.
"""
sorted_by_protein = sorted(
nabisco_cereals, key=lambda cereal: cereal_protein_fractions[cereal["name"]]
)
return sorted_by_protein[-1]["name"]
```

Let's visualize these assets in Dagit:

```bash
dagit -f complex_asset_graph.py
```

<img src="/images/guides/asset-tutorial/complex_asset_graph.png" />

If you click the "Materialize All" button, you'll see that `cereals` executes first, followed by `nabisco_cereals` and `cereal_protein_fractions` executing in parallel, since they don't depend on each other's outputs. Finally, `highest_protein_nabisco_cereal` executes last, only after `nabisco_cereals` and `cereal_protein_fractions` have both executed.
91 changes: 91 additions & 0 deletions docs/content/guides/dagster/asset-tutorial/defining-an-asset.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
title: Software-Defined Assets Tutorial | Dagster
description: A software-defined specifies an asset that you want to exist and how to compute its contents.
---

# A First Asset

## The Cereal Dataset

Our assets will represent and transform a simple but scary CSV dataset, cereal.csv, which contains nutritional facts about 80 breakfast cereals.

## Hello, Asset!

Let's write our first Dagster asset and save it as `cereal.py`.

A software-defined asset specifies an asset that you want to exist and how to compute its contents. Typically, you'll define assets by annotating ordinary Python functions with the <PyObject module="dagster" object="asset" decorator /> decorator.

Our first asset represents a dataset of cereal data, downloaded from the internet.

```python file=/guides/dagster/asset_tutorial/cereal.py startafter=start_asset_marker endbefore=end_asset_marker
import csv
import requests
from dagster import asset


@asset
def cereals():
response = requests.get("https://docs.dagster.io/assets/cereal.csv")
lines = response.text.split("\n")
cereal_rows = [row for row in csv.DictReader(lines)]

return cereal_rows
```

In this simple case, our asset doesn't depend on any other assets.

## Materializing our asset

"Materializing" an asset means computing its contents and then writing them to persistent storage. By default, Dagster will pickle the value returned by the function and store them it the local filesystem, using the name of the asset as the name of the file. Where and how the contents are stored is fully customizable - e.g. you might store them in a database or a cloud object store like S3. We'll look at how that works later.

Assuming you’ve saved this code as `cereal.py`, you can execute it via two different mechanisms:

### Dagit

To visualize your assets in Dagit, just run the following. Make sure you're in the directory that contains the file with your code:

```bash
dagit -f cereal.py
```

You'll see output like

```bash
Serving dagit on http://127.0.0.1:3000 in process 70635
```

You should be able to navigate to <http://127.0.0.1:3000> in your web browser and view your asset.

<img
alt="defining_an_asset.png"
src="/images/guides/asset-tutorial/defining_an_asset.png"
/>

Clicking on the "Materialize All" button will launch a run that will materialize the asset. After that run has completed, the shaded box underneath "cereals" holds information about that run. Clicking on the Run ID, which is the string of characters in the upper right of that box, will take you to a view that includes a structured stream of logs and events that occurred during its execution.

<img alt="asset_run.png" src="/images/guides/asset-tutorial/asset_run.png" />

In this view, you can filter and search through the logs corresponding to the run that's materializing your asset.

To see a history of all the materializations for your asset, you can navigate to the _Asset Details_ page for it. Click the "cereals" link in the upper left corner of this run page, next to "Success". Another way to get to the same page is to navigate back to the Asset Graph page by clicking "Assets" in the top navigation pane, clicking on your asset, and then clicking on "View in Asset Catalog" at the top of the pane that shows up on the right.

<img src="/images/guides/asset-tutorial/asset_details.png" />

Success!

### Python API

If you'd rather materialize your asset as a script, you can do that without spinning up Dagit. Just add a few lines to `cereal.py`. This executes a run within the Python process.

```python file=/guides/dagster/asset_tutorial/cereal.py startafter=start_materialize_marker endbefore=end_materialize_marker
from dagster import AssetGroup

if __name__ == "__main__":
AssetGroup([cereals]).materialize()
```

Now you can just run:

```bash
python cereal.py
```
62 changes: 62 additions & 0 deletions docs/content/guides/dagster/asset-tutorial/testing-assets.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
title: Testing Assets | Dagster
description: Dagster enables you to unit-test individual assets and graphs of assets
---

# Testing Assets

Creating testable and verifiable data pipelines is one of the focuses of Dagster. We believe ensuring data quality is critical for managing the complexity of data systems. Here, we'll cover how to write unit tests for individual assets, as well as for graphs of assets together.

## Testing the Cereal Asset Definitions

Let's go back to the assets we defined in the [prior section](/guides/dagster/asset-tutorial/asset-graph#a-more-complex-asset-graph), and ensure that they work as expected by writing some unit tests.

We'll start by writing a test for the `nabisco_cereals` asset definition, which filters the larger list of cereals down to the those that were manufactured by Nabisco. To run the function that derives an asset from its upstream dependencies, we can invoke it directly, as if it's a regular Python function:

```python file=/guides/dagster/asset_tutorial/complex_asset_graph_tests.py startafter=start_asset_test endbefore=end_asset_test
def test_nabisco_cereals():
cereals = [
{"name": "cereal1", "mfr": "N"},
{"name": "cereal2", "mfr": "K"},
]
result = nabisco_cereals(cereals)
assert len(result) == 1
assert result == [{"name": "cereal1", "mfr": "N"}]
```

We'll also write a test for all the assets together. To do that, we need to combine them into an <PyObject object="AssetGroup" />. Then, we can invoke <PyObject object="AssetGroup" method="materialize_in_process" />, which returns an <PyObject module="dagster" object="ExecuteInProcessResult" />, whose methods let us investigate, in detail, the success or failure of execution, the values produced by the computation, and (as we'll see later) other events associated with execution.

```python file=/guides/dagster/asset_tutorial/complex_asset_graph_tests.py startafter=start_asset_group_test endbefore=end_asset_group_test
from dagster import AssetGroup


def test_cereal_asset_group():
group = AssetGroup(
[
nabisco_cereals,
cereals,
cereal_protein_fractions,
highest_protein_nabisco_cereal,
]
)

result = group.materialize()
assert result.success
assert result.output_for_node("highest_protein_nabisco_cereal") == "100% Bran"
```

Now you can use pytest, or your test runner of choice, to run the unit tests.

```bash
pytest test_complex_asset_graph.py
```

Dagster is written to make testing easy in a domain where it has historically been very difficult. You can learn more about Testing in Dagster by reading the [Testing](/concepts/testing) page.

<br />

## Conclusion

🎉 Congratulations! Having reached this far, you now have a working, testable, and maintainable group of software-defined assets.

<br />
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
29 changes: 29 additions & 0 deletions docs/screenshot_capture/screenshots.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,35 @@
url: http://127.0.0.1:3000/workspace/hello_cereal_repository@scheduler.py/schedules/good_morning_schedule
vetted: false

#################
# Asset tutorial
#################

- path: guides/asset-tutorial/defining_an_asset.png
defs_file: examples/docs_snippets/docs_snippets/guides/dagster/asset_tutorial/cereal.py
url: http://127.0.0.1:3000/

- path: guides/asset-tutorial/asset_run.png
defs_file: examples/docs_snippets/docs_snippets/guides/dagster/asset_tutorial/cereal.py
url: http://127.0.0.1:3000/
steps:
- launch a run and click on the run

- path: guides/asset-tutorial/asset_details.png
defs_file: examples/docs_snippets/docs_snippets/guides/dagster/asset_tutorial/cereal.py
url: http://127.0.0.1:3000/instance/assets/cereals
steps:
- hit the materialize button and then close the run toast

- path: guides/asset-tutorial/serial_asset_graph.png
defs_file: examples/docs_snippets/docs_snippets/guides/dagster/asset_tutorial/serial_asset_graph.py
url: http://127.0.0.1:3000/

- path: guides/asset-tutorial/complex_asset_graph.png
defs_file: examples/docs_snippets/docs_snippets/guides/dagster/asset_tutorial/complex_asset_graph.py
url: http://127.0.0.1:3000/


#################
# Concepts: misc
#################
Expand Down

1 comment on commit 1600323

@vercel
Copy link

@vercel vercel bot commented on 1600323 Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sign in to comment.