-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
19 changed files
with
525 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
--- | ||
title: Software Defined Assets Tutorial | Dagster | ||
description: Getting familiar with Dagster software-defined assets and tooling through a hands-on tutorial | ||
--- | ||
|
||
# Tutorial | ||
|
||
If you're new to software-defined assets, we recommend working through this tutorial to become familiar with them, using small examples that are intended to be illustrative of real data problems. | ||
|
||
## Index | ||
|
||
The tutorial is divided into several sections: | ||
|
||
- [**Defining an Asset**](/guides/dagster/asset-tutorial/defining-an-asset) | ||
- [**Building Graphs of Assets**](/guides/dagster/asset-tutorial/asset-graph) | ||
- [**Testing Assets**](/guides/dagster/asset-tutorial/testing-assets) | ||
|
||
## Setup | ||
|
||
### Python and pip | ||
|
||
We’ll assume that you have some familiarity with Python, but you should be able to follow along even if you’re coming from a different programming language. To check that Python and the pip package manager are already installed in your environment or install them, you can follow the instructions [here](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). | ||
|
||
### Dagster and Dagit | ||
|
||
```bash | ||
pip install dagster dagit requests | ||
``` | ||
|
||
This installs a few packages: | ||
|
||
- **Dagster**: the core programming model and abstraction stack; stateless, single-node, single-process and multi-process execution engines; and a CLI tool for driving those engines. | ||
- **Dagit**: the UI for developing and operating Dagster assets. | ||
- **Requests**: not part of Dagster. Our examples will use it to download data from the internet. | ||
|
||
You can also check out our [Getting Started](/getting-started) page to make sure you have installed the packages and set up the environment properly. |
119 changes: 119 additions & 0 deletions
119
docs/content/guides/dagster/asset-tutorial/asset-graph.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
# Building Graphs of Assets | ||
|
||
Software-defined assets can depend on other software-defined assets. An asset dependency means that the contents of an "upstream" asset are used to compute the contents of the "downstream" asset. | ||
|
||
Why split up code into multiple assets? There are a few reasons: | ||
|
||
- Dagster can materialize assets without re-materialize all the upstream assets. This means that, if we hit a failure, we can re-materialize just the assets that didn't materialize successfully, which often allows us to avoid re-executing expensive computation. | ||
- When two assets don't depend on each other, Dagster can materialize them simultaneously. | ||
|
||
## Let's Get Serial | ||
|
||
Having defined a data set of cereals, we'll define a downstream asset that contains only the cereals that are manufactured by Nabisco. | ||
|
||
```python file=/guides/dagster/asset_tutorial/serial_asset_graph.py | ||
import csv | ||
|
||
import requests | ||
|
||
from dagster import asset | ||
|
||
|
||
@asset | ||
def cereals(): | ||
response = requests.get("https://docs.dagster.io/assets/cereal.csv") | ||
lines = response.text.split("\n") | ||
return [row for row in csv.DictReader(lines)] | ||
|
||
|
||
@asset | ||
def nabisco_cereals(cereals): | ||
"""Cereals manufactured by Nabisco""" | ||
return [row for row in cereals if row["mfr"] == "N"] | ||
``` | ||
|
||
We've defined our new asset, `nabisco_cereals`, with an argument, `cereals`. | ||
|
||
Dagster offers a few ways of specifying asset dependencies, but the easiest is to include an upstream asset name as an argument to the decorated function. When it's time to materialize the contents of the `sugariest_cereal` asset, the contents of `cereals` asset are provided as the value for the `cereals` argument to its compute function. | ||
|
||
So: | ||
|
||
- `cereals` doesn't depend on any other asset. | ||
- `nabisco_cereals` depends `cereals`. | ||
|
||
Let's visualize these assets in Dagit: | ||
|
||
```bash | ||
dagit -f serial_asset_graph.py | ||
``` | ||
|
||
Navigate to <http://127.0.0.1:3000>: | ||
|
||
<img src="/images/guides/asset-tutorial/serial_asset_graph.png" /> | ||
|
||
<br /> | ||
|
||
## A More Complex Asset Graph | ||
|
||
Assets don't need to be wired together serially. An asset can be depend on and be depended on by any number of other assets. | ||
|
||
Here, we're interested in which of Nabisco's cereals has the most protein. We define four assets: | ||
|
||
- The `cereals` and `nabisco_cereals` assets, same as above. | ||
- A `cereal_protein_fractions` asset, which records each cereal's protein content as a fraction of its total mass. | ||
- A `highest_protein_nabisco_cereal`, which has is the name of the nabisco cereal that has the highest protein content. | ||
|
||
```python file=/guides/dagster/asset_tutorial/complex_asset_graph.py | ||
import csv | ||
|
||
import requests | ||
|
||
from dagster import asset | ||
|
||
|
||
@asset | ||
def cereals(): | ||
response = requests.get("https://docs.dagster.io/assets/cereal.csv") | ||
lines = response.text.split("\n") | ||
return [row for row in csv.DictReader(lines)] | ||
|
||
|
||
@asset | ||
def nabisco_cereals(cereals): | ||
"""Cereals manufactured by Nabisco""" | ||
return [row for row in cereals if row["mfr"] == "N"] | ||
|
||
|
||
@asset | ||
def cereal_protein_fractions(cereals): | ||
""" | ||
For each cereal, records its protein content as a fraction of its total mass. | ||
""" | ||
result = {} | ||
for cereal in cereals: | ||
total_grams = float(cereal["weight"]) * 28.35 | ||
result[cereal["name"]] = float(cereal["protein"]) / total_grams | ||
|
||
return result | ||
|
||
|
||
@asset | ||
def highest_protein_nabisco_cereal(nabisco_cereals, cereal_protein_fractions): | ||
""" | ||
The name of the nabisco cereal that has the highest protein content. | ||
""" | ||
sorted_by_protein = sorted( | ||
nabisco_cereals, key=lambda cereal: cereal_protein_fractions[cereal["name"]] | ||
) | ||
return sorted_by_protein[-1]["name"] | ||
``` | ||
|
||
Let's visualize these assets in Dagit: | ||
|
||
```bash | ||
dagit -f complex_asset_graph.py | ||
``` | ||
|
||
<img src="/images/guides/asset-tutorial/complex_asset_graph.png" /> | ||
|
||
If you click the "Materialize All" button, you'll see that `cereals` executes first, followed by `nabisco_cereals` and `cereal_protein_fractions` executing in parallel, since they don't depend on each other's outputs. Finally, `highest_protein_nabisco_cereal` executes last, only after `nabisco_cereals` and `cereal_protein_fractions` have both executed. |
91 changes: 91 additions & 0 deletions
91
docs/content/guides/dagster/asset-tutorial/defining-an-asset.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
--- | ||
title: Software-Defined Assets Tutorial | Dagster | ||
description: A software-defined specifies an asset that you want to exist and how to compute its contents. | ||
--- | ||
|
||
# A First Asset | ||
|
||
## The Cereal Dataset | ||
|
||
Our assets will represent and transform a simple but scary CSV dataset, cereal.csv, which contains nutritional facts about 80 breakfast cereals. | ||
|
||
## Hello, Asset! | ||
|
||
Let's write our first Dagster asset and save it as `cereal.py`. | ||
|
||
A software-defined asset specifies an asset that you want to exist and how to compute its contents. Typically, you'll define assets by annotating ordinary Python functions with the <PyObject module="dagster" object="asset" decorator /> decorator. | ||
|
||
Our first asset represents a dataset of cereal data, downloaded from the internet. | ||
|
||
```python file=/guides/dagster/asset_tutorial/cereal.py startafter=start_asset_marker endbefore=end_asset_marker | ||
import csv | ||
import requests | ||
from dagster import asset | ||
|
||
|
||
@asset | ||
def cereals(): | ||
response = requests.get("https://docs.dagster.io/assets/cereal.csv") | ||
lines = response.text.split("\n") | ||
cereal_rows = [row for row in csv.DictReader(lines)] | ||
|
||
return cereal_rows | ||
``` | ||
|
||
In this simple case, our asset doesn't depend on any other assets. | ||
|
||
## Materializing our asset | ||
|
||
"Materializing" an asset means computing its contents and then writing them to persistent storage. By default, Dagster will pickle the value returned by the function and store them it the local filesystem, using the name of the asset as the name of the file. Where and how the contents are stored is fully customizable - e.g. you might store them in a database or a cloud object store like S3. We'll look at how that works later. | ||
|
||
Assuming you’ve saved this code as `cereal.py`, you can execute it via two different mechanisms: | ||
|
||
### Dagit | ||
|
||
To visualize your assets in Dagit, just run the following. Make sure you're in the directory that contains the file with your code: | ||
|
||
```bash | ||
dagit -f cereal.py | ||
``` | ||
|
||
You'll see output like | ||
|
||
```bash | ||
Serving dagit on http://127.0.0.1:3000 in process 70635 | ||
``` | ||
|
||
You should be able to navigate to <http://127.0.0.1:3000> in your web browser and view your asset. | ||
|
||
<img | ||
alt="defining_an_asset.png" | ||
src="/images/guides/asset-tutorial/defining_an_asset.png" | ||
/> | ||
|
||
Clicking on the "Materialize All" button will launch a run that will materialize the asset. After that run has completed, the shaded box underneath "cereals" holds information about that run. Clicking on the Run ID, which is the string of characters in the upper right of that box, will take you to a view that includes a structured stream of logs and events that occurred during its execution. | ||
|
||
<img alt="asset_run.png" src="/images/guides/asset-tutorial/asset_run.png" /> | ||
|
||
In this view, you can filter and search through the logs corresponding to the run that's materializing your asset. | ||
|
||
To see a history of all the materializations for your asset, you can navigate to the _Asset Details_ page for it. Click the "cereals" link in the upper left corner of this run page, next to "Success". Another way to get to the same page is to navigate back to the Asset Graph page by clicking "Assets" in the top navigation pane, clicking on your asset, and then clicking on "View in Asset Catalog" at the top of the pane that shows up on the right. | ||
|
||
<img src="/images/guides/asset-tutorial/asset_details.png" /> | ||
|
||
Success! | ||
|
||
### Python API | ||
|
||
If you'd rather materialize your asset as a script, you can do that without spinning up Dagit. Just add a few lines to `cereal.py`. This executes a run within the Python process. | ||
|
||
```python file=/guides/dagster/asset_tutorial/cereal.py startafter=start_materialize_marker endbefore=end_materialize_marker | ||
from dagster import AssetGroup | ||
|
||
if __name__ == "__main__": | ||
AssetGroup([cereals]).materialize() | ||
``` | ||
|
||
Now you can just run: | ||
|
||
```bash | ||
python cereal.py | ||
``` |
62 changes: 62 additions & 0 deletions
62
docs/content/guides/dagster/asset-tutorial/testing-assets.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
--- | ||
title: Testing Assets | Dagster | ||
description: Dagster enables you to unit-test individual assets and graphs of assets | ||
--- | ||
|
||
# Testing Assets | ||
|
||
Creating testable and verifiable data pipelines is one of the focuses of Dagster. We believe ensuring data quality is critical for managing the complexity of data systems. Here, we'll cover how to write unit tests for individual assets, as well as for graphs of assets together. | ||
|
||
## Testing the Cereal Asset Definitions | ||
|
||
Let's go back to the assets we defined in the [prior section](/guides/dagster/asset-tutorial/asset-graph#a-more-complex-asset-graph), and ensure that they work as expected by writing some unit tests. | ||
|
||
We'll start by writing a test for the `nabisco_cereals` asset definition, which filters the larger list of cereals down to the those that were manufactured by Nabisco. To run the function that derives an asset from its upstream dependencies, we can invoke it directly, as if it's a regular Python function: | ||
|
||
```python file=/guides/dagster/asset_tutorial/complex_asset_graph_tests.py startafter=start_asset_test endbefore=end_asset_test | ||
def test_nabisco_cereals(): | ||
cereals = [ | ||
{"name": "cereal1", "mfr": "N"}, | ||
{"name": "cereal2", "mfr": "K"}, | ||
] | ||
result = nabisco_cereals(cereals) | ||
assert len(result) == 1 | ||
assert result == [{"name": "cereal1", "mfr": "N"}] | ||
``` | ||
|
||
We'll also write a test for all the assets together. To do that, we need to combine them into an <PyObject object="AssetGroup" />. Then, we can invoke <PyObject object="AssetGroup" method="materialize_in_process" />, which returns an <PyObject module="dagster" object="ExecuteInProcessResult" />, whose methods let us investigate, in detail, the success or failure of execution, the values produced by the computation, and (as we'll see later) other events associated with execution. | ||
|
||
```python file=/guides/dagster/asset_tutorial/complex_asset_graph_tests.py startafter=start_asset_group_test endbefore=end_asset_group_test | ||
from dagster import AssetGroup | ||
|
||
|
||
def test_cereal_asset_group(): | ||
group = AssetGroup( | ||
[ | ||
nabisco_cereals, | ||
cereals, | ||
cereal_protein_fractions, | ||
highest_protein_nabisco_cereal, | ||
] | ||
) | ||
|
||
result = group.materialize() | ||
assert result.success | ||
assert result.output_for_node("highest_protein_nabisco_cereal") == "100% Bran" | ||
``` | ||
|
||
Now you can use pytest, or your test runner of choice, to run the unit tests. | ||
|
||
```bash | ||
pytest test_complex_asset_graph.py | ||
``` | ||
|
||
Dagster is written to make testing easy in a domain where it has historically been very difficult. You can learn more about Testing in Dagster by reading the [Testing](/concepts/testing) page. | ||
|
||
<br /> | ||
|
||
## Conclusion | ||
|
||
🎉 Congratulations! Having reached this far, you now have a working, testable, and maintainable group of software-defined assets. | ||
|
||
<br /> |
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
1600323
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Successfully deployed to the following URLs:
dagster – ./docs/next
docs.dagster.io
dagster.vercel.app
dagster-git-master-elementl.vercel.app
dagster-elementl.vercel.app
new-docs.dagster.io