# Data Pipeline Orchestration with Dagster

In this notebook we will get to know the basics of dagster. Therefore, we will create a simple data pipeline.

## Project Setup

### Some Preparations

Change the cwd.

In [1]:
import os
os.chdir("/workshop")

## Dagster Op Jobs

In this exercise we will define a dagster Job using Ops.

In the file `dagster_exercise_ops` you will find four functions, which we will use as Ops. Notice, that some of the functions return one or more objects. This will be important when we start to combine the ops later. You don't need to change the function definition here.

The Ops should be executed in the following order
1. download_data
2. split_data
3. train_model
4. evaluate model

Note: The functions are imported to `dagster_exercise_ops_job` for the purpose of making this exercise cleaner. All adjustments could also be done in the original file `dagster_exercise_ops`.

### Exercise 1: Define your pipeline

Fill the gaps in the `dagster_exercise_ops_job` file to create a dagster job. If you need any help, have a look at the [dagster documentation of Ops](https://docs.dagster.io/_apidocs/ops).


### Exercise 2: Start the Dagster UI

After completing the creation of the pipeline, start the Dagster UI using the following command:

In [None]:
%%bash
dagster dev -f notebooks/dagster/dagster_exercise_ops_job.py --host 0.0.0.0

[32m2023-11-06 11:22:56 +0000[0m - dagster - [34mINFO[0m - Using temporary directory /workshop/tmptr6e9uj_ for storage. This will be removed when dagster dev exits.
[32m2023-11-06 11:22:56 +0000[0m - dagster - [34mINFO[0m - To persist information across sessions, set the environment variable DAGSTER_HOME to a directory to use.
[32m2023-11-06 11:22:56 +0000[0m - dagster - [34mINFO[0m - Launching Dagster services...

  Telemetry:

  As an open source project, we collect usage statistics to inform development priorities. For more
  information, read https://docs.dagster.io/getting-started/telemetry.

  We will not see or store solid definitions, pipeline definitions, modes, resources, context, or
  any data that is processed within solids and pipelines.

  To opt-out, add the following to $DAGSTER_HOME/dagster.yaml, creating that file if necessary:

    telemetry:
      enabled: false


  Welcome to Dagster!

  If you have any questions or would like to engage with the Dagster


* 'schema_extra' has been renamed to 'json_schema_extra'


[32m2023-11-06 11:23:07 +0000[0m - dagster.daemon - [34mINFO[0m - Instance is configured with the following daemons: ['AssetDaemon', 'BackfillDaemon', 'SchedulerDaemon', 'SensorDaemon']



* 'schema_extra' has been renamed to 'json_schema_extra'


[32m2023-11-06 11:23:08 +0000[0m - dagster-webserver - [34mINFO[0m - Serving dagster-webserver on http://0.0.0.0:3000 in process 70



* 'schema_extra' has been renamed to 'json_schema_extra'

* 'schema_extra' has been renamed to 'json_schema_extra'
2023-11-06 11:24:42 +0000 - dagster - DEBUG - data_pipeline - 3c748465-5987-4fe6-aa67-0c9cae5de16b - 3119 - RUN_START - Started execution of run for "data_pipeline".
2023-11-06 11:24:42 +0000 - dagster - DEBUG - data_pipeline - 3c748465-5987-4fe6-aa67-0c9cae5de16b - 3119 - ENGINE_EVENT - Executing steps using multiprocess executor: parent process (pid: 3119)
2023-11-06 11:24:42 +0000 - dagster - DEBUG - data_pipeline - 3c748465-5987-4fe6-aa67-0c9cae5de16b - 3119 - download_data_op - STEP_WORKER_STARTING - Launching subprocess for "download_data_op".

* 'schema_extra' has been renamed to 'json_schema_extra'
2023-11-06 11:24:50 +0000 - dagster - DEBUG - data_pipeline - 3c748465-5987-4fe6-aa67-0c9cae5de16b - 3357 - STEP_WORKER_STARTED - Executing step "download_data_op" in subprocess.
2023-11-06 11:24:50 +0000 - dagster - DEBUG - data_pipeline - 3c748465-5987-4fe6-aa67-0c9ca

You will see a graph representing the job `data_pipeline`. All ops should be shown as well as their inputs and outputs and how they are connected.

### Exercise 3: Add missing Configs

Click on the `Launchpad` tab. Here you can add the missing configurations of your pipeline. You might already noticed, that your first op needs two parameters `url` and `path`. Dagster is also aware of it and does not let you start a new run. It also shows an Error at the bottom left. 

Let's add the missing config parameters by confirming `Scaffold missing config` underneath the Error message. Fill the empty parameter values as followed:

url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
path = 'data/winequality-red.csv'

### Exercise 4: Launch the Job run

Now Launch a run.

While running, you can follow the execution of the pipeline in the dagster UI.

Have a look at the UI on your own and see what dagster tracks for you.


## Dagster Asset Jobs

Similar to the Dagster Op Jobs, we now want to define a job using Assets.

We prepared the Asset Functions for you in the `dagster_exercise_assets.py` file. The performed tasks are the same as in the dagster op job exercise. But we changed the functions, so they all return an object. These will be later represented by the Assets.

### Exercise 1: Define the assets

In `dagster_exercise_assets_job` you will find cleaner representations of the tasks to perform.

Define the given functions as Dagster Assets as described in the job file.

Note that you don't need to define an extra job function. Dagster will know how to assemble the Ops, if your Asset definition is correct.

### Exercise 2: Start the dagster UI

After completing the creation of the Assets, start the Dagster UI using the following command:

In [None]:
%%bash
dagster dev -f notebooks/dagster/dagster_exercise_assets_job.py

You will see a graph representing the defined Asset Group. All Assets should be shown as well as their connection to each other.

All Assets haven't been materilized, yet.

### Exercise 3: Materialize the Assets

Run `Materialize all` on the assets and watch them get green.

### Exercise 4: Explore the UI

Have a look at the information Dagster gives you about the materialized Assets. Also notice, that there are Runs, like you've already seen for the Dagster Op Jobs.
