# Data Pipeline Orchestration with Dagster Assets

In this notebook we will get to know the basics of dagster Assets. Therefore, we will create a simple data pipeline for preprocessing.

**Dagsters definition of Asset and SDA:**
> An asset is an object in persistent storage, such as a table, file, or persisted machine learning model. A software-defined asset is a description, in code, of an asset that should exist and how to produce and update that asset.

In our use case we have a look at a song data set from Spotify. We want to clean the data, find and remove duplicates and prepare it, so we could use it to predict a songs music genre.

Therefore, we planned a small preprocessing pipeline wich will perform the following steps:

1. Load song data -> `song_data` Asset
2. Remove unnecessary data -> `data_cleaned` Asset
3. Extract Duplicates -> `duplicates` Asset
4. Deduplicate data -> `data_deduplicated` Asset
5. Perform one-hot-encoding on the `key` column -> `data_encoded` Asset
6. Standardize columns with a wide value range and save as a csv -> `data_standardized` Asset

The code for these tasks is already provided. All you need to do is put their logic together in the form of a dagster asset graph.

After the definitions are complete, we will have a look at the dagster UI and materialize the assets.
In the end, we will define Asset jobs as an alternative way to materialize a set of assets and run them in the UI.

Here are the imports, we will need for the whole task.

In [None]:
from typing import List
import pandas as pd
from dagster import asset, Config, Output, define_asset_job, Definitions, AssetSelection
from pandas import DataFrame

## 1) Load song data

 The function `load_data` contains the logic, with wich the asset is generated.
 `DataConfig` is a dagster config class which makes it possible to adjust the materialization of the asset. For example, by providing a different `input_file` path. You will see, that this config, can be modified via the dagster UI, without changing the underlying code. This means, that by providing an asset config, the SDA is very flexible.
 
By adding the `asset` decorator to the `load_data` function, you define it as a dagster asset and could already use and see it via the dagster UI.
We want to give some more information about the asset. As `load_data` is not a good name for a data instance, rename the asset to `song_data` and give it a proper description. Both can be done by modifying the 'asset' decorator.

In [None]:
class DataConfig(Config):
    input_file: str = "data/genres_v2.csv"
    url: str = "https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify/data"


# ...
def load_data(config: DataConfig) -> DataFrame:
    return pd.read_csv(config.input_file)

## 2) Remove unnecessary data

Let's have a look at the next asset. The name of this asset is ok, but we still have to define it as a dagster asset and there is no description, yet. Additionally, we have to change the name of the input dataframe to be the same as the name of our asset from task 1. Dagster will then connect the two assets.
Modify the code cell to add the missing information.

In [None]:
# ...
def data_cleaned(song_data: pd.DataFrame):
    return song_data.drop(
        [
            "type",
            "id",
            "uri",
            "track_href",
            "analysis_url",
            "song_name",
            "Unnamed: 0",
            "title",
        ],
        axis=1,
    )

## 3) Extract Duplicates

For this task we have to define the function as an asset and give it a proper description.
Taking into account, that our preprocessing pipeline or input data could change, it would be interesting to know, how many duplicated entries are in our dataset. By adding a monitoring, we would know if the number of duplicated entries changes.

For dagster assets you can track such information by saving additional `metadata` to your output. `dagster.Output` provides a simple way to do so. Instead of directly returning `df`, return an `Output` object, with `df` as value and a dictionary as `metadata`. This metadata dict should contain one entry representing the number of rows and another representing the number of columns.

As the metadata values are numeric, dagster will automatically display them as time-based graphs in the dagster UI. So, if you materialize the asset multiple times, you will be able to nicely see how the values changed over time.

In [None]:
# ...
def duplicates(data_cleaned: pd.DataFrame):
    df = data_cleaned[data_cleaned.duplicated()]
    return df

## 4) Deduplicate Data & 5) Perform one-hot-encoding on the 'key' column

For the next two ops, do the same as in 3) for the `deduplicated_data` asset.
Define them as Assets, add a proper description and add the metadata containing number of rows and columns.

In [None]:
# ...
def data_deduplicated(data_cleaned: pd.DataFrame):
    df = data_cleaned.drop_duplicates(keep="first")
    return df

In [None]:
# ...
def data_encoded(data_deduplicated: pd.DataFrame):
    df = pd.get_dummies(data_deduplicated, columns=["key"], prefix="key")
    return df

## 6) Standardize columns with a wide value range and save as a csv

This is the last asset you need to define for the preprocessing pipeline. On top of the basic tasks, which you've done in tasks 3, 4 and 5, define a new metadata entry. This time it is not a numeric value, but the list of standardized columns.

You may have noticed, that the columns, which are standardized, are hardcoded. Let's change that by providing a config for this asset. The `StandardizationConfig` should be an instance of `dagster.Config` with a single attribute `columns_to_standardize`. Insert the config as an input of `data_standardized` and replace the `columns_to_standardize` variable with the according config attribute.

If you need guidance, have a look at the first asset. 

In [None]:
# ...


# ...
def data_standardized(data_encoded: pd.DataFrame):
    pd.set_option("display.max_columns", 500)
    data_encoded.describe()
    columns_to_standardize: List[str] = ["duration_ms", "tempo"]
    for col in columns_to_standardize:
        data_encoded[col] = (data_encoded[col] - data_encoded[col].min()) / (
            data_encoded[col].max() - data_encoded[col].min()
        )
    data_encoded.to_csv("data/genres_standardized.csv", sep=";", index=False)
    return data_encoded

## 7) Have a look at the Dagster UI

We now have all of our assets defined, combined and configured. Let's have a look at the [Dagster UI](http://localhost:3000).

The UI was started automatically, when the docker container of this workshop was started.

Navigating to the `Assets` tab, you can see all the assets you defined, where they are located and if they were materialized yet. 

*Don't see anything?* You may need to reload your code locations. Navigate to the `Deployments` tab and reload the `dagster_exercise_assets_job.py`.

Select the `song_data` asset and materialize it. You notice, that a run is started, which performs the materialization. If the run was successfully completed, the status of the asset will change to `Materialized`.

Navigate to the `global asset lineage`, via the link in the upper right corner. Here you can see how the assets are connected with each other. Materialize all assets, to see if your definitions are correct.

After materialization, you can get more information about the assets by clicking on them. Have a look at `data_deduplicated` and the generated metadata plots. Optionally, redo the materialization and see how new metadata values are added. Dagster will also notice, that the next assets are now outdated, as `data_deduplicated` was updated separately.

Optionally, have a look at an asset in the asset catalog. It stores all information about the asset, like metadata, materialization timestamp, old materializations, heritage, code version and data version.

Congratulation! You successfully defined and combined your first assets to a working preprocessing pipeline!

## 8) Define Asset jobs

Materializing via the dagster UI is simple, but if you need to materialize a set of assets repeatedly, there is a better way - Dagster Asset Jobs.

With an asset job, you combine the materialization of multiple assets in one job. This is similar to selecting multiple assets in the `Asset` tab or `global asset lineage` and click materialize. But, you define it in your code, which gives you the opportunity to determine sets of workflows.

An asset job can be defined using dagsters `define_asset_job` function. You need to set the `name` parameter and may give a selection of assets you want to materialize in this job (string or list of strings). If you don't give a selection, all assets in the same file will be selected.

Define an `all_assets_job`, which materializes all assets.
Define an `get_duplicates_job`, which only materializes the `duplicates` asset.

In [None]:
# uncommend the following lines and complete them
# get_all_assets_job = define_asset_job(...)
# get_duplicates_job = define_asset_job(...)

Unlike assets, jobs are not automatically displayed in the dagster UI. You need to specify all jobs (and assets) you want to see in a dagster Definitions using its `assets` and `jobs` parameters.

In [None]:
# uncommend the following lines and complete them
# defs = Definitions(
#     assets=...,
#     jobs=...,
# )

Have a look at the dagster UI again. 

*Don't see the jobs?* You may need to reload your deployment to see the changes in the `Deployment` tab.

Navigate to `Overview` -> `Jobs` and see your defined jobs listed there. Click on one of the jobs and see the asset lineage of all assets, which are materialized with this job. Run the materialization.

Congratulations! You also learned how to define asset jobs!

## 9) Optional Task: Define an asset group

As the number of assets can grow in a complex data processing pipeline, dagster offers to define asset groups. They can be used to keep your assets organized.

Not all of the above assets are needed to get a cleaned data set. The `duplicates` asset is informative, but not explicitly needed. We want to have the option to save time and only materialize the necessary assets. Therefore, we will group all but `duplicates` in an asset group called `datapreprocessing`.

Add the `group_name` parameter to all assets but `duplicates` and see how the asset lineage visualization changed.

Define an asset job only for the assets in the asset group. Instead of listing each asset separatly, use `AssetSelection.groups("datapreprocessing")` to define the selection. Don't forget to add it to your Definitions.

Have a look at the dagster UI for the new job and materialize its assets.

Hooray! You now gained a very good insight into dagster assets and its functionality!  