### Install and config databand

Before using databand tracking, you have to install and enable it. There is some enviromental variables to set:
- DBND__TRACKING : enable dbnd tracking
- DBND__CORE__DATABAND_URL : adress to your databand aplication
- DBND__CORE__DATABAND_ACCESS_TOKEN : access token to your aplication, possible to generate in the application
- DBND__ENABLE__SPARK_CONTEXT_ENV : enable tracking Spark features

In [0]:
%pip install dbnd

In [0]:
import os

os.environ['DBND__TRACKING'] = 'True'
os.environ['DBND__CORE__DATABAND_URL'] = 'X'
os.environ['DBND__CORE__DATABAND_ACCESS_TOKEN'] = 'X'
os.environ['DBND__ENABLE__SPARK_CONTEXT_ENV'] = 'True'

### (Optional) Configure tracking parameters

Tracked tasks are easier to monitor in the app when you give them their own names, you can configure: 
- JOB : job name (pipeline)
- PROJECT : project name
- NAME : run name

More about tracking parameters in documentation

https://www.ibm.com/docs/en/dobd?topic=integrations-python#tracking-parameters


In [0]:
os.environ['DBND__TRACKING__PROJECT'] = 'Databricks Spark Examples'
os.environ['DBND__RUN_INFO__NAME'] = 'Example spark tracking'
os.environ['DBND__TRACKING__JOB'] = 'Spark Examples'

### Install and import Spark and Databand dependencies


In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from dbnd import dbnd_tracking, log_dataset_op, task, dbnd_tracking_stop

# Tracking dataset operations
## Scenerio Read data -> Modify -> Write

**Important! Please run all cells above, and run selected example. If you want try another example, please rerun all cells above. Otherwise, results may overwrite in databand app.**



### Example 1 - Track snippet of code

The simplest way how to track some dataset operations for any snippet 


In [0]:
os.environ['DBND__TRACKING__JOB'] = 'Spark Example 1'

spark = SparkSession.builder.appName("SimplePlaybook").getOrCreate()

with dbnd_tracking():
    # Read operation
    data = [
            ("2024-08-01", 150),
            ("2024-08-02", 50),
            ("2024-08-03", 200),
            ("2024-08-04", 75)
        ]

    columns = ["transaction_date", "transaction_amount"]
    data = spark.createDataFrame(data, schema=columns)
    log_dataset_op(op_path="input_path",op_type="read",data=data) # <- log info about the read


    data = data.filter(col("transaction_amount") > 100)
    log_dataset_op(op_path="input_path",op_type="write",data=data) # <- log info about the read

    # Data manipulations
    data = data.filter(col("transaction_amount") > 100)
    log_dataset_op(op_path="output_path",op_type="write",data=data) # <- log info about the write

print("done")