# Introduction
This example code requires that you have followed all the steps in the [README](../README.md) and also have worked through [Example 1](./1_get_data.ipynb).

In this example we will explore retrieving data on behalf of a another company.
<br>This works by utilizing [Altinn Delegation](https://docs.digdir.no/docs/Maskinporten/maskinporten_func_delegering.html)

Use cases for this could be that you are a service provider that needs to retrieve data on behalf of your customers, or that you are a company that needs to retrieve data on behalf of another company.

## Define variables
In the code below you will have to define the variables that will be used in the code blocks below.

In [None]:
# --------------------------------------------------
# Maskinporten
# --------------------------------------------------
maskinporten_private_key_file_path = "./private_key.pem" # Path to the private key pem file
maskinporten_client_key_id = "bf253e40-5d6a-49bd-88ff-f6c0625eff8c" # The ID of the key the private key corresponds to
maskinporten_client_id = "eac8848f-a662-46b5-bff4-0447db28bb7e" # The ID of the client you created for Maskinporten in Selvbetjeningsportalen
maskinporten_audience="https://test.sky.maskinporten.no"
maskinporten_scope = "dsb:data/dlesupervision.read"  # The scope you received from DSB
maskinporten_resource = "https://data.dsb.no" # The resource identifier / audience you received from DSB

# --------------------------------------------------
# Delta Sharing
# --------------------------------------------------
delta_sharing_endpoint = "https://norwayeast.azuredatabricks.net/api/2.0/delta-sharing/metastores/6d2d21f2-bfea-44bb-bb60-4eefa6be569b/recipients/a1ed2789-9e81-4673-b753-6974f957aa76" # The delta sharing endpoint you received from DSB

## Acquire an access token from Skyporten / Maskinporten
In the code below we will request an access token from Skyporten / Maskinporten.

The access token is stored in the `access_token` variable and is used in the subsequent code blocks in this file.

The access token is also written to the output of the code cell so you can see it in both JSON and bearer/base64 format.

In [None]:
# Import the a helper function in the lib folder to get an access token from Maskinporten
from lib.maskinporten import get_maskinporten_access_token

# This code is for reloading the module above to make sure changes are up-to-date if running this script multiple times
import lib.maskinporten
import importlib
importlib.reload(lib.maskinporten)

# Read the private key from a file
private_key = open(maskinporten_private_key_file_path, "rb").read()

# Request an access token from Maskinporten
access_token = get_maskinporten_access_token(
    key_id=maskinporten_client_key_id,
    client_id=maskinporten_client_id,
    audience=maskinporten_audience,
    scope=maskinporten_scope,
    resource=maskinporten_resource,
    private_key=private_key,
)

# Decode and print the access token
import jwt
decoded = jwt.decode(
    access_token,
    options={"verify_signature": False},
    algorithms=["RS256"],
)

print("Decoded access token:")
import json
print(json.dumps(decoded, indent=2))

print("Access token in base64 format:")
print(f"{access_token}")

print()
print(f"The sub-value '{decoded['sub']}' is what you need to send to DSB")

## Delta sharing
The code below will use the `access_token` variable from the previous code block to create a delta sharing profile JSON file.

This profile.json-file will be used by the Delta Sharing client to access the data on the Delta Sharing server/endpoint.

In [None]:
# Import helper functions for Delta Sharing
from lib.deltasharing import create_sharing_profile, get_table_urls
# Reload the module if running this script multiple times
import importlib
import lib.deltasharing
importlib.reload(lib.deltasharing)

# Create the Delta Sharing profile JSON file
profile = create_sharing_profile(
    profile_name="dsb_maskinporten_profile",
    bearer_token=access_token,
    endpoint=delta_sharing_endpoint
)

In [None]:
# --------------------------------------------------
# Testing access
# Using the created profile we will connect to the Delta Sharing server and list all available tables
# --------------------------------------------------
from delta_sharing import SharingClient
client = SharingClient(profile)

# List all tables using the Delta Sharing client
print(client.list_all_tables())

# List all tables using the get_table_urls helper function
table_urls = sorted(get_table_urls(profile, client))
print("\n".join(table_urls))


### Pandas
Useful when working on smaller datasets as it does not require to create an external spark cluster to work the data.<br>
Pandas also integrates well with the [Data Wrangler](vscode:extension/ms-toolsai.datawrangler) Visual Studio Code extension

In [None]:
#
# Consume with Pandas
#
from delta_sharing import load_as_pandas

# Load a specific table as a Pandas DataFrame
pandas_df = load_as_pandas(url=table_urls[-1])

# Show first 5 rows
print(pandas_df.head(n = 5))

# Show summary statistics of the DataFrame
pandas_df.describe()

In [None]:
from lib.deltasharing import pandas_dump_tables

pandas_dump_tables(
    table_urls=table_urls,
    data_folder="./data"
)

## Spark
Should be used when working on large datasets as it sends the jobs to a local spark cluster that optimizes the jobs

<span style="color:orange">Warning ⚠</span> The spark setup code below can only run once per jupyter session

In [None]:

from pyspark.sql import SparkSession
from pyspark import SparkConf 
from delta_sharing import load_as_spark

if 'spark' in globals() or 'spark' in locals():
    print("Spark session already exists, reusing it.")
    
else:
    spark_conf = (
        SparkConf()
        .set("spark.jars.packages", "io.delta:delta-sharing-spark_2.12:3.3.1") # org.apache.hadoop:hadoop-azure:3.3.1
        .set("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
        .set("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    )

    spark = (
        SparkSession.Builder()
        .master("local[*]")  # Use all available cores
        .config(conf=spark_conf)
        .getOrCreate()
    )

In [None]:
spark_df = load_as_spark(f"{profile}#keb_test.gold.age_stats_tbl")

spark_df.show()

spark_df.printSchema()

spark_df.summary().show()