### Setup

Make sure you have the files available from previous demos.

In [0]:
# This cell sets all the configuration parameters to connect to Azure Data Lake
spark.conf.set("fs.azure.account.auth.type.<account_name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<account_name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<account_name>.dfs.core.windows.net", "****************************")
spark.conf.set("fs.azure.account.oauth2.client.secret.<account_name>.dfs.core.windows.net", "*******************************")
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<account_name>.dfs.core.windows.net", "https://login.microsoftonline.com/************************/oauth2/token")

Verify that cloud storage is accessible

In [0]:
dbutils.fs.ls("abfss://pyspark@warnerdatalake.dfs.core.windows.net/")

[FileInfo(path='abfss://pyspark@warnerdatalake.dfs.core.windows.net/exports/', name='exports/', size=0, modificationTime=1740581924000),
 FileInfo(path='abfss://pyspark@warnerdatalake.dfs.core.windows.net/imports/', name='imports/', size=0, modificationTime=1740581918000)]

Let's load our dataset

In [0]:
from pyspark.sql import functions as F

# Paths to datasets
transactions_path = "abfss://pyspark@warnerdatalake.dfs.core.windows.net//imports//transactions_data.parquet"
customers_path = "abfss://pyspark@warnerdatalake.dfs.core.windows.net//imports//customers_data.parquet"

# Load DataFrames
df_transactions = spark.read.parquet(transactions_path)
df_customers = spark.read.parquet(customers_path)

# Display sample data
df_transactions.limit(5).display()
df_customers.limit(5).display()



transaction_id,customer_id,transaction_date,amount,category
1,3065,2025-03-17,76.1,Clothes
2,3274,2025-02-18,91.91,Clothes
3,130,2025-01-10,11.81,Accessories
4,320,2025-03-06,20.37,Furniture
5,6480,2025-03-22,12.31,Beauty


customer_id,first_name,last_name,email,age,country
1,First_1,Last_1,First_1.Last_1@example.com,40,Canada
2,First_2,Last_2,First_2.Last_2@example.com,55,USA
3,First_3,Last_3,First_3.Last_3@example.com,59,USA
4,First_4,Last_4,First_4.Last_4@example.com,49,Canada
5,First_5,Last_5,First_5.Last_5@example.com,58,Canada


Let's join the tables


In [0]:
df_join = df_transactions.join(df_customers, "customer_id", "inner")
df_join.count()

1000000

In [0]:
df_join_sum = df_join \
    .groupBy("country") \
    .agg(F.sum("amount").alias("total_amount"))

# Display the result
df_join_sum.display()

df_join_sum.explain(mode="formatted")

country,total_amount
USA,35188490.35
UK,1529744.27
Canada,13281149.6


== Physical Plan ==
AdaptiveSparkPlan (11)
+- == Initial Plan ==
   HashAggregate (10)
   +- Exchange (9)
      +- HashAggregate (8)
         +- Project (7)
            +- BroadcastHashJoin Inner BuildRight (6)
               :- Filter (2)
               :  +- Scan parquet  (1)
               +- Exchange (5)
                  +- Filter (4)
                     +- Scan parquet  (3)


(1) Scan parquet 
Output [2]: [customer_id#1458, amount#1460]
Batched: true
Location: InMemoryFileIndex [abfss://pyspark@warnerdatalake.dfs.core.windows.net/imports/transactions_data.parquet]
PushedFilters: [IsNotNull(customer_id)]
ReadSchema: struct<customer_id:int,amount:decimal(10,2)>

(2) Filter
Input [2]: [customer_id#1458, amount#1460]
Condition : isnotnull(customer_id#1458)

(3) Scan parquet 
Output [2]: [customer_id#1467L, country#1472]
Batched: true
Location: InMemoryFileIndex [abfss://pyspark@warnerdatalake.dfs.core.windows.net/imports/customers_data.parquet]
PushedFilters: [IsNotNull(customer_id)

If we are going to keep working with the joined data, we can cache it and avoid the file operations

In [0]:
df_join.cache()
df_join.count()


1000000

And run the aggregation again

In [0]:
df_join_sum = df_join \
    .groupBy("country") \
    .agg(F.sum("amount").alias("total_amount"))

# Display the result
df_join_sum.display()

df_join_sum.explain(mode="formatted")

country,total_amount
USA,35188490.35
UK,1529744.27
Canada,13281149.6


== Physical Plan ==
AdaptiveSparkPlan (14)
+- == Initial Plan ==
   HashAggregate (13)
   +- Exchange (12)
      +- HashAggregate (11)
         +- InMemoryTableScan (1)
               +- InMemoryRelation (2)
                     +- AdaptiveSparkPlan (10)
                     +- == Initial Plan ==
                        Project (9)
                        +- BroadcastHashJoin Inner BuildRight (8)
                           :- Filter (4)
                           :  +- Scan parquet  (3)
                           +- Exchange (7)
                              +- Filter (6)
                                 +- Scan parquet  (5)


(1) InMemoryTableScan
Output [2]: [amount#1460, country#1472]
Arguments: [amount#1460, country#1472]

(2) InMemoryRelation
Arguments: [customer_id#1458, transaction_id#1457L, transaction_date#1459, amount#1460, category#1461, first_name#1468, last_name#1469, email#1470, age#1471L, country#1472], CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCach