### Setup

To get set up, do these tasks first: 

- Get service credentials: Client ID `<aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee>` and Client Credential `<NzQzY2QzYTAtM2I3Zi00NzFmLWI3MGMtMzc4MzRjZmk=>`. Follow the instructions in [Create service principal with portal](https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-create-service-principal-portal). 
- Get directory ID `<ffffffff-gggg-hhhh-iiii-jjjjjjjjjjjj>`: This is also referred to as *tenant ID*. Follow the instructions in [Get tenant ID](https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-create-service-principal-portal#get-tenant-id). 
- If you haven't set up the service app, follow this [tutorial](https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-extract-load-sql-data-warehouse). Set access at the root directory or desired folder level to the service or everyone.

In [0]:
# This cell sets all the configuration parameters to connect to Azure Data Lake
spark.conf.set("fs.azure.account.auth.type.<account_name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<account_name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<account_name>.dfs.core.windows.net", "****************************")
spark.conf.set("fs.azure.account.oauth2.client.secret.<account_name>.dfs.core.windows.net", "*******************************")
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<account_name>.dfs.core.windows.net", "https://login.microsoftonline.com/************************/oauth2/token")

Verify that cloud storage is accessible

In [0]:
dbutils.fs.ls("abfss://pyspark@warnerdatalake.dfs.core.windows.net/")

[FileInfo(path='abfss://pyspark@warnerdatalake.dfs.core.windows.net/exports/', name='exports/', size=0, modificationTime=1740581924000),
 FileInfo(path='abfss://pyspark@warnerdatalake.dfs.core.windows.net/imports/', name='imports/', size=0, modificationTime=1740581918000)]

Create a Customers dataframe

In [0]:
from pyspark.sql import functions as F

# Number of customers to simulate
num_customers = 10000

# List of countries and corresponding skewed weights
countries = ["USA", "Canada", "UK"]
country_weights = [0.7, 0.2, 0.1]  # 70% USA, 20% Canada, 10% UK

# Function to get a weighted random selection
def get_weighted_country():
    return F.when(F.rand() < country_weights[0], countries[0]) \
            .when(F.rand() < (country_weights[0] + country_weights[1]), countries[1]) \
            .otherwise(countries[2])

# Create a customer dataframe with additional columns: age and country
df_customers = (
    spark.range(1, num_customers + 1)
         .withColumnRenamed("id", "customer_id")
         .withColumn("first_name", F.concat(F.lit("First_"), F.col("customer_id")))
         .withColumn("last_name", F.concat(F.lit("Last_"), F.col("customer_id")))
         .withColumn("email", F.concat(F.col("first_name"), F.lit("."), F.col("last_name"), F.lit("@example.com")))
         # Add skewed age column: 70% above 40, 30% below 40
         .withColumn("age", 
                     F.when(F.rand() < 0.3, (F.floor(F.rand() * 22) + 18))  # 18 to 39
                      .otherwise((F.floor(F.rand() * 21) + 40)))           # 40 to 60
         # Add skewed country column using weighted selection
         .withColumn("country", get_weighted_country())
)

df_customers.show(10)


+-----------+----------+---------+--------------------+---+-------+
|customer_id|first_name|last_name|               email|age|country|
+-----------+----------+---------+--------------------+---+-------+
|          1|   First_1|   Last_1|First_1.Last_1@ex...| 40| Canada|
|          2|   First_2|   Last_2|First_2.Last_2@ex...| 55|    USA|
|          3|   First_3|   Last_3|First_3.Last_3@ex...| 59|    USA|
|          4|   First_4|   Last_4|First_4.Last_4@ex...| 49| Canada|
|          5|   First_5|   Last_5|First_5.Last_5@ex...| 58| Canada|
|          6|   First_6|   Last_6|First_6.Last_6@ex...| 55|    USA|
|          7|   First_7|   Last_7|First_7.Last_7@ex...| 32|    USA|
|          8|   First_8|   Last_8|First_8.Last_8@ex...| 56|    USA|
|          9|   First_9|   Last_9|First_9.Last_9@ex...| 47|    USA|
|         10|  First_10|  Last_10|First_10.Last_10@...| 30| Canada|
+-----------+----------+---------+--------------------+---+-------+
only showing top 10 rows


Save it into storage

In [0]:
csv_output_path = "abfss://pyspark@warnerdatalake.dfs.core.windows.net//imports//customers_data"
df_customers.coalesce(1).write.mode("overwrite").option("header", "true").csv(csv_output_path)


In [0]:
orc_output_path = "abfss://pyspark@warnerdatalake.dfs.core.windows.net//imports//customers_data"
df_customers.coalesce(1).write.mode("overwrite").orc(orc_output_path)

In [0]:
pqt_output_path = "abfss://pyspark@warnerdatalake.dfs.core.windows.net//imports//customers_data"
df_customers.coalesce(1).write.mode("overwrite").parquet(pqt_output_path)

In [0]:
json_output_path = "abfss://pyspark@warnerdatalake.dfs.core.windows.net//imports//customers_data"
df_customers.coalesce(1).write.mode("overwrite").json(json_output_path)

After the files are generated, you would need to rename them to the "customers_data*." pattern

Generate the transactions file

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.types import DecimalType

# Define the list of categories
categories = ["Clothes", "Gadgets", "Food", "Toys", "Books", "Furniture", "Electronics", "Sports", "Beauty", "Accessories"]

# Create an array column literal with the categories
categories_array = F.array(*[F.lit(cat) for cat in categories])

# Number of transactions to simulate (at least 1M)
num_transactions = 1000000

df_transactions = (
    spark.range(1, num_transactions + 1)
         .withColumnRenamed("id", "transaction_id")
         # Assign a random customer_id from 1 to num_customers (assuming num_customers is defined)
         .withColumn("customer_id", (F.rand() * 10000).cast("integer") + 1)
         # Generate a random transaction date between 2025-01-01 and 2025-01-31
         .withColumn("transaction_date", 
                     F.expr("date_add('2025-01-01', cast(rand() * 90 as int))"))
         # Generate a random transaction amount between 0 and 100, formatted as decimal(10,2)
         .withColumn("amount", 
                     (F.rand() * 100).cast(DecimalType(10,2)))
         # Add a category column by randomly selecting one from the categories array
         .withColumn("category", 
                     F.element_at(categories_array, (F.floor(F.rand() * len(categories)) + 1).cast("integer")))
)

df_transactions.show(5)


+--------------+-----------+----------------+------+-----------+
|transaction_id|customer_id|transaction_date|amount|   category|
+--------------+-----------+----------------+------+-----------+
|             1|       3065|      2025-03-17| 76.10|    Clothes|
|             2|       3274|      2025-02-18| 91.91|    Clothes|
|             3|        130|      2025-01-10| 11.81|Accessories|
|             4|        320|      2025-03-06| 20.37|  Furniture|
|             5|       6480|      2025-03-22| 12.31|     Beauty|
+--------------+-----------+----------------+------+-----------+
only showing top 5 rows


In [0]:
pqt_output_path = "abfss://pyspark@warnerdatalake.dfs.core.windows.net//imports//transactions_data"
df_transactions.coalesce(1).write.mode("overwrite").parquet(pqt_output_path)