### Setup

To get set up, do these tasks first: 

- Upload the customer_data* and transactions_data* files to a cloud bucket or follow the 00 - Generating Data notebook to generate them.
- Regardless of how you get the files into your storage, you will have to replace the paths I use in the code below with the paths that make sense for your environment. The ones I use are for accessing Azure Data Lake Storage.

First I set my Azure storage configs

In [0]:
# This cell sets all the configuration parameters to connect to Azure Data Lake
spark.conf.set("fs.azure.account.auth.type.<account_name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<account_name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<account_name>.dfs.core.windows.net", "****************************")
spark.conf.set("fs.azure.account.oauth2.client.secret.<account_name>.dfs.core.windows.net", "*******************************")
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<account_name>.dfs.core.windows.net", "https://login.microsoftonline.com/************************/oauth2/token")

Verify that cloud storage is accessible

In [0]:
dbutils.fs.ls("abfss://pyspark@warnerdatalake.dfs.core.windows.net/")

[FileInfo(path='abfss://pyspark@warnerdatalake.dfs.core.windows.net/exports/', name='exports/', size=0, modificationTime=1740581924000),
 FileInfo(path='abfss://pyspark@warnerdatalake.dfs.core.windows.net/imports/', name='imports/', size=0, modificationTime=1740581918000)]

- First, define the paths of the files we are reading.
- Replace with what is appropriate with your environment.

In [0]:
# Define paths for each file format in ADLS
csv_path = "abfss://pyspark@warnerdatalake.dfs.core.windows.net//imports//customers_data.csv"
json_path = "abfss://pyspark@warnerdatalake.dfs.core.windows.net//imports//customers_data.json"
orc_path = "abfss://pyspark@warnerdatalake.dfs.core.windows.net//imports//customers_data.orc"
parquet_path = "abfss://pyspark@warnerdatalake.dfs.core.windows.net//imports//customers_data.parquet"

Let's load and verify the dataframes for each format

First, load CVS and use the show command

In [0]:
df_csv = (
    spark.read
         .option("header", "true")      # Assume first row as header
         .option("inferSchema", "true") # Infer data types
         .csv(csv_path)
)
print("CSV Data:")
df_csv.show(5)

CSV Data:
+-----------+----------+---------+--------------------+---+-------+
|customer_id|first_name|last_name|               email|age|country|
+-----------+----------+---------+--------------------+---+-------+
|          1|   First_1|   Last_1|First_1.Last_1@ex...| 40| Canada|
|          2|   First_2|   Last_2|First_2.Last_2@ex...| 55|    USA|
|          3|   First_3|   Last_3|First_3.Last_3@ex...| 59|    USA|
|          4|   First_4|   Last_4|First_4.Last_4@ex...| 49| Canada|
|          5|   First_5|   Last_5|First_5.Last_5@ex...| 58| Canada|
+-----------+----------+---------+--------------------+---+-------+
only showing top 5 rows


The display function is more user friendly

In [0]:
print("CSV Data:")
df_csv.limit(20).display()

CSV Data:


customer_id,first_name,last_name,email,age,country
1,First_1,Last_1,First_1.Last_1@example.com,40,Canada
2,First_2,Last_2,First_2.Last_2@example.com,55,USA
3,First_3,Last_3,First_3.Last_3@example.com,59,USA
4,First_4,Last_4,First_4.Last_4@example.com,49,Canada
5,First_5,Last_5,First_5.Last_5@example.com,58,Canada
6,First_6,Last_6,First_6.Last_6@example.com,55,USA
7,First_7,Last_7,First_7.Last_7@example.com,32,USA
8,First_8,Last_8,First_8.Last_8@example.com,56,USA
9,First_9,Last_9,First_9.Last_9@example.com,47,USA
10,First_10,Last_10,First_10.Last_10@example.com,30,Canada


Load JSON and print the schema

In [0]:
df_json = (
    spark.read
         .json(json_path)
)

print("JSON Schema:")
df_json.printSchema()
print("JSON Data:")
df_json.show(5)

JSON Schema:
root
 |-- age: long (nullable = true)
 |-- country: string (nullable = true)
 |-- customer_id: long (nullable = true)
 |-- email: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)

JSON Data:
+---+-------+-----------+--------------------+----------+---------+
|age|country|customer_id|               email|first_name|last_name|
+---+-------+-----------+--------------------+----------+---------+
| 40| Canada|          1|First_1.Last_1@ex...|   First_1|   Last_1|
| 55|    USA|          2|First_2.Last_2@ex...|   First_2|   Last_2|
| 59|    USA|          3|First_3.Last_3@ex...|   First_3|   Last_3|
| 49| Canada|          4|First_4.Last_4@ex...|   First_4|   Last_4|
| 58| Canada|          5|First_5.Last_5@ex...|   First_5|   Last_5|
+---+-------+-----------+--------------------+----------+---------+
only showing top 5 rows


Load ORC and describe the data

In [0]:
df_orc = spark.read.orc(orc_path)

print("ORC Description:")
df_orc.describe().display()

ORC Description:


summary,customer_id,first_name,last_name,email,age,country
count,10000.0,10000,10000,10000,10000.0,10000
mean,5000.5,,,,43.622,
stddev,2886.8956799071675,,,,11.43080847264244,
min,1.0,First_1,Last_1,First_1.Last_1@example.com,18.0,Canada
max,10000.0,First_9999,Last_9999,First_9999.Last_9999@example.com,60.0,USA


Finally, load parquet and check the summary

In [0]:
df_parquet = spark.read.parquet(parquet_path)

print("Parquet Summary:")
df_parquet.summary().display()

Parquet Summary:


summary,customer_id,first_name,last_name,email,age,country
count,10000.0,10000,10000,10000,10000.0,10000
mean,5000.5,,,,43.622,
stddev,2886.8956799071675,,,,11.43080847264244,
min,1.0,First_1,Last_1,First_1.Last_1@example.com,18.0,Canada
25%,2499.0,,,,37.0,
50%,4999.0,,,,46.0,
75%,7499.0,,,,53.0,
max,10000.0,First_9999,Last_9999,First_9999.Last_9999@example.com,60.0,USA
