### Setup

To get set up, do these tasks first: 

- Upload the customer_data* and transactions_data* files to a cloud bucket or follow the 00 - Generating Data notebook to generate them.
- Regardless of how you get the files into your storage, you will have to replace the paths I use in the code below with the paths that make sense for your environment. The ones I use are for accessing Azure Data Lake Storage.

First I set my Azure storage configs

In [0]:
# This cell sets all the configuration parameters to connect to Azure Data Lake
spark.conf.set("fs.azure.account.auth.type.<account_name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<account_name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<account_name>.dfs.core.windows.net", "****************************")
spark.conf.set("fs.azure.account.oauth2.client.secret.<account_name>.dfs.core.windows.net", "*******************************")
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<account_name>.dfs.core.windows.net", "https://login.microsoftonline.com/************************/oauth2/token")

Verify that cloud storage is accessible

In [0]:
all_files = dbutils.fs.ls("abfss://etl1@dbstoragebbpbs73u57xmm.dfs.core.windows.net/")
filered_files = [file for file in all_files if (file.name.endswith(".csv") or file.name.endswith(".json") or file.name.endswith(".orc") or file.name.endswith(".parquet")) and file.name.startswith("customers_data")]
display(filered_files)

- First, define the paths of the files we are reading.
- Replace with what is appropriate with your environment.

In [0]:
# Define paths for each file format in ADLS
csv_path = "abfss://etl1@dbstoragebbpbs73u57xmm.dfs.core.windows.net//customers_data.csv"
json_path = "abfss://etl1@dbstoragebbpbs73u57xmm.dfs.core.windows.net//customers_data.json"
orc_path = "abfss://etl1@dbstoragebbpbs73u57xmm.dfs.core.windows.net//customers_data.orc"
parquet_path = "abfss://etl1@dbstoragebbpbs73u57xmm.dfs.core.windows.net//customers_data.parquet"

Let's load and verify the dataframes for each format

First, load CVS and use the show command

In [0]:

df_csv = (
    spark.read
        .option("header", "true")      # Assume first row as header
        .option("inferSchema", "true")  # To infer Data types
        .csv(csv_path)
)
print("CSV Data")
display(df_csv)

# df_csv = (
#     spark.read
#          .option("header", "true")      # Assume first row as header
#          .option("inferSchema", "true") # Infer data types
#          .csv(csv_path)
# )
# print("CSV Data:")
# df_csv.show(5)

The display function is more user friendly

In [0]:
print("CSV Data:")
df_csv.limit(20).display()

Load JSON and print the schema

In [0]:
df_json = (
    spark.read
         .json(json_path)
)

print("json Schema:")
df_json.printSchema()
print("json Data:")
df_json.limit(20).display()




# df_json = (
#     spark.read
#          .json(json_path)
# )

# print("JSON Schema:")
# df_json.printSchema()
# print("JSON Data:")
# df_json.show(5)

Load ORC and describe the data

In [0]:
df_orc = (
    spark.read.orc(orc_path)
)

df_orc.describe().display()

# df_orc = spark.read.orc(orc_path)

# print("ORC Description:")
# df_orc.describe().display()

Finally, load parquet and check the summary

In [0]:
df_parquet = spark.read.parquet(parquet_path)

print("Parquet Summary:")
df_parquet.summary().display()