In [0]:
dbutils.widgets.text("ACCESS_KEY", '...')


This notebook shows you how to create and query a table or DataFrame loaded from data stored in Azure Blob storage.


### Step 1: Set the data location and type

There are two ways to access Azure Blob storage: account keys and shared access signatures (SAS).

To get started, we need to set the location and type of the file.

In [0]:
storage_account_access_key = dbutils.widgets.get("ACCESS_KEY")
print(storage_account_access_key)

abcdef


In [0]:
# hardcoded - no nie, co za lipa...
storage_account_name = "agnieszka2kuban"
storage_account_access_key = "tlxN66Zu/xTd+A8RuH1EBphp+L0x9WhVZfDUJHvk9SqnjjRgotxsWu1jHAWZv3R4yeM3fxKoAsZP+AStHcc/IQ=="

In [0]:
spark.conf.set(
  "fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
  storage_account_access_key)

Description of variables:
input - stands for "Container name"

Loading all files:
Afther the phrase "blob.core.windows.net/" don't post any characters
#file_location = f"wasbs://input@{storage_account_name}.blob.core.windows.net/"


### Step 2: Read the data

Now that we have specified our file metadata, we can create a DataFrame. Notice that we use an *option* to specify that we want to infer the schema from the file. We can also explicitly set this to a particular schema if we have one already.

First, let's create a DataFrame in Python.

In [0]:

#file_location = f"wasbs://input@{storage_account_name}.blob.core.windows.net/AzureUsage.csv"

# Ładowanie po masce nazwy - wildcard (*) 
# file_location = f"wasbs://input@{storage_account_name}.blob.core.windows.net/*AzureUsage (*"

file_location = f"wasbs://input@{storage_account_name}.blob.core.windows.net/"
file_type = "csv"

In [0]:
# podejrzana składnia proponowana przez databricksa
# df = spark.read.format(file_type).option("inferSchema", "true").option("header", "true").load(file_location)

In [0]:
df = spark.read.csv(file_location, header=True, inferSchema=True)
df.display()

SubscriptionName,SubscriptionGuid,Date,ResourceGuid,ServiceName,ServiceType,ServiceRegion,ServiceResource,Quantity,Cost
EVIDEN MS Partner,61096b0b-79b9-442a-9e59-9b03e9ed7dca,2/5/2024,59d063a4-87cd-40da-a237-0cd24bbb451d,Azure Data Factory v2,All,All,Cloud Pipeline Activity,0.150002,15.000748
EVIDEN MS Partner,61096b0b-79b9-442a-9e59-9b03e9ed7dca,2/6/2024,59d063a4-87cd-40da-a237-0cd24bbb451d,Azure Data Factory v2,All,All,Cloud Pipeline Activity,0.433335,0.002165
EVIDEN MS Partner,61096b0b-79b9-442a-9e59-9b03e9ed7dca,2/7/2024,59d063a4-87cd-40da-a237-0cd24bbb451d,Azure Data Factory v2,All,All,Cloud Pipeline Activity,0.283335,0.001415
EVIDEN MS Partner,61096b0b-79b9-442a-9e59-9b03e9ed7dca,2/2/2024,5bd67b90-1a10-5cba-84a9-1fc79c3f4a4f,Storage,General Block Blob v2 Hierarchical Namespace,PL Central,Hot Other Operations,0.003,1.8e-05
EVIDEN MS Partner,61096b0b-79b9-442a-9e59-9b03e9ed7dca,2/3/2024,5bd67b90-1a10-5cba-84a9-1fc79c3f4a4f,Storage,General Block Blob v2 Hierarchical Namespace,PL Central,Hot Other Operations,0.0001,1e-06
EVIDEN MS Partner,61096b0b-79b9-442a-9e59-9b03e9ed7dca,2/4/2024,5bd67b90-1a10-5cba-84a9-1fc79c3f4a4f,Storage,General Block Blob v2 Hierarchical Namespace,PL Central,Hot Other Operations,0.0002,2e-06
EVIDEN MS Partner,61096b0b-79b9-442a-9e59-9b03e9ed7dca,2/5/2024,5bd67b90-1a10-5cba-84a9-1fc79c3f4a4f,Storage,General Block Blob v2 Hierarchical Namespace,PL Central,Hot Other Operations,0.0064,4e-05
EVIDEN MS Partner,61096b0b-79b9-442a-9e59-9b03e9ed7dca,2/6/2024,5bd67b90-1a10-5cba-84a9-1fc79c3f4a4f,Storage,General Block Blob v2 Hierarchical Namespace,PL Central,Hot Other Operations,0.0005,5e-06
EVIDEN MS Partner,61096b0b-79b9-442a-9e59-9b03e9ed7dca,2/7/2024,5bd67b90-1a10-5cba-84a9-1fc79c3f4a4f,Storage,General Block Blob v2 Hierarchical Namespace,PL Central,Hot Other Operations,0.0015,1.1e-05
EVIDEN MS Partner,61096b0b-79b9-442a-9e59-9b03e9ed7dca,2/8/2024,5bd67b90-1a10-5cba-84a9-1fc79c3f4a4f,Storage,General Block Blob v2 Hierarchical Namespace,PL Central,Hot Other Operations,0.0001,1e-06


Databricks visualization. Run in Databricks to view.


## Saving the file to the container

In [0]:
output_path = f"wasbs://output@{storage_account_name}.blob.core.windows.net/"
df.write.parquet(output_path,partitionBy='ServiceName', mode="overwrite")

In [0]:
output_path = f"wasbs://output@{storage_account_name}.blob.core.windows.net/"
df.write.csv(output_path,sep=';', mode="overwrite")


### Step 4: (Optional) Create a view or table

If you want to query this data as a table, you can simply register it as a *view* or a table.

In [0]:
df.createOrReplaceTempView("AzureUsage")


We can query this view using Spark SQL. For instance, we can perform a simple aggregation. Notice how we can use `%sql` to query the view from SQL.

In [0]:
%sql

SELECT ServiceName, count(ServiceName) FROM AzureUsage GROUP BY ServiceName

ServiceName,count(ServiceName)
SQL Database,7
Storage,45
Bandwidth,4
Azure Data Factory v2,4


In [0]:
%sql
SELECT ServiceName, Round(sum(Cost), 2) as Cost_USD FROM AzureUsage GROUP BY ServiceName ORDER BY 2 desc 


ServiceName,Cost_USD
Azure Data Factory v2,15.02
SQL Database,4.22
Storage,0.03
Bandwidth,0.0


Databricks visualization. Run in Databricks to view.

In [0]:
#%scala
#display('AzureUsage')

In [0]:
#%r
#print(AzureUsage)