
This notebook shows you how to create and query a table or DataFrame loaded from data stored in Azure Blob storage.


### Step 1: Set the data location and type

There are two ways to access Azure Blob storage: account keys and shared access signatures (SAS).

To get started, we need to set the location and type of the file.

In [1]:
storage_account_name = "cataschevasticadw"
storage_account_access_key = "GWaJcOy67wGkrcgSSjMny86j5ZxVtBiXGNxU7dx+L/yMDXCl3Vk+1XKJ3vuEs+w4Edh+KWonSkdZ+ASty3bspQ=="

In [2]:
file_location = "wasbs://cataschevasticadw@cataschevasticadw.blob.core.windows.net/FactTempSales"
file_type = "csv"

In [3]:
spark.conf.set(
  "fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
  storage_account_access_key)

NameError: name 'spark' is not defined


### Step 2: Read the data

Now that we have specified our file metadata, we can create a DataFrame. Notice that we use an *option* to specify that we want to infer the schema from the file. We can also explicitly set this to a particular schema if we have one already.

First, let's create a DataFrame in Python.

In [None]:
df = spark.read.format(file_type) \
  .option("inferSchema", "true") \
  .option("header", "true") \
  .option("delimeter", ",") \
  .load(file_location)


### Step 3: Query the data

Now that we have created our DataFrame, we can query it. For instance, you can identify particular columns to select and display.

In [None]:
display(df)

OrderID,ProductID,OrderStatus,ProductKey,CustomerKey,EmployeeKey,DeliveryPartnerID,OrderDateKey,Quantity,Price,ExtendedPriceAmount
63,SKU007,in process,7,1,3,13,20240618,100,0.75,75
63,SKU008,in process,8,1,3,13,20240618,300,15.0,4500
67,SKU001,in process,1,2,7,8,20240622,100,1.5,150
67,SKU015,in process,15,2,7,8,20240622,200,12.0,2400
65,SKU011,in process,11,4,1,15,20240620,450,2.5,1125
65,SKU012,in process,12,4,1,15,20240620,200,18.0,3600
61,SKU003,in process,3,5,11,4,20240616,150,3.0,450
61,SKU004,in process,4,5,11,4,20240616,100,25.0,2500



### Step 4: (Optional) Create a view or table

If you want to query this data as a table, you can simply register it as a *view* or a table.

In [None]:
df.createOrReplaceTempView("YOUR_TEMP_VIEW_NAME")


We can query this view using Spark SQL. For instance, we can perform a simple aggregation. Notice how we can use `%sql` to query the view from SQL.

In [None]:
%sql

SELECT EXAMPLE_GROUP, SUM(EXAMPLE_AGG) FROM YOUR_TEMP_VIEW_NAME GROUP BY EXAMPLE_GROUP


Since this table is registered as a temp view, it will be available only to this notebook. If you'd like other users to be able to query this table, you can also create a table from the DataFrame.

In [None]:
df.write.format("parquet").saveAsTable("FactTempSales")


This table will persist across cluster restarts and allow various users across different notebooks to query this data.

In [None]:
%sql
SELECT * FROM FactProduction
INNER JOIN DimMaterial
  ON DimMaterial.MaterialKey =FactProduction.MaterialKey 
WHERE DimMaterial.RowIsCurrent = 1 AND FactProduction.MaterialID=3

OrderID,ProductID,MaterialID,ProductionStatus,ProductKey,MaterialKey,EmployeeKey,ProductionStartDateKey,ProductionEndDateKey,CostOfMaterial,AmountOfMaterialUsed,UnitsOfProduct,ExtendedCost,RowIsCurrent,MaterialKey.1,MaterialID.1,MaterialName,CostOfMaterial.1,SupplierID,SupplierName,RowIsCurrent.1,RowStartDate,RowEndDate,RowChangeReason,RowIsDeleted
2,SKU003,3,completed,3,3,2,20230311,20230313.0,0.1,2.0,200,40.0,1,3,3,Clay,0.1,3,Top Steel,1,1899-12-31,9999-12-31,,False
4,SKU007,3,completed,7,3,3,20230619,20230621.0,0.1,4.0,120,48.0,1,3,3,Clay,0.1,3,Top Steel,1,1899-12-31,9999-12-31,,False
17,SKU007,3,completed,7,3,7,20230318,20230320.0,0.1,4.0,170,68.0,1,3,3,Clay,0.1,3,Top Steel,1,1899-12-31,9999-12-31,,False
27,SKU007,3,completed,7,3,7,20230813,20230815.0,0.1,4.0,270,108.0,1,3,3,Clay,0.1,3,Top Steel,1,1899-12-31,9999-12-31,,False
37,SKU007,3,completed,7,3,3,20231202,20231204.0,0.1,4.0,370,148.0,1,3,3,Clay,0.1,3,Top Steel,1,1899-12-31,9999-12-31,,False
47,SKU003,3,completed,3,3,8,20230716,20230715.0,0.1,2.0,200,40.0,1,3,3,Clay,0.1,3,Top Steel,1,1899-12-31,9999-12-31,,False
57,SKU007,3,completed,7,3,7,20230216,20230218.0,0.1,4.0,400,160.0,1,3,3,Clay,0.1,3,Top Steel,1,1899-12-31,9999-12-31,,False
61,SKU003,3,in production,3,3,11,20240617,,0.1,2.0,150,30.0,1,3,3,Clay,0.1,3,Top Steel,1,1899-12-31,9999-12-31,,False
63,SKU007,3,in production,7,3,3,20240619,,0.1,4.0,100,40.0,1,3,3,Clay,0.1,3,Top Steel,1,1899-12-31,9999-12-31,,False
68,SKU003,3,completed,3,3,13,20240624,20240626.0,0.1,2.0,200,40.0,1,3,3,Clay,0.1,3,Top Steel,1,1899-12-31,9999-12-31,,False


In [None]:
%sql
SELECT to_date(CAST(19960423 AS VARCHAR(8)), 'yyyyMMdd') as formatted_date

formatted_date
1996-04-23


In [None]:
%sql

SELECT ProductID, 
  AVG(DATEDIFF(day, 
          to_date(CAST(OrderDateKey AS VARCHAR(8)), 'yyyyMMdd'), 
          to_date(CAST(ShippedDateKey AS VARCHAR(8)), 'yyyyMMdd'))) AS ExecutionTimeInDays
FROM FactSales
WHERE FactSales.OrderStatus='completed' OR FactSales.OrderStatus='in delivery'
GROUP BY ProductID

ProductID,ExecutionTimeInDays
SKU003,4.0
SKU014,4.5
SKU013,4.333333333333333
SKU005,4.0
SKU011,4.875
SKU007,4.2
SKU006,3.8333333333333335
SKU002,4.25
SKU008,4.5
SKU010,3.5


In [None]:
%sql
SELECT * FROM factsales

OrderID,ProductID,OrderStatus,ProductKey,CustomerKey,EmployeeKey,DeliveryPartnerID,OrderDateKey,ShippedDateKey,RecievedDateKey,CancellationDateKey,Quantity,Price,ExtendedPriceAmount,RowIsCurrent
1,SKU001,completed,1,1,7,11,20230115,20230120.0,20230125.0,,150,1.5,225.0,1
1,SKU002,completed,2,1,7,11,20230115,20230120.0,20230125.0,,100,45.0,4500.0,1
2,SKU003,completed,3,12,2,9,20230310,20230315.0,20230320.0,,200,3.0,600.0,1
2,SKU004,completed,4,12,2,9,20230310,20230315.0,20230320.0,,75,25.0,1875.0,1
3,SKU005,cancelled,5,8,5,3,20230405,,,20230407.0,150,5.0,750.0,1
4,SKU006,completed,6,4,3,7,20230618,20230620.0,20230625.0,,250,7.5,1875.0,1
4,SKU007,completed,7,4,3,7,20230618,20230620.0,20230625.0,,120,0.75,90.0,1
5,SKU008,completed,8,5,5,6,20230820,20230825.0,20230830.0,,300,15.0,4500.0,1
5,SKU009,completed,9,5,5,6,20230820,20230825.0,20230830.0,,200,2.0,400.0,1
6,SKU010,cancelled,10,15,6,1,20231015,,,20231018.0,150,1.5,225.0,1
