## Spark Overview


### **[Apache Spark](https://spark.apache.org/)™ is a multi-language engine for executing data engineering ... on single-node machines or clusters.**



<img src="https://spark.apache.org/docs/3.5.1/img/cluster-overview.png" width="1200" />









<img src="https://github.com/dbrownems/SparkDataEngineeringForSQLServerProfessionals/blob/main/cluster_overview2.png?raw=true" width="1200" />

# Introduction to Notebooks



Notebooks are the primary development tool for Spark.

 - Interactive development and data analysis tool
 - But they can also be saved and run as part of a job

 - And in addition to code, they support markdown, so you can embed rich documentation in your jobs

<details>
Additional resources:

Develop, execute, and manage Microsoft Fabric notebooks
https://learn.microsoft.com/en-us/fabric/data-engineering/author-execute-notebook

Python for beginners
https://learn.microsoft.com/en-us/shows/intro-to-python-development/

Spark docs
https://spark.apache.org/docs/latest/

Delta docs
https://docs.delta.io/latest/index.html



</details>

# Python code in Notebooks

In [2]:
%%pyspark
# top level variables in notebooks have session scope
msg = "hello from python"

def print_message():
    msg2 = msg
    print(msg2)

a = 2

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 4, Finished, Available, Finished)

In [3]:
%%pyspark

#print the session variable
print(msg)

#run the function
print_message()

#change the value
msg = "hello again"

#print the changed value
print_message()

#what is the print_message object?
print(print_message)

#asssign a variable to the function
f = print_message

print(f)

#run that
f()


#msg2 isn't defined; it's a local variable in the print_message method
print(msg2)

#notice that all the other commands ran: python is an "interpreted" language

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 5, Finished, Available, Finished)

hello from python
hello from python
hello again
<function print_message at 0x7f8d0a1b16c0>
<function print_message at 0x7f8d0a1b16c0>
hello again


NameError: name 'msg2' is not defined

# Working with Data

## DataFrame basics


"A DataFrame is a distributed collection of data organized into named columns."

[Datasets and DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html)

In [4]:
%%pyspark

df = spark.read.format("Delta").load("Tables/Sales_Customers")
print(df)

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 6, Finished, Available, Finished)

DataFrame[CustomerID: int, CustomerName: string, BillToCustomerID: int, CustomerCategoryID: int, BuyingGroupID: int, PrimaryContactPersonID: int, AlternateContactPersonID: int, DeliveryMethodID: int, DeliveryCityID: int, PostalCityID: int, CreditLimit: decimal(38,18), AccountOpenedDate: timestamp, StandardDiscountPercentage: decimal(38,18), IsStatementSent: boolean, IsOnCreditHold: boolean, PaymentDays: int, PhoneNumber: string, FaxNumber: string, DeliveryRun: string, RunPosition: string, WebsiteURL: string, DeliveryAddressLine1: string, DeliveryAddressLine2: string, DeliveryPostalCode: string, PostalAddressLine1: string, PostalAddressLine2: string, PostalPostalCode: string, LastEditedBy: int, ValidFrom: timestamp, ValidTo: timestamp]


In [5]:
%%pyspark

df = df.where("CustomerName like 'A%'").select("CustomerID","CustomerName")
print(df)


StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 7, Finished, Available, Finished)

DataFrame[CustomerID: int, CustomerName: string]


In [6]:
%%pyspark

display(df)

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 8, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, d9d8b6a4-b1b5-42a9-91d8-292cdb217e67)

- A DataFrame can be distributed in-memory collection of rows
- A DataFrame can be reference to a Data Lake location
- A DataFrame can specify a series of transformations

So what _is_ a DataFrame?
<details>

1. A DataFrame is a reference to external data

2. A query over one or more DataFrames

<details>
So basically, a DataFrame is a Query, and so the DataFrame API is an API for creating and mofifying queries.
</details>
</details>


In [7]:
%%pyspark

# the dataframe object has an API to transform the dataframe
# and you can easilly do stuff like rename all the columns

def fix_col_name(name):
    name = name.lower()\
               .replace("cust_","customer_")\
               .replace("addr_","address_")
               
    return "".join(x.capitalize() for x in name.lower().split("_"))

df = spark.sql("select 1 ID, 'Ann' CUST_NAME, '123 Garden Way' CUST_ADDRESS")

display(df)
for col in df.columns:
    df = df.withColumnRenamed(col, fix_col_name(col))

display(df)

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 9, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 6fc4d264-1424-491d-b81e-61b63174a948)

SynapseWidget(Synapse.DataFrame, ee0d6d53-2c62-42d5-98b4-68a5726d55fc)

[SparkSQL Reference](https://spark.apache.org/docs/latest/sql-ref-syntax.html#sql-syntax)

In [8]:
show tables

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 10, Finished, Available, Finished)

<Spark SQL result set with 31 rows and 3 fields>

In [10]:
SELECT * FROM Sales_Customers LIMIT 10;
SELECT * FROM WideWorldImporters_gold.Dimension_Customer LIMIT 1000;

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 14, Finished, Available, Finished)

<Spark SQL result set with 10 rows and 30 fields>

<Spark SQL result set with 204 rows and 11 fields>

In [11]:
%%pyspark
df = spark.sql("SELECT * FROM WideWorldImporters_bronze.Sales_Customers LIMIT 1000")
display(df)

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 15, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 99806408-ec14-43f5-b640-933a58a7ae1f)

## Loading a Dimension

### Generating Dimension Keys

https://spark.apache.org/docs/latest/api/sql/index.html

In [12]:
--hash of business key and source system
select xxhash64(CustomerId,"CRM") ID, *
from WideWorldImporters_bronze.Sales_Customers limit 10;

--or use a GUID
select uuid() ID, * 
from WideWorldImporters_bronze.Sales_Customers limit 10;

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 17, Finished, Available, Finished)

<Spark SQL result set with 10 rows and 31 fields>

<Spark SQL result set with 10 rows and 31 fields>

In [13]:
-- or use https://spark.apache.org/docs/latest/api/sql/index.html#monotonically_increasing_id
-- But the sequence has big gaps when processing across multiple worker nodes
select monotonically_increasing_id() ID, *
from WideWorldImporters_gold.Dimension_Customer_by_postalcode

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 18, Finished, Available, Finished)

<Spark SQL result set with 403 rows and 12 fields>

In [14]:
--or use SQL analytic functions to assign monotonically increasing keys
select  coalesce(c.CustomerKey,max(c.CustomerKey) over() 
                             + row_number() over (partition by c.CustomerKey order by s.CustomerID)) CustomerKey, 
       c.CustomerKey ExistingDimKey, 
       s.CustomerID
from WideWorldImporters_bronze.Sales_Customers s 
left join WideWorldImporters_gold.Dimension_Customer c 
  on s.CustomerID = c.WWICustomerID
order by CustomerKey;

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 20, Finished, Available, Finished)

<Spark SQL result set with 663 rows and 3 fields>

### Temporary Views and Temporary Tables

In [15]:
--temporary views are very cool
--like Common Table Expressions or subqueries, but much more powerfull
--They have session lifetime, rather than statement lifetime
create or replace temp view CustomerKeys as
select  coalesce(c.CustomerKey,max(c.CustomerKey) over() 
                             + row_number() over (partition by c.CustomerKey order by s.CustomerID)) CustomerKey, 
        s.CustomerID
from WideWorldImporters_bronze.Sales_Customers s 
left join WideWorldImporters_gold.Dimension_Customer c 
  on s.CustomerID = c.WWICustomerID

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 21, Finished, Available, Finished)

<Spark SQL result set with 0 rows and 0 fields>

In [16]:
explain select * from CustomerKeys

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 22, Finished, Available, Finished)

<Spark SQL result set with 1 rows and 1 fields>

In [17]:
--but temp views can be cached, and they become, essentially temp tables
--data is cached on the executor VMs, so this is useful for Delta tables too
cache table CustomerKeys

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 23, Finished, Available, Finished)

<Spark SQL result set with 0 rows and 0 fields>

In [18]:
explain  select * from CustomerKeys

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 24, Finished, Available, Finished)

<Spark SQL result set with 1 rows and 1 fields>

### Merging the dimension


In [19]:
select * from WideWorldImporters_gold.Dimension_Customer limit 10

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 25, Finished, Available, Finished)

<Spark SQL result set with 10 rows and 11 fields>

In [20]:
-- describe WideWorldImporters_gold.Dimension_Customer;
-- describe WideWorldImporters_bronze.Sales_Customers;
 create or replace temp view CustomerMergeSource 
 as
 select k.CustomerKey CustomerKey,
        c.CustomerID WWICustomerID,
        c.CustomerName Customer,
        bc.CustomerName BillToCustomer,
        cat.CustomerCategoryName Category,
        bg.BuyingGroupName BuyingGroup,
        p.FullName PrimaryContact,
        c.PostalPostalCode PostalCode,
        cast(0 as int) LineageKey,
        c.ValidFrom,
        c.ValidTo
    from WideWorldImporters_bronze.Sales_Customers c
    left join CustomerKeys k
       on k.CustomerID = c.CustomerID
    left join WideWorldImporters_bronze.Sales_Customers bc 
       on c.BillToCustomerID = bc.CustomerID
    left join WideWorldImporters_bronze.Sales_CustomerCategories cat 
       on cat.CustomerCategoryID = c.CustomerCategoryID
    left join WideWorldImporters_bronze.Sales_BuyingGroups bg 
       on c.BuyingGroupId = bg.BuyingGroupID
    left join WideWorldImporters_bronze.Application_People p 
       on p.PersonID = c.PrimaryContactPersonID
        



StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 26, Finished, Available, Finished)

<Spark SQL result set with 0 rows and 0 fields>

### Validate the data

In [21]:
%%pyspark 

ids = spark.sql("select WWICustomerID from CustomerMergeSource where CustomerKey is null limit 100").collect()

if len(ids) > 0:
    raise( ValueError(f"Invalid CustomerKey values for {len(ids)} keys example: {ids[0]}"))

ids = spark.sql("select CustomerKey from CustomerMergeSource group by CustomerKey having count(*)>1 limit 100").collect()

if len(ids) > 0:
    raise( ValueError(f"Duplicate CustomerKey values for {len(ids)} keys example: {ids[0]}"))

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 27, Finished, Available, Finished)

### Upsert the Dimension

In [22]:
merge into WideWorldImporters_gold.Dimension_Customer dest
using CustomerMergeSource src
  on src.WWICustomerID = dest.WWICustomerID
when matched then update set *
when not matched then insert *


StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 28, Finished, Available, Finished)

<Spark SQL result set with 1 rows and 4 fields>

## Bring in Unstructured Data

In [23]:
alter table WideWorldImporters_gold.Dimension_Customer 
add columns( Latitude float, Longitude float );

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 29, Finished, Available, Finished)

<Spark SQL result set with 0 rows and 0 fields>

```
[
    {
        "CustomerID": 1,
        "Location": "POINT (-102.6201979 41.4972022)"
    },
    {
        "CustomerID": 2,
        "Location": "POINT (-115.8743507 48.7163356)"
    },
    {
        "CustomerID": 3,
        "Location": "POINT (-112.7271223 34.2689145)"
    },
    {
        "CustomerID": 4,
        "Location": "POINT (-98.580361 37.2811339)"
    },
```

In [24]:
%%pyspark 

df = spark.read.text("Files/CustomerLocations.json").take(20)
display(df)

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 30, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, a9af7b63-d557-4b4a-b143-4daf729b8899)

In [25]:
%%pyspark

# to read data files without a built-in schema, supply the schema explicitly
# You can infer the schema, and then save and modify it if you like

from pyspark.sql.types import *

schema = StructType([
    StructField("CustomerID",IntegerType(),True),
    StructField("Location",StringType(),True)
])

dfCustLocations = spark.read\
                       .schema(schema)\
                       .option("multiLine", True)\
                       .json("Files/CustomerLocations.json")

display(dfCustLocations)

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 31, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 2069c371-cb87-4b8a-a4ac-58ef28f17780)

In [26]:
select * 
from dfCustLocations

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 32, Finished, Available, Finished)

Error: [TABLE_OR_VIEW_NOT_FOUND] The table or view `dfCustLocations` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.; line 2 pos 5;
'Project [*]
+- 'UnresolvedRelation [dfCustLocations], [], false


In [27]:
%%pyspark

dfCustLocations.createOrReplaceTempView("CustLocations")

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 33, Finished, Available, Finished)

In [28]:
--   'POINT (-123.8860114 47.4631419)''
  
  select CustomerId, split(Location,' ') locSplit
  from CustLocations

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 34, Finished, Available, Finished)

<Spark SQL result set with 663 rows and 2 fields>

In [30]:


-- select CustomerId, split(Location,' ') locSplit
-- from CustLocations;
-- "["POINT","(-120.1290272","36.0041223)"]"
create or replace temp view CustLocations2 as
with q AS
(
  select CustomerId, split(Location,' ') locSplit
  from CustLocations
)
select CustomerId, try_cast(replace(locSplit[1],'(','') as double) Long, try_cast(replace(locSplit[2],')','') as double) Lat
from q;


StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 36, Finished, Available, Finished)

<Spark SQL result set with 0 rows and 0 fields>

In [31]:
select * from CustLocations2

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 37, Finished, Available, Finished)

<Spark SQL result set with 663 rows and 3 fields>

In [32]:
%%pyspark
%pip install shapely

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 42, Finished, Available, Finished)

Collecting shapely
  Downloading shapely-2.0.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Downloading shapely-2.0.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: shapely
Successfully installed shapely-2.0.7

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.



In [33]:
%%pyspark
from shapely import wkt

shape = wkt.loads('POINT (-123.8860114 47.4631419)')
shape.centroid.x

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 44, Finished, Available, Finished)

-123.8860114

In [34]:
%%pyspark
from shapely import wkt
from pyspark.sql.functions import *
from pyspark.sql.types import *

def lat(s):
    shape = wkt.loads(s)
    return float(shape.centroid.y)


def lon(s):
    shape = wkt.loads(s)
    return float(shape.centroid.x)


spark.udf.register("lat", lat, FloatType())
spark.udf.register("lon", lon, FloatType())

lon('POINT (-123.8860114 47.4631419)')

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 45, Finished, Available, Finished)

-123.8860114

In [35]:
select lat('POINT (-123.8860114 47.4631419)') lat

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 46, Finished, Available, Finished)

<Spark SQL result set with 1 rows and 1 fields>

In [36]:
create or replace temp view CustLocations2 as

select CustomerId, lon(location) Long, lat(location) Lat
from CustLocations;

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 47, Finished, Available, Finished)

<Spark SQL result set with 0 rows and 0 fields>

In [37]:
with q as
(
    select c.*, l.Long NewLongitude, l.Lat NewLatitude
    from WideWorldImporters_gold.Dimension_Customer c
    left join CustLocations2 l 
    on c.WWICustomerID = l.CustomerID
)
update q set Latitude = NewLatitude, Longitude = NewLongitude

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 48, Finished, Available, Finished)

Error: [DELTA_UNSUPPORTED_SOURCE] UPDATE destination only supports Delta sources.
Some(Project [CustomerKey#8225, WWICustomerID#8226, Customer#8227, BillToCustomer#8228, Category#8229, BuyingGroup#8230, PrimaryContact#8231, PostalCode#8232, ValidFrom#8233, ValidTo#8234, LineageKey#8235, Latitude#8236, Longitude#8237, Long#8241 AS NewLongitude#8223, Lat#8242 AS NewLatitude#8224]
+- Join LeftOuter, (WWICustomerID#8226 = CustomerID#8240)
   :- Relation spark_catalog.wideworldimporters_gold.dimension_customer[CustomerKey#8225,WWICustomerID#8226,Customer#8227,BillToCustomer#8228,Category#8229,BuyingGroup#8230,PrimaryContact#8231,PostalCode#8232,ValidFrom#8233,ValidTo#8234,LineageKey#8235,Latitude#8236,Longitude#8237] parquet
   +- Project [cast(CustomerId#8150 as int) AS CustomerId#8240, cast(Long#8238 as float) AS Long#8241, cast(Lat#8239 as float) AS Lat#8242]
      +- Project [CustomerId#8150, lon(location#8151)#8243 AS Long#8238, lat(location#8151)#8244 AS Lat#8239]
         +- Relation [CustomerID#8150,Location#8151] json
)

In [38]:
update WideWorldImporters_gold.Dimension_Customer c
set Latitude = (select max(Latitude) from CustLocations2 l where l.CustomerID = c.WWICustomerID ),
    Longitude = (select max(Longitude) from CustLocations2 l where l.CustomerID = c.WWICustomerID )

   

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 49, Finished, Available, Finished)

Error: [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_SCALAR_SUBQUERY] Unsupported subquery expression: Correlated scalar subqueries can only be used in filters, aggregations, projections, and UPDATE/MERGE/DELETE commandsUpdateCommand Delta[version=56, ... msit-onelake.dfs.fabric.microsoft.com/54be16ef-ba86-4fd6-b494-2013153c2245/Tables/Dimension_Customer], `spark_catalog`.`wideworldimporters_gold`.`dimension_customer`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [CustomerKey#8248, WWICustomerID#8249, Customer#8250, BillToCustomer#8251, Category#8252, BuyingGroup#8253, PrimaryContact#8254, PostalCode#8255, ValidFrom#8256, ValidTo#8257, LineageKey#8258, scalar-subquery#8246 [max(Latitude#8259) && WWICustomerID#8249], scalar-subquery#8247 [max(Longitude#8260) && WWICustomerID#8249]]
   +- SubqueryAlias c
      +- SubqueryAlias spark_catalog.WideWorldImporters_gold.Dimension_Customer
         +- Relation spark_catalog.wideworldimporters_gold.dimension_customer[CustomerKey#8248,WWICustomerID#8249,Customer#8250,BillToCustomer#8251,Category#8252,BuyingGroup#8253,PrimaryContact#8254,PostalCode#8255,ValidFrom#8256,ValidTo#8257,LineageKey#8258,Latitude#8259,Longitude#8260] parquet
.; line 1 pos 0;
UpdateCommand Delta[version=56, ... msit-onelake.dfs.fabric.microsoft.com/54be16ef-ba86-4fd6-b494-2013153c2245/Tables/Dimension_Customer], `spark_catalog`.`wideworldimporters_gold`.`dimension_customer`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [CustomerKey#8248, WWICustomerID#8249, Customer#8250, BillToCustomer#8251, Category#8252, BuyingGroup#8253, PrimaryContact#8254, PostalCode#8255, ValidFrom#8256, ValidTo#8257, LineageKey#8258, scalar-subquery#8246 [max(Latitude#8259) && WWICustomerID#8249], scalar-subquery#8247 [max(Longitude#8260) && WWICustomerID#8249]]
   +- SubqueryAlias c
      +- SubqueryAlias spark_catalog.WideWorldImporters_gold.Dimension_Customer
         +- Relation spark_catalog.wideworldimporters_gold.dimension_customer[CustomerKey#8248,WWICustomerID#8249,Customer#8250,BillToCustomer#8251,Category#8252,BuyingGroup#8253,PrimaryContact#8254,PostalCode#8255,ValidFrom#8256,ValidTo#8257,LineageKey#8258,Latitude#8259,Longitude#8260] parquet


**#MergeEverything**

In [39]:
merge into  WideWorldImporters_gold.Dimension_Customer dest 
using CustLocations2 src
on src.CustomerID = dest.WWICustomerID
when matched then update set Latitude = src.Lat, Longitude = src.Long

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 50, Finished, Available, Finished)

<Spark SQL result set with 1 rows and 4 fields>

In [40]:
select * from  WideWorldImporters_gold.Dimension_Customer limit 10


StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 51, Finished, Available, Finished)

<Spark SQL result set with 10 rows and 13 fields>

## Delta table history

In [41]:
describe history WideWorldImporters_gold.Dimension_Customer

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 52, Finished, Available, Finished)

<Spark SQL result set with 21 rows and 15 fields>

```
#try to unwind the ETL, yuck
alter table WideWorldImporters_gold.Dimension_Customer drop column Latitude;
alter table WideWorldImporters_gold.Dimension_Customer drop column Longitude;
```

In [42]:
SELECT *, count(*) over () rows 
FROM WideWorldImporters_gold.Dimension_Customer  
version as of 55

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 53, Finished, Available, Finished)

<Spark SQL result set with 664 rows and 12 fields>

In [43]:
restore table WideWorldImporters_gold.Dimension_Customer 
to version as of 55

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 54, Finished, Available, Finished)

<Spark SQL result set with 1 rows and 6 fields>

In [44]:
describe history WideWorldImporters_gold.Dimension_Customer

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 55, Finished, Available, Finished)

<Spark SQL result set with 22 rows and 15 fields>

# Notebook Orchestration

In [45]:
%%pyspark

mssparkutils.notebook.runMultiple(["LoadCustomerDim","LoadDateDim","LoadEmployeeDim"])
mssparkutils.notebook.runMultiple(["LoadPurchaseFact","LoadOrderFact"])
mssparkutils.notebook.runMultiple(["LoadSaleFact"])

StatementMeta(, fc232cc6-0e44-48fb-b632-9a319b068c4a, 56, Finished, Available, Finished)

{'0': {'exitVal': '', 'exception': None}}