## Introduction to Spark Notebooks

Let's look at how to do data discovery/sandboxing with Spark Pools.  

A few pointers to get started:
* only run 1 cell at a time
* you will need to change the connection strings to the storage
* `ESC + a` to add a cell `above` the current cell
* `ESC + b` to add a cell `below` the current cell
* CTL+Enter or Shft+Enter to execute a cell


Navigate to `sale-small/Year=2018/Quarter=Q1/Month=1/Day=20180101/sale-small-20180101-snappy.parquet`, right click and choose **New notebook** then **Load to DataFrame**.

You can copy the code from the cells below or simply use this notebook directtly, but you will have to change the connstring to the storage from your other notebook.


# Let's make sure your Spark session is configured correctly first

In [None]:
1+1

In [None]:
#sp=rl&st=2022-04-28T16:20:50Z&se=2024-04-29T00:20:50Z&spr=https&sv=2022-11-02&sr=c&sig=J6MOgl2Z2Egxsi8LuUh%2FqgMy3bgv4htpz5q5MXMNIPA%3D
#https://asadatalakedavew891.blob.core.windows.net/wwi-02?sp=rl&st=2022-04-28T16:20:50Z&se=2024-04-29T00:20:50Z&spr=https&sv=2022-11-02&sr=c&sig=J6MOgl2Z2Egxsi8LuUh%2FqgMy3bgv4htpz5q5MXMNIPA%3D

storageAccount='asadatalakedavew891'
container='wwi-02'
sasToken='sp=rl&st=2022-04-28T16:20:50Z&se=2024-04-29T00:20:50Z&spr=https&sv=2022-11-02&sr=c&sig=J6MOgl2Z2Egxsi8LuUh%2FqgMy3bgv4htpz5q5MXMNIPA%3D'
lakepath='wasbs://{}@{}.blob.core.windows.net/'.format(container,storageAccount)

sc._jsc.hadoopConfiguration().set("fs.azure.sas.{0}.{1}.blob.core.windows.net".format(container,storageAccount), sasToken)

A little python tutorial on syntax...

In [None]:
filepath = lakepath + 'sale-small/Year=2019/Quarter=Q4/Month=12/Day=20191201/sale-small-20191201-snappy.parquet'
dfSales = spark.read.load(filepath, format='parquet')


In [None]:
# show the dataframe

In [None]:
# display the dataframe

In [None]:
# print the schema

This is a .ipynb PYTHON notebook, but we can write SQL too, using a `magic`

In [None]:
# this will create the link between pySpark and SparkSQL
# dfSales.createOrReplaceTempView ("dfSales")

In [None]:
%%sql

--display the data using SQL


Let's look at wildcarding in a datalake

In [None]:
filepath = lakepath + 'sale-small/Year=2018/Quarter=Q4/*/*/*'
dfSales2018Q4 = spark.read.load(filepath, format='parquet')


In [None]:
# display(dfSales2018Q4)

In [None]:
#dfSales2018Q4.describe()
#dfSales.describe()

In [None]:
# now let's do some aggregations on our df.  
# Let's look at sum/avg profit by TransactionDate

# python is whitespace-sensitive, so note the parens

dfProfitByDate = (
dfSales2018Q4
    .groupBy("TransactionDate")
    .agg(
        round(sum("ProfitAmount"),2).alias("(sum)Profit"),
        round(avg("ProfitAmount"),2).alias("(avg)Profit")
        )
    .orderBy("TransactionDate")
)

In [None]:
display(dfProfitByDate)

In [None]:
dfProfitByDate.show(100)

In [None]:
profitByDate.show(100)

In [None]:
%%sql
--now let's try from SQL
select * from dfSales2018Q4

In [None]:
## uh oh, what happened?

dfSales2018Q4.createOrReplaceTempView("dfSales2018Q4")

# now try the above cell again

In [None]:
%%sql

--now do the same aggregation in SQL
-- SELECT 
--     sum(ProfitAmount) AS SumProfit,
--     avg(ProfitAmount) AS AvgProfit
-- FROM dfSales2018Q4
-- GROUP BY TransactionDate
-- ORDER BY TransactionDate


In [None]:
%%sql

--now, how would we save this to a "temporary dataframe" so we could use it in python or in another SQL cell?
CREATE TEMP VIEW dfProfits AS 

In [None]:
dfProfits = spark.sql("SELECT * FROM dfProfits")

In [None]:
display(dfProfits)

In [None]:
#explore the datalake

mssparkutils.fs.ls(lakepath)

In [None]:
filepath

In [None]:
newfilepath = 'wasbs://wwi-02@asadatalakedavew891.blob.core.windows.net/sale-small/Year=2018/Quarter=Q4'

In [None]:
mssparkutils.fs.ls(newfilepath)

In [None]:
%help fs

In [None]:
%fs ls wasbs://wwi-02@asadatalakedavew891.blob.core.windows.net/sale-small/Year=2018/Quarter=Q4

In [None]:
%lsmagic