## Introduction to Spark Notebooks

Let's look at how to do data discovery/sandboxing with Spark Pools.  

A few pointers to get started:
* only run 1 cell at a time
* you will need to change the connection strings to the storage
* `ESC + a` to add a cell `above` the current cell
* `ESC + b` to add a cell `below` the current cell
* CTL+Enter or Shft+Enter to execute a cell


Navigate to `sale-small/Year=2018/Quarter=Q1/Month=1/Day=20180101/sale-small-20180101-snappy.parquet`, right click and choose **New notebook** then **Load to DataFrame**.

You can copy the code from the cells below or simply use this notebook directtly, but you will have to change the connstring to the storage from your other notebook.


In [1]:
%%pyspark
df = spark.read.load('abfss://wwi-02@asadatalakedavew891.dfs.core.windows.net/sale-small/Year=2019/Quarter=Q4/Month=12/Day=20191201/sale-small-20191201-snappy.parquet', format='parquet')
display(df.limit(10))

StatementMeta(SparkPool01, 3, 1, Finished, Available)

SynapseWidget(Synapse.DataFrame, 81472526-9210-44c8-a7fd-33a185263527)

Notice when executing a Spark cell for the first time it takes a few minutes to spin up the cluster and get it ready.  

I believe the default is to spin down each cluster after 15 mins of inactivity.  

In [2]:
df.printSchema()

StatementMeta(SparkPool01, 3, 2, Finished, Available)

root
 |-- TransactionId: string (nullable = true)
 |-- CustomerId: integer (nullable = true)
 |-- ProductId: short (nullable = true)
 |-- Quantity: byte (nullable = true)
 |-- Price: decimal(38,18) (nullable = true)
 |-- TotalAmount: decimal(38,18) (nullable = true)
 |-- TransactionDate: integer (nullable = true)
 |-- ProfitAmount: decimal(38,18) (nullable = true)
 |-- Hour: byte (nullable = true)
 |-- Minute: byte (nullable = true)
 |-- StoreId: short (nullable = true)

This is a .ipynb PYTHON notebook, but we can write SQL too, using a `magic`

In [4]:
df.registerTempTable("df")

StatementMeta(SparkPool01, 3, 4, Finished, Available)



In [8]:
%%sql
select * from df


StatementMeta(SparkPool01, 3, 6, Finished, Available)

<Spark SQL result set with 1000 rows and 11 fields>

If you actually type the above commands you should see autocompletion.  

Note that it does take some time to do even the simplest things in Spark.  It has to build the DAG, spawn executors, etc.  It's a BIG DATA tool, not a SMALL data tool.  Likewise, as mentioned above, it is meant to "batch" processing, for "interactive" querying (what I call sandboxing) there may be faster tools.  

In [9]:
%%pyspark
df = spark.read.load('abfss://wwi-02@asadatalakedavew891.dfs.core.windows.net/sale-small/Year=2018/Quarter=Q4/*/*/*', format='parquet')
df.limit(10)

StatementMeta(SparkPool01, 3, 7, Finished, Available)

DataFrame[TransactionId: string, CustomerId: int, ProductId: smallint, Quantity: tinyint, Price: decimal(38,18), TotalAmount: decimal(38,18), TransactionDate: int, ProfitAmount: decimal(38,18), Hour: tinyint, Minute: tinyint, StoreId: smallint]

In [12]:
# often a simple PRINT, like above, doesn't work great.  
# the trick in Spark, when that happens is to just rerun the command
# wrapped in a display.  That will usually fix it.  
display(df.limit(10))

StatementMeta(SparkPool01, 3, 10, Finished, Available)

SynapseWidget(Synapse.DataFrame, 166c7091-1ee1-4474-90ae-b2126030532c)

In [13]:
# now let's do some aggregations on our df.  
# Let's look at sum/avg profit by TransactionDate
# we need a few "imports" to get things right
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *

profitByDate = (df.groupBy("TransactionDate")
.agg(
    round(sum("ProfitAmount"),2).alias("(sum)Profit"),
    round(avg("ProfitAmount"),2).alias("(avg)Profit")
).orderBy("TransactionDate")
)
profitByDate.show(100)

StatementMeta(SparkPool01, 3, 11, Finished, Available)

+---------------+-----------+-----------+
|TransactionDate|(sum)Profit|(avg)Profit|
+---------------+-----------+-----------+
|       20181001|23415937.44|      20.55|
|       20181002|23323219.84|      20.54|
|       20181003|23398888.63|      20.54|
|       20181004|23463913.21|      20.53|
|       20181005|23335470.35|      20.53|
|       20181006| 5847087.49|      20.55|
|       20181007| 5851524.93|      20.54|
|       20181008|23421213.19|      20.53|
|       20181009|23393917.90|      20.53|
|       20181010|23304688.28|      20.52|
|       20181011|23362432.44|      20.52|
|       20181012|23329466.17|      20.52|
|       20181013| 5831549.92|      20.52|
|       20181014| 5773161.78|      20.54|
|       20181015|23286408.68|      20.52|
|       20181016|23400373.75|      20.54|
|       20181017|23461667.35|      20.53|
|       20181018|23270211.07|      20.51|
|       20181019|23444803.46|      20.54|
|       20181020| 5845799.59|      20.55|
|       20181021| 5820686.06|     

In [15]:
# again, try the display trick, and then note that I can chart it too
# in this case, you need to remove the show method
# often Spark isn't intuitive for new users.  
display(profitByDate)

StatementMeta(SparkPool01, 3, 13, Finished, Available)

SynapseWidget(Synapse.DataFrame, 6a7e1fa4-8a68-486b-ac0b-438899ef8fbd)