# Basic Spark SQL Usage

### Example of using Spark SQL with Stroom DataFrame

#### Prerequisites
This notebook is designed to work with a Stroom server process running on `localhost`, into which the example data has been loaded (e.g. by running the gradle task `setupSampleData`).

You must set the environmental variable `STROOM_API_KEY` to the API token associated with a suitably privileged Stroom user account before starting the Jupyter notebook server process.

#### Setup
Import standard utility classes/functions, including JSON handling XSLT.

In [1]:
from pyspark.sql.types import *
from pyspark.sql.functions import from_json, col
from IPython.display import display
from pyspark.sql import SparkSession

#### Create a schema using Extraction Pipeline

In [2]:
mySchema = StructType([StructField("user", StringType(), True, 
                                   metadata={"get": "UserId"}), 
                       StructField("type", StringType(), True, 
                                   metadata={"get": "Generator"}),
                       StructField("streamid", StringType(), False,
                                metadata={"get": "StreamId"}),
                       StructField("eventid", StringType(), False,
                                metadata={"get": "EventId"}),])

In [3]:
stroomDf = spark.read.format('stroom.spark.datasource.StroomDataSource').load(
        token=os.environ['STROOM_API_KEY'],host='localhost',protocol='http',
        uri='api/stroom-index/v2',
        index='57a35b9a-083c-4a93-a813-fc3ddfe1ff44',pipeline='e5ecdf93-d433-45ac-b14a-1f77f16ae4f7',
        schema=mySchema).select(['streamid', 'eventid', 'user', 'type'])

In [4]:
display(stroomDf.limit(5).toPandas().head())

Unnamed: 0,streamid,eventid,user,type
0,742,1,user1,CSV
1,740,3,user3,CSV
2,777,1,user1,
3,777,2,user2,
4,777,3,user3,


#### Using Spark SQL

In order to start actually writing SQL queries, it is necessary to create a temporary view onto the 
Stroom DataFrame created above.

Results are returned as DataFrames themselves, making further operations possible.

In [5]:
stroomDf.createOrReplaceTempView("userops")
sqlDf = spark.sql("select * from userops where user='user1' and type='CSV'")

In [6]:
display(sqlDf.limit(5).toPandas().head())

Unnamed: 0,streamid,eventid,user,type
0,742,1,user1,CSV
1,1796,1,user1,CSV
2,1794,1,user1,CSV
3,1799,1,user1,CSV
4,1100,1,user1,CSV


In [20]:
sqlDf2 = spark.sql("select user,type, count (streamid, eventid) as events from userops \
                    where user != 'user1' group by user, type \
                    order by events desc")
display(sqlDf2.toPandas())

Unnamed: 0,user,type,events
0,user8,CSV,4366
1,user7,CSV,4313
2,user4,CSV,4304
3,user3,CSV,4275
4,user6,CSV,4268
5,user10,CSV,4255
6,user2,CSV,4246
7,user9,CSV,4223
8,user5,CSV,4142
9,user7,Apache HTTPD,150
