# Basic Spark SQL Usage

### Example of using Spark SQL with Stroom DataFrame

#### Prerequisites
This notebook is designed to work with a `stroom-full-test` Stroom stack intalled on `localhost`.

You must set the environmental variable `STROOM_API_KEY` to the API token associated with a suitably privileged Stroom user account before starting the Jupyter notebook server process.

#### Use Java 8
It is necessary to start `pyspark` from a Java 8 shell.  Failure to do so will result in errors, including quite mysterious ones relating to missing Hive classes.

#### Setup
Import standard utility classes/functions, including JSON handling XSLT.

In [1]:
from pyspark.sql.types import *
from pyspark.sql.functions import from_json, col
from IPython.display import display
from pyspark.sql import SparkSession

#### Create a schema using XPaths

N.B. XPath @* is used to extract both StreamId and EventId from the Event, and placed into a single field.
This field has unique values, handy for working with SQL.

In [2]:
mySchema = StructType([StructField("user", StringType(), True, 
                                   metadata={"get": "EventSource/User/Id"}), 
                       StructField("operation", StringType(), True, 
                                   metadata={"get": "EventDetail/TypeId"}),
                     StructField("eventid", StringType(), False,
                                metadata={"get": "@*"})])

In [3]:
stroomDf = spark.read.format('stroom.spark.datasource.StroomDataSource').load(
        token=os.environ['STROOM_API_KEY'],host='localhost',protocol='http',
        uri='api/stroom-index/v2',
        index='57a35b9a-083c-4a93-a813-fc3ddfe1ff44', pipeline='26ed1000-255e-4182-b69b-00266be891ee',
        schema=mySchema).select('eventid','user','operation','idxUserId')

In [4]:
display(stroomDf.limit(5).toPandas().head())

Unnamed: 0,eventid,user,operation,idxUserId
0,711,,GET,
1,572,"CN=A Test Client (testuser),O=Test Organizatio...",POST,
2,573,"CN=A Test Client (testuser),O=Test Organizatio...",POST,
3,574,"CN=A Test Client (testuser),O=Test Organizatio...",POST,
4,575,"CN=A Test Client (testuser),O=Test Organizatio...",POST,


#### Using Spark SQL

In order to start actually writing SQL queries, it is necessary to create a temporary view onto the 
Stroom DataFrame created above.

Results are returned as DataFrames themselves, making further operations possible.

In [14]:
stroomDf.createOrReplaceTempView("userops")
sqlDf = spark.sql("select * from userops where user = 'admin' and operation='StroomIndexQueryResourceImpl.search'")

In [15]:
display(sqlDf.limit(5).toPandas().head())

Unnamed: 0,eventid,user,operation,idxUserId
0,1766,admin,StroomIndexQueryResourceImpl.search,
1,1767,admin,StroomIndexQueryResourceImpl.search,
2,1768,admin,StroomIndexQueryResourceImpl.search,
3,1769,admin,StroomIndexQueryResourceImpl.search,
4,17610,admin,StroomIndexQueryResourceImpl.search,


In [16]:
sqlDf2 = spark.sql("select user,operation, count (eventid) as events from userops \
                    where idxUserId != 'admin' group by user, operation \
                    order by events desc")
display(sqlDf2.toPandas())

Unnamed: 0,user,operation,events
0,,POST,2392
1,"CN=A Test Client (testuser),O=Test Organizatio...",POST,1628
2,,GET,723
3,,HEAD,467
4,INTERNAL_PROCESSING_USER,getCerts,3
5,INTERNAL_PROCESSING_USER,GlobalConfigResourceImpl.fetchUiConfig,3
6,INTERNAL_PROCESSING_USER,OpenIdResourceImpl.token,2
7,INTERNAL_PROCESSING_USER,AuthenticationResourceImpl.fetchPasswordPolicy,2
