# Basic Spark SQL Usage

### Example of using Spark SQL with Stroom DataFrame

#### Prerequisites
This notebook is designed to work with a `stroom-full-test` Stroom stack intalled on `localhost`.

You must set the environmental variable `STROOM_API_KEY` to the API token associated with a suitably privileged Stroom user account before starting the Jupyter notebook server process.

#### Java 8
This notebook must be run from `pyspark` running in a Java 8 environment.  Using other versions of Java will result in failure to execute, potentially `ClassNotFoundError: HiveConf`

#### Setup
Import standard utility classes/functions, including JSON handling XSLT.

In [1]:
from pyspark.sql.types import *
from pyspark.sql.functions import from_json, col
from IPython.display import display
from pyspark.sql import SparkSession

#### Create a schema using Extraction Pipeline

In [2]:
mySchema = StructType([StructField("user", StringType(), True, 
                                   metadata={"get": "UserId"}), 
                       StructField("type", StringType(), True, 
                                   metadata={"get": "Generator"}),
                       StructField("streamid", StringType(), False,
                                metadata={"get": "StreamId"}),
                       StructField("eventid", StringType(), False,
                                metadata={"get": "EventId"}),])

In [3]:
stroomDf = spark.read.format('stroom.spark.datasource.StroomDataSource').load(
        token=os.environ['STROOM_API_KEY'],host='localhost',protocol='http',
        uri='api/stroom-index/v2',
        index='57a35b9a-083c-4a93-a813-fc3ddfe1ff44',pipeline='e5ecdf93-d433-45ac-b14a-1f77f16ae4f7',
        schema=mySchema).select(['streamid', 'eventid', 'user', 'type'])

In [4]:
display(stroomDf.limit(5).toPandas().head())

Unnamed: 0,streamid,eventid,user,type
0,71,1,,Stroom NGINX
1,57,3,"CN=A Test Client (testuser),O=Test Organizatio...",Stroom NGINX
2,57,4,"CN=A Test Client (testuser),O=Test Organizatio...",Stroom NGINX
3,57,5,"CN=A Test Client (testuser),O=Test Organizatio...",Stroom NGINX
4,57,6,"CN=A Test Client (testuser),O=Test Organizatio...",Stroom NGINX


#### Using Spark SQL

In order to start actually writing SQL queries, it is necessary to create a temporary view onto the 
Stroom DataFrame created above.

Results are returned as DataFrames themselves, making further operations possible.

In [12]:
stroomDf.createOrReplaceTempView("userops")
sqlDf = spark.sql("select * from userops where user='admin' and type='StroomEventLoggingService'")

In [13]:
display(sqlDf.limit(5).toPandas().head())

Unnamed: 0,streamid,eventid,user,type
0,85,3,admin,StroomEventLoggingService
1,86,1,admin,StroomEventLoggingService
2,86,2,admin,StroomEventLoggingService
3,86,3,admin,StroomEventLoggingService
4,86,4,admin,StroomEventLoggingService


In [14]:
sqlDf2 = spark.sql("select user,type, count (streamid, eventid) as events from userops \
                    where user != 'user1' group by user, type \
                    order by events desc")
display(sqlDf2.toPandas())

Unnamed: 0,user,type,events
0,,Apache HTTPD,2982
1,"CN=A Test Client (testuser),O=Test Organizatio...",Stroom NGINX,1660
2,admin,StroomEventLoggingService,832
3,,Stroom NGINX,799
4,INTERNAL_PROCESSING_USER,StroomEventLoggingService,10
