# Basic Spark SQL Usage

### Example of using Spark SQL with Stroom DataFrame

In [1]:
from pyspark.sql.types import *
from pyspark.sql.functions import from_json, col
from IPython.display import display
from pyspark.sql import SparkSession

#### Create a schema using XPaths

N.B. XPath @* is used to extract both StreamId and EventId from the Event, and placed into a single field.
This field has unique values, handy for working with SQL.

In [2]:
mySchema = StructType([StructField("user", StringType(), True, 
                                   metadata={"get": "EventSource/User/Id"}), 
                       StructField("operation", StringType(), True, 
                                   metadata={"get": "EventDetail/TypeId"}),
                     StructField("eventid", StringType(), False,
                                metadata={"get": "@*"})])

In [3]:
stroomDf = spark.read.format('stroom.spark.datasource.StroomDataSource').load(
        token='not required',host='localhost:8080',protocol='http',
        uri='api/stroom-index/v2',
        index='57a35b9a-083c-4a93-a813-fc3ddfe1ff44',pipeline='bb25824e-6369-464a-81e1-876ffe3b95a0',
        schema=mySchema).select('eventid','user','operation','idxUserId')

In [4]:
display(stroomDf.limit(5).toPandas().head())

Unnamed: 0,eventid,user,operation,idxUserId
0,5298|1,user1,1,
1,5377|2,user2,1,
2,5377|5,user5,1,
3,5377|6,user6,1,
4,5377|7,user7,1,


#### Using Spark SQL

In order to start actually writing SQL queries, it is necessary to create a temporary view onto the 
Stroom DataFrame created above.

Results are returned as DataFrames themselves, making further operations possible.

In [5]:
stroomDf.createOrReplaceTempView("userops")
sqlDf = spark.sql("select * from userops where user='user1' and operation='0001'")

In [6]:
display(sqlDf.limit(5).toPandas().head())

Unnamed: 0,eventid,user,operation,idxUserId
0,5298|1,user1,1,
1,5636|1,user1,1,
2,5380|1,user1,1,
3,5632|47,user1,1,
4,5632|93,user1,1,


In [7]:
sqlDf2 = spark.sql("select user,operation, count (eventid) as events from userops \
                    where idxUserId != 'User1' group by user, operation \
                    order by events desc")
display(sqlDf2.toPandas())

Unnamed: 0,user,operation,events
0,user8,0001,1446
1,user3,0001,1380
2,user10,0001,1378
3,user2,0001,1358
4,user7,0001,1355
5,user4,0001,1344
6,user9,0001,1337
7,user5,0001,1323
8,user6,0001,1310
9,user10,Logon,10
