# Stroom Spark DataSource - Basic

### How to obtain a DataFrame for further processing or analysis within Spark.

#### Setup
Import standard utility classes/functions

In [1]:
from pyspark.sql.types import *
from pyspark.sql.functions import from_json, col
from IPython.display import display
import time

#### Basic Usage
Create the most basic kind of DataFrame to pull data from Stroom, and view the first records.

All the fields in the index can be searched using the field name **idxFieldName**, where FieldName is the name of the field within the Stroom index.

Data is returned as a single JSON structure within the field **json**

In [2]:
basicDf = spark.read.format('stroom.spark.datasource.StroomDataSource').load(
        token='not required',host='localhost:8080',protocol='http',
        uri='api/stroom-index/v2',
        index='57a35b9a-083c-4a93-a813-fc3ddfe1ff44',pipeline='bb25824e-6369-464a-81e1-876ffe3b95a0')

In [3]:
display(basicDf.limit(5).toPandas().head()) 

Unnamed: 0,json,idxStreamId,idxEventId,idxFeed,idxFeed (Keyword),idxAction,idxEventTime,idxUserId,idxSystem,idxEnvironment,idxIPAddress,idxHostName,idxGenerator,idxCommand,idxCommand (Keyword),idxDescription,idxDescription (Case Sensitive)
0,"{""Event"":{""xmlns"":""event-logging:3"",""StreamId""...",,,,,,,,,,,,,,,,
1,"{""Event"":{""xmlns"":""event-logging:3"",""StreamId""...",,,,,,,,,,,,,,,,
2,"{""Event"":{""xmlns"":""event-logging:3"",""StreamId""...",,,,,,,,,,,,,,,,
3,"{""Event"":{""xmlns"":""event-logging:3"",""StreamId""...",,,,,,,,,,,,,,,,
4,"{""Event"":{""xmlns"":""event-logging:3"",""StreamId""...",,,,,,,,,,,,,,,,


#### Working with JSON

The JSON is much easier to work with when parsed into a structure by Spark.

First it is necessary to perform some schema discovery.  Here the schema is made available as a variable called json_schema which can be inspected in the normal way.

In [4]:
json_schema = spark.read.json(basicDf.rdd.map(lambda row: row.json)).schema

json_schema

StructType(List(StructField(Event,StructType(List(StructField(EventDetail,StructType(List(StructField(Authenticate,StructType(List(StructField(Action,StringType,true),StructField(Data,ArrayType(StructType(List(StructField(Name,StringType,true),StructField(Value,LongType,true))),true),true),StructField(LogonType,StringType,true),StructField(User,StructType(List(StructField(Id,StringType,true))),true))),true),StructField(Description,StringType,true),StructField(TypeId,StringType,true),StructField(View,StructType(List(StructField(Resource,StructType(List(StructField(ResponseCode,LongType,true),StructField(Type,StringType,true),StructField(URL,StringType,true))),true))),true))),true),StructField(EventId,LongType,true),StructField(EventSource,StructType(List(StructField(Data,ArrayType(StructType(List(StructField(Name,StringType,true),StructField(Value,StringType,true))),true),true),StructField(Device,StructType(List(StructField(Data,ArrayType(StructType(List(StructField(Name,StringType,true

#### Working with JSON (cont)

Now a new DataFrame can be constructed using that schema as an additional (complex) column called evt.

For further ease of use, alias columns can be created from this. Scroll right to see these columns

In [5]:
wideDf = basicDf.withColumn('evt', from_json(col('json'), json_schema)).\
    withColumn ('timestamp', col('evt.Event.EventTime.TimeCreated')).\
    withColumn ('user', col('evt.Event.EventSource.User.Id')).\
    withColumn('operation', col('evt.Event.EventDetail.TypeId'))

In [6]:
display(wideDf.limit(5).toPandas().head())

Unnamed: 0,json,idxStreamId,idxEventId,idxFeed,idxFeed (Keyword),idxAction,idxEventTime,idxUserId,idxSystem,idxEnvironment,...,idxHostName,idxGenerator,idxCommand,idxCommand (Keyword),idxDescription,idxDescription (Case Sensitive),evt,timestamp,user,operation
0,"{""Event"":{""xmlns"":""event-logging:3"",""StreamId""...",,,,,,,,,,...,,,,,,,"(((Row(Action='Logon', Data=[Row(Name='FileNo'...",2010-01-01T00:03:00.000Z,user4,1
1,"{""Event"":{""xmlns"":""event-logging:3"",""StreamId""...",,,,,,,,,,...,,,,,,,"(((Row(Action='Logon', Data=[Row(Name='FileNo'...",2010-01-01T00:00:00.000Z,user1,1
2,"{""Event"":{""xmlns"":""event-logging:3"",""StreamId""...",,,,,,,,,,...,,,,,,,"(((Row(Action='Logon', Data=[Row(Name='FileNo'...",2010-01-01T00:01:00.000Z,user2,1
3,"{""Event"":{""xmlns"":""event-logging:3"",""StreamId""...",,,,,,,,,,...,,,,,,,"(((Row(Action='Logon', Data=[Row(Name='FileNo'...",2010-01-01T00:07:00.000Z,user8,1
4,"{""Event"":{""xmlns"":""event-logging:3"",""StreamId""...",,,,,,,,,,...,,,,,,,"(((Row(Action='Logon', Data=[Row(Name='FileNo'...",2010-01-01T00:02:00.000Z,user3,1


#### Using XPath Expression fields
Alternatively, Stroom can extract XPaths for fields, by specifying "xpath" key metadata.

In [7]:
mySchema = StructType([StructField("myUser", StringType(), True, 
                                   metadata={"xpath": "EventSource/User/Id"}), 
                       StructField("myOperation", StringType(), True, 
                                   metadata={"xpath": "EventDetail/TypeId"})])

In [8]:
dfWithXPaths = spark.read.format('stroom.spark.datasource.StroomDataSource').load(
        token='not required',host='localhost:8080',protocol='http',
        uri='api/stroom-index/v2',
        index='57a35b9a-083c-4a93-a813-fc3ddfe1ff44',pipeline='bb25824e-6369-464a-81e1-876ffe3b95a0',
        schema=mySchema)

In [9]:
display(dfWithXPaths.limit(5).toPandas().head()) 

Unnamed: 0,myUser,myOperation,json,idxStreamId,idxEventId,idxFeed,idxFeed (Keyword),idxAction,idxEventTime,idxUserId,idxSystem,idxEnvironment,idxIPAddress,idxHostName,idxGenerator,idxCommand,idxCommand (Keyword),idxDescription,idxDescription (Case Sensitive)
0,user1,1,"{""Event"":{""xmlns"":""event-logging:3"",""StreamId""...",,,,,,,,,,,,,,,,
1,user4,1,"{""Event"":{""xmlns"":""event-logging:3"",""StreamId""...",,,,,,,,,,,,,,,,
2,user2,1,"{""Event"":{""xmlns"":""event-logging:3"",""StreamId""...",,,,,,,,,,,,,,,,
3,user3,1,"{""Event"":{""xmlns"":""event-logging:3"",""StreamId""...",,,,,,,,,,,,,,,,
4,user2,1,"{""Event"":{""xmlns"":""event-logging:3"",""StreamId""...",,,,,,,,,,,,,,,,


#### Combining XPath (Stroom) and JSON (Spark) fields
It is possible to add JSON fields to a dataframe created with a schema that defines XPaths

In [10]:
dfWithXPaths

DataFrame[myUser: string, myOperation: string, json: string, idxStreamId: string, idxEventId: string, idxFeed: string, idxFeed (Keyword): string, idxAction: string, idxEventTime: string, idxUserId: string, idxSystem: string, idxEnvironment: string, idxIPAddress: string, idxHostName: string, idxGenerator: string, idxCommand: string, idxCommand (Keyword): string, idxDescription: string, idxDescription (Case Sensitive): string]

In [11]:
extraWideDf = dfWithXPaths.withColumn('evt', from_json(col('json'), json_schema)).\
    withColumn ('user', col('evt.Event.EventSource.User.Id')).\
    withColumn('operation', col('evt.Event.EventDetail.TypeId'))

In [12]:
extraWideDf

DataFrame[myUser: string, myOperation: string, json: string, idxStreamId: string, idxEventId: string, idxFeed: string, idxFeed (Keyword): string, idxAction: string, idxEventTime: string, idxUserId: string, idxSystem: string, idxEnvironment: string, idxIPAddress: string, idxHostName: string, idxGenerator: string, idxCommand: string, idxCommand (Keyword): string, idxDescription: string, idxDescription (Case Sensitive): string, evt: struct<Event:struct<EventDetail:struct<Authenticate:struct<Action:string,Data:array<struct<Name:string,Value:bigint>>,LogonType:string,User:struct<Id:string>>,Description:string,TypeId:string,View:struct<Resource:struct<ResponseCode:bigint,Type:string,URL:string>>>,EventId:bigint,EventSource:struct<Data:array<struct<Name:string,Value:string>>,Device:struct<Data:array<struct<Name:string,Value:string>>,IPAddress:string,MACAddress:string>,Generator:string,System:struct<Environment:string,Name:string>,User:struct<Data:array<struct<Name:string,Value:string>>,Id:str

In [13]:
display(extraWideDf.select('myUser','user','myOperation','operation').limit(5).toPandas().head())

Unnamed: 0,myUser,user,myOperation,operation
0,user4,user4,1,1
1,user2,user2,1,1
2,user1,user1,1,1
3,user3,user3,1,1
4,user4,user4,1,1


## Conclusion
Three different ways to access data, they can all be capable of reaching the same point (eventually).  The best approach will vary depending on the situation.


In [18]:
start = time.time()
indexCount = basicDf.filter(basicDf['idxUserId'] == 'user1').count()
end = time.time()
indexTime = end - start

In [19]:
start = time.time()
sparkCount = wideDf.filter(wideDf['user'] == 'user1').count()
end = time.time()
sparkTime = end - start

In [20]:
start = time.time()
stroomCount = dfWithXPaths.filter(dfWithXPaths['myUser'] == 'user1').count()
end = time.time()
stroomTime = end - start

In [21]:
if ((indexCount == sparkCount) and (sparkCount == stroomCount)):
    print ("All counts are the same (as expected):", sparkCount)
else:
    print ("Counts Differ!")
    print ("Stroom using indexes: ", indexCount)
    print ("Stroom using xpath: ", stroomCount)
    print ("Spark: ", sparkCount)

print ()
print ("Times as follows")
print ("Stroom using indexes: ", indexTime)
print ("Stroom using xpath: ", stroomTime)
print ("Spark: ", sparkTime)

All counts are the same (as expected): 1398

Times as follows
Stroom using indexes:  8.50417423248291
Stroom using xpath:  8.989763259887695
Spark:  9.178659200668335
