# Stroom Spark DataSource - Basic

### How to obtain a DataFrame for further processing or analysis within Spark.

#### Setup
Import standard utility classes/functions

In [1]:
from pyspark.sql.types import *
from pyspark.sql.functions import from_json, col
from IPython.display import display
import time

#### Basic Usage
Create the most basic kind of DataFrame to pull data from Stroom, and view the first records.

All the fields in the index can be searched using the field name **idxFieldName**, where FieldName is the name of the field within the Stroom index.

The specified pipeline is a Stroom Search Extraction Pipeline that uses the stroom:json XSLT function to create a JSON representation of the entire event.  This field is called "Json" by default but the name of the field that contains the JSON representation can (optionally) be changed with the parameter jsonField.

In this manner, all data is returned as a single JSON structure within the field **json**

In [2]:
basicDf = spark.read.format('stroom.spark.datasource.StroomDataSource').load(
        token='not required',host='localhost:8080',protocol='http',
        uri='api/stroom-index/v2',
        index='57a35b9a-083c-4a93-a813-fc3ddfe1ff44',pipeline='13143179-b494-4146-ac4b-9a6010cada89')

In [3]:
display(basicDf.limit(5).toPandas().head()) 

Unnamed: 0,json,idxStreamId,idxEventId,idxFeed,idxFeed (Keyword),idxAction,idxEventTime,idxUserId,idxSystem,idxEnvironment,idxIPAddress,idxHostName,idxGenerator,idxCommand,idxCommand (Keyword),idxDescription,idxDescription (Case Sensitive)
0,"[{""StreamId"":""1400"",""EventId"":""1"",""EventTime"":...",,,,,,,,,,,,,,,,
1,"[{""StreamId"":""1400"",""EventId"":""2"",""EventTime"":...",,,,,,,,,,,,,,,,
2,"[{""StreamId"":""1398"",""EventId"":""2"",""EventTime"":...",,,,,,,,,,,,,,,,
3,"[{""StreamId"":""1398"",""EventId"":""3"",""EventTime"":...",,,,,,,,,,,,,,,,
4,"[{""StreamId"":""1398"",""EventId"":""5"",""EventTime"":...",,,,,,,,,,,,,,,,


#### Working with JSON

The JSON is much easier to work with when parsed into a structure by Spark.

First it is necessary to perform some schema discovery.  Here the schema is made available as a variable called json_schema which can be inspected in the normal way.

In [4]:
json_schema = spark.read.json(basicDf.rdd.map(lambda row: row.json)).schema

json_schema

StructType(List(StructField(EventDetail,StructType(List(StructField(Authenticate,StructType(List(StructField(Action,StringType,true),StructField(Data,ArrayType(StructType(List(StructField(Name,StringType,true),StructField(Value,StringType,true))),true),true),StructField(LogonType,StringType,true),StructField(User,StructType(List(StructField(Id,StringType,true))),true))),true),StructField(Description,StringType,true),StructField(TypeId,StringType,true))),true),StructField(EventId,StringType,true),StructField(EventSource,StructType(List(StructField(Data,ArrayType(StructType(List(StructField(Name,StringType,true),StructField(Value,StringType,true))),true),true),StructField(Device,StructType(List(StructField(Data,ArrayType(StructType(List(StructField(Name,StringType,true),StructField(Value,StringType,true))),true),true),StructField(IPAddress,StringType,true),StructField(MACAddress,StringType,true))),true),StructField(Generator,StringType,true),StructField(System,StructType(List(StructField

#### Working with JSON (cont)

Now a new DataFrame can be constructed using that schema as an additional (complex) column called evt.

For further ease of use, alias columns can be created from this. Scroll right to see these columns

In [5]:
wideDf = basicDf.withColumn('evt', from_json(col('json'), json_schema)).\
    withColumn ('timestamp', col('evt.EventTime.TimeCreated')).\
    withColumn ('user', col('evt.EventSource.User.Id')).\
    withColumn('operation', col('evt.EventDetail.TypeId'))

In [6]:
display(wideDf.limit(5).toPandas().head())

Unnamed: 0,json,idxStreamId,idxEventId,idxFeed,idxFeed (Keyword),idxAction,idxEventTime,idxUserId,idxSystem,idxEnvironment,...,idxHostName,idxGenerator,idxCommand,idxCommand (Keyword),idxDescription,idxDescription (Case Sensitive),evt,timestamp,user,operation
0,"[{""StreamId"":""1398"",""EventId"":""2"",""EventTime"":...",,,,,,,,,,...,,,,,,,"(((Logon, [Row(Name='FileNo', Value='2'), Row(...",2010-01-01T00:01:00.000Z,user2,1
1,"[{""StreamId"":""1398"",""EventId"":""3"",""EventTime"":...",,,,,,,,,,...,,,,,,,"(((Logon, [Row(Name='FileNo', Value='2'), Row(...",2010-01-01T00:02:00.000Z,user3,1
2,"[{""StreamId"":""1400"",""EventId"":""1"",""EventTime"":...",,,,,,,,,,...,,,,,,,"(((Logon, [Row(Name='FileNo', Value='2'), Row(...",2010-01-01T00:00:00.000Z,user1,1
3,"[{""StreamId"":""1400"",""EventId"":""2"",""EventTime"":...",,,,,,,,,,...,,,,,,,"(((Logon, [Row(Name='FileNo', Value='2'), Row(...",2010-01-01T00:01:00.000Z,user2,1
4,"[{""StreamId"":""2176"",""EventId"":""1"",""EventTime"":...",,,,,,,,,,...,,,,,,,"(((Logon, [Row(Name='FileNo', Value='2'), Row(...",2010-01-01T00:00:00.000Z,user1,1


#### Using Search Extraction Pipeline Created Fields
All fields created by the Stroom Search Extraction Pipeline can be accessed by specifying additional fields with associated metadata that name the field in Stroom using the "get" key.

In [7]:
mySchema = StructType([StructField("user", StringType(), True, 
                                   metadata={"get": "UserId"})])
dfWithExtractedFields = spark.read.format('stroom.spark.datasource.StroomDataSource').load(
        token='not required',host='localhost:8080',protocol='http',
        uri='api/stroom-index/v2',
        index='57a35b9a-083c-4a93-a813-fc3ddfe1ff44',pipeline='e5ecdf93-d433-45ac-b14a-1f77f16ae4f7',
        schema=mySchema)
display(dfWithExtractedFields.limit(5).toPandas().head())

Unnamed: 0,user,json,idxStreamId,idxEventId,idxFeed,idxFeed (Keyword),idxAction,idxEventTime,idxUserId,idxSystem,idxEnvironment,idxIPAddress,idxHostName,idxGenerator,idxCommand,idxCommand (Keyword),idxDescription,idxDescription (Case Sensitive)
0,user2,,,,,,,,,,,,,,,,,
1,user3,,,,,,,,,,,,,,,,,
2,user5,,,,,,,,,,,,,,,,,
3,user7,,,,,,,,,,,,,,,,,
4,user1,,,,,,,,,,,,,,,,,


#### Using XPath Expression fields
If an Extraction Pipeline that uses an XPathExtractionOutputFilter is used Stroom will extract XPaths for fields, by specifying "get" key metadata.

In [8]:
xpathSchema = StructType([StructField("myUser", StringType(), True, 
                                   metadata={"get": "EventSource/User/Id"}), 
                       StructField("myOperation", StringType(), True, 
                                   metadata={"get": "EventDetail/TypeId"})])

In [9]:
dfWithXPaths = spark.read.format('stroom.spark.datasource.StroomDataSource').load(
        token='not required',host='localhost:8080',protocol='http',
        uri='api/stroom-index/v2',
        index='57a35b9a-083c-4a93-a813-fc3ddfe1ff44',pipeline='bb25824e-6369-464a-81e1-876ffe3b95a0',
        schema=xpathSchema)

In [10]:
display(dfWithXPaths.limit(5).toPandas().head()) 

Unnamed: 0,myUser,myOperation,json,idxStreamId,idxEventId,idxFeed,idxFeed (Keyword),idxAction,idxEventTime,idxUserId,idxSystem,idxEnvironment,idxIPAddress,idxHostName,idxGenerator,idxCommand,idxCommand (Keyword),idxDescription,idxDescription (Case Sensitive)
0,user1,1,,,,,,,,,,,,,,,,,
1,user2,1,,,,,,,,,,,,,,,,,
2,user3,1,,,,,,,,,,,,,,,,,
3,user2,1,,,,,,,,,,,,,,,,,
4,user5,1,,,,,,,,,,,,,,,,,


## Comparison
Four different ways to access data, they can all be capable of reaching the same point (eventually).  The best approach will vary depending on the situation.


In [11]:
start = time.time()
indexCount = basicDf.filter(basicDf['idxUserId'] == 'user1').count()
end = time.time()
indexTime = end - start

In [12]:
start = time.time()
sparkCount = wideDf.filter(wideDf['user'] == 'user1').count()
end = time.time()
sparkTime = end - start

In [13]:
start = time.time()
stroomExtractionCount = dfWithExtractedFields.filter(dfWithExtractedFields['user'] == 'user1').count()
end = time.time()
stroomExtractionTime = end - start

In [14]:
start = time.time()
stroomXPathCount = dfWithXPaths.filter(dfWithXPaths['myUser'] == 'user1').count()
end = time.time()
stroomXPathTime = end - start

In [15]:
if ((indexCount == sparkCount) and (indexCount == stroomExtractionCount) and (indexCount == stroomXPathCount)):
    print ("All counts are the same (as expected):", sparkCount)
else:
    print ("Counts Differ!")
    print ("Stroom using indexes: ", indexCount)
    print ("Stroom using xpath: ", stroomXPathCount)
    print ("Stroom using extracted field: ", stroomExtractionCount)
    print ("Spark: ", sparkCount)

print ()
print ("Times as follows")
print ("Stroom using indexes: ", indexTime)
print ("Stroom using xpath: ", stroomXPathTime)
print ("Stroom using extracted field: ", stroomExtractionTime)
print ("Spark: ", sparkTime)

All counts are the same (as expected): 30

Times as follows
Stroom using indexes:  1.7833888530731201
Stroom using xpath:  2.052158832550049
Stroom using extracted field:  2.1541764736175537
Spark:  2.4311017990112305
