# Stroom Spark DataSource - Use for Analytic Creation

### How to obtain a DataFrame for further processing or analysis within Spark.

#### Prerequisites
This notebook is designed to work with a Stroom server process running on `localhost`, into which data from `EventGen` application has been ingested and indexed in the manner described in `stroom-analytic-demo`.

You must set the environmental variable `STROOM_API_KEY` to the API token associated with a suitably privileged Stroom user account before starting the Jupyter notebook server process.

In [1]:
from pyspark.sql.types import *
from pyspark.sql.functions import from_json, col
from IPython.display import display
import time,os

#### Basic Usage
Create the most basic kind of DataFrame to pull data from the Stroom index "Sample Index", and view the first records.

All the fields in the index can be searched using the field name **idxFieldName**, where FieldName is the name of the field within the Stroom index.

The specified pipeline is a Stroom Search Extraction Pipeline that uses the stroom:json XSLT function to create a JSON representation of the entire event.  This field is called "Json" by default but the name of the field that contains the JSON representation can (optionally) be changed with the parameter jsonField.

In this manner, all data is returned as a single JSON structure within the field **json**

In [2]:
basicDf = spark.read.format('stroom.spark.datasource.StroomDataSource').load(
        token=os.environ['STROOM_API_KEY'],host='localhost',protocol='http',
        uri='api/stroom-index/v2',
        index='32dfd401-ee11-49b9-84c9-88c3d3f68dc2',pipeline='13143179-b494-4146-ac4b-9a6010cada89')

In [3]:
display(basicDf.limit(5).toPandas().head()) 

Unnamed: 0,json,idxStreamId,idxEventId,idxFeed,idxFeed (Keyword),idxAction,idxEventTime,idxUserId,idxSystem,idxEnvironment,idxIPAddress,idxHostName,idxGenerator,idxCommand,idxCommand (Keyword),idxDescription,idxDescription (Case Sensitive)
0,"[{""StreamId"":""4830"",""EventId"":""1"",""EventTime"":...",,,,,,,,,,,,,,,,
1,"[{""StreamId"":""4830"",""EventId"":""2"",""EventTime"":...",,,,,,,,,,,,,,,,
2,"[{""StreamId"":""4830"",""EventId"":""3"",""EventTime"":...",,,,,,,,,,,,,,,,
3,"[{""StreamId"":""4830"",""EventId"":""4"",""EventTime"":...",,,,,,,,,,,,,,,,
4,"[{""StreamId"":""4830"",""EventId"":""5"",""EventTime"":...",,,,,,,,,,,,,,,,


#### Working with JSON

The JSON is much easier to work with when parsed into a structure by Spark.

First it is necessary to perform some schema discovery.  Here the schema is made available as a variable called json_schema which can be inspected in the normal way.

In [4]:
json_schema = spark.read.json(basicDf.rdd.map(lambda row: row.json)).schema

json_schema

StructType(List(StructField(EventDetail,StructType(List(StructField(Authenticate,StructType(List(StructField(Action,StringType,true),StructField(Outcome,StructType(List(StructField(Permitted,StringType,true),StructField(Reason,StringType,true),StructField(Success,StringType,true))),true),StructField(User,StructType(List(StructField(Id,StringType,true))),true))),true),StructField(Process,StructType(List(StructField(Action,StringType,true),StructField(Command,StringType,true),StructField(Type,StringType,true))),true),StructField(TypeId,StringType,true))),true),StructField(EventId,StringType,true),StructField(EventSource,StructType(List(StructField(Client,StructType(List(StructField(HostName,StringType,true))),true),StructField(Device,StructType(List(StructField(HostName,StringType,true))),true),StructField(Generator,StringType,true),StructField(System,StructType(List(StructField(Environment,StringType,true),StructField(Name,StringType,true))),true),StructField(User,StructType(List(Struct

#### Working with JSON (cont)

The schema created above can also be used for Spark Structured Streaming via Kafka.


In [5]:
df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "localhost:9092") \
  .option("subscribe", "ANALYTIC-DEMO-UEBA") \
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

DataFrame[key: string, value: string]