# Single Event with Complex Analysis

## Part Two (ML Training)

#### Prerequisites
This notebook is designed to work with a Stroom server process running on `localhost`, into which data from `EventGen` application has been ingested and indexed in the manner described in the previous notebook of this series "Part One (Data Exploration)"

You must set the environmental variable `STROOM_API_KEY` to the API token associated with a suitably privileged Stroom user account before starting the Jupyter notebook server process.

In [1]:
from pyspark.sql.types import *
from pyspark.sql.functions import from_json, col, coalesce, unix_timestamp,lit,to_timestamp,hour,date_format,date_trunc
from pyspark.ml.feature import OneHotEncoderEstimator,VectorAssembler,StringIndexer
from pyspark.ml.regression import LinearRegression,RandomForestRegressor
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from IPython.display import display
import time,os

#### Schema Discovery
It is necessary to specify the structure of the JSON data arriving on the topic.  This structure can be determined at runtime.

As the same format of data is also available via an indexed search using the `stroom-spark-datasource`, one way to determine the JSON schema is by interrogating the data held in the `Sample Index` Stroom index.

The specified pipeline is a Stroom Search Extraction Pipeline that uses the stroom:json XSLT function to create a JSON representation of the entire event.  This field is called "Json" by default but the name of the field that contains the JSON representation can (optionally) be changed with the parameter jsonField.

In this manner, all data is returned as a single JSON structure within the field **json**

In [4]:
spark = SparkSession \
    .builder \
    .appName("MyTestApp") \
    .getOrCreate()
schemaDf = spark.read.format('stroom.spark.datasource.StroomDataSource').load(
        token=os.environ['STROOM_API_KEY'],host='localhost',protocol='http',
        uri='api/stroom-index/v2',traceLevel="0",
        index='5b41ebbf-b53e-41e6-a4e5-e5a220d8fd69',pipeline='13143179-b494-4146-ac4b-9a6010cada89',
        maxResults='300000').filter((col('idxEventTime') > '2019-12-08T00:00:00.000Z')
            & (col('idxEventTime') < '2019-12-24T00:00:00.000Z')
            & (col('idxDescription') == 'Authentication Failure'))

print ('Using ', schemaDf.count(), ' records for training')
json_schema = spark.read.json(schemaDf.rdd.map(lambda row: row.json)).schema

json_schema

Using  2371  records for training


StructType(List(StructField(EventDetail,StructType(List(StructField(Authenticate,StructType(List(StructField(Action,StringType,true),StructField(Outcome,StructType(List(StructField(Permitted,StringType,true),StructField(Reason,StringType,true),StructField(Success,StringType,true))),true),StructField(User,StructType(List(StructField(Id,StringType,true))),true))),true),StructField(TypeId,StringType,true))),true),StructField(EventId,StringType,true),StructField(EventSource,StructType(List(StructField(Device,StructType(List(StructField(HostName,StringType,true))),true),StructField(Generator,StringType,true),StructField(System,StructType(List(StructField(Environment,StringType,true),StructField(Name,StringType,true))),true))),true),StructField(EventTime,StructType(List(StructField(TimeCreated,StringType,true))),true),StructField(StreamId,StringType,true)))

## Feature Engineering
Creating a feature vector suitable for ML

In [8]:
featuresDf = schemaDf.withColumn('evt', from_json(col('json'), json_schema)).\
    withColumn ('timestamp', to_timestamp(col('evt.EventTime.TimeCreated')).cast("timestamp")).\
    withColumn('operation', col('evt.EventDetail.TypeId')).\
    groupBy(date_trunc('day',"timestamp").alias("date"), 
            date_format('timestamp', 'EEEE').alias("day"), 
            hour("timestamp").alias("hour"),
            col('operation')).\
    count().\
    sort(col('date'),col('hour'))

featuresDf.show()


# operationNameIndexer = StringIndexer(inputCol="operation",outputCol="opCat")
# operationEncoder = OneHotEncoderEstimator(inputCols=['opCat'],outputCols=['opVec'])
hourEncoder = OneHotEncoderEstimator(inputCols=['hour'],outputCols=['hourVec'])
dayNameIndexer = StringIndexer(inputCol="day",outputCol="dayCat")
dayEncoder = OneHotEncoderEstimator(inputCols=['dayCat'],outputCols=['dayVec'])
basicPipeline = Pipeline(stages=[hourEncoder, dayNameIndexer, dayEncoder])

pipelineModel = basicPipeline.fit(featuresDf)
pipelineModel.write().overwrite().save("models/inputVecPipelineModel")

vecDf = pipelineModel.transform(featuresDf)

vecDf.show()


+-------------------+-------+----+--------------------+-----+
|               date|    day|hour|           operation|count|
+-------------------+-------+----+--------------------+-----+
|2019-12-08 00:00:00| Sunday|  14|Authentication Fa...|    1|
|2019-12-08 00:00:00| Sunday|  22|Authentication Fa...|    1|
|2019-12-09 00:00:00| Monday|   7|Authentication Fa...|    1|
|2019-12-09 00:00:00| Monday|   8|Authentication Fa...|    6|
|2019-12-09 00:00:00| Monday|   9|Authentication Fa...|   16|
|2019-12-09 00:00:00| Monday|  10|Authentication Fa...|   21|
|2019-12-09 00:00:00| Monday|  11|Authentication Fa...|   16|
|2019-12-09 00:00:00| Monday|  12|Authentication Fa...|   25|
|2019-12-09 00:00:00| Monday|  13|Authentication Fa...|   30|
|2019-12-09 00:00:00| Monday|  14|Authentication Fa...|   31|
|2019-12-09 00:00:00| Monday|  15|Authentication Fa...|   16|
|2019-12-09 00:00:00| Monday|  16|Authentication Fa...|   16|
|2019-12-09 00:00:00| Monday|  17|Authentication Fa...|   13|
|2019-12

We can now create the entire feature vector.  Shown below with what will be the required output vector (actually a simple scalar "count")

In [9]:
vectorAssembler = VectorAssembler(inputCols = ['hourVec','dayVec'], outputCol = 'features')

trainingDf = vectorAssembler.transform(vecDf).select('features','count')

trainingDf.show()


+--------------------+-----+
|            features|count|
+--------------------+-----+
|     (29,[14],[1.0])|    1|
|     (29,[22],[1.0])|    1|
|(29,[7,26],[1.0,1...|    1|
|(29,[8,26],[1.0,1...|    6|
|(29,[9,26],[1.0,1...|   16|
|(29,[10,26],[1.0,...|   21|
|(29,[11,26],[1.0,...|   16|
|(29,[12,26],[1.0,...|   25|
|(29,[13,26],[1.0,...|   30|
|(29,[14,26],[1.0,...|   31|
|(29,[15,26],[1.0,...|   16|
|(29,[16,26],[1.0,...|   16|
|(29,[17,26],[1.0,...|   13|
|(29,[18,26],[1.0,...|   11|
|(29,[19,26],[1.0,...|   11|
|(29,[20,26],[1.0,...|    5|
|(29,[21,26],[1.0,...|    5|
|(29,[22,26],[1.0,...|    1|
|     (29,[26],[1.0])|    1|
|(29,[0,25],[1.0,1...|    1|
+--------------------+-----+
only showing top 20 rows



## Training (Linear Regression)
Now create a Linear Regression to predict the number of auth failures in each hour/day of week.
Save the model for later use.

In [10]:
linearReg = LinearRegression(maxIter=20, regParam=0.001, featuresCol='features', labelCol='count')

linearRegModel = linearReg.fit(trainingDf)

linearRegModel.write().overwrite().save("models/linearRegressionAuthFailuresModel")

## Model Evaluation (Linear Regression)
There are many ways that an ML model could be refined and improved.  Here we are only interested in understanding whether the model fits the data.


In [11]:
summaryInfo = linearRegModel.evaluate(trainingDf)
print ("Mean Absolute Error", summaryInfo.meanAbsoluteError, "Residuals", summaryInfo.residuals)

Mean Absolute Error 3.292371960739538 Residuals DataFrame[residuals: double]


In [12]:
linearRegModel.transform(trainingDf).show()

+--------------------+-----+------------------+
|            features|count|        prediction|
+--------------------+-----+------------------+
|     (29,[14],[1.0])|    1| 6.555973381511475|
|     (29,[22],[1.0])|    1|-7.270283360971843|
|(29,[7,26],[1.0,1...|    1|5.0790358057289335|
|(29,[8,26],[1.0,1...|    6|7.9394014613871295|
|(29,[9,26],[1.0,1...|   16|13.086953493613871|
|(29,[10,26],[1.0,...|   21|18.002570782066748|
|(29,[11,26],[1.0,...|   16| 20.60465112558277|
|(29,[12,26],[1.0,...|   25|21.492568904673462|
|(29,[13,26],[1.0,...|   30| 23.18054079271544|
|(29,[14,26],[1.0,...|   31|19.421370222186795|
|(29,[15,26],[1.0,...|   16|18.707163779950015|
|(29,[16,26],[1.0,...|   16|17.493012828147492|
|(29,[17,26],[1.0,...|   13| 12.34841936939023|
|(29,[18,26],[1.0,...|   11|13.464921706757428|
|(29,[19,26],[1.0,...|   11| 9.111181549376402|
|(29,[20,26],[1.0,...|    5| 7.348978712470895|
|(29,[21,26],[1.0,...|    5| 6.115948669800332|
|(29,[22,26],[1.0,...|    1|  5.59511347

## Training (Logistic Regression)
Although the required prediction is a continuous number, there are possibly so few values that a logistic regression can be used.  Let's try!

In [13]:
logisticReg = LogisticRegression(maxIter=20, regParam=0.001, featuresCol='features', labelCol='count')

logisticRegModel = logisticReg.fit(trainingDf)

logisticRegModel.write().overwrite().save("models/logisticRegressionAuthFailuresModel")

## Model Evaluation (Logistic Regression)
There are many ways that an ML model could be refined and improved.  Here we are only interested in understanding whether the model fits the data.


In [14]:
logisticRegModel.transform(trainingDf).show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|count|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|     (29,[14],[1.0])|    1|[-2.1631433212228...|[1.19124516688074...|       1.0|
|     (29,[22],[1.0])|    1|[-2.0581177605006...|[3.02925063554502...|       1.0|
|(29,[7,26],[1.0,1...|    1|[-2.2686740344799...|[5.49477583726992...|       1.0|
|(29,[8,26],[1.0,1...|    6|[-2.3589126058195...|[2.18803680242672...|      16.0|
|(29,[9,26],[1.0,1...|   16|[-2.3557513308063...|[1.09249672120952...|      16.0|
|(29,[10,26],[1.0,...|   21|[-2.3414182895521...|[4.95686434473195...|      16.0|
|(29,[11,26],[1.0,...|   16|[-2.3595197366029...|[1.06119311955341...|      31.0|
|(29,[12,26],[1.0,...|   25|[-2.3429049530476...|[1.14344337315033...|      33.0|
|(29,[13,26],[1.0,...|   30|[-2.3553860157658...|[1.08143829586599...|      30.0|
|(29,[14,26],[1.

## Training (Random Forest Regression)
Maybe a decision tree / random forest approach might be more successful.

In [15]:
randomForestRegressor = RandomForestRegressor(featuresCol='features', labelCol='count')

randomForestModel = randomForestRegressor.fit(trainingDf)

randomForestModel.write().overwrite().save("models/randomForestAuthFailuresModel")

## Model Evaluation (Random Forest Regression)
There are many ways that an ML model could be refined and improved.  Here we are only interested in understanding whether the model fits the data.

In [16]:
randomForestModel.transform(trainingDf).show()

+--------------------+-----+------------------+
|            features|count|        prediction|
+--------------------+-----+------------------+
|     (29,[14],[1.0])|    1| 9.296174154562895|
|     (29,[22],[1.0])|    1| 6.372897085061266|
|(29,[7,26],[1.0,1...|    1| 6.876284959473503|
|(29,[8,26],[1.0,1...|    6| 8.738633311467257|
|(29,[9,26],[1.0,1...|   16| 8.738633311467257|
|(29,[10,26],[1.0,...|   21|11.729651891699124|
|(29,[11,26],[1.0,...|   16| 17.86019124883081|
|(29,[12,26],[1.0,...|   25|18.418382159382045|
|(29,[13,26],[1.0,...|   30|20.409249132398266|
|(29,[14,26],[1.0,...|   31|10.345037605024945|
|(29,[15,26],[1.0,...|   16|13.347927376863652|
|(29,[16,26],[1.0,...|   16| 9.645655072509118|
|(29,[17,26],[1.0,...|   13| 8.738633311467257|
|(29,[18,26],[1.0,...|   11| 8.738633311467257|
|(29,[19,26],[1.0,...|   11| 8.738633311467257|
|(29,[20,26],[1.0,...|    5| 8.738633311467257|
|(29,[21,26],[1.0,...|    5| 8.738633311467257|
|(29,[22,26],[1.0,...|    1| 6.631452995