## Transform MachineInfo from RDD to DataFrames

* Import `PsPipeline`
* Construct `miRDDReader`
* Transform `miRDDs` to `miDF`
* Perform/Benchmark some queries on `miDF`

#### Overview
This notebook aims to experiment whether:
 1. Spark DataFrames \would feasibly allow us to perform complex SQL queries on MachineInfo at scale.
 2. Saving file as `Parquet` (column-based) could reduce storage cost.

#### TODO

* Fix `miReader`

### 1. Inject `PsPipline `

In [1]:
// Import extra dependency using Jar
// Ref: https://toree.apache.org/docs/current/user/faq/
%AddJar file:///Users/datng2/pspipline.jar
sc.setLogLevel("DEBUG")

Starting download from file:///Users/datng2/pspipline.jar
Finished download of pspipline.jar


In [2]:
// For implicit conversions from RDDs to DataFrames
// https://stackoverflow.com/questions/44094108/not-able-to-import-spark-implicits-in-scalatest
val spark2 = spark
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import spark2.implicits._
import sqlContext.implicits._

import java.net.URI
import collection.JavaConverters._
import org.apache.spark.sql.Row
import org.apache.hadoop.fs.{FileSystem, Path}
import com.tetration.proto.MachineInfoProtos.MachineInfo
import com.tetration.processanalytics.pipeline.io.MachineInfoBatchReaderWithProtoField

spark2 = org.apache.spark.sql.SparkSession@fe55ca1
sqlContext = org.apache.spark.sql.SQLContext@45773d74




org.apache.spark.sql.SQLContext@45773d74

### 2. Build Dataframes from RDD

```
/* 
*      Vanilla schemas to represent relationship between
*  MachineInfo (1) <-------> (N) ProcessInfo (1) <-------> (N) ForensicEvent
*
*/
```

In [4]:
val miProtoReader = new MachineInfoBatchReaderWithProtoField(
    sc,
    sc.hadoopConfiguration,
    FileSystem.get(URI.create("file:///"), sc.hadoopConfiguration),
    "/Users/datng2/data/machineinfo/galois")

val miBatches = Seq("201805091902", 
                    "201805091903", 
                    "201805091904").asJava

val miRDDs = miProtoReader
    .readBatches(miBatches).asScala
    .reduce(_ union _ )
    .rdd

miProtoReader = com.tetration.processanalytics.pipeline.io.MachineInfoBatchReaderWithProtoField@72dbe8db
miBatches = [201805091902, 201805091903, 201805091904]
miRDDs = UnionRDD[13] at union at <console>:67


UnionRDD[13] at union at <console>:67

### Define experimental schemas

In [5]:
case class ProcessInfoRow(
    val sensorId: String,
    val processKeyPart1: Long,
    val processKeyPart2: Long,
    val commandString: String)

case class SensorIdRow(val sensorId: String)

defined class ProcessInfoRow
defined class SensorIdRow


In [6]:
// Transform functions
val MIToProcessInfo: MachineInfo => List[ProcessInfoRow] = mi => {
  mi.getProcessInfoList.asScala
    .map(pi => ProcessInfoRow(
        mi.getSensorId,
        pi.getKey.getPart1,
        pi.getKey.getPart2,
        pi.getCommandString))
    .toList
}

val MIToSensorId: MachineInfo => List[SensorIdRow] = mi => {
    mi.getProcessInfoList.asScala
        .map(pi => SensorIdRow(mi.getSensorId))
        .toList
}

val processInfoDF = miRDDs.flatMap(MIToProcessInfo).toDF
val sensorIdDF = miRDDs.flatMap(MIToSensorId).toDF

MIToProcessInfo = > List[ProcessInfoRow] = <function1>
MIToSensorId = > List[SensorIdRow] = <function1>
processInfoDF = [sensorId: string, processKeyPart1: bigint ... 2 more fields]
sensorIdDF = [sensorId: string]


<console>:6: error: Symbol 'type scala.AnyRef' is missing from the classpath.
This symbol is required by 'class org.apache.spark.sql.catalyst.QualifiedTableName'.
Make sure that type AnyRef is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
A full rebuild may help if 'QualifiedTableName.class' was compiled against an incompatible version of scala.
  lazy val $print: String =  {
           ^


[sensorId: string]

### Sample Queries

In [7]:
sensorIdDF.createOrReplaceTempView("SensorId")
sqlContext.sql("SELECT COUNT(*) from SensorId").show()

+--------+
|count(1)|
+--------+
|   11486|
+--------+



In [8]:
processInfoDF.createOrReplaceTempView("ProcessInfo")
val res = sqlContext.sql("""
    SELECT sensorId, COUNT(DISTINCT commandString) FROM ProcessInfo
        GROUP BY sensorId
""")
res.show(10)

+--------------------+-----------------------------+
|            sensorId|count(DISTINCT commandString)|
+--------------------+-----------------------------+
|1f5daca2e393a1ab5...|                          128|
|5601456f1b9e75fc6...|                          212|
|4a345e6a678d3bb21...|                          127|
|fb6bd6990776bce3d...|                          128|
|fda22d272815f1503...|                           24|
|aff1466a070b40053...|                          239|
|c513d16dc8aa88554...|                          130|
|7ff2be60e4d97634a...|                          130|
|749e050e5494f15d5...|                          419|
|d09989fcc31fa0c41...|                          231|
+--------------------+-----------------------------+
only showing top 10 rows



res = [sensorId: string, count(DISTINCT commandString): bigint]


[sensorId: string, count(DISTINCT commandString): bigint]

### Saving to Parquet

In [None]:
import org.apache.hadoop.io.compress.GzipCodec
sqlContext.setConf("spark.sql.parquet.compression.codec", "gzip")

sensorIdDF.write.parquet("sensorid.parquet")
processInfoDF.write.parquet("pi.parquet")

sensorIdDF.rdd.saveAsTextFile("sensorid.rdd", classOf[GzipCodec])
processInfoDF.rdd.saveAsTextFile("pi.rdd", classOf[GzipCodec])