# Truck Data Wrangler - Streaming part

In this notebook we will develop a solution to stream the trucks data using Spark Structured Streaming.

First of all, lets get a Spark Session to work on

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Truck Data Wrangler").getOrCreate()
spark

## Schema

After getting the spark session, we'll define the schema of this Structured Streaming process:

| Field | Type | Description |
| ----- | ---- | ----------- |
| `c0` | `integer` | The index key, used just as reference of the order |
| `event_type` | `string` | The event type accordingly to the categorization of the data |
| `label` | `string` | The label for data segmentation |
| `accel_x` | `double` | The X-axis accelerometer value |
| `accel_y` | `double` | The Y-axis accelerometer value |
| `accel_z` | `double` | The Z-axis accelerometer value |
| `gyro_roll` | `double` | The Roll-axis accelerometer value |
| `gyro_pitch` | `double` | The Pitch-axis accelerometer value |
| `gyro_yaw` | `double` | The Yaw-axis accelerometer value |

In [14]:
from pyspark.sql.types import *

csvSchema = StructType([
    StructField("c0", StringType(), True),
    StructField("event_type", StringType(), False),
    StructField("label", StringType(), False),
    StructField("timestamp", LongType(), False),
    StructField("accel_x", DoubleType(), False),
    StructField("accel_y", DoubleType(), False),
    StructField("accel_z", DoubleType(), False),
    StructField("gyro_roll", DoubleType(), False),
    StructField("gyro_pitch", DoubleType(), False),
    StructField("gyro_yaw", DoubleType(), False)
])

## Loading the data

In [15]:
truck_events_df = spark.read.schema(csvSchema).csv('data/unified.csv', header=True)
truck_events_df.createOrReplaceTempView("truck_events")

spark.sql("""
SELECT 
    *
FROM 
    truck_events 
LIMIT
    10
""").toPandas()

Unnamed: 0,c0,event_type,label,timestamp,accel_x,accel_y,accel_z,gyro_roll,gyro_pitch,gyro_yaw
0,0,agressive_bump,0,1550163148318484,0.033898,0.077898,0.749529,-0.000423,-0.000528,7.6e-05
1,1,agressive_bump,0,1550163148368484,0.032748,0.077898,0.749353,-0.000423,-0.000528,7.6e-05
2,2,agressive_bump,0,1550163148418484,0.034472,0.080838,0.74988,-0.000954,4e-06,7.6e-05
3,3,agressive_bump,0,1550163148468484,0.033898,0.080838,0.749002,-0.000422,-0.000528,-0.000456
4,4,agressive_bump,0,1550163148518484,0.033898,0.080838,0.749353,-0.000422,-0.000527,-0.000455
5,5,agressive_bump,0,1550163148568484,0.034472,0.079368,0.750055,0.00011,-0.000527,7.7e-05
6,6,agressive_bump,0,1550163148618484,0.034472,0.080838,0.748827,-0.000422,-0.000527,7.7e-05
7,7,agressive_bump,0,1550163148668484,0.033898,0.079368,0.749529,-0.000421,-0.000526,7.7e-05
8,8,agressive_bump,0,1550163148718484,0.033898,0.080838,0.749529,-0.000421,-0.000526,7.6e-05
9,9,agressive_bump,0,1550163148768484,0.032748,0.079368,0.749178,-0.000421,-0.000526,-0.000455


## Stream Processing

Now that we tested the schema by loading our default `data/unified.csv`, we have to set the stream processing options and actions.

In [16]:
inputPath = 'data/unified.csv'

rawRecords = (
    spark
        .readStream
        .schema(csvSchema)
        .option("maxFilesPerTrigger", 1)
        .json(inputPath)
)

### Generating jerk data as stream flow in

In [None]:
from pyspark.sql.functions import col
from pyspark.sql import functions as F
from pyspark.sql.window import Window

column_list = ["event_type","label"]
win_spec = Window.partitionBy([col(x) for x in column_list]).orderBy("timestamp")

jerk_truck_events_df = rawRecords

columns_that_needs_latest_values = ['accel_x', 'accel_y', 'accel_z', 'timestamp']

for column_name in columns_that_needs_latest_values:
    jerk_truck_events_df = jerk_truck_events_df.withColumn("last_" + column_name, F.lag(col(column_name)).over(win_spec))

# x axis
jerk_truck_events_df = jerk_truck_events_df.withColumn(
    "jerk_x", 
    F.when(F.isnull(col("last_accel_x")), 0)
     .when(F.isnull(col("last_timestamp")), 0)
     .otherwise((col("accel_x") - col("last_accel_x")) / (col("timestamp") - col("last_timestamp")))
)

# y axis
jerk_truck_events_df = jerk_truck_events_df.withColumn(
    "jerk_y", 
    F.when(F.isnull(col("last_accel_y")), 0)
     .when(F.isnull(col("last_timestamp")), 0)
     .otherwise((col("accel_y") - col("last_accel_y")) / (col("timestamp") - col("last_timestamp")))
)

# z axis
jerk_truck_events_df = jerk_truck_events_df.withColumn(
    "jerk_z", 
    F.when(F.isnull(col("last_accel_z")), 0)
     .when(F.isnull(col("last_timestamp")), 0)
     .otherwise((col("accel_z") - col("last_accel_z")) / (col("timestamp") - col("last_timestamp")))
)