# Truck Data Wrangler - Streaming part

In this notebook we will develop a solution to stream the trucks data using Spark Structured Streaming.

First of all, lets get a Spark Session to work on

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Truck Data Wrangler").getOrCreate()
spark

## Schema

After getting the spark session, we'll define the schema of this Structured Streaming process:

| Field | Type | Description |
| ----- | ---- | ----------- |
| `c0` | `integer` | The index key, used just as reference of the order |
| `event_type` | `string` | The event type accordingly to the categorization of the data |
| `label` | `string` | The label for data segmentation |
| `accel_x` | `double` | The X-axis accelerometer value |
| `accel_y` | `double` | The Y-axis accelerometer value |
| `accel_z` | `double` | The Z-axis accelerometer value |
| `gyro_roll` | `double` | The Roll-axis accelerometer value |
| `gyro_pitch` | `double` | The Pitch-axis accelerometer value |
| `gyro_yaw` | `double` | The Yaw-axis accelerometer value |

## Schema on Apache Spark

In [None]:
from pyspark.sql.types import *

csvSchema = StructType([
    StructField("c0", StringType(), True),
    StructField("event_type", StringType(), False),
    StructField("label", StringType(), False),
    StructField("timestamp", LongType(), False),
    StructField("accel_x", DoubleType(), False),
    StructField("accel_y", DoubleType(), False),
    StructField("accel_z", DoubleType(), False),
    StructField("gyro_roll", DoubleType(), False),
    StructField("gyro_pitch", DoubleType(), False),
    StructField("gyro_yaw", DoubleType(), False)
])

## Schema on TimescaleDB

For the database to serve as a read to visualize and query our truck data, we'll go with TimescaleDB.

In [4]:
!pip install psycopg2

Collecting psycopg2
[?25l  Downloading https://files.pythonhosted.org/packages/5c/1c/6997288da181277a0c29bc39a5f9143ff20b8c99f2a7d059cfb55163e165/psycopg2-2.8.3.tar.gz (377kB)
[K     |████████████████████████████████| 378kB 2.6MB/s eta 0:00:01
[31m    ERROR: Command errored out with exit status 1:
     command: /opt/conda/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-o3eu0p9x/psycopg2/setup.py'"'"'; __file__='"'"'/tmp/pip-install-o3eu0p9x/psycopg2/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base pip-egg-info
         cwd: /tmp/pip-install-o3eu0p9x/psycopg2/
    Complete output (23 lines):
    running egg_info
    creating pip-egg-info/psycopg2.egg-info
    writing pip-egg-info/psycopg2.egg-info/PKG-INFO
    writing dependency_links to pip-egg-info/psycopg2.egg-info/dependency_links.txt
    writing top

In [5]:
import psycopg2
from config import parse_config
from sql_queries import create_jerked_truck_events, drop_jerked_truck_events_table

configs = parse_config()

print(configs['timescaledb']['host'])

# connect to default database
conn = psycopg2.connect("host={} port={} dbname={} user={} password={}".format( \
    configs['timescaledb']['host'], \
    configs['timescaledb']['port'], \
    configs['timescaledb']['db'], \
    configs['timescaledb']['user'], \
    configs['timescaledb']['password'], \
))
conn.set_session(autocommit=True)
cur = conn.cursor()

# create sparkify database with UTF8 encoding
cur.execute(drop_jerked_truck_events_table)
cur.execute(create_jerked_truck_events)

ModuleNotFoundError: No module named 'psycopg2'

## Loading the data

We will test load the data just to see if the schema is compatible with the stream file source.

In [None]:
truck_events_df = spark.read.schema(csvSchema).csv('data/unified.csv', header=True)
truck_events_df.createOrReplaceTempView("truck_events")

truck_events_df.limit(10).toPandas()

## Stream Processing

Now that we tested the schema by loading our default `data/unified.csv`, we have to set the stream processing options and actions.

In [None]:
inputPath = 'data/'

rawRecords = (
    spark
        .readStream
        .schema(csvSchema)
        .option("maxFilesPerTrigger", 1)
        .json(inputPath)
)

### Generating jerk data as stream flow in

Essentially, we need to be calculating the jerk values and the flags (is_accelerating, is_breaking, is_turning_right and is_turning_left), however the streaming data frame don't support partitioning/ordering windows with non-time based column types. For that reason we will have to explode that columns in another table using the `forEachBatch` callback.

In [None]:
from pyspark.sql.functions import col
from pyspark.sql import functions as F
from pyspark.sql.window import Window

jerk_truck_events_df = rawRecords

jerk_truck_events_df = jerk_truck_events_df.withColumn(
    "date_timestamp",
    F.to_date(F.from_unixtime(((col("timestamp") / 1000) / 1000), 'yyyy-MM-dd HH:mm:ss.SSS'))
)

#column_list = ["timestamp", "event_type","label"]


jerk_truck_events_df.printSchema()

In [None]:
def explodeJerkColumns(df, epochId):
    global configs
    jerk_truck_events_df = df
    
    column_list = ["event_type","label"]
    
    win_spec = Window.partitionBy([col(x) for x in column_list]).orderBy("timestamp")

    columns_that_needs_latest_values = ['accel_x', 'accel_y', 'accel_z', 'timestamp']

    for column_name in columns_that_needs_latest_values:
        jerk_truck_events_df = jerk_truck_events_df.withColumn("last_" + column_name, F.lag(col(column_name)).over(win_spec))

    # x axis
    jerk_truck_events_df = jerk_truck_events_df.withColumn(
        "jerk_x", 
        F.when(F.isnull(col("last_accel_x")), 0)
         .when(F.isnull(col("last_timestamp")), 0)
         .otherwise((col("accel_x") - col("last_accel_x")) / (col("timestamp") - col("last_timestamp")))
    )

    # y axis
    jerk_truck_events_df = jerk_truck_events_df.withColumn(
        "jerk_y", 
        F.when(F.isnull(col("last_accel_y")), 0)
         .when(F.isnull(col("last_timestamp")), 0)
         .otherwise((col("accel_y") - col("last_accel_y")) / (col("timestamp") - col("last_timestamp")))
    )

    # z axis
    jerk_truck_events_df = jerk_truck_events_df.withColumn(
        "jerk_z", 
        F.when(F.isnull(col("last_accel_z")), 0)
         .when(F.isnull(col("last_timestamp")), 0)
         .otherwise((col("accel_z") - col("last_accel_z")) / (col("timestamp") - col("last_timestamp")))
    )

    # adding the is_accelerating flag
    jerk_truck_events_df = jerk_truck_events_df.withColumn(
        "is_accelerating",
        F.when(F.isnull(col("jerk_x")), 0)
         .when(col("jerk_x") > 0, 1)
         .otherwise(0)
    )

    # adding the is_breaking flag
    jerk_truck_events_df = jerk_truck_events_df.withColumn(
        "is_breaking",
        F.when(F.isnull(col("jerk_x")), 0)
         .when(col("jerk_x") < 0, 1)
         .otherwise(0)
    )
    
    dbhost = configs['timescaledb']['host']
    dbport = configs['timescaledb']['port']
    dbname = configs['timescaledb']['db']
    dbuser = configs['timescaledb']['user']
    dbpass = configs['timescaledb']['password']
    url = "jdbc:postgresql://"+dbhost+":"+dbport+"/"+dbname
    properties = {
        "driver": "org.postgresql.Driver",
        "user": dbuser,
        "password": dbpass
    }

    jerk_truck_events_df.write.jdbc(url=url, table="jerked_truck_events", mode="append",
                          properties=properties)

streamingIn = jerk_truck_events_df \
    .writeStream \
    .trigger(processingTime='10 seconds') \
    .option("checkpointLocation", ".spark-stream-checkpoint/") \
    .foreachBatch(explodeJerkColumns) \
    .start(path=inputPath)

In [None]:
#spark.sql("SELECT * FROM jerked_truck_events").limit(10).toPandas()