# Truck Data Wrangler

Exploratory analysis of the dataset inside the `data` folder.

## Unifying the csv´s

First of all we will aggregate all csv data inside a single one. All the CSV files has these commons columns:

 - Timestamp (`timestamp timestamp`)
 - Longitudinal acceleration (`accel_x decimal (12, 6)`)
 - Lateral acceleration (`accel_y decimal (12, 6)`)
 - Vertical acceleration (`accel_z decimal (12, 6)`)
 - Roll rotation (`gyro_roll decimal (12, 6)`)
 - Pitch rotation (`gyro_pitch decimal (12, 6)`)
 - Yaw rotation (`gyro_yaw decimal (12, 6)`)
 - Label (`label integer`)

And we will create another one to identify from which file the rows are coming from:

 - Event type (`event_type string`)
 
Altought this is not an scalable approach, I will use pandas just for the sake of this specific dataset analysis.

In [1]:
import pandas as pd

data_files = {
    'aggressive_bump_1550163148318484.csv': 'agressive_bump',
    'aggressive_longitudinal_acceleration_1549653321089461.csv': 'aggressive_longitudinal_acceleration',
    'aggressive_turn_1549625320507325.csv': 'aggressive_turn',
    'normal_longitudinal_acceleration_1549908723215048.csv': 'normal_longitudinal_acceleration',
    'normal_mixed_1549901031015048.csv': 'normal_mixed',
    'normal_mixed_1550054269957615.csv': 'normal_mixed',
    'normal_turn_1549626293857325.csv': 'normal_turn'
}

read_frames = []

for data_file, event_type in data_files.items():
    tmp_df = pd.read_csv('raw_data/' + data_file)
    tmp_df['event_type'] = event_type
    tmp_df = tmp_df[['event_type', 'label', 'timestamp', 'accel_x', 'accel_y', 'accel_z', 'gyro_roll', 'gyro_pitch', 'gyro_yaw']]
    read_frames.append(tmp_df)

df = pd.concat(read_frames)
df.to_csv('data/unified.csv')
df.shape

FileNotFoundError: [Errno 2] File b'data/aggressive_bump_1550163148318484.csv' does not exist: b'data/aggressive_bump_1550163148318484.csv'

# Using Spark to wrangle the data

After creating the Spark session, we will load the `data/unified.csv`.

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Truck Data Wrangler").getOrCreate()
spark

## Average acceleration by event type and label

To calculate the average acceleration, grouping by event type and label, we will first read the unified.csv and create a SQL table from it:

In [None]:
from pyspark.sql.types import *

csvSchema = StructType([
    StructField("c0", IntegerType(), True),
    StructField("event_type", StringType(), False),
    StructField("label", StringType(), False),
    StructField("timestamp", LongType(), False),
    StructField("accel_x", DoubleType(), False),
    StructField("accel_y", DoubleType(), False),
    StructField("accel_z", DoubleType(), False),
    StructField("gyro_roll", DoubleType(), False),
    StructField("gyro_pitch", DoubleType(), False),
    StructField("gyro_yaw", DoubleType(), False)
])

truck_events_df = spark.read.schema(csvSchema).csv('data/unified.csv', header=True)
truck_events_df.createOrReplaceTempView("truck_events")

After creating the SQL view, the following query will give us the grouped results we want to analyze:

In [None]:
spark.sql("""
SELECT 
    event_type, 
    label, 
    AVG(accel_x) AS avg_accel_x,
    AVG(accel_y) AS avg_accel_y,
    AVG(accel_z) AS avg_accel_z
FROM 
    truck_events 
GROUP BY
    event_type, label
ORDER BY
    event_type, label
""").toPandas()

## The maximum jerk timestamp

To calculate the maximum jerk timestamp, first of all, we have to calculate the jerk of the momentum.

Thus, to calculate the jerk we can do it with this simple equation:

`jerk = (𝑎2 − 𝑎1) / (𝑡2 − 𝑡1)`

Where:
 - 𝑎2 is the acceleration of the current sensor registry
 - 𝑎1 is the acceleration of the last sensor registry before the current
 - 𝑡2 is the timestamp of the current sensor registry
 - 𝑡1 is the timestamp of the last sensor registry before the current
 


In [None]:
from pyspark.sql.functions import col
from pyspark.sql import functions as F
from pyspark.sql.window import Window

column_list = ["event_type","label"]
win_spec = Window.partitionBy([col(x) for x in column_list]).orderBy("timestamp")

jerk_truck_events_df = truck_events_df

columns_that_needs_latest_values = ['accel_x', 'accel_y', 'accel_z', 'timestamp']

for column_name in columns_that_needs_latest_values:
    jerk_truck_events_df = jerk_truck_events_df.withColumn("last_" + column_name, F.lag(col(column_name)).over(win_spec))

jerk_truck_events_df.toPandas().head()

Now that we have the latest values appended on the data frame, we need to create a last column with the jerk value calculated:

In [None]:
# x axis
df = jerk_truck_events_df.withColumn(
    "jerk_x", 
    F.when(F.isnull(col("last_accel_x")), 0)
     .when(F.isnull(col("last_timestamp")), 0)
     .otherwise((col("accel_x") - col("last_accel_x")) / (col("timestamp") - col("last_timestamp")))
)

# y axis
df = df.withColumn(
    "jerk_y", 
    F.when(F.isnull(col("last_accel_y")), 0)
     .when(F.isnull(col("last_timestamp")), 0)
     .otherwise((col("accel_y") - col("last_accel_y")) / (col("timestamp") - col("last_timestamp")))
)

# z axis
df = df.withColumn(
    "jerk_z", 
    F.when(F.isnull(col("last_accel_z")), 0)
     .when(F.isnull(col("last_timestamp")), 0)
     .otherwise((col("accel_z") - col("last_accel_z")) / (col("timestamp") - col("last_timestamp")))
)

#df.toPandas().head()
df.describe("jerk_x").show()
df.describe("jerk_y").show()
df.describe("jerk_z").show()

#### Answering the question, the of the maximum jerk timestamp (per event_type/label):

In [None]:
df.createOrReplaceTempView("jerked_truck_events")

spark.sql("""
SELECT 
    event_type, 
    label, 
    from_unixtime(cast(((timestamp / 1000) / 1000) as bigint),'yyyy-MM-dd HH:mm:ss.SSS') AS timestamp,
    jerk_x
FROM 
    jerked_truck_events 
ORDER BY
    jerk_x DESC
LIMIT 1
""").toPandas()

In [None]:

spark.sql("""
SELECT 
    event_type, 
    label, 
    from_unixtime(cast(((timestamp / 1000) / 1000) as bigint),'yyyy-MM-dd HH:mm:ss.SSS') AS timestamp,
    jerk_y
FROM 
    jerked_truck_events 
ORDER BY
    jerk_y DESC, event_type, label
LIMIT 1
""").toPandas()

In [None]:

spark.sql("""
SELECT 
    event_type, 
    label, 
    from_unixtime(cast(((timestamp / 1000) / 1000) as bigint),'yyyy-MM-dd HH:mm:ss.SSS') AS timestamp,
    jerk_z
FROM 
    jerked_truck_events 
ORDER BY
    jerk_z DESC, event_type, label
LIMIT 1
""").toPandas()

In [None]:
import matplotlib.pyplot as plt
import numpy as np 
import matplotlib.dates as mdates

df_jerk = spark.sql("""
SELECT 
    from_unixtime(cast(((timestamp / 1000) / 1000) as bigint),'yyyy-MM-dd HH:mm:ss.SSS') AS timestamp,
    event_type, 
    label,
    jerk_x,
    jerk_y,
    jerk_z
FROM 
    jerked_truck_events 
ORDER BY
    jerked_truck_events.timestamp, event_type, label
""").toPandas()

def show_subplots(grouped_df_jerk):
    ncols=2
    nrows = int(np.ceil(grouped_df_jerk.ngroups/ncols))

    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(10,20), sharey=True)
    
    fig.autofmt_xdate()

    for (key, ax) in zip(grouped_df_jerk.groups.keys(), axes.flatten()):
        ax.fmt_xdata = mdates.DateFormatter('%Y-%m-%d ')
        ax.set_title(key)
        grouped_df_jerk.get_group(key).plot(kind='bar',ax=ax,x='timestamp')

    ax.legend(loc='best')
    plt.show()

grouped_df_jerk = df_jerk[['timestamp', 'event_type', 'label', 'jerk_x']].groupby(['event_type', 'label'])

show_subplots(grouped_df_jerk)

In [None]:
grouped_df_jerk = df_jerk[['timestamp', 'event_type', 'label', 'jerk_y']].groupby(['event_type', 'label'])

show_subplots(grouped_df_jerk)

In [None]:
grouped_df_jerk = df_jerk[['timestamp', 'event_type', 'label', 'jerk_z']].groupby(['event_type', 'label'])

show_subplots(grouped_df_jerk)