In [45]:
import pandas as pd
import numpy as np
from IPython.display import display
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual

# EDA 1: Understanding the Data

Our complete dataset consists of 9 csv files that are defined in the data dictionary given by Honda. Some of the files contain time series logs of kinematics and events while other files are dictionaries that further explain features of the log data. A brief description of the data is below. 

### Logs:
- __`Summary.csv`__: Aggregated summary per vehicle per trip. Each ignition cycle generates only one summary record.
- __`EvtWarn.csv`__: Log of all the events observed by the host vehicle. Each event of interest creates a one-record snapshot of the kinematics.
- __`Host.csv`__: Periodic log of host vehicle’s internal signal
- __`RvBsm.csv`__: The messages received by the host vehicle from remote vehicle
- __`Spat.csv`__: The messages received from intersection unit. One record each 0.5 sec that is received by the host

### Dictionaries:
- __`EventAppID.csv`__: Relative location of the remote vehicle
- __`AlertLevel.csv`__: Basic type/class of the remote vehicle
- __`RvZone.csv`__: Level of the event
- __`RvBasicVehClass.csv`__: Type of event/alert

In this notebook, we will do some prelimary analysis of the data in order to get a better understanding of the data and how the different tables work in relation to one another. 

## Part 1: Log Data
In this part, we look at the log data to better understand how the tables are connected. We look to summarize the following for each log data table:

1. Scope
2. Granularity
3. Representation

We hope that this will help paint a clearer picture what our data represents. 

### A) Features Summaries
In the following cells, we determine the features in each log and where they overlap. 

In [46]:
# Load Data
print('loading summary...')
summary = pd.read_csv('../Data/Summary.csv')
print('loading host (this one takes a bit)...')
host = pd.read_csv('../Data/Host.csv')
print('loading rvbsm...')
rvbsm = pd.read_csv('../Data/RvBsm.csv')
print('loading evtwarn...')
evtwarn = pd.read_csv('../Data/EvtWarn.csv')
print('loading evtwarn...')
spat = pd.read_csv('../Data/Spat.csv')
print('loading evtwarn...')
rvzone = pd.read_csv('../Data/RvZone.csv')
print('loading evtwarn...')
vehclass = pd.read_csv('../Data/RvBasicVehClass.csv')
print('loading evtwarn...')
alertlevel = pd.read_csv('../Data/AlertLevel.csv')
print('loading evtwarn...')
eventappid = pd.read_csv('../Data/EventAppID.csv')
print('done!')

loading summary...
loading host (this one takes a bit)...
loading rvbsm...
loading evtwarn...
loading evtwarn...
loading evtwarn...
loading evtwarn...
loading evtwarn...
loading evtwarn...
done!


In [47]:
# Create a DataFrame that summarizes the overlap of features in our data. 
# Note that the kinematic features of RvBsm are set with an Rv prefix in corresponding EvtWarn data
logs = {
       'Summary': summary, 
       'Host': host, 
       'RvBsm': rvbsm, 
       'EvtWarn': evtwarn, 
       'Spat': spat
}

log_cols = {
       'Summary': [], 
       'Host': [],
       'RvBsm': [], 
       'EvtWarn': [], 
       'Spat': []
}
all_cols = []
for name in logs:
    cols = list(logs[name].columns)
    for col in cols:
        if col not in all_cols:
            all_cols.append(col)

for name in logs:
    these_cols = list(logs[name].columns)
    col_indicators = []
    for col in all_cols:
        if col in these_cols:
            col_indicators.append(1)
        else:
            col_indicators.append(0)
    log_cols[name] = col_indicators

feature_summary = pd.DataFrame(log_cols, index = all_cols)
feature_summary

Unnamed: 0,Summary,Host,RvBsm,EvtWarn,Spat
Device,1,1,1,1,1
Trip,1,1,1,1,1
StartTime,1,0,0,0,0
Endtime,1,0,0,0,0
UTCTime,1,0,0,0,0
TripStart,1,0,0,0,0
TodTripStart,1,0,0,0,0
Time,0,1,1,1,1
NativeFlag,0,1,1,1,1
LocalTimeMS,0,1,1,1,1


In [48]:
# The number of features in each log
feature_summary.sum()

Summary     7
Host       29
RvBsm      17
EvtWarn    35
Spat       10
dtype: int64

### B) Devices, Trips, and Times

We see that the following columns are in all 4 log datasets. 

- __`Device`__: the host vehicle (connected vehicle) unique id
- __`Trip`__: the unique trip id for that host vehicle
- __`Time`__: the time since start of trip in centiseconds

Using all three together gives the primary key for each row of data. Our main questions are: 

 1. Do all the logs contain data about the same trips and devices?
 2. Do all the logs contain the same time range?
 3. Do all the logs have any common time ranges?

From the data dictionary, we know that summary contains one row for each unique trip for all vehicles in the dataset, so we use the devices and trips in summary in comparison with the other three log dataframes.  

#### Raw Data

In [49]:
# Function for interacting with the df output in the notebook
# source: Data 100 
def df_interact(df):
    '''
    Outputs sliders that show rows and columns of df
    '''
    def peek(row=0, col=0):
        return df.iloc[row:row + 5, col:col + 6]
    interact(peek, row=(0, len(df), 5), col=(0, len(df.columns) - 6))
    print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))

for df_name in logs:
    print(df_name)
    df_interact(logs[df_name])
    print('\n')

Summary


interactive(children=(IntSlider(value=0, description='row', max=4804, step=5), IntSlider(value=0, description=…

(4804 rows, 7 columns) total


Host


interactive(children=(IntSlider(value=0, description='row', max=11418494, step=5), IntSlider(value=0, descript…

(11418494 rows, 29 columns) total


RvBsm


interactive(children=(IntSlider(value=0, description='row', max=648149, step=5), IntSlider(value=0, descriptio…

(648149 rows, 17 columns) total


EvtWarn


interactive(children=(IntSlider(value=0, description='row', max=2461, step=5), IntSlider(value=0, description=…

(2461 rows, 35 columns) total


Spat


interactive(children=(IntSlider(value=0, description='row', max=56568, step=5), IntSlider(value=0, description…

(56568 rows, 10 columns) total




#### Devices

In [70]:
# See if any devices are missing that are in Summary
summary_devices = list(summary["Device"].unique())
host_devices = list(host["Device"].unique())
rvbsm_devices = list(rvbsm["Device"].unique())
evtwarn_devices = list(evtwarn["Device"].unique())
spat_devices = list(spat["Device"].unique())

all_devices = {
    'host':host_devices, 
    'rvbsm':rvbsm_devices, 
    'evtwarn':evtwarn_devices, 
    'spat':spat_devices
}

print('Missing Devices:')
for df in all_devices:
    for sum_device in summary_devices:
        if sum_device not in all_devices[df]:
            print(df, sum_device)

Missing Devices:
evtwarn 2998
spat 2998


In [88]:
# See how many unique trips are in each dataframe
spat_dev = []
spat_dev_lens = []
evtwarn_dev = []
evtwarn_dev_lens = []
sum_dev = []
sum_dev_lens = []
host_dev = []
host_dev_lens = []
rvbsm_dev = []
rvbsm_dev_lens = []

for device in summary_devices:
    if device != 2998:
        df_dev = list(spat[spat["Device"] == device].drop("Device", axis = 1)["Trip"].unique())
        spat_dev.append(df_dev)
        spat_dev_lens.append(len(df_dev))
        
        df_dev = list(evtwarn[evtwarn["Device"] == device].drop("Device", axis = 1)["Trip"].unique())
        evtwarn_dev.append(df_dev)
        evtwarn_dev_lens.append(len(df_dev))
    else:
        spat_dev.append(None)
        spat_dev_lens.append(0)
        evtwarn_dev.append(None)
        evtwarn_dev_lens.append(0)
    
    df_dev = list(summary[summary["Device"] == device].drop("Device", axis = 1)["Trip"].unique())
    sum_dev.append(df_dev)  
    sum_dev_lens.append(len(df_dev))
    
    df_dev = list(host[host["Device"] == device].drop("Device", axis = 1)["Trip"].unique())
    host_dev.append(df_dev)
    host_dev_lens.append(len(df_dev))
    
    df_dev = list(rvbsm[rvbsm["Device"] == device].drop("Device", axis = 1)["Trip"].unique())
    rvbsm_dev.append(df_dev)
    rvbsm_dev_lens.append(len(df_dev))

all_trips = pd.DataFrame({
    'Summary' : sum_dev,
    'Host' : host_dev,
    'RvBsm' : rvbsm_dev,
    'EvtWarn': evtwarn_dev,
    'Spat' : spat_dev,  
}, index = summary_devices)

trip_counts = pd.DataFrame({
    'Summary_Total_Trips' : sum_dev_lens,
    'Host_Total_Trips' : host_dev_lens,
    'RvBsm_Total_Trips' : rvbsm_dev_lens,
    'EvtWarn_Total_Trips': evtwarn_dev_lens, 
    'Spat_Total_Trips' : spat_dev_lens 
}, index = summary_devices)

trip_counts

Unnamed: 0,Summary_Total_Trips,Host_Total_Trips,RvBsm_Total_Trips,EvtWarn_Total_Trips,Spat_Total_Trips
2004,143,143,44,4,28
2008,38,38,21,6,20
2017,149,149,32,2,18
2107,307,307,45,2,12
2147,122,122,57,3,50
2218,279,279,92,3,88
2233,171,171,42,3,27
2331,271,271,86,8,40
2348,7,7,3,1,1
2494,148,148,78,7,57


In [103]:
# From df_interact, there seem to be some rv_bsm data with the same timestamp. 
# check to see how common that is in the data
rvbsm_repeated_times = rvbsm.loc[:,["Device", "Trip", "Time"]].groupby(["Device", "Trip"]).count()["Time"] - rvbsm.loc[:,["Device", "Trip", "Time"]].groupby(["Device", "Trip"]).nunique()["Time"]
rvbsm_repeated_sum = rvbsm_repeated_times.sum()

print("Number of repeated timestamps in the same trip in RvBsm:", rvbsm_repeated_sum)

Number of repeated timestamps in the same trip in RvBsm: 356445


In [107]:
# What percent of the times is this?
percent_rvbsm_repeated = rvbsm_repeated_times.astype(bool).mean()
print("Repeated times account for ", percent_rvbsm_repeated, 'percent of rvbsm times')

Repeated times account for  0.5897149245388486 percent of rvbsm times


### Summary of Findings:

1. All devices (with the exception of 2998) occur across the logs data
2. Not every trip is in all data
3. Rvbsm contains repeated time values

## C The dictionaries

In [115]:
dicts = {
    'RvZone': rvzone,
    'RvBasicVehClass': vehclass,
    'AlertLevel': alertlevel,
    'EventAppID': eventappid
}

In [118]:
rvzone

Unnamed: 0,Id,Name
0,0,Unclassified
1,1,Ahead
2,2,Behind
3,3,Oncoming
4,4,AheadLeft
5,5,AheadRight
6,6,BehindLeft
7,7,BehindRight
8,8,OncomingLeft
9,9,OncomingRight


In [119]:
vehclass

Unnamed: 0,Id,Name
0,0,Car
1,20,LightTruck
2,25,HeavyTruck
3,40,Motorcycle
4,60,EmergencyVeh
5,82,Pedestrian
6,85,Bicycle


In [120]:
alertlevel

Unnamed: 0,Id,Name
0,2,Inform
1,3,Warning


In [121]:
eventappid

Unnamed: 0,Id,Name
0,0,EEBL
1,1,FCW
2,2,IMA
3,3,BSW/LCW
4,4,DNPW
5,5,CLW
6,6,
7,7,RSZW
8,8,CSW
9,9,RLVW
