In [21]:
import pandas as pd
import numpy as np
from IPython.display import display
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual

# EDA 1: Understanding the Data

Our complete dataset consists of 9 csv files that are defined in the data dictionary given by Honda. Some of the files contain time series logs of kinematics and events while other files are dictionaries that further explain features of the log data. A brief description of the data is below. 

### Logs:
- __`Summary.csv`__: Aggregated summary per vehicle per trip. Each ignition cycle generates only one summary record.
- __`EvtWarn.csv`__: Log of all the events observed by the host vehicle. Each event of interest creates a one-record snapshot of the kinematics.
- __`Host.csv`__: Periodic log of host vehicle’s internal signal
- __`RvBsm.csv`__: The messages received by the host vehicle from remote vehicle
- __`Spat.csv`__: The messages received from intersection unit. One record each 0.5 sec that is received by the host

### Dictionaries:
- __`EventAppID.csv`__: Relative location of the remote vehicle
- __`AlertLevel.csv`__: Basic type/class of the remote vehicle
- __`RvZone.csv`__: Level of the event
- __`RvBasicVehClass.csv`__: Type of event/alert

In this notebook, we will do some prelimary analysis of the data in order to get a better understanding of the data and how the different tables work in relation to one another. 

## Part 1: Log Data
In this part, we look at the log data to better understand how the tables are connected. We look to summarize the following for each log data table:

1. Scope
2. Granularity
3. Representation

We hope that this will help paint a clearer picture what our data represents. 

### A) Features Summaries
In the following cells, we determine the features in each log and where they overlap. 

In [22]:
# Load Data
print('loading summary...')
summary = pd.read_csv('../Data/Summary.csv')
print('loading host (this one takes a bit)...')
host = pd.read_csv('../Data/Host.csv')
print('loading rvbsm...')
rvbsm = pd.read_csv('../Data/RvBsm.csv')
print('loading evtwarn...')
evtwarn = pd.read_csv('../Data/EvtWarn.csv')
print('loading evtwarn...')
spat = pd.read_csv('../Data/Spat.csv')
print('loading evtwarn...')
rvzone = pd.read_csv('../Data/RvZone.csv')
print('loading evtwarn...')
vehclass = pd.read_csv('../Data/RvBasicVehClass.csv')
print('loading evtwarn...')
alertlevel = pd.read_csv('../Data/AlertLevel.csv')
print('loading evtwarn...')
eventappid = pd.read_csv('../Data/EventAppID.csv')
print('done!')

In [32]:
# Create a DataFrame that summarizes the overlap of features in our data. 
# Note that the kinematic features of RvBsm are set with an Rv prefix in corresponding EvtWarn data
logs = {
       'Summary': summary, 
       'Host': host, 
       'RvBsm': rvbsm, 
       'EvtWarn': evtwarn, 
       'Spat': spat
}

log_cols = {
       'Summary': [], 
       'Host': [],
       'RvBsm': [], 
       'EvtWarn': [], 
       'Spat': []
}
all_cols = []
for name in logs:
    cols = list(logs[name].columns)
    for col in cols:
        if col not in all_columns:
            all_columns.append(col)
        
all_columns

for name in logs:
    these_cols = list(logs[name].columns)
    col_indicators = []
    for col in all_columns:
        if col in these_cols:
            col_indicators.append(1)
        else:
            col_indicators.append(0)
    log_cols[name] = col_indicators

feature_summary = pd.DataFrame(log_cols, index = all_columns)
feature_summary

Unnamed: 0,Summary,Host,RvBsm,EvtWarn,Spat
Device,1,1,1,1,1
Trip,1,1,1,1,1
StartTime,1,0,0,0,0
Endtime,1,0,0,0,0
UTCTime,1,0,0,0,0
TripStart,1,0,0,0,0
TodTripStart,1,0,0,0,0
Time,0,1,1,1,1
NativeFlag,0,1,1,1,1
LocalTimeMS,0,1,1,1,1


In [33]:
# The number of features in each log
feature_summary.sum()

Summary     7
Host       29
RvBsm      17
EvtWarn    35
Spat       10
dtype: int64

### B) Devices, Trips, and Times

We see that the following columns are in all 4 log datasets. 

- __`Device`__: the host vehicle (connected vehicle) unique id
- __`Trip`__: the unique trip id for that host vehicle
- __`Time`__: the time since start of trip in centiseconds

Using all three together gives the primary key for each row of data. Our main questions are: 

 1. Do all the logs contain data about the same trips and devices?
 2. Do all the logs contain the same time range?
 3. Do all the logs have any common time ranges?

From the data dictionary, we know that summary contains one row for each unique trip for all vehicles in the dataset, so we use the devices and trips in summary in comparison with the other three log dataframes.  

#### Devices

In [49]:
summary_devices = list(summary["Device"].unique())
host_devices = list(host["Device"].unique())
rvbsm_devices = list(rvbsm["Device"].unique())
evtwarn_devices = list(evtwarn["Device"].unique())
spat_devices = list(spat["Device"].unique())

all_devices = {
    'host':host_devices, 
    'rvbsm':rvbsm_devices, 
    'evtwarn':evtwarn_devices, 
    'spat':spat_devices
}


for df in all_devices:
    for sum_device in summary_devices:
        if sum_device not in all_devices[df]:
            print(df, sum_device)
summary_devices

evtwarn 2998
spat 2998


[2004,
 2008,
 2017,
 2107,
 2147,
 2218,
 2233,
 2331,
 2348,
 2494,
 2496,
 2527,
 2533,
 2559,
 2584,
 2588,
 2627,
 2720,
 2804,
 2858,
 2936,
 2941,
 2969,
 2998,
 2999]

In [None]:
rvzone = pd.read_csv('../Data/RvZone.csv')
print('loading evtwarn...')
vehclass = pd.read_csv('../Data/RvBasicVehClass.csv')
print('loading evtwarn...')
alertlevel = pd.read_csv('../Data/AlertLevel.csv')
print('loading evtwarn...')
eventappid = pd.read_csv('../Data/EventAppID.csv')
print('done!')

Unnamed: 0,Id,Name
0,0,EEBL
1,1,FCW
2,2,IMA
3,3,BSW/LCW
4,4,DNPW
5,5,CLW
6,6,
7,7,RSZW
8,8,CSW
9,9,RLVW
