# Inferring the Correct Vehicle Type with TensorFlow

<font color="aqua">Ruggero Fabbiano – 21 February 2021</font>

<font color="aqua">**Note: descriptive cells to be refined**</font>

## Introduction

In our previous analysis we saw that it seems that there are issues with pedestrians objects in SUMo MoST. The idea behind this notebook is to use deep learning to correctly classify the types of such vehicles. We will use all the other vehicles as a learning set for our model, then use this model to identify "fake" pedestrians.

#### Why Deep Learning?

Our data set can be thought as composed by multivariate time-series (after proper manipulation) of vehicles trip around Monaco area. As we know, SKTimes is the "SKLearn version for time-series", and as a package conceived specifically to treat this kind of data, is meant to provide better results than deep learning for multivariate classification. So why approach this problem with deep learning?

The answer is that this article is meant exactly to provide a comparison with [the work of another author with SKTimes](notebook_link), hoping such comparison would allow to draw interesting conclusions. So, we will use TensorFlow to build a neural network to approach this problem.

 # The Data

For the sake of comprehension we will use here the same data as in [this notebook](notebook_link). Moreover, we suppose to have already carried out all the "data cleaning" carried out there.
    
We can so import such data, after installing and importing the needed packages.

## Install Needed Packages

In [None]:
!pip install pandas

## Import Packages and Data

In [201]:
import pandas as pd

In [2]:
data = pd.read_pickle("MoST_0600_0830_processed.pkl")

data.head()

Unnamed: 0,t,angle,ID,lane_edge,position,slope,speed,type,x,y,z,time
2,21600.0,50.48,pedestrian_2-1_3985_tr,153330#1_1,2.2,-1.57,0.0,moped,4811.5,2704.21,143.81,06:00:00
3,21600.0,50.48,pedestrian_2-1_3985,153330#1,2.2,-1.55,0.0,pedestrian,4811.5,2704.21,143.81,06:00:00
4,21600.0,307.42,pedestrian_2-1_5063_tr,152413#3_1,2.3,-6.2,0.0,motorcycle,2006.06,3313.49,400.4,06:00:00
5,21600.0,307.42,pedestrian_2-1_5063,152413#3,2.3,-6.2,0.0,pedestrian,2006.06,3313.49,400.4,06:00:00
6,21600.0,121.94,pedestrian_2-1_5887_tr,-152557#1_1,2.3,-0.04,0.0,motorcycle,6482.5,3763.28,57.6,06:00:00


## Framing the Problem

We want to classify vehicles. As we saw in [this notebook](notebook_link), there's something strange about pedestrians. There we re-classified them based on an intuition about what the possible bug could be; here we aim at seeing if a machine learning algorithm can do better.

We decided to opt for deep learning, so as to constitute a complementary approach WRT to [this notebook](notebook_link), that uses tha SKTimes, well suited for time-series classification and that should perform better. Even using DP, we have two main strategies:
* doing a feature extraction on the multi-variate time-series that represent each vehicle path, and then doing a "standard" classification problem with such features;
* use our dataframe as-is and using directly time-series as our to-be-classified data.

We start with this second approach to exploit more time-series, even if it could prove to be more complex / resource-intensive.

## Selecting Features

The first step in our problem is so to select the features get the good data we want; let's recapitulate them:

In [5]:
data.columns

Index(['t', 'angle', 'ID', 'lane_edge', 'position', 'slope', 'speed', 'type',
       'x', 'y', 'z', 'time'],
      dtype='object')

* Concerning time we only keep one (<font color="aqua">which one is better though? Lighter / more easily exploitable?</font>).
* ID of course we need it, to understand it is the same time-series
* lane_edge is probably useful, since the list of lanes will help to better understand the other variables?
* angle, position, slope, x, y, z are (probably) related to the lane so not very interesting
* speed for sure
* type is our target

Let's proceed so:

In [7]:
data.drop(['angle', 'position', 'slope', 'x', 'y', 'z', 'time'], axis=1, inplace=True)

data.head()

Unnamed: 0,t,ID,lane_edge,speed,type
2,21600.0,pedestrian_2-1_3985_tr,153330#1_1,0.0,moped
3,21600.0,pedestrian_2-1_3985,153330#1,0.0,pedestrian
4,21600.0,pedestrian_2-1_5063_tr,152413#3_1,0.0,motorcycle
5,21600.0,pedestrian_2-1_5063,152413#3,0.0,pedestrian
6,21600.0,pedestrian_2-1_5887_tr,-152557#1_1,0.0,motorcycle


We could infer some derived variables, such as approximate acceleration; nonetheless, if we are working directly with time-series, the model should be capable to already understand this time-related behaviour, so this should not be necessary.

## Splitting the data base

Another thing to do is to divide our dataframe into training and testing sets.
<font color="aqua">Take into account the fact that the database is probably strongly unbalanced?</font>

To do so, we have to remember that we want to classify pedestrians. Nonetheless, we cannot just move all of them into testing set, because the model should learn also what a pedestrian time-series looks like!
We need to find a simple method to get what could be "true" pedestrians.

Let's naively start with speed.

First, how many possible pedestrians we have?

In [13]:
A = data[data['type']=='pedestrian']['ID'].unique()
B = data['ID'].unique()

print(F"{len(A)} pedestrians out of {len(B)} objects")

19526 pedestrians out of 38882 objects


More than half.

[SUMo vehicle type parameter defaults](https://sumo.dlr.de/docs/Vehicle_Type_Parameter_Defaults.html) tell us v=5.4 km/h. With some variation (std=0,1), more than 99 % of pedestrians should have v_max<=5,7. let's choose a speed of 6 km/h as limit.

In [146]:
ped_type = data[data['type']=='pedestrian'].groupby('ID')['speed'].max()

In [147]:
real_ped = ped_type[ped_type<6]

In [148]:
real_ped

ID
pedestrian_1-1-pt_1013    5.508
pedestrian_1-1-pt_1042    3.924
pedestrian_1-1-pt_1053    3.456
pedestrian_1-1-pt_1054    4.608
pedestrian_1-1-pt_1075    3.060
                          ...  
pedestrian_2-1_2917       3.600
pedestrian_2-1_3415       3.240
pedestrian_2-1_4340       3.636
pedestrian_2-1_4381       3.528
pedestrian_3-1_2654       3.312
Name: speed, Length: 460, dtype: float64

460! Out of the almost 20000 "possible pedestrians", only 460 have a max. speed lower than the maximum declared!

Let's do a kind of "manual sanity check" on acceleration. [SUMo vehicle type parameter defaults](https://sumo.dlr.de/docs/Vehicle_Type_Parameter_Defaults.html) tell us a_max = 1.5, a_min = -2, (m/s2). With some variation (std=0,1), more than 99 % of pedestrians should have a_max<1,8 and a_min>-2,3. Let's choose a -2,5<a<2.

In [149]:
ped_speeds = data[data['type']=='pedestrian'][['ID', 'speed']]
real_ped = pd.DataFrame(real_ped)

In [153]:
def get_acc_peaks(ID):
    acc = ped_speeds[ped_speeds['ID']==ID]['speed'].diff()[1:]/5/3.6 # m/s^2
    try:
        return min(acc), max(acc)
    except ValueError:
        return 0, 0

def set_acc_peaks(DF):
    ind, acc_min, acc_max = [], [], []
    for obj_ID, _ in DF.iterrows():
        m, M = get_acc_peaks(obj_ID)
        ind.append(obj_ID)
        acc_min.append(m)
        acc_max.append(M)
    return pd.DataFrame({"min.": acc_min, "max.": acc_max}, index=DF.index)

In [157]:
real_ped[["a min.", "a max."]] = set_acc_peaks(real_ped)

In [168]:
assert real_ped[real_ped['a min.']<-2.5].empty
assert real_ped[real_ped['a max.']>2].empty

OK, those will then be our "real" pedestrians. The other will be the objects to be classified.

In [188]:
test_data = data[data['ID'].isin(ped_type[ped_type>6].index)]
train_data = data[~data['ID'].isin(test_data['ID'])]

Unfortunately the train and test sizes cannot be chosen; let's see their ratio to see if it's good:

In [196]:
train_size = len(train_data['ID'].unique())
test_size = len(test_data['ID'].unique())
data_size = len(data['ID'].unique())
print(F"Train time-series: {train_size}, corresponding to {train_size/data_size:.2%} of the data set")
print(F"Test time-series: {test_size}, corresponding to {test_size/data_size:.2%} of the data set")

Train time-series: 19816, corresponding to 50.96% % of the data set
Test time-series: 19066, corresponding to 49.04% % of the data set


In [199]:
train_size = len(train_data)
test_size = len(test_data)
data_size = len(data)
print(F"Train instances: {train_size}, corresponding to {train_size/data_size:.2%} of the data set")
print(F"Test instances: {test_size}, corresponding to {test_size/data_size:.2%} of the data set")

Train instances: 8498510, corresponding to 70.86% of the data set
Test instances: 3495423, corresponding to 29.14% of the data set


Our data-set is surprisingly well-balanced if we think about each time stamp as an independent feature! Unfortunately as supposed, reasoning in terms of objects time-series we have almost as many test instances than train instances. Let's see what will happen.

In [200]:
train_data.to_pickle("train.pkl")
test_data.to_pickle("test.pkl")

## Building the Classification Model

In [None]:
train_data = pd.read_pickle("train.pkl")
test_data = pd.read_pickle("test.pkl")

We have only one numeric feature, so we don't have to scale!
But categorical features is not good at the same point...