<h1> Car Lane Prediction </h1>

<p align='justify'>

This Jupyter notebook contains a classification example which is
done with the help of Scikit-Learn library. In this notebook,
the following steps are performed:
</p>

<ol align='justify'>
    <li> The preprocessing i.e. feature generation, filtering and
         interpolation of the data is carried out using the
         PTRAIL Library.
    </li>
    <li> Further, several models like RandomForestClassifier, KMeans
         Classifier etc. are trained using the Scikit-Learn library
         based on the cleaned dataset.
    </li>
    <li>
        Finally, on the interpolated dataset, the lane of the vehicle
        is predicted.
    </li>

In [1]:
# Import the dataset.

import pandas as pd
from ptrail.core.TrajectoryDF import PTRAILDataFrame
from datetime import datetime

pdf = pd.read_csv('./data/traffic.csv')

# Modifying the datetime values to make it correct format.
lst = []
for i in range(len(pdf)):
    lst.append(pd.to_datetime(datetime.strptime(str(int(pdf.iloc[i]['datetime'])), "%H%M%S%f")))
pdf['datetime'] = lst

# Now converting the dataframe to PTRAILDataFrame.
np_traffic = PTRAILDataFrame(data_set=pdf  ,
                             latitude='lat',
                             longitude='lon',
                             datetime='datetime',
                             traj_id='id')
np_traffic.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,vehicle_type,velocity,traffic_lane,lon,lat,kilopost,vehicle_length,detected_flag
traj_id,DateTime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1371,1900-01-01 07:30:00.000,1,48.0,2,135.46995,34.710999,3539.5,3.0,0
1371,1900-01-01 07:30:00.100,1,47.9,2,135.469957,34.710991,3532.5,3.0,0
1371,1900-01-01 07:30:00.200,1,47.9,2,135.469963,34.710984,3532.5,3.0,0
1371,1900-01-01 07:30:00.300,1,47.9,2,135.469968,34.710979,3531.5,3.0,0
1371,1900-01-01 07:30:00.400,1,47.9,2,135.469972,34.710974,3530.8,3.0,0


In [2]:
%%time

# First, we will generate distance between consecutive points
# and will run hampel filter on the basis of that.
from ptrail.features.kinematic_features import KinematicFeatures
from ptrail.preprocessing.filters import Filters

np_traffic_dist = KinematicFeatures.create_distance_column(np_traffic)
filt_traffic = Filters.hampel_outlier_detection(np_traffic_dist,
                                                column_name='Distance')
print(f"Length of original DF: {len(np_traffic_dist)}")
print(f"Length of filtered DF: {len(filt_traffic)}")

Length of original DF: 44905
Length of filtered DF: 44628
CPU times: user 151 ms, sys: 112 ms, total: 263 ms
Wall time: 3.45 s




In [3]:
# Now, lets drop duplicate points from the dataset.

dp_traffic = Filters.remove_duplicates(filt_traffic)
print(f"Length of original DF: {len(filt_traffic)}")
print(f"Length of filtered DF: {len(dp_traffic)}")

Length of original DF: 44628
Length of filtered DF: 44628


In [4]:
# Finally, before interpolation, lets remove points
few_filt_traffic = Filters.remove_trajectories_with_less_points(dp_traffic)
print(f"Length of original DF: {len(dp_traffic)}")
print(f"Length of filtered DF: {len(few_filt_traffic)}")

Length of original DF: 44628
Length of filtered DF: 44626


In [5]:
# Finally, lets interpolate the trajectory. We will use linear
# interpolation this time to interpolate and make the trajectory
# smoother.
a = few_filt_traffic.reset_index()['DateTime'].diff().dt.total_seconds()
(a > 0.1).value_counts()

False    44435
True       191
Name: DateTime, dtype: int64

In [6]:
# Based on the values shown above, the trajectory data is already
# very smooth as is. Hence interpolation is not performed here. For
# testing and training datasets, we will split 70% of the IDs for
# training dataset and rest of the 30% for testing dataset.

# 70% into the training dataset.
ids_ = list(few_filt_traffic.traj_id.value_counts().keys())
train_df = []
for i in range(int(len(ids_) * 0.7)):
    small = few_filt_traffic.reset_index().loc[few_filt_traffic.reset_index()['traj_id'] == ids_[i]]
    train_df.append(small)

np_train = PTRAILDataFrame(pd.concat(train_df),
                           latitude='lat',
                           longitude='lon',
                           datetime='DateTime',
                           traj_id='traj_id')
print(np_train.shape)
np_train.head()

(40275, 10)


Unnamed: 0_level_0,Unnamed: 1_level_0,index,vehicle_type,velocity,traffic_lane,lon,lat,kilopost,vehicle_length,detected_flag,Distance
traj_id,DateTime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1389,1900-01-01 07:30:00.000,1603,1,28.6,2,135.468358,34.712838,3790.2,4.5,1,
1389,1900-01-01 07:30:00.100,1604,1,28.6,2,135.468364,34.712832,3789.0,4.5,1,0.885661
1389,1900-01-01 07:30:00.200,1605,1,29.2,2,135.46837,34.712825,3788.0,4.5,1,0.900019
1389,1900-01-01 07:30:00.300,1606,1,29.7,2,135.468376,34.712818,3786.9,4.5,1,0.928737
1389,1900-01-01 07:30:00.400,1607,1,30.3,2,135.468381,34.712811,3785.9,4.5,1,0.937812


In [7]:
# 30% into the testing dataset.
test_df = []
for i in range(int(len(ids_) * 0.7)+1, len(ids_)):
    small = few_filt_traffic.reset_index().loc[few_filt_traffic.reset_index()['traj_id'] == ids_[i]]
    test_df.append(small)

np_test = PTRAILDataFrame(pd.concat(test_df),
                          latitude='lat',
                          longitude='lon',
                          datetime='DateTime',
                          traj_id='traj_id')
print(np_test.shape)
np_test.head()

(4116, 10)


Unnamed: 0_level_0,Unnamed: 1_level_0,index,vehicle_type,velocity,traffic_lane,lon,lat,kilopost,vehicle_length,detected_flag,Distance
traj_id,DateTime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1371,1900-01-01 07:30:00.000,0,1,48.0,2,135.46995,34.710999,3539.5,3.0,0,
1371,1900-01-01 07:30:00.100,1,1,47.9,2,135.469957,34.710991,3532.5,3.0,0,1.115504
1371,1900-01-01 07:30:00.200,2,1,47.9,2,135.469963,34.710984,3532.5,3.0,0,0.939478
1371,1900-01-01 07:30:00.300,3,1,47.9,2,135.469968,34.710979,3531.5,3.0,0,0.763477
1371,1900-01-01 07:30:00.400,4,1,47.9,2,135.469972,34.710974,3530.8,3.0,0,0.596403


In [9]:
# Finally, splitting the test and train datasets as follows:
#   1. Training:
#       1.1 train_x
#       1.2 train_y
#   2. Testing:
#       2.1 test_x
#       2.2 test_y

train_x = np_train.drop(columns=['traffic_lane', 'Distance'])
train_y = np_train.reset_index()['traffic_lane']

test_x = np_test.drop(columns=['traffic_lane', 'Distance'])
test_y = np_test.reset_index()['traffic_lane']

In [10]:
%%time

# Now it is time to train some models and predict
# the ship types.

# 1. RandomForestClassifier model.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf_model = RandomForestClassifier()
rf_model.fit(train_x, train_y)

# Now, lets predict on the training and testing set
# using the above trained model.
rf_train_predict = rf_model.predict(train_x)
rf_test_predict = rf_model.predict(test_x)


# Finally, lets test the accuracy of the model on both
# the datasets.
rf_train_accuracy = accuracy_score(train_y, rf_train_predict)
rf_test_accuracy = accuracy_score(test_y, rf_test_predict)

print('---------------- RandomForest Classifier -----------------')
print(f"The predicted train set values for RF are: {rf_train_predict}")
print(f"The predicted test set values for RF are: {rf_test_predict}\n")
print(f"The Training accuracy of RF is: {round(rf_train_accuracy*100, 2)}%")
print(f"The Testing accuracy of RF is: {round(rf_test_accuracy*100, 2)}%")
print('----------------------------------------------------------\n')

---------------- RandomForest Classifier -----------------
The predicted train set values for RF are: [2 2 2 ... 1 1 1]
The predicted test set values for RF are: [2 2 2 ... 1 1 1]

The Training accuracy of RF is: 100.0%
The Testing accuracy of RF is: 76.8%
----------------------------------------------------------

CPU times: user 3.38 s, sys: 16 ms, total: 3.4 s
Wall time: 3.39 s


In [11]:
%%time

# 2. DecisionTree Classifier model.

from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier()
dt_model.fit(train_x, train_y)

# Now, lets predict on the training and testing set
# using the above trained model.
dt_train_predict = dt_model.predict(train_x)
dt_test_predict = dt_model.predict(test_x)


# Finally, lets test the accuracy of the model on both
# the datasets.
dt_train_accuracy = accuracy_score(train_y, dt_train_predict)
dt_test_accuracy = accuracy_score(test_y, dt_test_predict)

print('---------------- DecisionTree Classifier -----------------')
print(f"The predicted train set values for DT are: {dt_train_predict}")
print(f"The predicted test set values for DT are: {dt_test_predict}\n")
print(f"The Training accuracy of DT is: {round(dt_train_accuracy*100, 2)}%")
print(f"The Testing accuracy of DT is: {round(dt_test_accuracy*100, 2)}%")
print('----------------------------------------------------------\n')


---------------- DecisionTree Classifier -----------------
The predicted train set values for DT are: [2 2 2 ... 1 1 1]
The predicted test set values for DT are: [2 2 2 ... 1 1 1]

The Training accuracy of DT is: 100.0%
The Testing accuracy of DT is: 57.12%
----------------------------------------------------------

CPU times: user 113 ms, sys: 18 µs, total: 113 ms
Wall time: 111 ms


In [12]:
%%time

# 3. Gaussian Naive Bayes model.

from sklearn.naive_bayes import GaussianNB

gnb_model = GaussianNB()
gnb_model.fit(train_x, train_y)

# Now, lets predict on the training and testing set
# using the above trained model.
gnb_train_predict = gnb_model.predict(X=train_x)
gnb_test_predict = gnb_model.predict(test_x)


# Finally, lets test the accuracy of the model on both
# the datasets.
gnb_train_accuracy = accuracy_score(train_y, gnb_train_predict)
gnb_test_accuracy = accuracy_score(test_y, gnb_test_predict)

print('----------------- Naive Bayes Classifier -----------------')
print(f"The predicted train set values for GNB are: {gnb_train_predict}")
print(f"The predicted test set values for GNB are: {gnb_test_predict}\n")
print(f"The Training accuracy of GNB is: {round(gnb_train_accuracy*100, 2)}%")
print(f"The Testing accuracy of GNB is: {round(gnb_test_accuracy*100, 2)}%")
print('----------------------------------------------------------\n')

----------------- Naive Bayes Classifier -----------------
The predicted train set values for GNB are: [2 2 2 ... 1 1 1]
The predicted test set values for GNB are: [2 2 2 ... 1 1 1]

The Training accuracy of GNB is: 68.25%
The Testing accuracy of GNB is: 59.01%
----------------------------------------------------------

CPU times: user 26.7 ms, sys: 46 µs, total: 26.7 ms
Wall time: 24.5 ms


In [13]:

%%time

# 4. K-Nearest Neighbors Classifier model.

from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier()
knn_model.fit(train_x, train_y)

# Now, lets predict on the training and testing set
# using the above trained model.
knn_train_predict = knn_model.predict(train_x)
knn_test_predict = knn_model.predict(test_x)


# Finally, lets test the accuracy of the model on both
# the datasets.
knn_train_accuracy = accuracy_score(train_y, knn_train_predict)
knn_test_accuracy = accuracy_score(test_y, knn_test_predict)

print('--------------------- KNN Classifier ---------------------')
print(f"The predicted train set values for KNN are: {knn_train_predict}")
print(f"The predicted test set values for KNN are: {knn_test_predict}\n")
print(f"The Training accuracy of KNN is: {round(knn_train_accuracy*100, 2)}%")
print(f"The Testing accuracy of KNN is: {round(knn_test_accuracy*100, 2)}%")
print('----------------------------------------------------------\n')

--------------------- KNN Classifier ---------------------
The predicted train set values for KNN are: [2 2 2 ... 1 1 1]
The predicted test set values for KNN are: [2 2 2 ... 1 1 1]

The Training accuracy of KNN is: 100.0%
The Testing accuracy of KNN is: 35.86%
----------------------------------------------------------

CPU times: user 4.25 s, sys: 8 ms, total: 4.26 s
Wall time: 4.24 s


In [14]:
%%time

# 5. K-Means Classifier model.

from sklearn.cluster import KMeans

km_model = KMeans(3)
km_model.fit(train_x, train_y)

# Now, lets predict on the training and testing set
# using the above trained model.
km_train_predict = km_model.predict(train_x)
km_test_predict = km_model.predict(test_x)


# Finally, lets test the accuracy of the model on both
# the datasets.
km_train_accuracy = accuracy_score(train_y, km_train_predict)
km_test_accuracy = accuracy_score(test_y, km_test_predict)

print('------------------- K-Means Classifier --------------------')
print(f"The predicted train set values for KM are: {km_train_predict}")
print(f"The predicted test set values for KM are: {km_test_predict}\n")
print(f"The Training accuracy of KM is: {round(km_train_accuracy*100, 2)}%")
print(f"The Testing accuracy of KM is: {round(km_test_accuracy*100, 2)}%")
print('----------------------------------------------------------\n')

------------------- K-Means Classifier --------------------
The predicted train set values for KM are: [2 2 2 ... 1 1 1]
The predicted test set values for KM are: [2 2 2 ... 1 1 1]

The Training accuracy of KM is: 46.56%
The Testing accuracy of KM is: 43.42%
----------------------------------------------------------

CPU times: user 2.75 s, sys: 57.1 ms, total: 2.81 s
Wall time: 310 ms
