<h1> Ship Type Prediction </h1>

<p align='justify'>

This Jupyter notebook contains a classification example which is
done with the help of Scikit-Learn library. In this notebook,
the following steps are performed:
</p>

<ol align='justify'>
    <li> The preprocessing i.e. feature generation, filtering and
         interpolation of the data is carried out using the
         PTRAIL Library.
    </li>
    <li> Further, several models like RandomForestClassifier, KMeans
         Classifier etc. are trained using the Scikit-Learn library
         based on the cleaned dataset.
    </li>
    <li>
        Finally, on the interpolated dataset, the type of ships are
        predicted and their accuracy is checked.
    </li>

In [1]:
# Import the dataset.

import pandas as pd
from ptrail.core.TrajectoryDF import PTRAILDataFrame

pdf = pd.read_csv('./data/ships.csv')
np_ships = PTRAILDataFrame(data_set=pdf,
                           latitude='lat',
                           longitude='lon',
                           datetime='Timestamp',
                           traj_id='Name')
np_ships.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,lat,lon,MMSI,NavStatus,SOG,COG,ShipType
traj_id,DateTime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AB RAMANTENN,2017-05-07 00:13:05,11.905735,57.681092,265902200,Moored,0.1,170.7,Undefined
AB RAMANTENN,2017-05-07 00:25:04,11.90574,57.68107,265902200,Moored,0.1,170.7,Undefined
AB RAMANTENN,2017-05-07 00:31:05,11.905792,57.68106,265902200,Moored,0.1,177.4,Undefined
AB RAMANTENN,2017-05-07 01:01:05,11.90565,57.681127,265902200,Moored,0.0,175.6,Undefined
AB RAMANTENN,2017-05-07 01:07:05,11.9057,57.681107,265902200,Moored,0.1,180.8,Undefined


In [2]:
%%time

# Now using PTRAIL, generate distance features and
# run hampel filter on the dataset to remove outliers.
from ptrail.features.spatial_features import SpatialFeatures
from ptrail.preprocessing.filters import Filters

dist_ships = SpatialFeatures.create_distance_between_consecutive_column(np_ships)
dist_ships.head()

CPU times: user 196 ms, sys: 8.01 ms, total: 204 ms
Wall time: 203 ms


Unnamed: 0_level_0,Unnamed: 1_level_0,lat,lon,MMSI,NavStatus,SOG,COG,ShipType,Distance_prev_to_curr
traj_id,DateTime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
AB RAMANTENN,2017-05-07 00:13:05,11.905735,57.681092,265902200,Moored,0.1,170.7,Undefined,
AB RAMANTENN,2017-05-07 00:25:04,11.90574,57.68107,265902200,Moored,0.1,170.7,Undefined,2.457384
AB RAMANTENN,2017-05-07 00:31:05,11.905792,57.68106,265902200,Moored,0.1,177.4,Undefined,5.883613
AB RAMANTENN,2017-05-07 01:01:05,11.90565,57.681127,265902200,Moored,0.0,175.6,Undefined,17.391237
AB RAMANTENN,2017-05-07 01:07:05,11.9057,57.681107,265902200,Moored,0.1,180.8,Undefined,5.970428


In [3]:
%%time

# Now, filter out the outliers using the hampel filter.

filt_ships = Filters.hampel_outlier_detection(dist_ships,
                                              column_name='Distance_prev_to_curr')
print(f"Length of original DF: {len(dist_ships)}")
print(f"Length of Filtered DF: {len(filt_ships)}")

Length of original DF: 84702
Length of Filtered DF: 61394
CPU times: user 128 ms, sys: 64.2 ms, total: 192 ms
Wall time: 7.29 s




In [4]:
# Further more, remove the duplicate points from the
# trajectories.

fp_filt_ships = Filters.remove_duplicates(filt_ships)
print(f"Length of original DF: {len(filt_ships)}")
print(f"Length of Filtered DF: {len(fp_filt_ships)}")

Length of original DF: 61394
Length of Filtered DF: 61102


In [5]:
#Now, remove the trajectories that have fewer than 3 points.
print(f"Before: {len(fp_filt_ships)}")
fp_filt_ships = Filters.remove_trajectories_with_less_points(fp_filt_ships)
print(f"After: {len(fp_filt_ships)}")

Before: 61102
After: 61097


In [6]:
# Now, since the model fitting does not take string values,
# convert the ship types to integers.

fp_filt_ships["ShipType"] = fp_filt_ships["ShipType"].str.strip()
int_test = []
types = fp_filt_ships['ShipType'].tolist()
for i in range(len(fp_filt_ships['ShipType'])):
    if types[i] == 'Tanker':
        int_test.append(0)
    elif types[i] == 'Passenger':
        int_test.append(1)
    elif types[i] == 'HSC':
        int_test.append(2)
    elif types[i] == 'Pilot':
        int_test.append(3)
    elif types[i] == 'SAR':
        int_test.append(4)
    elif types[i] == 'Tug':
        int_test.append(5)
    elif types[i] == 'Cargo':
        int_test.append(6)
    elif types[i] == 'Pleasure':
        int_test.append(7)
    elif types[i] == 'Undefined':
        int_test.append(8)
    elif types[i] == 'Sailing':
        int_test.append(9)
    elif types[i] == 'Law enforcement':
        int_test.append(10)
    elif types[i] == 'Spare 2':
        int_test.append(11)
    elif types[i] == 'Diving':
        int_test.append(12)
    elif types[i] == 'Fishing':
        int_test.append(13)
    elif types[i] == 'Other':
        int_test.append(14)
    else:
        int_test.append(15)
fp_filt_ships['ShipType'] = int_test
fp_filt_ships.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,index,lat,lon,MMSI,NavStatus,SOG,COG,ShipType,Distance_prev_to_curr
traj_id,DateTime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
AB RAMANTENN,2017-05-07 00:13:05,0,11.905735,57.681092,265902200,Moored,0.1,170.7,8,
AB RAMANTENN,2017-05-07 00:25:04,1,11.90574,57.68107,265902200,Moored,0.1,170.7,8,2.457384
AB RAMANTENN,2017-05-07 00:31:05,2,11.905792,57.68106,265902200,Moored,0.1,177.4,8,5.883613
AB RAMANTENN,2017-05-07 01:07:05,3,11.9057,57.681107,265902200,Moored,0.1,180.8,8,5.970428
AB RAMANTENN,2017-05-07 01:31:04,4,11.905708,57.681045,265902200,Moored,0.1,173.2,8,6.804183


In [7]:
# Finally, for cubic interpolation, 2 values having
# same traj_id and datetime are not allowed. Hence
# drop those points and convert the DF back to PTRAILDataFrame.

fp_filt_ships = fp_filt_ships.reset_index().drop_duplicates(subset=['traj_id', 'DateTime'])
print(fp_filt_ships.columns)
fp_filt_ships = fp_filt_ships[['traj_id', 'DateTime', 'lat', 'lon', 'ShipType']]
fp_filt_ships = PTRAILDataFrame(data_set=fp_filt_ships,
                              latitude='lat',
                              longitude='lon',
                              datetime='DateTime',
                              traj_id='traj_id')
fp_filt_ships.head()

Index(['traj_id', 'DateTime', 'index', 'lat', 'lon', 'MMSI', 'NavStatus',
       'SOG', 'COG', 'ShipType', 'Distance_prev_to_curr'],
      dtype='object')


Unnamed: 0_level_0,Unnamed: 1_level_0,lat,lon,ShipType
traj_id,DateTime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AB RAMANTENN,2017-05-07 00:13:05,11.905735,57.681092,8
AB RAMANTENN,2017-05-07 00:25:04,11.90574,57.68107,8
AB RAMANTENN,2017-05-07 00:31:05,11.905792,57.68106,8
AB RAMANTENN,2017-05-07 01:07:05,11.9057,57.681107,8
AB RAMANTENN,2017-05-07 01:31:04,11.905708,57.681045,8


In [8]:
a = fp_filt_ships.reset_index()["DateTime"].diff().dt.total_seconds()
(a > 30).value_counts()

False    47968
True     12980
Name: DateTime, dtype: int64

In [9]:
# Now, interpolate the filtered dataframe and add points
# to make the trajectories smoother.

from ptrail.preprocessing.interpolation import Interpolation as ip

ip_ships = ip.interpolate_position(fp_filt_ships,
                                   time_jump=15,
                                   ip_type='cubic')

print(f"Length of original DF: {len(fp_filt_ships)}")
print(f"Length of interpolated DF: {len(ip_ships)}")

Length of original DF: 60948
Length of interpolated DF: 83662


In [10]:
# Now, fixing the ShipType column for the interpolated ships dataset.

# The logic behind this is that when a point is added for a
# particular traj_id, then it will retain the ship type as well
# since interpolation is performed on each trajectory. As a result
# the ShipType remains the same and they can be reassigned to
# new points as well.

# Create a list of all unique ids.
ids_ = list(ip_ships.reset_index()['traj_id'].value_counts().keys())

df_chunks = []
# Create a small chunk for each ID, then for the same ID in the
# original dataset, grab the species and then again assign that
# to the interpolated ID.
for i in range(len(ids_)):
    small = ip_ships.reset_index().loc[ip_ships.reset_index()['traj_id'] == ids_[i]]
    spec = fp_filt_ships.reset_index().loc[fp_filt_ships.reset_index()['traj_id'] == ids_[i], 'ShipType']
    small['ShipType'] = spec.value_counts().idxmax()
    df_chunks.append(small)

# Now, convert the dataframe with interpolated species to
# PTRAILDataFrame.
ip_ships = PTRAILDataFrame(data_set=pd.concat(df_chunks),
                         latitude='lat',
                         longitude='lon',
                         traj_id='traj_id',
                         datetime='DateTime')
ip_ships.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,lat,lon,ShipType
traj_id,DateTime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AB RAMANTENN,2017-05-07 00:13:05,11.905735,57.681092,8
AB RAMANTENN,2017-05-07 00:13:20,11.905732,57.681092,8
AB RAMANTENN,2017-05-07 00:25:04,11.90574,57.68107,8
AB RAMANTENN,2017-05-07 00:25:19,11.905742,57.681069,8
AB RAMANTENN,2017-05-07 00:31:05,11.905792,57.68106,8


In [11]:
%%time

import datetime as dt


# Convert the timestamp to seconds.
def dtt2timestamp(dtt):
    ts = (dtt.hour * 60 + dtt.minute) * 60 + dtt.second
    #if you want microseconds as well
    ts += dtt.microsecond * 10**(-6)
    return ts


# Now, on both the train and test datasets, we generate
# date and time features and then convert both values to ordinal
# form in order to be eligible for model fitting.
from ptrail.features.temporal_features import TemporalFeatures

# 1. Train dataset.
fp_filt_ships = TemporalFeatures.create_date_column(fp_filt_ships)
fp_filt_ships = TemporalFeatures.create_time_column(fp_filt_ships)

fp_filt_ships['Time'] = fp_filt_ships['Time'].apply(dtt2timestamp)
fp_filt_ships['Date'] = fp_filt_ships['Date'].map(dt.datetime.toordinal)

# 2. Test dataset.
ip_ships = TemporalFeatures.create_time_column(ip_ships)
ip_ships = TemporalFeatures.create_date_column(ip_ships)

ip_ships['Date'] = ip_ships['Date'].map(dt.datetime.toordinal)
ip_ships['Time'] = ip_ships['Time'].apply(dtt2timestamp)

CPU times: user 750 ms, sys: 32.1 ms, total: 782 ms
Wall time: 781 ms


In [12]:
# Now, splitting the test datasets into 4 following parts:
#   1. Training:
#       1.1 train_x
#       1.2 train_y
#   2. Testing:
#       2.1 test_x
#       2.2 test_y

train_x = fp_filt_ships.drop(columns=['ShipType'])
train_y = fp_filt_ships.reset_index()['ShipType']

test_x = ip_ships.drop(columns=['ShipType'])
test_y = ip_ships.reset_index()['ShipType']

In [13]:
%%time

# Now it is time to train some models and predict
# the ship types.

# 1. RandomForestClassifier model.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf_model = RandomForestClassifier()
rf_model.fit(train_x, train_y)

# Now, lets predict on the training and testing set
# using the above trained model.
rf_train_predict = rf_model.predict(train_x)
rf_test_predict = rf_model.predict(test_x)


# Finally, lets test the accuracy of the model on both
# the datasets.
rf_train_accuracy = accuracy_score(train_y, rf_train_predict)
rf_test_accuracy = accuracy_score(test_y, rf_test_predict)

print('---------------- RandomForest Classifier -----------------')
print(f"The predicted train set values for RF are: {rf_train_predict}")
print(f"The predicted test set values for RF are: {rf_test_predict}\n")
print(f"The Training accuracy of RF is: {round(rf_train_accuracy*100, 2)}%")
print(f"The Testing accuracy of RF is: {round(rf_test_accuracy*100, 2)}%")
print('----------------------------------------------------------\n')

---------------- RandomForest Classifier -----------------
The predicted train set values for RF are: [8 8 8 ... 9 9 9]
The predicted test set values for RF are: [8 8 8 ... 1 1 0]

The Training accuracy of RF is: 100.0%
The Testing accuracy of RF is: 80.09%
----------------------------------------------------------

CPU times: user 8.91 s, sys: 63.7 ms, total: 8.98 s
Wall time: 8.98 s


In [14]:
%%time

# 2. DecisionTree Classifier model.

from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier()
dt_model.fit(train_x, train_y)

# Now, lets predict on the training and testing set
# using the above trained model.
dt_train_predict = dt_model.predict(train_x)
dt_test_predict = dt_model.predict(test_x)


# Finally, lets test the accuracy of the model on both
# the datasets.
dt_train_accuracy = accuracy_score(train_y, dt_train_predict)
dt_test_accuracy = accuracy_score(test_y, dt_test_predict)

print('---------------- DecisionTree Classifier -----------------')
print(f"The predicted train set values for DT are: {dt_train_predict}")
print(f"The predicted test set values for DT are: {dt_test_predict}\n")
print(f"The Training accuracy of DT is: {round(dt_train_accuracy*100, 2)}%")
print(f"The Testing accuracy of DT is: {round(dt_test_accuracy*100, 2)}%")
print('----------------------------------------------------------\n')

---------------- DecisionTree Classifier -----------------
The predicted train set values for DT are: [8 8 8 ... 9 9 9]
The predicted test set values for DT are: [8 8 8 ... 1 1 1]

The Training accuracy of DT is: 100.0%
The Testing accuracy of DT is: 78.68%
----------------------------------------------------------

CPU times: user 165 ms, sys: 4.02 ms, total: 169 ms
Wall time: 169 ms


In [15]:
%%time

# 3. Gaussian Naive Bayes model.

from sklearn.naive_bayes import GaussianNB

gnb_model = GaussianNB()
gnb_model.fit(train_x, train_y)

# Now, lets predict on the training and testing set
# using the above trained model.
gnb_train_predict = gnb_model.predict(X=train_x)
gnb_test_predict = gnb_model.predict(test_x)


# Finally, lets test the accuracy of the model on both
# the datasets.
gnb_train_accuracy = accuracy_score(train_y, gnb_train_predict)
gnb_test_accuracy = accuracy_score(test_y, gnb_test_predict)

print('----------------- Naive Bayes Classifier -----------------')
print(f"The predicted train set values for GNB are: {gnb_train_predict}")
print(f"The predicted test set values for GNB are: {gnb_test_predict}\n")
print(f"The Training accuracy of GNB is: {round(gnb_train_accuracy*100, 2)}%")
print(f"The Testing accuracy of GNB is: {round(gnb_test_accuracy*100, 2)}%")
print('----------------------------------------------------------\n')

----------------- Naive Bayes Classifier -----------------
The predicted train set values for GNB are: [1 1 1 ... 0 0 0]
The predicted test set values for GNB are: [6 6 6 ... 6 6 6]

The Training accuracy of GNB is: 27.61%
The Testing accuracy of GNB is: 4.25%
----------------------------------------------------------

CPU times: user 75.4 ms, sys: 3.99 ms, total: 79.4 ms
Wall time: 78.7 ms


In [16]:
%%time

# 4. K-Nearest Neighbors Classifier model.

from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier()
knn_model.fit(train_x, train_y)

# Now, lets predict on the training and testing set
# using the above trained model.
knn_train_predict = knn_model.predict(train_x)
knn_test_predict = knn_model.predict(test_x)


# Finally, lets test the accuracy of the model on both
# the datasets.
knn_train_accuracy = accuracy_score(train_y, knn_train_predict)
knn_test_accuracy = accuracy_score(test_y, knn_test_predict)

print('--------------------- KNN Classifier ---------------------')
print(f"The predicted train set values for KNN are: {knn_train_predict}")
print(f"The predicted test set values for KNN are: {knn_test_predict}\n")
print(f"The Training accuracy of KNN is: {round(knn_train_accuracy*100, 2)}%")
print(f"The Testing accuracy of KNN is: {round(knn_test_accuracy*100, 2)}%")
print('----------------------------------------------------------\n')

--------------------- KNN Classifier ---------------------
The predicted train set values for KNN are: [8 8 0 ... 0 1 0]
The predicted test set values for KNN are: [0 0 0 ... 0 0 0]

The Training accuracy of KNN is: 45.32%
The Testing accuracy of KNN is: 26.64%
----------------------------------------------------------

CPU times: user 5.4 s, sys: 11.9 ms, total: 5.41 s
Wall time: 5.42 s


In [17]:
%%time

# 5. K-Means Classifier model.

from sklearn.cluster import KMeans

km_model = KMeans(15)
km_model.fit(train_x, train_y)

# Now, lets predict on the training and testing set
# using the above trained model.
km_train_predict = km_model.predict(train_x)
km_test_predict = km_model.predict(test_x)


# Finally, lets test the accuracy of the model on both
# the datasets.
km_train_accuracy = accuracy_score(train_y, km_train_predict)
km_test_accuracy = accuracy_score(test_y, km_test_predict)

print('------------------- K-Means Classifier --------------------')
print(f"The predicted train set values for KM are: {km_train_predict}")
print(f"The predicted test set values for KM are: {km_test_predict}\n")
print(f"The Training accuracy of KM is: {round(km_train_accuracy*100, 2)}%")
print(f"The Testing accuracy of KM is: {round(km_test_accuracy*100, 2)}%")
print('----------------------------------------------------------\n')

------------------- K-Means Classifier --------------------
The predicted train set values for KM are: [ 4  4  4 ... 11 11 11]
The predicted test set values for KM are: [5 5 5 ... 5 5 5]

The Training accuracy of KM is: 7.63%
The Testing accuracy of KM is: 5.08%
----------------------------------------------------------

CPU times: user 19.3 s, sys: 492 ms, total: 19.8 s
Wall time: 1.81 s
