## Business problem

Port terminals constantly strive to improve the efficiency of their operations through a careful
management of their berth facilities, machinery and personnel. 



The most important variables when planning terminal operations is knowing which vessels will arrive at the terminal and when.


MarineTraffic aims at being the best visibility providers by providing up to date vessel tracking data (using AIS), as well as additional derived information such as the estimated time of arrival (ETA) of a vessel to a port of interest.


AIS messages contain information on the port that the vessel is traveling to as well as the estimated time of arrival.

However, since ports may consist of more than one terminal, the exact terminal that the
vessel will visit is not known in advance making it difficult for MarineTraffic to assign future arrivals to terminals which, in turn, limits the ability to measure terminal congestion and calculate more accurate terminal arrival times.



A model which predicts the terminal a vessel will travel to has the potential to help all parties involved in a port call to plan their operations more effectively

# Data Description

A dataset has been extracted containing container calls at terminals that took place during the past 3
years for the Port of **Hamburg and Port of Los Angeles**. The dataset contains the following fields

A dataset has been extracted containing container calls at terminals that took place during the past 3
years for the Port of Hamburg and Port of Los Angeles. The dataset contains the following fields;


* **last_port**: Port where the *last* terminal call by vessel is recorded.



* **last_terminal**: The immediately previous *terminal* call of the vessel.


* **last_terminal_doc_timestamp**: Timestamp of previous terminal call.


* **current_port**: Port where the current terminal call by vessel is recorded.


* **current_terminal**: Current terminal call of the vessel.


* **shipname**: Name of the vessel.


* **dock_timestamp**: Timestamp of current terminal call.


* **GRT**: Vessel capacity (gross tonnage unit).


* **TEU**: Vessel capacity (twenty-foot equivalent Unit).


* **length**: Vessel length.


* **width**: Vessel width.

# Goals and Deliverable


The goal of this task is to implement & evaluate the accuracy of a solution that predicts the terminal
that a vessel will call.
**The product team claims that the history of terminals visited by a vessel in the past is a critical factor
that should be incorporated into the model**.

Some important steps that your solution would be expected to address and describe are the following:
* What features have you finally selected and engineered for your modeling approach? What led you to these choices? Why & how have you processed them?
* Which features seem to be the most important & how did you evaluate their importance?
* Do your findings agree with the product team’s insights discussed above? Before developing a ML model, how would you evaluate the importance/predictive power of one of the productidentified features as an independent variable?
* What type of prediction/training model have you chosen and why?
* How well does your predictive solution perform in terms of predicting the terminal a vessel will call?
* What different metrics/graphs can you use in order to understand when & why the algorithm fails/succeeds?
* What would be your baseline (i.e. a “naive” approach) to compare against?

# Solution Procedure

### problem understanding & EDA

In [99]:
# In order to understand the problem we have to dive into the data

In [100]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import RobustScaler

In [101]:
raw_data = pd.read_csv("data/mt_terminal_calls.csv")

In [102]:
raw_data.head(5)

Unnamed: 0,last_port,last_terminal,last_terminal_doc_timestamp,current_port,current_terminal,shipname,doc_timestamp,grt,teu,length,width
0,BREMERHAVEN,North Sea Terminal,2020-01-02 16:12:00.000,HAMBURG,Eurogate Container Terminal Hamburg,HEINRICH EHLER,2020-01-03 06:20:00.000,17488,1421,168.11,26.8
1,TILBURY,London Container Terminal,2020-01-01 21:56:00.000,HAMBURG,C. Steinweg Multipurpose Terminal,HENNEKE RAMBOW,2020-01-03 07:45:00.000,9981,868,134.4,22.74
2,ROTTERDAM MAASVLAKTE,Rotterdam World Gateway Terminal,2020-01-02 12:06:00.000,HAMBURG,HHLA Container Terminal Burchardkai,CMA CGM TANGER,2020-01-03 20:20:00.000,9966,1118,147.8,23.28
3,ROTTERDAM WAALHAVEN,RST Waalhaven,2020-01-01 16:05:00.000,HAMBURG,HHLA Container Terminal Burchardkai,NIEVES B,2020-01-03 20:28:00.000,10318,1036,151.72,23.4
4,GDYNIA,Gdynia Container Terminal,2020-01-01 14:45:00.000,HAMBURG,HHLA Container Terminal Burchardkai,JUDITH,2020-01-04 05:05:00.000,16023,1440,170.02,25.19


In order to make sense of the markets team claim of the historical importance of each ship,
is to check if any of those ships have actually performed a "full circle" from HAMBURG/LA to HAMBURG/LA

In [103]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12054 entries, 0 to 12053
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   last_port                    12054 non-null  object 
 1   last_terminal                12054 non-null  object 
 2   last_terminal_doc_timestamp  12054 non-null  object 
 3   current_port                 12054 non-null  object 
 4   current_terminal             12054 non-null  object 
 5   shipname                     12054 non-null  object 
 6   doc_timestamp                12054 non-null  object 
 7   grt                          12054 non-null  int64  
 8   teu                          12054 non-null  int64  
 9   length                       12054 non-null  float64
 10  width                        12054 non-null  float64
dtypes: float64(2), int64(2), object(7)
memory usage: 1.0+ MB


In [104]:
last_ports = raw_data.loc[:, "last_port"].unique().tolist()

In [105]:
current_port = raw_data.loc[:, "current_port"].unique().tolist()

In [106]:
print(f"{len(current_port)}")

2


In [107]:
print(f"{len(last_ports)}")

165


In [108]:
set(current_port).intersection(last_ports)

set()

the empty set here signifies that there are no roundtrips of the ships.

### lets investigate the timeline of some ships

In [109]:
print(f"the unique ships in this dataset are {len(raw_data.shipname.unique())}")

the unique ships in this dataset are 1397


In [110]:
ships_gone_both = []

In [111]:
for ship in raw_data.shipname.unique():
    dest = len(raw_data.loc[raw_data["shipname"]==ship,"current_port"].unique())
    if dest !=1:
        ships_gone_both.append(ship)

In [112]:
print(f"{len(ships_gone_both)} have gone to both ports.")

101 have gone to both ports.


In [113]:
user_defined = 5
for ship_no,(name,group) in enumerate(raw_data.groupby("shipname")):
    print(f"ship: {name}")
    display(group.sort_values(by="last_terminal_doc_timestamp"))
    if ship_no > user_defined -1:
        break

ship: A DAISEN


Unnamed: 0,last_port,last_terminal,last_terminal_doc_timestamp,current_port,current_terminal,shipname,doc_timestamp,grt,teu,length,width
7497,SHANGHAI,Shanghai Mingdong Container Terminal,2021-09-08 06:13:00.000,LOS ANGELES,Everport Services Terminal,A DAISEN,2021-10-07 11:55:00.000,18326,1740,175.46,27.69
9058,QINGDAO,QQCT,2022-01-12 07:56:00.000,LOS ANGELES,TraPac Terminal,A DAISEN,2022-02-17 11:00:00.000,18326,1740,175.46,27.69
9666,NINGBO,Beilun International Container Terminals,2022-03-22 14:58:00.000,LOS ANGELES,TraPac Terminal,A DAISEN,2022-04-12 10:23:00.000,18326,1740,175.46,27.69
10272,NINGBO,Beilun International Container Terminals,2022-05-13 15:25:00.000,LOS ANGELES,Yusen Terminal Inc.,A DAISEN,2022-06-03 22:19:00.000,18326,1740,175.46,27.69


ship: A LA MARINE


Unnamed: 0,last_port,last_terminal,last_terminal_doc_timestamp,current_port,current_terminal,shipname,doc_timestamp,grt,teu,length,width
122,ST PETERSBURG,Container Terminal Saint-Petersburg,2020-01-06 15:48:00.000,HAMBURG,Eurogate Container Terminal Hamburg,A LA MARINE,2020-01-14 22:49:00.000,16023,1440,170.06,25.2
287,ST PETERSBURG,Container Terminal Saint-Petersburg,2020-01-22 23:14:00.000,HAMBURG,Eurogate Container Terminal Hamburg,A LA MARINE,2020-01-28 23:34:00.000,16023,1440,170.06,25.2
446,ST PETERSBURG,Bulk cargo Quay,2020-02-05 22:01:00.000,HAMBURG,Eurogate Container Terminal Hamburg,A LA MARINE,2020-02-12 01:00:00.000,16023,1440,170.06,25.2
586,ST PETERSBURG,Container Terminal Saint-Petersburg,2020-02-20 20:52:00.000,HAMBURG,Cruise Center Steinwerder,A LA MARINE,2020-02-25 11:22:00.000,16023,1440,170.06,25.2
771,ST PETERSBURG,Container Terminal Saint-Petersburg,2020-03-06 22:48:00.000,HAMBURG,HHLA Container Terminal Burchardkai,A LA MARINE,2020-03-12 10:42:00.000,16023,1440,170.06,25.2
917,BREMERHAVEN,EUROGATE CTB,2020-03-24 07:17:00.000,HAMBURG,HHLA Container Terminal Tollerort,A LA MARINE,2020-03-25 13:31:00.000,16023,1440,170.06,25.2
1034,BREMERHAVEN,North Sea Terminal,2020-04-05 02:03:00.000,HAMBURG,HHLA Container Terminal Burchardkai,A LA MARINE,2020-04-05 19:26:00.000,16023,1440,170.06,25.2
1085,BREMERHAVEN,EUROGATE CTB,2020-04-08 21:15:00.000,HAMBURG,Eurogate Container Terminal Hamburg,A LA MARINE,2020-04-09 12:05:00.000,16023,1440,170.06,25.2
1182,COPENHAGEN,CMP CONTAINER TERMINAL,2020-04-17 06:45:00.000,HAMBURG,HHLA Container Terminal Altenwerder,A LA MARINE,2020-04-18 23:34:00.000,16023,1440,170.06,25.2
1264,BREMERHAVEN,EUROGATE CTB,2020-04-20 20:50:00.000,HAMBURG,Eurogate Container Terminal Hamburg,A LA MARINE,2020-04-25 03:51:00.000,16023,1440,170.06,25.2


ship: A.IDEFIX


Unnamed: 0,last_port,last_terminal,last_terminal_doc_timestamp,current_port,current_terminal,shipname,doc_timestamp,grt,teu,length,width
12036,NINGBO,Daxie China Merchants ICT,2022-11-05 05:47:00.000,LOS ANGELES,TraPac Terminal,A.IDEFIX,2022-11-23 12:04:00.000,18263,1700,182.0,26.0


ship: ADAMS


Unnamed: 0,last_port,last_terminal,last_terminal_doc_timestamp,current_port,current_terminal,shipname,doc_timestamp,grt,teu,length,width
878,ANTWERP,PSA Europa Terminal,2020-03-20 00:37:00.000,HAMBURG,HHLA Container Terminal Altenwerder,ADAMS,2020-03-22 12:14:00.000,66462,5928,279.7,40.0
1430,ANTWERP,PSA Europa Terminal,2020-05-08 00:39:00.000,HAMBURG,HHLA Container Terminal Altenwerder,ADAMS,2020-05-11 02:02:00.000,66462,5928,279.7,40.0
1973,ANTWERP,PSA Europa Terminal,2020-06-26 23:00:00.000,HAMBURG,HHLA Container Terminal Altenwerder,ADAMS,2020-06-29 12:16:00.000,66462,5928,279.7,40.0


ship: ADELINA D


Unnamed: 0,last_port,last_terminal,last_terminal_doc_timestamp,current_port,current_terminal,shipname,doc_timestamp,grt,teu,length,width
7354,BREMERHAVEN,MSC Gate,2021-09-24 11:11:00.000,HAMBURG,Eurogate Container Terminal Hamburg,ADELINA D,2021-09-25 05:21:00.000,15487,1578,168.0,25.3
7479,BREMERHAVEN,North Sea Terminal,2021-10-05 04:45:00.000,HAMBURG,Eurogate Container Terminal Hamburg,ADELINA D,2021-10-05 22:17:00.000,15487,1578,168.0,25.3
7627,BREMERHAVEN,North Sea Terminal,2021-10-16 23:36:00.000,HAMBURG,HHLA Container Terminal Burchardkai,ADELINA D,2021-10-17 16:42:00.000,15487,1578,168.0,25.3
7743,NORVIK,Stockholm Norvik Container Terminal,2021-10-25 05:24:00.000,HAMBURG,HHLA Container Terminal Burchardkai,ADELINA D,2021-10-27 12:19:00.000,15487,1578,168.0,25.3
7886,BREMERHAVEN,North Sea Terminal,2021-11-06 07:23:00.000,HAMBURG,HHLA Container Terminal Burchardkai,ADELINA D,2021-11-07 09:47:00.000,15487,1578,168.0,25.3
8000,GAVLE,Yilport Container Terminal,2021-11-12 12:34:00.000,HAMBURG,HHLA Container Terminal Burchardkai,ADELINA D,2021-11-17 14:04:00.000,15487,1578,168.0,25.3
8138,BREMERHAVEN,EUROGATE CTB,2021-11-28 11:28:00.000,HAMBURG,HHLA Container Terminal Burchardkai,ADELINA D,2021-11-29 04:07:00.000,15487,1578,168.0,25.3
8312,BREMERHAVEN,EUROGATE CTB,2021-12-13 00:14:00.000,HAMBURG,HHLA Container Terminal Tollerort,ADELINA D,2021-12-14 06:17:00.000,15487,1578,168.0,25.3
8430,BREMERHAVEN,EUROGATE CTB,2021-12-22 21:06:00.000,HAMBURG,Eurogate Container Terminal Hamburg,ADELINA D,2021-12-23 22:16:00.000,15487,1578,168.0,25.3
8554,GAVLE,Yilport Container Terminal,2021-12-31 13:16:00.000,HAMBURG,HHLA Container Terminal Burchardkai,ADELINA D,2022-01-06 01:44:00.000,15487,1578,168.0,25.3


ship: ADILIA I


Unnamed: 0,last_port,last_terminal,last_terminal_doc_timestamp,current_port,current_terminal,shipname,doc_timestamp,grt,teu,length,width
3217,ROTTERDAM MAASVLAKTE,Hutchison Ports Delta II,2020-10-07 20:12:00.000,HAMBURG,HHLA Container Terminal Altenwerder,ADILIA I,2020-10-09 04:33:00.000,9701,830,140.48,22.84


In [114]:
terminal_ham = raw_data.loc[raw_data["current_port"]=="HAMBURG", "current_terminal"].unique().tolist()

In [115]:
terminal_la = raw_data.loc[raw_data["current_port"]=="LOS ANGELES", "current_terminal"].unique().tolist()

In [116]:
# Essentially these here are our labels.

In [117]:
set(terminal_ham).intersection(set(terminal_la)),set(terminal_la).intersection(set(terminal_ham))

(set(), set())

In [118]:
print("possible final terminals", len(terminal_ham)+len(terminal_la))

possible final terminals 19


In [119]:
raw_data.columns

Index(['last_port', 'last_terminal', 'last_terminal_doc_timestamp',
       'current_port', 'current_terminal', 'shipname', 'doc_timestamp', 'grt',
       'teu', 'length', 'width'],
      dtype='object')

In [120]:
raw_data["port_terminal"] = raw_data.apply(lambda x: f"{x['current_port']}_{x['current_terminal']}",axis=1)

In [121]:
print(raw_data["port_terminal"].unique().shape[0])

19


In [122]:
raw_data.apply(lambda x: f"{x['last_port']}_{x['last_terminal']}",axis=1).unique().shape

(269,)

In [123]:
# if we keep in mind the port and the last terminal in mind we will have a severe case of sparse matrix
# since we're going to have to create one hot representation for that feature.

In [124]:
# first attempt

In [125]:
raw_data.columns

Index(['last_port', 'last_terminal', 'last_terminal_doc_timestamp',
       'current_port', 'current_terminal', 'shipname', 'doc_timestamp', 'grt',
       'teu', 'length', 'width', 'port_terminal'],
      dtype='object')

In [126]:
kept_cols =['last_terminal_doc_timestamp', 'current_port', 'current_terminal', 'shipname', 'doc_timestamp', 'grt','teu', 'length', 'width', 'port_terminal']

In [127]:
# distance could potentially be a much greater feature than everything else?

In [128]:
kept_cols.append("diff")

In [129]:
raw_data["diff"] = (pd.to_datetime(raw_data.loc[:, "doc_timestamp"]) - pd.to_datetime(raw_data.loc[:, "last_terminal_doc_timestamp"]))

In [130]:
raw_data = raw_data.loc[:, kept_cols]

In [131]:
raw_data = raw_data.loc[:, ["grt","teu","length", "width", "diff", "port_terminal"]]

In [132]:
raw_data.describe()

Unnamed: 0,grt,teu,length,width,diff
count,12054.0,12054.0,12054.0,12054.0,12054
mean,60620.134063,5541.533018,244.204202,34.706638,6 days 17:26:49.094076655
std,57664.198201,5446.331962,90.141575,11.647131,9 days 00:47:17.448698429
min,3992.0,508.0,99.95,17.9,0 days 09:39:00
25%,11662.0,1036.0,154.52,23.53,1 days 21:54:15
50%,40146.0,4045.0,259.0,32.2,2 days 22:20:30
75%,95680.0,8566.0,333.0,45.6,6 days 22:08:15
max,236583.0,24004.0,400.0,62.0,197 days 21:37:00


In [133]:
raw_data["diff"] = raw_data.apply(lambda x: x['diff'].seconds,axis=1)

In [134]:
raw_data

Unnamed: 0,grt,teu,length,width,diff,port_terminal
0,17488,1421,168.11,26.80,50880,HAMBURG_Eurogate Container Terminal Hamburg
1,9981,868,134.40,22.74,35340,HAMBURG_C. Steinweg Multipurpose Terminal
2,9966,1118,147.80,23.28,29640,HAMBURG_HHLA Container Terminal Burchardkai
3,10318,1036,151.72,23.40,15780,HAMBURG_HHLA Container Terminal Burchardkai
4,16023,1440,170.02,25.19,51600,HAMBURG_HHLA Container Terminal Burchardkai
...,...,...,...,...,...,...
12049,39753,3100,228.15,32.25,71760,HAMBURG_Eurogate Container Terminal Hamburg
12050,35835,3078,223.30,32.20,15720,HAMBURG_Eurogate Container Terminal Hamburg
12051,17982,1380,169.95,26.90,59400,HAMBURG_HHLA Container Terminal Altenwerder
12052,10585,1000,151.72,23.40,69300,HAMBURG_Eurogate Container Terminal Hamburg


# first simple baseline

In [135]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

encoder.fit(raw_data.loc[:,["port_terminal"]].values.reshape(-1,1))



In [136]:
# since we actually do know the destination port, and those two ports do have different
# number of

In [137]:
model = LogisticRegression()

In [138]:
from sklearn.model_selection import train_test_split

In [139]:
X_train, X_test, y_train, y_test = train_test_split(raw_data.loc[:, raw_data.columns!="port_terminal"], raw_data.loc[:,["port_terminal"]], test_size=0.2, random_state=42)

In [140]:
y_train = encoder.transform(y_train)
y_train = np.argmax(y_train, axis=1)
y_train = pd.DataFrame(y_train,columns=["label"])
y_test = encoder.transform(y_test)
y_test = np.argmax(y_test, axis=1)
y_test = pd.DataFrame(y_test,columns=["label"])



In [142]:
model.fit(X_train,y_train)

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [143]:
y_pred = model.predict(X_test)

In [144]:
from sklearn.metrics import f1_score

In [147]:
print(f1_score(y_test, y_pred, average="micro"))

0.24678556615512234


In [148]:
print(f1_score(y_test, y_pred, average="macro"))

0.05239921087671952


In [149]:
from sklearn.metrics import accuracy_score

In [150]:
accuracy_score(y_test, y_pred)

0.24678556615512234

### Enstablish simple baselines

### Can we do better?

# Possible future work

* Given the accuracy of your model, what would be the next steps in terms of deploying a solution such that the model’s value to the business is maximised?