## Business problem

Port terminals constantly strive to improve the efficiency of their operations through a careful
management of their berth facilities, machinery and personnel. 



The most important variables when planning terminal operations is knowing which vessels will arrive at the terminal and when.


MarineTraffic aims at being the best visibility providers by providing up to date vessel tracking data (using AIS), as well as additional derived information such as the estimated time of arrival (ETA) of a vessel to a port of interest.


AIS messages contain information on the port that the vessel is traveling to as well as the estimated time of arrival.

However, since ports may consist of more than one terminal, the exact terminal that the
vessel will visit is not known in advance making it difficult for MarineTraffic to assign future arrivals to terminals which, in turn, limits the ability to measure terminal congestion and calculate more accurate terminal arrival times.



A model which predicts the terminal a vessel will travel to has the potential to help all parties involved in a port call to plan their operations more effectively

# Data Description

A dataset has been extracted containing container calls at terminals that took place during the past 3
years for the Port of **Hamburg and Port of Los Angeles**. The dataset contains the following fields

A dataset has been extracted containing container calls at terminals that took place during the past 3
years for the Port of Hamburg and Port of Los Angeles. The dataset contains the following fields;


* **last_port**: Port where the *last* terminal call by vessel is recorded.



* **last_terminal**: The immediately previous *terminal* call of the vessel.


* **last_terminal_doc_timestamp**: Timestamp of previous terminal call.


* **current_port**: Port where the current terminal call by vessel is recorded.


* **current_terminal**: Current terminal call of the vessel.


* **shipname**: Name of the vessel.


* **dock_timestamp**: Timestamp of current terminal call.


* **GRT**: Vessel capacity (gross tonnage unit).


* **TEU**: Vessel capacity (twenty-foot equivalent Unit).


* **length**: Vessel length.


* **width**: Vessel width.

# Goals and Deliverable


The goal of this task is to implement & evaluate the accuracy of a solution that predicts the terminal
that a vessel will call.
**The product team claims that the history of terminals visited by a vessel in the past is a critical factor
that should be incorporated into the model**.

Some important steps that your solution would be expected to address and describe are the following:
* What features have you finally selected and engineered for your modeling approach? What led you to these choices? Why & how have you processed them?
* Which features seem to be the most important & how did you evaluate their importance?
* Do your findings agree with the product team’s insights discussed above? Before developing a ML model, how would you evaluate the importance/predictive power of one of the productidentified features as an independent variable?
* What type of prediction/training model have you chosen and why?
* How well does your predictive solution perform in terms of predicting the terminal a vessel will call?
* What different metrics/graphs can you use in order to understand when & why the algorithm fails/succeeds?
* What would be your baseline (i.e. a “naive” approach) to compare against?

# Solution Procedure

### problem understanding & EDA

In order to understand the problem we have to dive into the data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import RobustScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.preprocessing import LabelEncoder

from sklearn.ensemble import RandomForestClassifier

In [2]:
raw_data = pd.read_csv("data/mt_terminal_calls.csv")

In [3]:
raw_data.head(5)

Unnamed: 0,last_port,last_terminal,last_terminal_doc_timestamp,current_port,current_terminal,shipname,doc_timestamp,grt,teu,length,width
0,BREMERHAVEN,North Sea Terminal,2020-01-02 16:12:00.000,HAMBURG,Eurogate Container Terminal Hamburg,HEINRICH EHLER,2020-01-03 06:20:00.000,17488,1421,168.11,26.8
1,TILBURY,London Container Terminal,2020-01-01 21:56:00.000,HAMBURG,C. Steinweg Multipurpose Terminal,HENNEKE RAMBOW,2020-01-03 07:45:00.000,9981,868,134.4,22.74
2,ROTTERDAM MAASVLAKTE,Rotterdam World Gateway Terminal,2020-01-02 12:06:00.000,HAMBURG,HHLA Container Terminal Burchardkai,CMA CGM TANGER,2020-01-03 20:20:00.000,9966,1118,147.8,23.28
3,ROTTERDAM WAALHAVEN,RST Waalhaven,2020-01-01 16:05:00.000,HAMBURG,HHLA Container Terminal Burchardkai,NIEVES B,2020-01-03 20:28:00.000,10318,1036,151.72,23.4
4,GDYNIA,Gdynia Container Terminal,2020-01-01 14:45:00.000,HAMBURG,HHLA Container Terminal Burchardkai,JUDITH,2020-01-04 05:05:00.000,16023,1440,170.02,25.19


In order to make sense of the markets team claim of the historical importance of each ship,
is to check if any of those ships have actually performed a "full circle" from HAMBURG/LA to HAMBURG/LA

In [4]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12054 entries, 0 to 12053
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   last_port                    12054 non-null  object 
 1   last_terminal                12054 non-null  object 
 2   last_terminal_doc_timestamp  12054 non-null  object 
 3   current_port                 12054 non-null  object 
 4   current_terminal             12054 non-null  object 
 5   shipname                     12054 non-null  object 
 6   doc_timestamp                12054 non-null  object 
 7   grt                          12054 non-null  int64  
 8   teu                          12054 non-null  int64  
 9   length                       12054 non-null  float64
 10  width                        12054 non-null  float64
dtypes: float64(2), int64(2), object(7)
memory usage: 1.0+ MB


In [5]:
last_ports = raw_data.loc[:, "last_port"].unique().tolist()

In [6]:
current_port = raw_data.loc[:, "current_port"].unique().tolist()

In [7]:
print(f"{len(current_port)}")

2


In [8]:
print(f"{len(last_ports)}")

165


In [9]:
set(current_port).intersection(last_ports)

set()

the empty set here signifies that there are no roundtrips of the ships.

It is rather safe to assume that no "last terminals" exist in the the "current terminals" and vice versa


##### The product team claims that the history of terminals visited by a vessel in the past is a critical factor that should be incorporated into the model.

This claim would make a lot of sense if we could actually use

In [None]:
grouped_by_ship = raw_data.groupby(by="shipname")

### lets investigate the timeline of some ships

In [None]:
print(f"the unique ships in this dataset are {len(raw_data.shipname.unique())}")

In [None]:
ships_gone_both = []

In [None]:
for ship in raw_data.shipname.unique():
    dest = len(raw_data.loc[raw_data["shipname"]==ship,"current_port"].unique())
    if dest !=1:
        ships_gone_both.append(ship)

In [None]:
print(f"{len(ships_gone_both)} have gone to both ports.")

In [None]:
user_defined = 5
for ship_no,(name,group) in enumerate(raw_data.groupby("shipname")):
    print(f"ship: {name}")
    display(group.sort_values(by="last_terminal_doc_timestamp"))
    if ship_no > user_defined -1:
        break

It is evident that those ships

In [None]:
terminal_ham = raw_data.loc[raw_data["current_port"]=="HAMBURG", "current_terminal"].unique().tolist()

In [None]:
terminal_la = raw_data.loc[raw_data["current_port"]=="LOS ANGELES", "current_terminal"].unique().tolist()

In [None]:
# Essentially these here are our labels.

In [None]:
set(terminal_ham).intersection(set(terminal_la)),set(terminal_la).intersection(set(terminal_ham))

In [None]:
print("possible final terminals", len(terminal_ham)+len(terminal_la))

In [None]:
raw_data.columns

In [None]:
raw_data["port_terminal"] = raw_data.apply(lambda x: f"{x['current_port']}_{x['current_terminal']}",axis=1)

In [None]:
print(raw_data["port_terminal"].unique().shape[0])

In [None]:
raw_data.apply(lambda x: f"{x['last_port']}_{x['last_terminal']}",axis=1).unique().shape

In [None]:
# if we keep in mind the port and the last terminal in mind we will have a severe case of sparse matrix
# since we're going to have to create one hot representation for that feature.

In [None]:
# first attempt

What features have you finally selected and engineered for your modeling approach? What led you to these choices? Why & how have you processed them?

In [None]:
raw_data

In [None]:
raw_data.columns

In [None]:
kept_cols =['last_terminal_doc_timestamp', 'current_port', 'current_terminal', 'shipname', 'doc_timestamp', 'grt','teu', 'length', 'width', 'port_terminal']

In [None]:
# distance could potentially be a much greater feature than everything else?

In [None]:
kept_cols.append("diff")

In [None]:
raw_data["diff"] = (pd.to_datetime(raw_data.loc[:, "doc_timestamp"]) - pd.to_datetime(raw_data.loc[:, "last_terminal_doc_timestamp"]))

In [None]:
raw_data = raw_data.loc[:, kept_cols]

In [None]:
raw_data = raw_data.loc[:, ["grt","teu","length", "width", "diff", "port_terminal"]]

In [None]:
raw_data.describe()

In [None]:
raw_data["diff"] = raw_data.apply(lambda x: x['diff'].seconds,axis=1)

In [None]:
raw_data

# first simple baseline

In [None]:
encoder = LabelEncoder()
encoder.fit(raw_data.loc[:,["port_terminal"]].values.reshape(-1,1))

In [None]:
# since we actually do know the destination port, and those two ports do have different
# number of

In [None]:
model = LogisticRegression()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(raw_data.loc[:, raw_data.columns!="port_terminal"], raw_data.loc[:,["port_terminal"]], test_size=0.2, random_state=42)

In [None]:
y_train = encoder.transform(y_train)
# y_train = np.argmax(y_train, axis=1)
y_train = pd.DataFrame(y_train,columns=["label"])
y_test = encoder.transform(y_test)
# y_test = np.argmax(y_test, axis=1)
y_test = pd.DataFrame(y_test,columns=["label"])

In [None]:
model.fit(X_train,y_train)

In [None]:
y_pred = model.predict(X_test)

In [None]:
print(f1_score(y_test, y_pred, average="micro"))

In [None]:
print(f1_score(y_test, y_pred, average="macro"))

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
model = SVC()
model.fit(X_train, y_train.values.ravel())

y_pred = model.predict(X_test)

print(f1_score(y_test, y_pred, average="micro"))
print(f1_score(y_test, y_pred, average="macro"))

In [None]:
model = RandomForestClassifier()

In [None]:
model.fit(X_train,y_train)
y_pred = model.predict(X_test)

In [None]:
print(f1_score(y_test, y_pred, average="micro"))

In [None]:
print(f1_score(y_test, y_pred, average="macro"))

In [None]:
train_scaler = RobustScaler()

In [None]:
X_train = train_scaler.fit_transform(X_train)

In [None]:
X_train

In [None]:
eval_scaler = RobustScaler()

In [None]:
X_test = eval_scaler.fit_transform(X_test)

In [None]:
model = RandomForestClassifier()

In [None]:
model.fit(X_train, y_train.values.ravel())

y_pred = model.predict(X_test)

In [None]:
print(f1_score(y_test, y_pred, average="micro"))

In [None]:
print(f1_score(y_test, y_pred, average="macro"))

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train.values.ravel())

y_pred = model.predict(X_test)

print(f1_score(y_test, y_pred, average="micro"))
print(f1_score(y_test, y_pred, average="macro"))

### Enstablish simple baselines

### Can we do better?

# Possible future work

* Given the accuracy of your model, what would be the next steps in terms of deploying a solution such that the model’s value to the business is maximised?