##### *13. Describe or show how you would create a Machine Learning Model to predict “dwell” times for the region.*

In [1]:
import pandas as pd

The crucial aspect here is to prepare the dataset correctly. Our goal is to predict dwell times, so before even trying to do that, we need to obtain this feature (that is, calculate dvelling time for each vessel by `mmsi` identifier). 

It is also important to select and leave a reasonable list of features for model training. For example, at the previous stages we've learnt that dwelling time can differ depending on navigation description. Lots of features, on the other hand, are irrelevant. Or in particular, irrelant in our context: our data represents a shipments from a single port, so it does not make a lot of sense to choose location parameters (lat, lon) for training because they are nearly identical and do not carry any valuable meaning. The same applies to the port name and so forth. 

Let's now get our hands dirty.

### Read the dataset

In [2]:
df = pd.read_parquet('../data/parquet')

nav_details = pd.DataFrame(df['navigation'].tolist())
vessel_details = pd.DataFrame(df['vesselDetails'].tolist())
posirion = pd.DataFrame(df['position'].tolist())

Drop irrelevant features (or those packed in JSON structure some of which will return in a different format)

In [3]:
df = df[['epochMillis', 'mmsi']]

### Feature Engineering

In [4]:
df['timestamp'] = pd.to_datetime(df['epochMillis'], unit='ms')
df['navCode'] = nav_details['navCode']

df = df.sort_values(by=['mmsi', 'timestamp'])
df['lead_time'] = df.groupby('navCode')['timestamp'].diff().shift(-1).dt.total_seconds() / 60 # minutes

Another important aspect is dealing with outliers. Our dataset is big enough so that we might (for a starting experimental point) simply try to remove extreme outliers (those that are beyond below `Q1 - 1.5 * IQR` or above `Q3 + 1.5 * IQR`).

In [5]:
Q1 = df['lead_time'].quantile(0.25)
Q3 = df['lead_time'].quantile(0.75)
IQR = Q3 - Q1

outliers = df[(df['lead_time'] < (Q1 - 1.5 * IQR)) | (df['lead_time'] > (Q3 + 1.5 * IQR))]
print(outliers.shape)

df = df[~((df['lead_time'] < (Q1 - 1.5 * IQR)) | (df['lead_time'] > (Q3 + 1.5 * IQR)))]

(444342, 5)


We have removed 444342 outliars, which may seem a lot at first glance, taking into account the size of the whole dataset, this is only about 13%. 

### Feature Selection

In [6]:
df['typeName'] = vessel_details['typeName']
df['navDesc'] = nav_details['navDesc']
df['courseOverGround'] = nav_details['courseOverGround']
df['speedOverGround'] = nav_details['speedOverGround']

In [7]:
df = df.dropna(subset=['lead_time'])

Four features that I decided to select are navigation description, course over the ground, speed over the ground, and vessel type.

In [8]:
X = df[['navDesc', 'courseOverGround', 'speedOverGround', 'typeName']]
y = df['lead_time']

### Transforming Categorial Variables (One-Hot encoding)

In [9]:
X = pd.get_dummies(X, columns=['navDesc', 'typeName'])

### Split the dataset and train the model

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = DecisionTreeRegressor(max_depth=3)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

In [11]:
from sklearn.metrics import mean_absolute_error, median_absolute_error, r2_score, mean_squared_error

def evaluate(y_pred, y_test):
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    medae = median_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print(f'Mean Squared Error (MSE): {mse}')
    print(f'Mean Absolute Error (MAE): {mae}')
    print(f'R-squared (R²): {r2}')
    print(f'Median Absolute Error: {medae}')

    return mse, mae, medae, r2

In [12]:
metrics = evaluate(y_pred, y_test)

Mean Squared Error (MSE): 22.933907487749178
Mean Absolute Error (MAE): 3.3828172567205357
R-squared (R²): 0.04388435618894415
Median Absolute Error: 1.6950060070939053


Evaluating the metrics, we can see that our model on average is wrong by about 3 minutes in dvell time. $R^2$ is quite low indicating that the model explains only a small portion of the variance in the target variable. Finally, Median Absolute Error being lower than MAE indicates that most errors are relatively small, but there are some larger errors that affect the MAE (there are still outliers that affect the performance). Summarising, this model has the potential to be improved by more extended feature engineering and dealing with outliers. It wouldn't hurt to also experiment with different and more complex models (such as Random Forests or XGBoost).