# Workflow of the entire project
As detailed in the readme file the workflow of this project is done via for scripts which should be run one after another. This can thought of as DAG representing a sequence of steps.

The order of the scripts are as follows:-
#### DataRetrieval.ipynb
This script downloads the locations of all ships with DWT $>$ 150000.
#### GetApproximatePortCoordinates.ipynb
This script process each of the vessel locations independently to infer the coordinates where these ships have been stationary for a period greater than 2 hours.
#### InferPortCentroidsAndShipCounts.ipynb
This script used HDBSCAN algorithm to cluster together the stationary ship locations to infer the port coordinates and count the number of distinct ship arrivals per port per day. To enable HDBSCAN to run on this massive dataset, I have partioned the locations into Geohashes (precision 3) and all geohashes are processed one after another to determine port locations within the envelope.
#### PredictNumberOfArrivals.ipynb
The final script predicts the number of ship arrivals for one of the ports using the FBProphet time-series library. Note that I have fit for only port, this time-series model can easily be applied for each port location independently using the light weight algorithm.

#### How would I have done differently in production/ real world use case?
I would have gone for spark based pipeline by leveraging AWS EMR.
It is pretty obvious that all steps detailed above are embarssingly parallel, a key (ship_id, geohash, port-arrivals) --> dataframe operations which can be trivially parallelized using repartion, pandas UDF operations.

#### Final Note
None of the above operations consumed more than 4G of RAM on my machine, this can be easily verified from the code.


## Objective of this script
In this script I retrieve the locations of all ships with dead weight $>$ than 150000. This script took a couple of hours to run. Tradeoff of not including ships with dead weight $\le$ 150,000 is that inferring port locations will be a bit off. I chose to download ship locations $>$ 150,000 to ensure that the script runs in a limited timeframe. The entire dataset could have been downloaded had I let my PC run overnight. Some points to note

* I have used the boto3 wrapper for Athena, pyathena for downloading ship locations.
* The ship locations have been stored in parquet format partitioned by ship ID. The reason for this partitioning is that I shall be processing each ship location independently (in the next script) to figure out where the ships have been stationary for a long time which would correspond to port locations.
* Finally I have used job lib to paralellize the entire operation.
* This script took around 2 hours to download entire dataset (as parquet) to a local directory which occupies 1.06 GB. 

#### Libraries used
Pandas, PyAthena and joblib

In [1]:
import pandas as pd
import numpy as np
import datetime
import time
import json
from tqdm import tqdm
from os.path import expanduser
from IPython.display import display, HTML
from pyathena import connect
from pyathena.pandas_cursor import PandasCursor
from joblib import Parallel, delayed
%matplotlib inline

In [2]:
cursor = connect(profile_name="abhinav.sunderrajan",
                     s3_staging_dir='s3://data-science-athena-challenge/athena-output/',
                     schema_name='prophesea_staging',
                     region_name='eu-west-2',
                     work_group="data-candidate-1",
                     cursor_class=PandasCursor).cursor()

In [3]:
query = "select imo from vessel_temp where dwt>150000"
imos = cursor.execute(query).as_pandas()
cursor.close()
imos_list = list(imos.iloc[:, 0])
len(imos_list)

1648

In [4]:
def get_data_from_athena_and_write_to_parquet(ship_no):
    query = f"""
    select movementdatetime,movestatus,latitude,
    longitude,length,lrimoshipno 
    from ais where lrimoshipno IN {ship_no}
    """

    cursor = connect(profile_name="abhinav.sunderrajan",
                     s3_staging_dir='s3://data-science-athena-challenge/athena-output/',
                     schema_name='prophesea_staging',
                     region_name='eu-west-2',
                     work_group="data-candidate-1",
                     cursor_class=PandasCursor).cursor()
    vessel_locations = cursor.execute(query).as_pandas()
    cursor.close()
    for index in vessel_locations.dtypes.index:
        if("Int64" in str(vessel_locations.dtypes[index])):
            vessel_locations[index] = vessel_locations[index].astype(int)

    vessel_locations.drop_duplicates(inplace=True)
    if vessel_locations.shape[0] > 0:
        vessel_locations.to_parquet(
            f"data/vessel_locations", partition_cols=["lrimoshipno"],engine='pyarrow', compression='gzip')

In [6]:
Parallel(n_jobs=6)(delayed(get_data_from_athena_and_write_to_parquet)(str(ship_nos))
                   for ship_nos in tqdm(zip(*[iter(imos_list)]*4)))


0it [00:00, ?it/s][A
12it [02:00, 10.01s/it][A
13it [02:48, 21.46s/it][A
14it [02:52, 16.23s/it][A
15it [03:03, 14.60s/it][A
16it [03:16, 14.15s/it][A
17it [03:35, 15.69s/it][A
18it [05:01, 36.84s/it][A
19it [05:22, 31.90s/it][A
20it [05:29, 24.53s/it][A
21it [05:41, 20.73s/it][A
22it [06:29, 29.00s/it][A
23it [06:44, 24.74s/it][A
24it [07:39, 33.97s/it][A
25it [07:40, 23.95s/it][A
26it [07:46, 18.50s/it][A
27it [08:36, 27.89s/it][A
28it [08:52, 24.53s/it][A
29it [09:08, 21.85s/it][A
30it [09:24, 20.26s/it][A
31it [10:35, 35.23s/it][A
32it [10:43, 27.10s/it][A
33it [11:26, 31.94s/it][A
34it [11:33, 24.43s/it][A
35it [11:55, 23.77s/it][A
36it [12:09, 20.70s/it][A
37it [13:17, 35.15s/it][A
38it [13:19, 25.14s/it][A
39it [13:49, 26.48s/it][A
40it [14:17, 26.86s/it][A
41it [14:24, 21.14s/it][A
42it [14:47, 21.68s/it][A
43it [15:37, 30.09s/it][A
44it [15:47, 23.91s/it][A
45it [16:16, 25.63s/it][A
46it [16:29, 21.88s/it][A
47it [17:17, 29.52s/it][A
48it 

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,