# File to do Porto Taxi Trajectory Similiarity

### Step 1

Open the code/global_variables.py file, [or just click here](global_variables.py). And edit the values to fit the given experiment, the name of the chosen subset ("subset-*size*) and the size of the subset. As well as the coordinates of the geographical area.


### Step 2

Make sure you have the needed files/folders for the chosen subset.
 - in data/raw_data there must be a .csv file with the subset of the chosen size. If not, it must be uploaded.
 - in data/raw_data there must be a .csv file with the busroutes to be used. If not, this must be uploaded.

 - in data/chosen_data there must be a folder with the same name as global_variables.CHOSEN_SUBSET_NAME. If not, create this empty folder.
 - in data/hashed_data there must be a folder with the same name as global_variables.CHOSEN_SUBSET_NAME. If not, create this empty folder.
 - in data there must be a folder called bus_data. If not, create this empty folder.
 - in code/experiments/results there must be a folder with the same name as global_variables.CHOSEN_SUBSET_NAME. If not create this folder.
        - Inside this folder there must be a folder called lists, and a folder named plots. If not, create these empty folders.

### Step 3

Run the cells in [code/porto-data.ipynb](porto-data.ipynb), or just run the cell below.
This will load the data from the chosen subset into the folder data/chosen_data/subset-'size', each row in the dataset is written in its own text file. It also creates a META-file which contains the name of all the text files in the subset.

(Might requires to install nbformat: "pip install nbformat")

In [1]:
%run "porto-data.ipynb"

Check the folder: data/chosen_data/subset-2500-without-frechet. Files should have been generated.


### Step 4

Run the cells in [code/bus-data.ipynb](bus-data.ipynb), or just run the cell below. This will load the bus data into the folder data/bus_data, each bus-route is written in its own file. It also creted a META-file which containt the name of all the text files in the subset.

In [2]:
%run "bus-data.ipynb"

Check the folder: data/bus_data/. Files should have been generated.


### Step 5

Run the code below to see the clusters created. Reusing the Similarity Matrix (LSH) made by original subset-2500 run, to get the exact same clusters ("samme utgangspunkt").

In [None]:
from math import ceil
from experiments.hierarchical_clustering import HCA
from experiments.davies_bouldin import davies_bouldin

import global_variables

best_db_value = 100
BestGrid = None

highest_number_of_clusters = ceil(global_variables.CHOSEN_SUBSET_SIZE / global_variables.THRESHOLD_NUMBER_OF_TRAJECTORIES)
for i in range (5,highest_number_of_clusters):
    PortoGrid = HCA("Porto", f"../code/experiments/similarities/grid_porto-subset-2500-original.csv", i )
    result, _, _ = davies_bouldin(PortoGrid.distances, PortoGrid.clusters)
    if result<best_db_value:
        best_db_value = result
        BestGrid = PortoGrid

print("Best number of clusters is: " + str(BestGrid.n_clusters))
BestGrid.plot_clusters("Porto - Grid")
clusters_dict = BestGrid.get_cluster_dictionary()
print("Here is the dictionary with the clusters:")
print(clusters_dict)

### Step 6
Run the cell below to run the Frechet algorithm(with a twist) ONLY to match the clusters with the buses. It uses the clusters already created from LSH, and compare them to the bus routes, to check for similarity. The clusters from LSH is immediately defined as a well-used taxi route, as long as it has enough trajectories.

In [None]:
from experiments.frechet_for_taxi_case import do_whole_experiment_with_only_lsh_to_find_well_used_routes

do_whole_experiment_with_only_lsh_to_find_well_used_routes(clusters_dict, raw_df, raw_df_bus)

### Step 7
Run the cells in [code/results-visualisation.ipynb](results-visualisation.ipynb), or just run the cell below. This will create visualisation of the results, and save them to the folder "code/experiments/results/subset-'size'/plots" as html pages, which can be opened in a web browser.

In [None]:
%run "results-visualisation.ipynb"

### (Step 8)
If you want to see the visualisations of the results in this notebook, run the cell below. Update the name of the result you want to see, check folder code/experiments/results/subset-'size'/lists to see all results which can be plotted.

In [None]:
# Update this parameter depending on which result you want to view.
# All files in folder code/experiments/results/subset-'size'/lists can be chosen. Eg. match-0.csv, not-match-0-csv.
#name_of_file = "match-0.csv"

#plot = plot_result(f"experiments/results/{global_variables.CHOSEN_SUBSET_NAME}/lists/{name_of_file}")
#plot