# File to do Porto Taxi Trajectory Similiarity

### Step 1

Open the code/global_variables.py file, [or just click here](global_variables.py). And edit the values to fit the given experiment, the name of the chosen subset ("subset-*size*) and the size of the subset. As well as the coordinates of the geographical area.


### Step 2

Make sure you have the needed files/folders for the chosen subset.
 - in data/raw_data there must be a .csv file with the subset of the chosen size. If not, it must be uploaded.
 - in data/chosen_data there must be a folder with the same name as global_variables.CHOSEN_SUBSET_NAME. If not, create this empty folder.
 - in data/hashed_data there must be a folder with the same name as global_variables.CHOSEN_SUBSET_NAME. If not, create this empty folder.

### Step 3

Run the cells in [code/porto-data.ipynb](porto-data.ipynb), or just run the cell below.
This will load the data from the chosen subset into the folder data/chosen_data/subset-'size', each row in the dataset is written in its own text file. I also creates a META-file which contains the name of all the text files in the subset.

(Might requires to install nbformat: "pip install nbformat")

In [1]:
%run "porto-data.ipynb"

Check the folder: data/chosen_data/subset-subset-100. Files should have been generated.


### Step 4

Run the cells in [code/lsh-grid.ipynb](lsh-grid.ipynb), or just run the cell below. This will represent each of the rows/trajectories as an hash, and create a text file for each hashed trajectory in the folder data/hashed_data/subset-'size', as well as a META file.

In [2]:
%run "lsh-grid.ipynb"

       Average runtime  Minimum runtime  Maximum runtime
porto         0.021477         0.021013         0.022953
                 Average runtime  Minimum runtime  Maximum runtime
porto_naive             4.180326         4.131812          4.22105
porto_quadrants         2.111735         1.977048          2.29845
porto_kd_tree           2.121689         2.065953          2.19822
Check the folder: data/hashed_data/subset-100. Files should have been generated.


### Step 5
Calculate similarities by running the following code

In [3]:
%run similarities-only-grid.ipynb

Check ../code/experiments/similarities/, it should be a file here named grid_porto-subset-100.csv which contains the similarities in the dataset.
 Check ../code/experiments/timing/, it should be a file here named similarity_runtimes_grid_porto-{global_variables.CHOSEN_SUBSET_NAME}.csv which contains the time spent to do the hash similarity(?)


### Step 6

Run the code below to see the clustering of the trajectories. Decide the number of clusters you want by updating number_of_trajectories.

In [4]:
#Change this to the number of clusters you want (if wanted number is more than 30: update in def plot_clusters() in hierarchical_clustering.py)
number_of_clusters = 10

In [5]:
from experiments.hierarchical_clustering import HCA
from experiments import davies_bouldin as DB 
from sklearn import metrics as mcs

# Porto Grid similarities
#TODO: remove city
PortoGrid = HCA("Porto", f"../code/experiments/similarities/grid_porto-{global_variables.CHOSEN_SUBSET_NAME}.csv", number_of_clusters )
print(PortoGrid.clusters)
#PortoGrid.plot_clusters("Porto - Grid")
clusters_dict = PortoGrid.get_cluster_dictionary()
print(clusters_dict)

[0 8 1 4 0 2 3 5 9 5 1 6 0 9 1 1 4 9 0 6 9 2 6 0 0 1 6 5 9 1 1 1 3 0 9 6 5
 1 6 0 2 6 2 6 6 9 4 8 7 9 0 0 2 6 2 0 0 4 8 0 9 0 9 1 1 9 6 2 8 2 5 5 9 0
 0 4 0 1 1 0 0 1 0 0 9 0 5 9 4 8 6 0 6 0 9 1 1 1 0 1]
TEST123
['P_AAAA', 'P_AAAB', 'P_AAAC', 'P_AAAD', 'P_AAAE', 'P_AAAF', 'P_AAAG', 'P_AAAH', 'P_AAAI', 'P_AAAJ', 'P_AAAK', 'P_AAAL', 'P_AAAM', 'P_AAAN', 'P_AAAO', 'P_AAAP', 'P_AAAQ', 'P_AAAR', 'P_AAAS', 'P_AAAT', 'P_AAAU', 'P_AAAV', 'P_AAAW', 'P_AAAX', 'P_AAAY', 'P_AAAZ', 'P_AABA', 'P_AABB', 'P_AABC', 'P_AABD', 'P_AABE', 'P_AABF', 'P_AABG', 'P_AABH', 'P_AABI', 'P_AABJ', 'P_AABK', 'P_AABL', 'P_AABM', 'P_AABN', 'P_AABO', 'P_AABP', 'P_AABQ', 'P_AABR', 'P_AABS', 'P_AABT', 'P_AABU', 'P_AABV', 'P_AABW', 'P_AABX', 'P_AABY', 'P_AABZ', 'P_AACA', 'P_AACB', 'P_AACC', 'P_AACD', 'P_AACE', 'P_AACF', 'P_AACG', 'P_AACH', 'P_AACI', 'P_AACJ', 'P_AACK', 'P_AACL', 'P_AACM', 'P_AACN', 'P_AACO', 'P_AACP', 'P_AACQ', 'P_AACR', 'P_AACS', 'P_AACT', 'P_AACU', 'P_AACV', 'P_AACW', 'P_AACX', 'P_AACY', 'P_AACZ', 'P_AADA

  return linkage(y, method='ward', metric='euclidean')


In [6]:
from experiments.frechet_for_taxi_case import find_similarity_in_clusters
from experiments.frechet_for_taxi_case import find_similarity_in_cluster
from experiments.frechet_for_taxi_case import frechet_similar_taxi_trajectories
from experiments.frechet_for_taxi_case import frechet_similar_taxi_and_bus_trajectories


result = find_similarity_in_clusters(clusters_dict)
print(result)

[[['P_AABH', 'P_AACH', 'P_AAAY', 'P_AACJ', 'P_AAAA', 'P_AAAS', 'P_AAAA', 'P_AAAX', 'P_AAAS', 'P_AADE', 'P_AAAX', 'P_AADP', 'P_AABN', 'P_AADE', 'P_AACJ', 'P_AADE', 'P_AADB', 'P_AADP', 'P_AADP', 'P_AADU', 'P_AAAY', 'P_AADC', 'P_AAAY', 'P_AADF', 'P_AAAY', 'P_AADU', 'P_AABN', 'P_AABZ', 'P_AABN', 'P_AACD', 'P_AABN', 'P_AACH', 'P_AABN', 'P_AACJ', 'P_AABY', 'P_AACJ', 'P_AABY', 'P_AACV', 'P_AABZ', 'P_AADH', 'P_AACD', 'P_AACJ', 'P_AACE', 'P_AADC', 'P_AACH', 'P_AACY', 'P_AACV', 'P_AADC', 'P_AACV', 'P_AADH', 'P_AACW', 'P_AADB', 'P_AACW', 'P_AADC', 'P_AACY', 'P_AADB', 'P_AACY', 'P_AADC', 'P_AADB', 'P_AADN', 'P_AABH', 'P_AACY'], [], [], [], [], [], [], []], [['P_AAAO', 'P_AAAZ', 'P_AAAC', 'P_AAAK', 'P_AAAK', 'P_AACM', 'P_AAAK', 'P_AACZ', 'P_AAAK', 'P_AADS', 'P_AAAP', 'P_AACM', 'P_AAAZ', 'P_AACZ', 'P_AAAZ', 'P_AADT', 'P_AABD', 'P_AACM', 'P_AABE', 'P_AACZ', 'P_AABE', 'P_AADS', 'P_AABF', 'P_AACM', 'P_AACM', 'P_AADT'], [], [], [], ['P_AADA', 'P_AADR']], [['P_AAAV', 'P_AACP'], ['P_AABO', 'P_AACC'], ['P_