## Hotels Challenge I.

given a database of hotels, and a set of input coordinates, for each coordinate, find the hotel closest to it

a solution is represented by a directory

the directory must contain one [yaml](https://en.wikipedia.org/wiki/YAML) file:
`commands.yaml`

and in the file, on the top level, 4 keys can have values: `setup-env-command`, `etl-command`, `process-command`
and `cleanup-command`. 

- `setup-env-command` sets up the environment where the other commans can run. it can assume the presence of python3.7 and pip
- `etl-command` runs, when the data is already accessible by the solution, in this case in a `hotel_table.csv` in the root of the solution. the etl command can do whatever it wants with the data to prepare it for the process command
- when `process-command` runs, an additional `inputs.json` file is also present in the solution root. your task is to make this command write out the answers to the queries found in inputs into an `outputs.json` file in the root of the solution, as fast as possible. this is the only mandatory value
- `cleanup-command` runs after everything is done


solutions will be avaluated based on:
- scaling with size of input
- scaling with data size

there are 4 levels for evaluation:
- 10k hotels - 1, 2, 5, 10 queries
- 5 queries - 10k, 50k, 100k, 200k hotels
- 50k hotels - 1, 10, 100, 1000 queries
- 500k hotels - 1, 10, 100, 1000 queries

### install package for data downloading and evaluation

In [None]:
!pip install --upgrade git+https://github.com/endreMBorza/jkg_evaluators

In [1]:
from jkg_evaluators.challenges.data.hotels import get_hotel_data, dump_hotel_input
import shutil
import os

### download practice data

In [2]:
get_hotel_data()

### select one and move to notebook root

In [3]:
data_size_to_copy = 10000
shutil.copyfile(os.path.join("data", 
                             f"{data_size_to_copy}.csv"), 
                "data.csv")

'data.csv'

### generate some inputs

In [4]:
dump_hotel_input(size=10, path="inputs.json")

## base solution ETL

In [5]:
%%time
import pandas as pd

data_file_path = "data.csv"

df = pd.read_csv(data_file_path)

df.loc[:, ['lon','lat','name']].to_csv('filtered.csv',index=None)

CPU times: user 87 ms, sys: 14 ms, total: 101 ms
Wall time: 109 ms


## base solution process

In [78]:
%%time
import pandas as pd
import numpy as np
import json

input_locations = json.load(open('inputs.json', 'r'))

df = pd.read_csv('filtered.csv')

answers = []

for place in input_locations:
    min_distance = np.inf
    closest_place = {}
    for idx,row in df.iterrows():
        distance = ((place['lon']-row['lon']) ** 2 + (place['lat']-row['lat']) ** 2) ** 0.5
        if distance < min_distance:
            min_distance = distance
            closest_place = row[['lon','lat','name']].to_dict()
    answers.append(closest_place.copy())

json.dump(answers,open('output.json','w'))

CPU times: user 11.7 s, sys: 32.4 ms, total: 11.8 s
Wall time: 11.8 s


In [67]:
answers

[{'lon': 2.1137490000000003, 'lat': 13.503765, 'name': 'Hôtel Terminus'},
 {'lon': 106.93460900000001, 'lat': 47.922101, 'name': 'Chinggis Khaan Hotel'},
 {'lon': 27.9266,
  'lat': -32.99568,
  'name': 'The Hill Boutique Bed & Breakfast'},
 {'lon': -90.966089,
  'lat': -0.9570209999999999,
  'name': 'Cormorant Beach House'},
 {'lon': 32.418676, 'lat': -28.37433900000001, 'name': 'Fishermans Flat'},
 {'lon': -59.56523000000001, 'lat': 13.065819, 'name': 'Melbourne Inn'},
 {'lon': -49.37167, 'lat': -28.7188, 'name': 'Passione - estadia e Lazer'},
 {'lon': 64.39622800000001, 'lat': 39.773229, 'name': 'Daryo Hostel'},
 {'lon': -49.962192, 'lat': -9.270277, 'name': 'Pousada Sonho Meu'},
 {'lon': -23.757431, 'lat': 15.274973000000001, 'name': 'King Fisher Village'}]

In [85]:
%%time
import pandas as pd
import numpy as np
import json
from sklearn.neighbors import BallTree

input_locations = json.load(open('inputs.json', 'r'))

df = pd.read_csv('filtered.csv')
df.head()
df = df[['name','lon','lat']]

query_lats = []
query_lons = []

input_locations = json.load(open('inputs.json', 'r'))
for elements in input_locations:
    query_lats.append(elements["lat"])
    query_lons.append(elements["lon"])

bt = BallTree(np.deg2rad(df[['lat', 'lon']].values))
distances, indices = bt.query(np.deg2rad(np.c_[query_lats, query_lons]))

nearest_cities = df.iloc[[item for sublist in indices.tolist() for item in sublist],]
answers = [{'lat': x['lat'],'lon':x['lon'],'name':x['name']} for index, x in nearest_cities.iterrows()]

json.dump(answers,open('output.json','w'))

CPU times: user 22 ms, sys: 5.85 ms, total: 27.8 ms
Wall time: 32.1 ms


In [84]:
[{'lat': x['lat'],'lon':x['lon'],'name':x['name']} for index, x in nearest_cities.iterrows()]


[{'lat': 13.503765, 'lon': 2.1137490000000003, 'name': 'Hôtel Terminus'},
 {'lat': 47.922101, 'lon': 106.93460900000001, 'name': 'Chinggis Khaan Hotel'},
 {'lat': -32.99568,
  'lon': 27.9266,
  'name': 'The Hill Boutique Bed & Breakfast'},
 {'lat': -0.9570209999999999,
  'lon': -90.966089,
  'name': 'Cormorant Beach House'},
 {'lat': -28.37433900000001, 'lon': 32.418676, 'name': 'Fishermans Flat'},
 {'lat': 13.065819, 'lon': -59.56523000000001, 'name': 'Melbourne Inn'},
 {'lat': -28.7188, 'lon': -49.37167, 'name': 'Passione - estadia e Lazer'},
 {'lat': 39.773229, 'lon': 64.39622800000001, 'name': 'Daryo Hostel'},
 {'lat': -9.270277, 'lon': -49.962192, 'name': 'Pousada Sonho Meu'},
 {'lat': 15.274973000000001, 'lon': -23.757431, 'name': 'King Fisher Village'}]

In [83]:
for index,x in nearest_cities.iterrows():
    print(x)

name    Hôtel Terminus
lon            2.11375
lat            13.5038
Name: 1298, dtype: object
name    Chinggis Khaan Hotel
lon                  106.935
lat                  47.9221
Name: 8933, dtype: object
name    The Hill Boutique Bed & Breakfast
lon                               27.9266
lat                              -32.9957
Name: 7774, dtype: object
name    Cormorant Beach House
lon                  -90.9661
lat                 -0.957021
Name: 3360, dtype: object
name    Fishermans Flat
lon             32.4187
lat            -28.3743
Name: 1721, dtype: object
name    Melbourne Inn
lon          -59.5652
lat           13.0658
Name: 1514, dtype: object
name    Passione - estadia e Lazer
lon                       -49.3717
lat                       -28.7188
Name: 1174, dtype: object
name    Daryo Hostel
lon          64.3962
lat          39.7732
Name: 1233, dtype: object
name    Pousada Sonho Meu
lon              -49.9622
lat              -9.27028
Name: 8361, dtype: object
name    Ki