## Hotels Challenge I.

given a database of hotels, and a set of input coordinates, for each coordinate, find the hotel closest to it

a solution is represented by a directory

the directory must contain one [yaml](https://en.wikipedia.org/wiki/YAML) file:
`commands.yaml`

and in the file, on the top level, 4 keys can have values: `setup-env-command`, `etl-command`, `process-command`
and `cleanup-command`. 

- `setup-env-command` sets up the environment where the other commans can run. it can assume the presence of python3.7 and pip
- `etl-command` runs, when the data is already accessible by the solution, in this case in a `hotel_table.csv` in the root of the solution. the etl command can do whatever it wants with the data to prepare it for the process command
- when `process-command` runs, an additional `inputs.json` file is also present in the solution root. your task is to make this command write out the answers to the queries found in inputs into an `outputs.json` file in the root of the solution, as fast as possible. this is the only mandatory value
- `cleanup-command` runs after everything is done


solutions will be avaluated based on:
- scaling with size of input
- scaling with data size

there are 4 levels for evaluation:
- 10k hotels - 1, 2, 5, 10 queries
- 5 queries - 10k, 50k, 100k, 200k hotels
- 50k hotels - 1, 10, 100, 1000 queries
- 500k hotels - 1, 10, 100, 1000 queries

### install package for data downloading and evaluation

In [None]:
#!pip install --upgrade git+https://github.com/endreMBorza/jkg_evaluators

In [1]:
from jkg_evaluators.challenges.data.hotels import get_hotel_data, dump_hotel_input
import shutil
import os

### download practice data

In [2]:
get_hotel_data()

### select one and move to notebook root

In [9]:
data_size_to_copy = 10000
shutil.copyfile(os.path.join("data", 
                             f"{data_size_to_copy}.csv"), 
                "data.csv")

PermissionError: [Errno 13] Permission denied: 'data.csv'

### generate some inputs

In [None]:
dump_hotel_input(size=10, path="inputs.json")

## base solution ETL

In [86]:
%%time
import pandas as pd

data_file_path = "data.csv"

df = pd.read_csv(data_file_path)

df.loc[:, ['lon','lat','name', 'current-price', 'stars']].to_csv('filtered.csv',index=None)

df['current-price'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')
df['current-price'] = df['current-price'].astype('int64')
df = df.sort_values(by = 'current-price')
df.drop_duplicates(inplace = True)
df.dropna(how = "all")

starunique = sorted(df['stars'].unique())
stardict = {elem : pd.DataFrame() for elem in starunique}

for key in stardict.keys():
    stardict[key] = df[df['stars'] == key]

Wall time: 220 ms


In [90]:
stardict[2]

Unnamed: 0,lat,lon,name,tagline,postal-addr,current-price,stars
5434,11.547344,104.918427,Stay Inn Hostel,Walking distance from Tuol Sleng Genocide Museum,"17C, Street 368, Sangkat BKK III, Khan Chamkar...",5,2.0
261,-1.259490,116.868262,OYO 1685 Garuda Guest House,No-frills hotel in Balikpapan,"Jl. Manunggal No.53, Damai, Balikpapan Kota, B...",6,2.0
1972,9.141832,99.319198,Sleep Box Hostel Suratthani,City-center Surat Thani capsule hotel with bar...,"37/8 Moo 1, Mai Mueang, Surat Thani, Surat Tha...",6,2.0
7097,0.455788,101.424714,SPOT ON 2252 Ranira Homestay,No-frills hostel in Pekanbaru,"Jl. Inpres, Pekanbaru, Pekanbaru, Riau, 28289,...",7,2.0
3975,4.636354,-75.570443,Hostal La Casa de Lili,No-frills hostel in Salento,"Carrera 6 #3-45, Salento, 631020, Colombia",7,2.0
...,...,...,...,...,...,...,...
2049,1.080710,34.167430,LACAM LODGE,Mbale hostel with restaurant and bar/lounge,"SIPI FALLS, Mbale, Uganda",223,2.0
4777,24.162418,-110.314585,Nahuala Hostal,Property in La Paz,"Zona Comercial, La Paz, BCS, 23000, Mexico",351,2.0
1880,34.053192,-118.278907,Americas Best Value Inn Los Angeles at S Alvar...,"Motel with restaurant, near Los Angeles Conven...","906 S. Alvarado Street, Los Angeles, CA, 90006...",404,2.0
1577,6.141572,100.355141,OYO 89586 Hotel MNY Wangsa Inn,No-frills hotel in Alor Setar,"No. 54, Kawasan, 5 Jalan Shahab, 1, Shahab Per...",474,2.0


## base solution process

In [None]:
%%time
import pandas as pd
import numpy as np
import json

input_locations = json.load(open('inputs.json', 'r'))

df = pd.read_csv('filtered.csv')

answers = []

for place in input_locations:
    min_distance = np.inf
    closest_place = {}
    for idx,row in df.iterrows():
        distance = ((place['lon']-row['lon']) ** 2 + (place['lat']-row['lat']) ** 2) ** 0.5
        if distance < min_distance:
            min_distance = distance
            closest_place = row[['lon','lat','name']].to_dict()
    answers.append(closest_place.copy())

json.dump(answers,open('output.json','w'))