# Whereabouts: calculation of speed, accuracy and comparison with other geocoders

06 09 24

## Description
This notebook contains code for producing the accuracy and speed results for the whereabouts package (https://www.github.com/ajl2718/whereabouts) and comparing with three other geocoders: Google, MapBox and Nominatim.


## Test datasets
There are three datasets used for testing, two for accuracy and one for speed:
- Locations of Guzman and Gomez stores in Australia
- Random residential locations from parts of Australia
- Licensed venues in Victoria, Australia

These datasets are available here: https://github.com/ajl2718/python_learning/whereabouts_testing/

## Methodology
Accuracy is assessed at four levels of geographic granularity:
- Apartment number: All components of the address must be correct including level number, building number, shop number where these appear
- House number: The main street number has to be correct
- Street name: The address is correct up to the street name
- Suburb name: The correct suburb of the location is found

For each level, a score of 1 is given if the address matches and 0 otherwise. Total accuracy (number of correct / number of addresses) is then calculated for each level

In [7]:
from time import time, sleep
import requests

import pandas as pd
import numpy as np

import math

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

In [8]:
sns.set_style("whitegrid")

## Data sources

In [9]:
# load Guzman y Gomez
df = pd.read_csv('data/gyg_210624_geocoded.csv')
addresses_gyg = df['address'].values

In [10]:
# load residential addresses
df = pd.read_csv('data/rea_130824.csv', sep='\t')
addresses_rea = df['address'].sample(128, random_state=42).values

## Geocoding: whereabouts

In [14]:
from whereabouts.Matcher import Matcher
from whereabouts.MatcherPipeline import MatcherPipeline

In [15]:
matcher1 = Matcher('au_all_lg', how='standard')
matcher2 = Matcher('au_all_lg', how='trigram')

pipeline = MatcherPipeline([matcher1, matcher2])

In [20]:
# use matcher pipeline
results_gyg = matcher1.geocode(list(addresses_gyg))

In [22]:
df_results = pd.DataFrame(data=results_gyg)
df_results.to_csv('gyg_experiment_whereabouts_030924.csv', index=False)

In [24]:
results_rea = pipeline.geocode(list(addresses_rea))

In [25]:
df_results = pd.DataFrame(data=results_rea)
df_results.to_csv('rea_experiment_whereabouts_030924.csv', index=False)

## Nominatim

In [27]:
headers = {
    "User-Agent": "" # change this
}


In [28]:
# add country name to addresses
addresses_gyg = [address + ' AUSTRALIA' for address in addresses_gyg]
addresses_rea = [address + ' AUSTRALIA' for address in addresses_rea]

In [None]:
all_results = []

for address in addresses_rea:
    sleep(1)
    print(f'Geocoding {address}')
    address_new = address.replace(' ', '+')
    url_base = 'https://nominatim.openstreetmap.org/search'
    params = {'q': address_new,
              'format': 'json',
              'addressdetails': 1}
    page = requests.get(url_base, params=params, headers=headers)
    results = page.json()
    if isinstance(results, list):
        if len(results) > 0:
            results_new = results[0]
            all_results.append(results_new)
        else:
            all_results.append({})
    elif isinstance(results, dict):
        results_new = results
        all_results.append(results_new)
    else:
        all_results.append({})
    

In [21]:
# write the results
(
    pd.DataFrame(all_results)
    .assign(input_address=addresses_rea)
    .to_csv('rea_experiments_nominatim_030924.csv', index=False)
)

## Googlemaps API

In [90]:
api_key = '' # set this

In [None]:
url_base = 'https://maps.googleapis.com/maps/api/geocode/json'

In [None]:
all_results = []

dfs = []

for address in addresses_rea:
    print(f'Geocoding {address}')
    address_new = address.replace(' ', '+')
    params = {'address': address_new, 'key': api_key}
    page = requests.get(url=url_base, params=params)
    df_temp = (
        pd.DataFrame(page.json()['results'][0]['address_components'])[['long_name']]
        .T
        .reset_index()
        .rename(columns={0: 'number', 1: 'street', 2: 'name', 3: 'suburb', 4: 'state', 5: 'country', 6: 'postcode'})
    )
    dfs.append(df_temp)

In [1]:
# write the results
(
    pd.concat(dfs)
    .assign(input_address=addresses_rea)
    .to_csv('rea_experiments_google_030924.csv', index=False)
)

## Mapbox


In [77]:
access_token = '' # set this

In [78]:
url_base = 'https://api.mapbox.com/search/geocode/v6/forward'
url_full = f'{url_base}?q={address}&access_token='+access_token

In [None]:
all_results = []

dfs = []

for address in addresses_rea:
    print(f'Geocoding {address}')
    address_new = address.replace(' ', '+')
    url_full = f'{url_base}?q={address_new}&access_token='+access_token
    page = requests.get(url=url_full)
    address_result = pd.DataFrame(page.json()['features'][0])['properties']
    dfs.append(address_result)

In [86]:
# write the data
(
    pd.DataFrame(dfs)
    .assign(input_address=addresses_rea)
    .to_csv('rea_experiments_mapbox_030924.csv', index=False)
)

## Overall accuracy results

These are found by manually reviewing the outputs from each of the geocoders. See the labelled outputs for the original data used to calculate these values.

In [None]:
overall_results_data = {'google': [0.675257732, 0.8969072165, 0.9742268041, 0.9896907216, 
                                   0.984375, 1, 1, 1],
                        'whereabouts': [0.6701030928, 0.9536082474, 0.9484536082, 0.9536082474,
                                        0.90625, 0.984375, 0.984375, 1],
                        'mapbox': [0.4639175258, 0.7731958763, 0.8298969072, 0.8453608247, 
                                   0.5234375, 0.578125, 0.984375, 0.984375],
                        'nominatim': [0.1494845361, 0.1494845361, 0.2164948454, 0.2164948454, 
                                      0.1171875, 0.1171875, 0.7265625, 0.71875]}

overall_results_labels1 = ['apartment', 'house', 'street', 'suburb'] + ['apartment', 'house', 'street', 'suburb']
overall_results_labels2 = ['retail', 'retail', 'retail', 'retail'] + ['residential', 'residential', 'residential', 'residential']

df_overall_results = (
    pd.DataFrame(data=overall_results_data)
    .assign(address_level=overall_results_labels1,
            location_type=overall_results_labels2)
    .melt(id_vars=['address_level', 'location_type'], 
          value_vars=['google', 'whereabouts', 'mapbox', 'nominatim'])
)

custom_palette = {
    'whereabouts': '#c34a36',   
    'google': '#b0a8b9', 
    'mapbox': '#bea6a0', 
    'nominatim': '#b9b2d9'
}

In [None]:
plt.figure(figsize=(8, 6))
sns.barplot(df_overall_results.query('location_type=="retail"'),
            x='address_level',
            y='value',
            hue='variable', 
            palette=custom_palette)
plt.xlabel('Geographic granularity')
plt.ylabel('Accuracy')
plt.title('Accuracy comparison for retail dataset')
plt.legend(loc='lower right')
plt.savefig('geocoder_comparison_retail_050924.png')

## Speed comparison

The %%timeit function is used which carries out 7 runs. The average speed is then calculated. 

These calculations are done for two databases:
- The large Victoria database: this handles a broader range of spelling errors
- The small Australian database: country wide but with less tolerance for spelling errors

In [67]:
# load the data
df = pd.read_excel('data/liquor.xlsx', skiprows=3)

In [68]:
# calculate the full address
df = (
    df
    .query('Postcode.isnull() == False')
    .assign(Postcode=lambda df_: df_.Postcode.astype(int))
    .loc[:, ['Address', 'Suburb', 'Postcode']]
    .assign(full_address=lambda df_: df_.Address + ' ' + df_.Suburb + ' ' + df_.Postcode.astype(str))
)

In [69]:
# take a random sample of 8192 addresses
addresses = df.full_address.sample(2**13, random_state=42).values

In [76]:
matcher_vic = Matcher('au_all_sm')

In [None]:
%%timeit
results = matcher_vic.geocode(addresses)