## Evaluation of reverse geocoding (countries)

First, download [the dataset with locations of countries](https://www.kaggle.com/datasets/liewyousheng/geolocation/code) and put csv files into `datasets/geolocations`

In [17]:
import pandas as pd

from geocoding import settings

df = pd.read_csv(settings.DATASETS_DIR / "geolocations" / "countries.csv")

df.head()

Unnamed: 0,id,name,iso3,iso2,numeric_code,phone_code,capital,currency,currency_name,currency_symbol,tld,native,region,subregion,timezones,latitude,longitude,emoji,emojiU
0,1,Afghanistan,AFG,AF,4,93,Kabul,AFN,Afghan afghani,؋,.af,افغانستان,Asia,Southern Asia,"[{zoneName:'Asia\/Kabul',gmtOffset:16200,gmtOf...",33.0,65.0,🇦🇫,U+1F1E6 U+1F1EB
1,2,Aland Islands,ALA,AX,248,+358-18,Mariehamn,EUR,Euro,€,.ax,Åland,Europe,Northern Europe,"[{zoneName:'Europe\/Mariehamn',gmtOffset:7200,...",60.116667,19.9,🇦🇽,U+1F1E6 U+1F1FD
2,3,Albania,ALB,AL,8,355,Tirana,ALL,Albanian lek,Lek,.al,Shqipëria,Europe,Southern Europe,"[{zoneName:'Europe\/Tirane',gmtOffset:3600,gmt...",41.0,20.0,🇦🇱,U+1F1E6 U+1F1F1
3,4,Algeria,DZA,DZ,12,213,Algiers,DZD,Algerian dinar,دج,.dz,الجزائر,Africa,Northern Africa,"[{zoneName:'Africa\/Algiers',gmtOffset:3600,gm...",28.0,3.0,🇩🇿,U+1F1E9 U+1F1FF
4,5,American Samoa,ASM,AS,16,+1-684,Pago Pago,USD,US Dollar,$,.as,American Samoa,Oceania,Polynesia,"[{zoneName:'Pacific\/Pago_Pago',gmtOffset:-396...",-14.333333,-170.0,🇦🇸,U+1F1E6 U+1F1F8


Leave only desired columns (coordinates, information about countries)

In [18]:
source_df = df[["latitude", "longitude", "name", "iso3"]].dropna()
source_df.head()

Unnamed: 0,latitude,longitude,name,iso3
0,33.0,65.0,Afghanistan,AFG
1,60.116667,19.9,Aland Islands,ALA
2,41.0,20.0,Albania,ALB
3,28.0,3.0,Algeria,DZA
4,-14.333333,-170.0,American Samoa,ASM


Extract coordinates into a separate list

In [19]:
points = list(zip(source_df["latitude"], source_df["longitude"]))

Run reverse geocoding for these points

In [20]:
from geocoding.geocoder import get_countries_by_points

countries = get_countries_by_points(points)
countries

{(33.0,
  65.0): HexCountry(hex_id=600163863987486719, id=1, name='Afghanistan', code='AFG'),
 (60.116667,
  19.9): HexCountry(hex_id=599128543867174911, id=4, name='Åland', code='ALA'),
 (41.0,
  20.0): HexCountry(hex_id=599521116762931199, id=5, name='Albania', code='ALB'),
 (28.0,
  3.0): HexCountry(hex_id=599980501096202239, id=64, name='Algeria', code='DZA'),
 (-14.33333333, -170.0): None,
 (42.5,
  1.5): HexCountry(hex_id=599988255659655167, id=69, name='Spain', code='ESP'),
 (-12.5,
  18.5): HexCountry(hex_id=601622808543363071, id=2, name='Angola', code='AGO'),
 (18.25, -63.16666666): None,
 (-74.65,
  4.48): HexCountry(hex_id=603050037986983935, id=11, name='Antarctica', code='ATA'),
 (17.05,
  -61.8): HexCountry(hex_id=600638818650947583, id=13, name='Antigua and Barbuda', code='ATG'),
 (-34.0,
  -64.0): HexCountry(hex_id=602414522413613055, id=8, name='Argentina', code='ARG'),
 (40.0,
  45.0): HexCountry(hex_id=599752897894285311, id=9, name='Armenia', code='ARM'),
 (12.5,
 

In [21]:
result_df = pd.DataFrame(
    [
        (
            *point,
            c.name if (c := countries[point]) else None,
            c.code if c else None,
        )
        for point in points
    ],
    columns=["latitude", "longitude", "country_name", "country_code"],
)
result_df.head()

Unnamed: 0,latitude,longitude,country_name,country_code
0,33.0,65.0,Afghanistan,AFG
1,60.116667,19.9,Åland,ALA
2,41.0,20.0,Albania,ALB
3,28.0,3.0,Algeria,DZA
4,-14.333333,-170.0,,


Merge two dataset on coordinates

In [22]:
merged_df = pd.merge(
    source_df, result_df, on=["latitude", "longitude"], suffixes=("_source", "_result")
)
merged_df.head()

Unnamed: 0,latitude,longitude,name,iso3,country_name,country_code
0,33.0,65.0,Afghanistan,AFG,Afghanistan,AFG
1,60.116667,19.9,Aland Islands,ALA,Åland,ALA
2,41.0,20.0,Albania,ALB,Albania,ALB
3,28.0,3.0,Algeria,DZA,Algeria,DZA
4,-14.333333,-170.0,American Samoa,ASM,,


Evaluate results

In [23]:
total_num = len(merged_df)
nulls_num = sum(merged_df["country_code"].isnull())
wrong_num = sum(
    (~merged_df["country_code"].isnull())
    & (merged_df["iso3"] != merged_df["country_code"])
)
correct_num = total_num - nulls_num - wrong_num
print(f"Failed (nulls): {nulls_num}/{total_num}")
print(f"Wrong: {wrong_num}/{total_num}")
print(f"Correct: {correct_num}/{total_num} ({correct_num / total_num * 100:.2f}%)")

Failed (nulls): 58/250
Wrong: 5/250
Correct: 187/250 (74.80%)


Show points reverse geocoded incorrectly

In [24]:
merged_df[merged_df["iso3"] != merged_df["country_code"]]

Unnamed: 0,latitude,longitude,name,iso3,country_name,country_code
4,-14.333333,-170.000000,American Samoa,ASM,,
5,42.500000,1.500000,Andorra,AND,Spain,ESP
7,18.250000,-63.166667,Anguilla,AIA,,
16,24.250000,-76.000000,Bahamas The,BHS,,
30,-54.433333,3.400000,Bouvet Island,BVT,,
...,...,...,...,...,...,...
244,18.340000,-64.930000,Virgin Islands (US),VIR,,
245,-13.300000,-176.200000,Wallis And Futuna Islands,WLF,,
247,15.000000,48.000000,Yemen,YEM,,
248,-15.000000,30.000000,Zambia,ZMB,,
