## Evaluation of reverse geocoding (countries)

First, download [the dataset with locations of countries](https://www.kaggle.com/datasets/liewyousheng/geolocation/code) and put csv files into `datasets/geolocations`


In [53]:
import pandas as pd

from geocoding import settings

df = pd.read_csv(settings.DATASETS_DIR / "geolocations" / "states.csv")

df.head()

Unnamed: 0,id,name,country_id,country_code,country_name,state_code,type,latitude,longitude
0,3901,Badakhshan,1,AF,Afghanistan,BDS,,36.734772,70.811995
1,3871,Badghis,1,AF,Afghanistan,BDG,,35.167134,63.769538
2,3875,Baghlan,1,AF,Afghanistan,BGL,,36.178903,68.745306
3,3884,Balkh,1,AF,Afghanistan,BAL,,36.75506,66.897537
4,3872,Bamyan,1,AF,Afghanistan,BAM,,34.810007,67.82121


Leave only desired columns (coordinates, information about countries). Filter out rows with empty coordinates.

In [54]:
source_df = df[["latitude", "longitude", "name", "country_name", "state_code"]].dropna()
source_df.head()

Unnamed: 0,latitude,longitude,name,country_name,state_code
0,36.734772,70.811995,Badakhshan,Afghanistan,BDS
1,35.167134,63.769538,Badghis,Afghanistan,BDG
2,36.178903,68.745306,Baghlan,Afghanistan,BGL
3,36.75506,66.897537,Balkh,Afghanistan,BAL
4,34.810007,67.82121,Bamyan,Afghanistan,BAM


Extract coordinates into a separate list

In [None]:
points = list(zip(source_df["latitude"], source_df["longitude"]))

Run reverse geocoding for these points

In [None]:
from geocoding.geocoder import get_countries_subdivisions_by_points

subdivisions = get_countries_subdivisions_by_points(points)
subdivisions

In [None]:
result_df = pd.DataFrame(
    [
        (
            *point,
            s.name if (s := subdivisions[point]) else None,
            s.code if s else None,
            s.hasc_code if s else None,
        )
        for point in points
    ],
    columns=[
        "latitude",
        "longitude",
        "subdivision_name",
        "subdivision_code",
        "hasc_code",
    ],
)
result_df.head()

In [None]:
merged_df = pd.merge(
    source_df, result_df, on=["latitude", "longitude"], suffixes=("_source", "_result")
)
merged_df.head()

In the source dataset the format of subdivision code varies for different countries. I will compare names of subdivisions calculating Jaro similarity distance between two strings. It is measured by number which ranges from 0 to 1. I set 0.75 as a threshold. I will assume that if the similarity is higher than a threshold, the names are identical.

In [None]:
import jellyfish

SUBDIVISIONS_NAMES = (
    "state",
    "province",
    "land",
    "oblast",
    "governorate",
    "canton",
    "prefectur",
    "region",
    "department",
    "emirate",
    "circuit",
    "count",
    "comarca",
    "raion",
    "judet",
    "district",
    "municipalit",
    "commun",
)


def clean_name(name):
    """Remove "state", "province" etc from the name."""
    name = name or ""
    return "".join(
        [
            word
            for word in name.split()
            if not word.lower().startswith(SUBDIVISIONS_NAMES)
        ]
    )


def similar(s1, s2):
    """Check if two names are similar."""
    return jellyfish.jaro_similarity(clean_name(s1), clean_name(s2)) >= 0.75


def similar_row(row):
    """Check name and subdivision_name are similar."""
    return similar(row["name"], row["subdivision_name"])

In [66]:
total_num = len(merged_df)
nulls_num = sum(merged_df["subdivision_name"].isnull())
wrong_num = sum(
    (~merged_df["subdivision_name"].isnull())
    & merged_df.apply(lambda row: similar_row(row), axis=1)
)
correct_num = total_num - nulls_num - wrong_num
print(f"Failed (nulls): {nulls_num}/{total_num}")
print(f"Wrong: {wrong_num}/{total_num}")
print(f"Correct: {correct_num}/{total_num} ({correct_num / total_num * 100:.2f}%)")

Failed (nulls): 274/5006
Wrong: 2657/5006
Correct: 2075/5006 (41.45%)


Show points reverse geocoded incorrectly

In [67]:
merged_df[merged_df.apply(lambda row: not similar_row(row), axis=1)]

Unnamed: 0,latitude,longitude,name,country_name,state_code,subdivision_name,subdivision_code,hasc_code
10,39.298936,-76.616047,Helmand,Afghanistan,HEL,Maryland,USA.21_1,US.MD
26,38.880239,-77.171724,Panjshir,Afghanistan,PAN,Virginia,USA.47_1,US.VA
35,41.494259,20.214716,Bulqizë District,Albania,BU,Dibër,ALB.2_1,AL.DB
36,39.948136,20.095589,Delvinë District,Albania,DL,Vlorë,ALB.12_1,AL.VR
37,40.644735,20.950664,Devoll District,Albania,DV,Korçë,ALB.7_1,AL.KE
...,...,...,...,...,...,...,...,...
4989,23.166969,49.365315,Eastern Province,Zambia,03,Ash Sharqiyah,SAU.8_1,SA.SH
4992,-15.382193,28.261580,Muchinga Province,Zambia,10,Lusaka,ZMB.5_1,ZM.LS
4994,6.237375,80.543845,Southern Province,Zambia,07,Matara,LKA.17_1,LK.MH
4995,6.901609,80.008775,Western Province,Zambia,01,Colombo,LKA.5_1,LK.CO
