In [1]:
pip install geopy

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
from geopy.geocoders import Nominatim

In [3]:
hd = ["loc"]
df = pd.read_fwf('tweet.txt', names = hd)
df["id"] = df.index
df["query"] = df["loc"]
df

Unnamed: 0,loc,id,query
0,@ShortStintatPlanet-Earth-UK,0,@ShortStintatPlanet-Earth-UK
1,"**302** Wilmington, Delaware",1,"**302** Wilmington, Delaware"
2,#Blockchain,2,#Blockchain
3,#Cryptoverse!,3,#Cryptoverse!
4,#Earth #Europe #Germany #NRW,4,#Earth #Europe #Germany #NRW
...,...,...,...
495,भारत,495,भारत
496,চাঁপাইনবাবগঞ্জ,496,চাঁপাইনবাবগঞ্জ
497,スワップ部屋,497,スワップ部屋
498,太阳系第三行星-中国四川,498,太阳系第三行星-中国四川


In [4]:
df["location_lat"]=""
df["location_long"]=""
df["location_address"]=""
df.head()

Unnamed: 0,loc,id,query,location_lat,location_long,location_address
0,@ShortStintatPlanet-Earth-UK,0,@ShortStintatPlanet-Earth-UK,,,
1,"**302** Wilmington, Delaware",1,"**302** Wilmington, Delaware",,,
2,#Blockchain,2,#Blockchain,,,
3,#Cryptoverse!,3,#Cryptoverse!,,,
4,#Earth #Europe #Germany #NRW,4,#Earth #Europe #Germany #NRW,,,


This cell below will taken approximately 3 to 5 minutes to finish. 

In [5]:
geolocator = Nominatim(user_agent="myApp")

for i in df.index:
    try:
        #tries fetch address from geopy
        location = geolocator.geocode(df['query'][i])
        
        #append lat/long to column using dataframe location
        df.loc[i,'location_lat'] = location.latitude
        df.loc[i,'location_long'] = location.longitude
        df.loc[i,'location_address'] = location.address
    except:
        #catches exception for the case where no value is returned
        #appends null value to column
        df.loc[i,'location_lat'] = ""
        df.loc[i,'location_long'] = ""
        df.loc[i,'location_address'] = "MISSING"

#print first rows as sample
df.head()

Unnamed: 0,loc,id,query,location_lat,location_long,location_address
0,@ShortStintatPlanet-Earth-UK,0,@ShortStintatPlanet-Earth-UK,,,MISSING
1,"**302** Wilmington, Delaware",1,"**302** Wilmington, Delaware",39.720746,-75.542273,"302, Wilmington Avenue, Southbridge, Middlebor..."
2,#Blockchain,2,#Blockchain,44.646598,34.400734,"BLOCKCHAIN, Набережная улица, Профессорский уг..."
3,#Cryptoverse!,3,#Cryptoverse!,,,MISSING
4,#Earth #Europe #Germany #NRW,4,#Earth #Europe #Germany #NRW,,,MISSING


In [6]:
#checkpoint
#df.to_excel('raw_data.xls', index=False)

In [7]:
#df = pd.read_excel('raw_data.xls')
#df.head()

We can see that geopy produces many false responses:
1. crypto-terms used as tongue-in-cheek "locations" (like "Blockchain" or "Decentralized") are misinterpreted as real places
2. astronomical locations ("Aldebaran", "Andromeda", etc.) are assigned to eponymous locations all over Earth
3. words/phrases with locational meaning ("home", "somewhere out there", "anywhere", "over there") are interpreted incorrectly, usually as (parts of) names of public places; for example, and funnily enough, "Anywhere" is interpreted as "Go Anywhere Slab" in Malawi, please see https://peakvisor.com/peak/go-anywhere-slab.html for hiking suggestions
4. general geopy mistakes ("Albany, New York" is assigned to Albany Avenue in Connecticut, even though the input should totally be machine-understandable)

In my view, the best way to clear this problem out is to maximize the accuracy and reliability of predictions rather than how many entries the model predicts. My guess is that it is safer business-wise to know for certain that the user's location is fuzzy or unidentified rather than to assume that "#Blockchain" refers to a nightclub in Alushta, Crimea (https://www.instagram.com/blockchain_moreclub/), as geopy does, and build mistakingly on this assumption. In short, we prioritize precision over recall.

To get a quality model, we exclude any inputs containing:
- crypto-terms
- astronomical names
- non-geographical locational expressions
- hashtags

The geopy baseline occasionally makes mistakes with locating the true geographical entries. In order to exclude non-geopolitical entries and provide an alternative to geopy, we use spacy with only geopolitical entities (GPE) for predictions.

In [8]:
import spacy
from spacy import displacy

In [9]:
# python -m spacy download xx_ent_wiki_sm
nlp = spacy.load("en_core_web_sm")

We use the spacy model only on the entities it labels as geopolitical (cities, countries, etc), therefore making sure that there are no/fewer false positives.

In [10]:
article = [_ for _ in df['query']]

entities_by_article = []
for doc in nlp.pipe(article):
  people = []
  for ent in doc.ents:
    if ent.label_ == "GPE":
      people.append(ent)
  entities_by_article.append(people)

df['geopy_pred'] = df["location_address"]
df['spacy_pred'] = pd.Series(entities_by_article)

In [11]:
from IPython.display import display

display(df)

Unnamed: 0,loc,id,query,location_lat,location_long,location_address,geopy_pred,spacy_pred
0,@ShortStintatPlanet-Earth-UK,0,@ShortStintatPlanet-Earth-UK,,,MISSING,MISSING,[]
1,"**302** Wilmington, Delaware",1,"**302** Wilmington, Delaware",39.720746,-75.542273,"302, Wilmington Avenue, Southbridge, Middlebor...","302, Wilmington Avenue, Southbridge, Middlebor...","[(Wilmington), (Delaware)]"
2,#Blockchain,2,#Blockchain,44.646598,34.400734,"BLOCKCHAIN, Набережная улица, Профессорский уг...","BLOCKCHAIN, Набережная улица, Профессорский уг...",[]
3,#Cryptoverse!,3,#Cryptoverse!,,,MISSING,MISSING,[]
4,#Earth #Europe #Germany #NRW,4,#Earth #Europe #Germany #NRW,,,MISSING,MISSING,[(Germany)]
...,...,...,...,...,...,...,...,...
495,भारत,495,भारत,22.351115,78.667743,India,India,[]
496,চাঁপাইনবাবগঞ্জ,496,চাঁপাইনবাবগঞ্জ,24.595443,88.270761,"চাঁপাইনবাবগঞ্জ, Nawabganj, চাপাইনবাবগঞ্জ জেলা,...","চাঁপাইনবাবগঞ্জ, Nawabganj, চাপাইনবাবগঞ্জ জেলা,...",[]
497,スワップ部屋,497,スワップ部屋,,,MISSING,MISSING,[]
498,太阳系第三行星-中国四川,498,太阳系第三行星-中国四川,,,MISSING,MISSING,[(太阳系第三行星)]


In [12]:
#df.to_excel('processed_full.xls', index=False)

In [13]:
df = df.loc[:, ~df.columns.isin(['id', 'query', 'location_lat', 'location_long', 'location_address'])]

The issue with geopy is that we need a supervisor to verify the output. This makes the comparison of geopy and my model costly, as it requires a human to overview the models' outputs case-by-case and estimate which one provides a better response. On the task data with 500 entries, such evaluation is possible and demonstrates that geopy can be improved. If the data composition does not change too much with further additions, focusing on geopolitical names exclusively may be the best way forward.

The model I suggest provides only the very certain geographical matches and discards anything else -- including hashtags, star names, general phrases, etc. One could view it as a more precise estimator of the location, even though it wastes the observations of which we cannot be certain without taking some considerable amount of risk. In my view, this is still a better way of geotagging in the situations when false positives should be avoided at all costs, including the cost of foregoing some true positives.

In [14]:
import numpy as np
df["geopy_pred"].value_counts(dropna = False)

MISSING                                                                                                             69
London, Greater London, England, United Kingdom                                                                      4
Nederland                                                                                                            4
Berlin, Deutschland                                                                                                  4
City of New York, New York, United States                                                                            4
                                                                                                                    ..
Guayaquil, Guayas, Ecuador                                                                                           1
Grafton, Worcester County, Massachusetts, United States                                                              1
GLOBAL, Jalan Raya Ngawi - Maospati, Bayemtaman,

In [15]:
df['spacy_pred'] = df['spacy_pred'].str[0]
df["spacy_pred"].value_counts(dropna = False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['spacy_pred'] = df['spacy_pred'].str[0]


NaN                 213
(Wilmington)          1
(Österreich)          1
(France)              1
(Orange, County)      1
                   ... 
(Gurgaon)             1
(Gujarat)             1
(Indonesia)           1
(Glasgow)             1
(太阳系第三行星)             1
Name: spacy_pred, Length: 288, dtype: int64

The geopy model makes 431 predictions, with some of them credible false positives, as we have already demonstrated ("Aldebaran", "Anywhere", etc). The spacy model makes 288 predictions, which is 143 less than geopy, but we can be certain that all, 100 per cent, of these predictions are real geopolitical entities such as countries or cities, rather than phrases or star-named bars or hotels.

Admittedly, the predictions from spacy are less specific, as sometimes it does not recognize the places which geopy does indentify correctly, such as smaller villages or cities. Moving on, it is possible to compensate for this by running geopy on spacy results in order to retrieve the coordinates for these certainly real places. Still, spacy evidently has a better rate of eliminating true negatives than geopy, and therefore might offer a safer approach. Additionally, it makes predictions quicker than geopy, especially if coordinates are not required.

In [16]:
df.to_excel('processed.xls', index=False)

  df.to_excel('processed.xls', index=False)


In [23]:
from random import sample
from random import seed
sample = df.sample(n = 100)

In [25]:
seed(2022)
sample.to_excel('sample.xls', index=False)

  sample.to_excel('sample.xls', index=False)


In [40]:
url = 'https://github.com/avarsenev/INCA/blob/main/sample_evaluated.xls?raw=true'
df_sample = pd.read_excel(url, engine='xlrd')

In [40]:
df_sample

Unnamed: 0,loc,geopy_pred,spacy_pred,g,s
0,"Rotorua District, New Zealand","Rotorua Lakes District, Bay of Plenty, New Zea...",New Zealand,1,1
1,"Puerto Rico, USA","Puerto Rico, United States",Puerto Rico,1,1
2,"Virginia, USA","Virginia, United States",Virginia,1,1
3,Kansas City,"Kansas City, Jackson County, Missouri, United ...",Kansas City,1,1
4,"Lagos, Nigeria","Lagos, Lagos Island, Lagos, 100242, Nigeria",Lagos,1,1
...,...,...,...,...,...
96,"Malta, Österreich","Malta, Bezirk Spittal an der Drau, Kärnten, 98...",Malta,1,1
97,"College Station, TX","College Station, Brazos County, Texas, United ...",,1,0
98,"England, United Kingdom","England, United Kingdom",England,1,1
99,Österreich,Österreich,Österreich,1,1


I take a sample of 100 and estimate the outputs of the two models by hand. "0"'s are either incorrect predictions (e.g., same city in a different state), or missed predictions with a real geopolitical location that the model could have predicted but has not. "1"'s are everything else (fully correct prediction, prediction only at the country level, true negatives, etc). Geopy returns 84 correct responses out of 100, spacy returns 75 out of 100. But, as I have suggested before, this issue may be viewed as not just the matter of accuracy but rather of minimizing the amount of false positives.

Notably, spacy is less precise in substantive sense and sometimes predicts, e.g., only the country when the data is given up to the city and postcode level. At the same time, spacy is more conservative than geopy, and in the sample it has only ONE false positive, Dreamland, for which neither model has a way of knowing that it is not a real place ‐ given that there are places called like that in the world and the idea of the user implying the unreal Dreamland is based on my own human assumption. 

In comparison, the geopy model suffers a lot from false triggers ("moon", "Aldebaran", "Planet Earth", "Decentralized", etc, all attributed to some geographical location), notably mistaking almost ALL of the ambiguous entries. That is, baseline geopy simply trailblazes through the data and gives out overfitted predictions. 

My final suggestion is that spacy might be preferrable as a more efficient estimator of the true geopolitical entities. Furthermore, geopy could be used over the spacy's responses and along with the initial data as a way to check whether the entry is real and only then to deduce its coordinates and more precise location than the one spacy is capable of. If the cost of a mistake is higher than omission, spacy can be preferred as a more conservative approach. 