# Anonymise Places

In this notebook I illustrate how to identify and anonymise places in Python, without the usage of NLP techniques, such as Named Entity Recognition.

Places identification is based on a gazetteer, which is built from the [Geonames Database](http://www.geonames.org/). Geonames is a Web service, containing (almost) all the places in the world. The Geonames database can be downloaded for free at at this [link](https://download.geonames.org/export/dump/). You can download the full database, covering all the world countries, or only one specific country. 


## Import the Geonames Database

In [66]:
import pandas as pd

df = pd.read_csv('source/IT.txt', sep='	', header=None)
df.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,781057,Fosso di San Antonio,Fosso di San Antonio,,41.48333,27.76667,H,STM,IT,,0.0,,,,0,,171,Europe/Rome,1993-12-10
1,781059,Colognole,Colognole,,43.50972,10.44833,P,PPL,IT,,16.0,LI,49008.0,,128,210.0,208,Europe/Rome,2014-01-19
2,781060,Casale Sant'Antonio,Casale Sant'Antonio,,44.61907,11.02235,P,PPL,IT,,5.0,MO,36006.0,,59,,35,Europe/Rome,2014-05-04
3,2522617,Graham Island,Graham Island,"Banco Graham,Banco Grahm,Ferdinandea Bank,Ferd...",37.14266,12.88126,U,SHLU,IT,,15.0,,,,0,,-9999,,2021-04-27
4,2522676,Zungti,Zungti,,38.65,15.98333,P,PPL,IT,,3.0,VV,102050.0,,0,,511,Europe/Rome,2011-09-11


In [67]:
gaz = df[1]
gaz = gaz.tolist()

In [68]:
len(gaz)

119539

## Identify Places

In [193]:
from nltk import ngrams
import re

def get_places(txt):
    # remove punctuation
    txt = re.sub(r"[^\w\d'\s]+",'',txt)
    n = 5
    places = []
    
    for i in range(n,0,-1):
        tokens = ngrams(txt.split(), i)
        for t in tokens:
            token = " ".join(t)
            try:
                res = gaz.index(token)
            except ValueError:
                continue
            if res:
                places.append(token)
                txt = txt.replace(token,"")
    return places

In [195]:
txt = 'Oggi sono andata a Roma e a Milano.'
get_places(txt)

['Roma', 'Milano']

## Anonymise Places

In [186]:
def anonymise_places(txt):
    temp_txt = re.sub(r"[^\w\d'\s]+",'',txt)
    n = 5
    #places = []
    # remove punctuation
    for i in range(n,0,-1):
        tokens = ngrams(temp_txt.split(), i)
        for t in tokens:
            token = " ".join(t)
            try:
                res = gaz.index(token)
            except ValueError:
                continue
            if res:
                txt = txt.replace(token,"X")
                temp_txt = temp_txt.replace(token,"")
                #places.append(token)
    return txt

In [187]:
txt = 'Oggi sono andata a Roma e a Macerata, poi sono passata da San Severino Lucano, San Severino Marche e Francavilla in Sinni'
anonymise_places(txt)

['San Severino Lucano', 'San Severino Marche', 'Francavilla in Sinni', 'Roma', 'Macerata']


'Oggi sono andata a X e a X, poi sono passata da X, X e X'

## Test the Anonymiser

In [108]:
import gradio as gr

iface = gr.Interface(
    anonymise_places,
    gr.inputs.Textbox(placeholder="Enter sentence here..."),
    gr.outputs.HTML(),
    examples=[
        ["Roma è la capitale d'Italia"],
        ["Dove vai? A Volterra."],
    ]
)

In [109]:
iface.launch()

Running locally at: http://127.0.0.1:7913/
To create a public link, set `share=True` in `launch()`.
Interface loading below...


(<Flask 'gradio.networking'>, 'http://127.0.0.1:7913/', None)

## Extend to all the world

Warning: please download the file allCountries.zip from Geonames, extract it and put in the source directory.

In [197]:
df_all = pd.read_csv('source/allCountries.txt', sep='	', header=None)
df_all[1].to_csv('source/places.csv')

KeyboardInterrupt: 

In [None]:
df_all = pd.read_csv('source/places.csv')

In [137]:
df_all = pd.read_csv('source/places.csv')
df_all.head()

Unnamed: 0.1,Unnamed: 0,1
0,0,Pic de Font Blanca
1,1,Roc Mélé
2,2,Pic des Langounelles
3,3,Pic de les Abelletes
4,4,Estany de les Abelletes


In [174]:
df_all[df_all['1'] == 'Kuala Lumpur']

Unnamed: 0.1,Unnamed: 0,1
3567996,3567996,Kuala Lumpur
6483193,6483193,Kuala Lumpur
6485291,6485291,Kuala Lumpur


In [140]:
gaz_all = df_all['1']
gaz_all = gaz_all.tolist()

In [175]:
def anonymise_places_all(txt):
    temp_txt = re.sub(r"[^\w\d'\s]+",'',txt)
    n = 5
    #places = []
    # remove punctuation
    for i in range(n,0,-1):
        tokens = ngrams(temp_txt.split(), i)
        for t in tokens:
            token = " ".join(t)
            try:
                res = gaz_all.index(token)
            except ValueError:
                continue
            if res:
                #places.append(token)
                txt = txt.replace(token,"X")
    #print(places)
    return txt

In [179]:
txt = 'Oggi sono andata a Roma e a Macerata, poi sono passata da San Severino Lucano, San Severino Marche e Francavilla in Sinni'
anonymise_places_all(txt)

['San Severino Lucano', 'San Severino Marche', 'Francavilla in Sinni', 'San Severino', 'San Severino', 'Roma', 'Macerata', 'San', 'Severino', 'Lucano', 'San', 'Severino', 'Marche', 'Sinni']


'Oggi sono andata a X e a X, poi sono passata da X, X e X'

## Test the anonymiser

In [178]:
import gradio as gr

iface = gr.Interface(
    anonymise_places_all,
    gr.inputs.Textbox(placeholder="Enter sentence here..."),
    gr.outputs.HTML(),
    examples=[
        ["Kuala Lumpur è una bella città"],
        ["New York si trova negli USA."],
    ]
)
iface.launch()

Running locally at: http://127.0.0.1:7919/
To create a public link, set `share=True` in `launch()`.
Interface loading below...


(<Flask 'gradio.networking'>, 'http://127.0.0.1:7919/', None)

['Kuala Lumpur', 'Kuala', 'Lumpur']
