## Extracting and Mapping Place Names
### This notebook demonstrates the power of and problems with automated Geoparsing.

In this notebook we:
- fetch a text from project gutenberg 
- extract place names from that text with NLTK (geoparsing)
- geocode the place names with geopy / geonames (a type of georeferencing)
- map the place locations with datascience maps module

Learning goals:
- Think about the differences in place name vs. coordinate representations of locations.
- Consider some of the things you can do with the place names once they are geocoded.

In [1]:
# Run but don't change!
from datascience import *
from datascience.predicates import are
import numpy as np
from scipy import stats
from scipy import misc

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

from urllib.request import urlopen 
import re
def read_url(url): 
    return re.sub('\\s+', ' ', urlopen(url).read().decode())

from itertools import groupby

from geopy.geocoders import GeoNames

In [5]:
# Create an NLTK location parsing function
# Source: http://stackoverflow.com/questions/31836058/nltk-named-entity-recognition-to-a-python-list
def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    #print(chunked)
    prev = None
    continuous_chunk = []
    current_chunk = []
    for i in chunked:
        if type(i) == Tree and (str(i).find('GPE') >=0):
            #print(i)
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
            #print("......", current_chunk)
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            #if named_entity not in continuous_chunk:
            #    continuous_chunk.append(named_entity)
            #    current_chunk = []
            continuous_chunk.append(named_entity)
            current_chunk = []
        else:
            continue
    return continuous_chunk

In [6]:
# Test the NLTK location parsing function
my_sent = "Washington -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement. Similar feelings were reported in Washington."

my_sent

'Washington -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement. Similar feelings were reported in Washington.'

In [7]:
my_locs = get_continuous_chunks(my_sent)

In [8]:
loc_count =  [(k, len(list(g))) for k, g in groupby(sorted(my_locs))]
print(loc_count)
(a,b) = zip(*loc_count) # this returns two lists that can then be columns in tables
print(a)
print(b)

[('Brooklyn', 1), ('New York', 1), ('Washington', 2)]
('Brooklyn', 'New York', 'Washington')
(1, 1, 2)


In [60]:
# Define a geocoder 
# IMPORTANT ! - this uses my username!! need to change that to your own!
gloc = GeoNames(country_bias=None, username='XXXXXX', timeout=10, proxies=None, user_agent=None)

In [10]:
# Test the geocoder
x = gloc.geocode('Berkeley')
x

Location(Berkeley, CA, US, (37.87159, -122.27275, 0.0))

In [11]:
# geocoder output
print(x.raw)
print(x.raw['name'])
print(x.latitude)

{'fclName': 'city, village,...', 'lng': '-122.27275', 'adminCode1': 'CA', 'toponymName': 'Berkeley', 'fcode': 'PPL', 'fcl': 'P', 'lat': '37.87159', 'population': 112580, 'name': 'Berkeley', 'countryName': 'United States', 'countryCode': 'US', 'geonameId': 5327684, 'countryId': '6252001', 'adminName1': 'California', 'fcodeName': 'populated place'}
Berkeley
37.87159


In [100]:
# Read in a text to Geoparse

#huck_finn_url = 'http://www.gutenberg.org/cache/epub/76/pg76.txt'
#huck_finn_text = read_url(huck_finn_url)
#text_url = huck_finn_url # NOT good for geocoding

#text_url = 'http://www.gutenberg.org/cache/epub/541/pg541.txt' # age of innocence
#text_url = 'http://www.gutenberg.org/cache/epub/3029/pg3029.txt'
text_url = 'http://www.gutenberg.org/cache/epub/103/pg103.txt' #around the world in 80 days

the_text = read_url(text_url)

In [101]:
# this can take a few minutes...
text_locs = get_continuous_chunks(the_text)

In [102]:
loc_count =  [(k, len(list(g))) for k, g in groupby(sorted(text_locs))]

In [103]:
loc_count

[('Aden', 5),
 ('Africa', 3),
 ('Ahmehnagara', 1),
 ('Allahabad', 9),
 ('America', 14),
 ('American', 27),
 ('Americans', 3),
 ('Anam', 1),
 ('Aouda', 33),
 ('Apiece', 1),
 ('Arabic', 1),
 ('Armenian', 1),
 ('Asia', 4),
 ('Asian', 1),
 ('Athens', 1),
 ('Auburn', 1),
 ('Banyans', 1),
 ('Barnum', 1),
 ('Behar', 2),
 ('Benares', 1),
 ('Bengal', 4),
 ('Biblical', 1),
 ('Birmingham', 1),
 ('Bombay', 54),
 ('Bordeaux', 4),
 ('Brahmins', 1),
 ('Brazil', 1),
 ('Brigham Young', 1),
 ('Brindisi', 9),
 ('British', 7),
 ('British India', 1),
 ('Broadway', 2),
 ('Buddhism', 1),
 ('Buddhist', 1),
 ('Bundelcund', 7),
 ('Burdwan', 1),
 ('Burhampoor', 1),
 ('Buxar', 1),
 ('Byronic', 1),
 ('Calais', 2),
 ('Calcutta', 29),
 ('California', 3),
 ('Californian', 1),
 ('Cambray', 1),
 ('Cancer', 1),
 ('Cani', 1),
 ('Captain', 1),
 ('Captain Speedy', 1),
 ('Cardiff', 1),
 ('Carnatic', 1),
 ('Certain', 1),
 ('Ceylon', 1),
 ('Ceylonese', 1),
 ('Chancery', 1),
 ('Chandernagor', 2),
 ('Cheshire', 1),
 ('Chicago',

In [109]:
# Create an empty table
loc_table = Table(['place','count'])

In [110]:
# Put the location data into the table
(loc_table['place'], loc_table['count']) = zip(*loc_count)

In [111]:
loc_table

place,count
Aden,5
Africa,3
Ahmehnagara,1
Allahabad,9
America,14
American,27
Americans,3
Anam,1
Aouda,33
Apiece,1


In [112]:
# Optional and arbitrary - remove minor mentions
top_locs = loc_table.where(loc_table['count'] > 5)
top_locs.sort('count', descending=True).show()

place,count
London,62
Hong Kong,57
English,56
Bombay,54
Passepartout,51
India,34
Aouda,33
Indian,31
England,30
Calcutta,29


In [113]:
# Examining some of the non-loc locs and how the geocoder will disambiguate
x = gloc.geocode('Portuguese')

In [114]:
x.raw

{'adminCode1': '00',
 'adminName1': '',
 'countryCode': 'MZ',
 'countryId': '1036973',
 'countryName': 'Mozambique',
 'fcl': 'A',
 'fclName': 'country, state, region,...',
 'fcode': 'PCLI',
 'fcodeName': 'independent political entity',
 'geonameId': 1036973,
 'lat': '-18.25',
 'lng': '35',
 'name': 'Mozambique',
 'population': 22061451,
 'toponymName': 'Republic of Mozambique'}

In [115]:
if (x.raw['fcode'] == 'PPLC'): { print ('yes')}

In [116]:
# Half-hearted attempt to clean up locations - an example of how one might move forward..
# Could use regular expressions to clean this.
# remove places ending in "an", "ans", "ish", "ese", and "French" 
top_locs = top_locs.where((top_locs['place'] !=('French')) & (top_locs['place'] !=('American')))
top_locs = top_locs.where((top_locs['place'] !=('British')) & (top_locs['place'] !=('English')))
top_locs = top_locs.where((top_locs['place'] !=('French')) & (top_locs['place'] !=('American')))
top_locs = top_locs.where((top_locs['place'] !=('Chinese')) & (top_locs['place'] !=('Portuguese')))
top_locs = top_locs.where((top_locs['place'] !=('Indian')) & (top_locs['place'] !=('Indians')))
top_locs = top_locs.where((top_locs['place'] !=('Japanese')) & (top_locs['place'] !=('Mexican')))
top_locs = top_locs.where((top_locs['place'] !=('Servian'))) 
top_locs.show()

place,count
Allahabad,9
America,14
Aouda,33
Bombay,54
Brindisi,9
Bundelcund,7
Calcutta,29
Chicago,8
China,13
England,30


In [117]:
# A function to geocode the places one by one and 
# return output that will load nicely into our table
def getGeocodeInfo(place):
    print('geocoding...', place)
    x = gloc.geocode(place)
    if(x != None):
        mystuff = [float(x.raw['lng']), float(x.raw['lat']), x.raw['fcl'], x.raw['fclName']]
    else:
        #what to return whn a place can't be geocoded
        mystuff = [0,0,"none","none"]
        
    return mystuff

In [118]:
# testing function
getGeocodeInfo('Berkeley')

geocoding... Berkeley


[-122.27275, 37.87159, 'P', 'city, village,...']

In [119]:
# test data that won't geocode
t = gloc.geocode('Bundelcund')
print(t)
#print(t.raw)

getGeocodeInfo('Bundelcund')


None
geocoding... Bundelcund


[0, 0, 'none', 'none']

In [120]:
## THIS DIDN"T WORK! But would be nice to git it to
#(loc_table['place'], loc_table['count']) = zip(*loc_count)
#(top_locs['lng'], top_locs['lat'], top_locs['fpl'], top_locs['fcl_name']) = top_locs.apply(lambda x: zip(*getGeocodeInfo(x)), ['place'])

In [121]:
# GEOCODE ALL PLACES IN THE TOP_LOCS TABLE
x = []
def getAll():
    for i in top_locs['place']:
        x.append(getGeocodeInfo(i))
    return x

In [122]:
x = getAll()
x

geocoding... Allahabad
geocoding... America
geocoding... Aouda
geocoding... Bombay
geocoding... Brindisi
geocoding... Bundelcund
geocoding... Calcutta
geocoding... Chicago
geocoding... China
geocoding... England
geocoding... Europe
geocoding... European
geocoding... Fix
geocoding... Frenchman
geocoding... Hong Kong
geocoding... India
geocoding... Japan
geocoding... London
geocoding... Mongolia
geocoding... New York
geocoding... Omaha
geocoding... Paris
geocoding... Parsee
geocoding... Passepartout
geocoding... Pillaji
geocoding... San
geocoding... San Francisco
geocoding... Saville
geocoding... Shanghai
geocoding... Singapore
geocoding... Suez
geocoding... United States
geocoding... Yokohama


[[81.84322, 25.44478, 'P', 'city, village,...'],
 [-105.64453, 54.77535, 'L', 'parks,area, ...'],
 [0.82273, 35.18509, 'P', 'city, village,...'],
 [72.88261, 19.07283, 'P', 'city, village,...'],
 [17.93607, 40.63215, 'P', 'city, village,...'],
 [0, 0, 'none', 'none'],
 [88.36304, 22.56263, 'P', 'city, village,...'],
 [-87.65005, 41.85003, 'P', 'city, village,...'],
 [105.0, 35.0, 'A', 'country, state, region,...'],
 [-70.3064, 44.2056, 'A', 'country, state, region,...'],
 [28.38867, 51.72703, 'L', 'parks,area, ...'],
 [15.97753, 45.81313, 'P', 'city, village,...'],
 [3.66832, 45.14264, 'P', 'city, village,...'],
 [174.5275, -35.86083, 'T', 'mountain,hill,rock,... '],
 [114.15769, 22.28552, 'P', 'city, village,...'],
 [79.0, 22.0, 'A', 'country, state, region,...'],
 [139.75309, 35.68536, 'A', 'country, state, region,...'],
 [-0.12574, 51.50853, 'P', 'city, village,...'],
 [105.0, 46.0, 'A', 'country, state, region,...'],
 [-74.00597, 40.71427, 'P', 'city, village,...'],
 [-95.93779, 41

In [123]:
#MAKE SURE THE NUMBER OF GEOCODED LOCS IS SAME AS NUMBER OF ROWS IN OUR TABLE
top_locs.num_rows == len(x)

True

In [124]:
# Add geocoded location data to the table
(top_locs['longitude'], top_locs['latitude'], top_locs['fpl'], top_locs['fcl_name']) = zip(*x)
top_locs.show()

place,count,longitude,latitude,fpl,fcl_name
Allahabad,9,81.8432,25.4448,P,"city, village,..."
America,14,-105.645,54.7754,L,"parks,area, ..."
Aouda,33,0.82273,35.1851,P,"city, village,..."
Bombay,54,72.8826,19.0728,P,"city, village,..."
Brindisi,9,17.9361,40.6322,P,"city, village,..."
Bundelcund,7,0.0,0.0,none,none
Calcutta,29,88.363,22.5626,P,"city, village,..."
Chicago,8,-87.65,41.85,P,"city, village,..."
China,13,105.0,35.0,A,"country, state, region,..."
England,30,-70.3064,44.2056,A,"country, state, region,..."


In [125]:
# Set the color and radius for each point we will map
top_locs['radius'] = 1000 * top_locs['count']
top_locs['color'] = 'red'
top_locs

place,count,longitude,latitude,fpl,fcl_name,radius,color
Allahabad,9,81.8432,25.4448,P,"city, village,...",9000,red
America,14,-105.645,54.7754,L,"parks,area, ...",14000,red
Aouda,33,0.82273,35.1851,P,"city, village,...",33000,red
Bombay,54,72.8826,19.0728,P,"city, village,...",54000,red
Brindisi,9,17.9361,40.6322,P,"city, village,...",9000,red
Bundelcund,7,0.0,0.0,none,none,7000,red
Calcutta,29,88.363,22.5626,P,"city, village,...",29000,red
Chicago,8,-87.65,41.85,P,"city, village,...",8000,red
China,13,105.0,35.0,A,"country, state, region,...",13000,red
England,30,-70.3064,44.2056,A,"country, state, region,...",30000,red


In [126]:
# Creat descriptive text for popup
top_locs['description'] = top_locs.apply(lambda x,y: "%s, %s mentions"% (x, str(y)), ['place', 'count'])


In [127]:
# Select only the columns that will be used to map the points
top_locs = top_locs.where(top_locs['latitude'] != 0)  # couldnt combine these two conditions
top_locs = top_locs.where(top_locs['longitude'] != 0) # assuming 0,0 not a valid locatoin
locmap = top_locs.select(['latitude', 'longitude','description','color','radius'])
locmap

latitude,longitude,description,color,radius
25.4448,81.8432,"Allahabad, 9 mentions",red,9000
54.7754,-105.645,"America, 14 mentions",red,14000
35.1851,0.82273,"Aouda, 33 mentions",red,33000
19.0728,72.8826,"Bombay, 54 mentions",red,54000
40.6322,17.9361,"Brindisi, 9 mentions",red,9000
22.5626,88.363,"Calcutta, 29 mentions",red,29000
41.85,-87.65,"Chicago, 8 mentions",red,8000
35.0,105.0,"China, 13 mentions",red,13000
44.2056,-70.3064,"England, 30 mentions",red,30000
51.727,28.3887,"Europe, 13 mentions",red,13000


In [128]:
# Create the map
locmap['radius'] =  10000 * top_locs['count']
Circle.map_table(locmap)

In [134]:
top_locs.where('place','San Francisco')

place,count,longitude,latitude,fpl,fcl_name,radius,color,description
San Francisco,6,-84.1293,9.99299,P,"city, village,...",6000,red,"San Francisco, 6 mentions"


In [132]:
gloc.geocode('San Francisco')

Location(San Francisco, 04, CR, (9.99299, -84.12934, 0.0))

In [133]:
gloc.geocode('San Francisco, CA')

Location(San Francisco, CA, US, (37.77493, -122.41942, 0.0))

## Questions:

- Name some of the characteristics of place names that geocoded well and of those that did poorly.
- Name some of the reasons why place name geoparsing (location named entity recognition NER) is difficult.
- Similarly why is place name "data cleaning" difficult.
- What are some of the benefits to automated geoparsing?
- What types of texts would geocode better than others? worse?
- What are the alternatives to automated geoparsing?
- Discuss the difference between georeferinging that big city across the bay as 'San Francisco' vs. 37.77493, '-122.41942'.
- What can you do with the results of automated geoparsing? How might they be used?
