## Extracting and Mapping Place Names
### This notebook demonstrates the power of and problems with automated Geoparsing.

In this notebook we:
- fetch a text from project gutenberg 
- extract place names from that text with NLTK (geoparsing)
- geocode the place names with geopy / geonames (a type of georeferencing)
- map the place locations with datascience maps module

Learning goals:
- Think about the differences in place name vs. coordinate representations of locations.
- Consider some of the things you can do with the place names once they are geocoded.

In [52]:
# Run but don't change!
from datascience import *
from datascience.predicates import are
import numpy as np
from scipy import stats
from scipy import misc

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

from urllib.request import urlopen 
import re
def read_url(url): 
    return re.sub('\\s+', ' ', urlopen(url).read().decode())

from itertools import groupby

from geopy.geocoders import GoogleV3

import folium

In [53]:
# Create an NLTK location parsing function
# Source: http://stackoverflow.com/questions/31836058/nltk-named-entity-recognition-to-a-python-list
def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    #print(chunked)
    prev = None
    continuous_chunk = []
    current_chunk = []
    for i in chunked:
        if type(i) == Tree and (str(i).find('GPE') >=0):
            #print(i)
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
            #print("......", current_chunk)
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            #if named_entity not in continuous_chunk:
            #    continuous_chunk.append(named_entity)
            #    current_chunk = []
            continuous_chunk.append(named_entity)
            current_chunk = []
        else:
            continue
    return continuous_chunk

In [54]:
# Test the NLTK location parsing function
my_sent = "Washington -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement. Similar feelings were reported in Washington."

my_sent

'Washington -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement. Similar feelings were reported in Washington.'

In [55]:
my_locs = get_continuous_chunks(my_sent)

In [56]:
loc_count =  [(k, len(list(g))) for k, g in groupby(sorted(my_locs))]
print(loc_count)
(a,b) = zip(*loc_count) # this returns two lists that can then be columns in tables
print(a)
print(b)

[('Brooklyn', 1), ('New York', 1), ('Washington', 2)]
('Brooklyn', 'New York', 'Washington')
(1, 1, 2)


In [57]:
# Define a geocoder 
# IMPORTANT ! - this uses my username!! need to change that to your own!
gloc = GoogleV3()

In [58]:
# Test the geocoder
x = gloc.geocode('Berkeley')
x #basic geocoder output

Location(Berkeley, CA, USA, (37.8715926, -122.272747, 0.0))

In [59]:
# Full geocoder output
print(x.raw)

{'formatted_address': 'Berkeley, CA, USA', 'address_components': [{'long_name': 'Berkeley', 'types': ['locality', 'political'], 'short_name': 'Berkeley'}, {'long_name': 'Alameda County', 'types': ['administrative_area_level_2', 'political'], 'short_name': 'Alameda County'}, {'long_name': 'California', 'types': ['administrative_area_level_1', 'political'], 'short_name': 'CA'}, {'long_name': 'United States', 'types': ['country', 'political'], 'short_name': 'US'}], 'types': ['locality', 'political'], 'place_id': 'ChIJ00mFOjZ5hYARk-l1ppUV6pQ', 'geometry': {'location': {'lng': -122.272747, 'lat': 37.8715926}, 'viewport': {'southwest': {'lng': -122.3270651, 'lat': 37.8462261}, 'northeast': {'lng': -122.234179, 'lat': 37.9056681}}, 'bounds': {'southwest': {'lng': -122.3270651, 'lat': 37.8462261}, 'northeast': {'lng': -122.234179, 'lat': 37.9056681}}, 'location_type': 'APPROXIMATE'}}


In [60]:
# Read in a text to Geoparse

#huck_finn_url = 'http://www.gutenberg.org/cache/epub/76/pg76.txt'
#huck_finn_text = read_url(huck_finn_url)
#text_url = huck_finn_url # NOT good for geocoding

#text_url = 'http://www.gutenberg.org/cache/epub/541/pg541.txt' # age of innocence
#text_url = 'http://www.gutenberg.org/cache/epub/3029/pg3029.txt'
text_url = 'http://www.gutenberg.org/cache/epub/103/pg103.txt' #around the world in 80 days

the_text = read_url(text_url)

In [61]:
# this can take a few minutes...
text_locs = get_continuous_chunks(the_text)

In [62]:
loc_count =  [(k, len(list(g))) for k, g in groupby(sorted(text_locs))]

In [63]:
loc_count

[('Aden', 5),
 ('Africa', 3),
 ('Ahmehnagara', 1),
 ('Allahabad', 9),
 ('America', 14),
 ('American', 27),
 ('Americans', 3),
 ('Anam', 1),
 ('Aouda', 33),
 ('Apiece', 1),
 ('Armenian', 1),
 ('Asia', 4),
 ('Asian', 1),
 ('Athens', 1),
 ('Auburn', 1),
 ('Banyans', 1),
 ('Barnum', 1),
 ('Behar', 2),
 ('Benares', 1),
 ('Bengal', 4),
 ('Biblical', 1),
 ('Birmingham', 1),
 ('Bombay', 52),
 ('Bordeaux', 5),
 ('Brahmins', 1),
 ('Brazil', 1),
 ('Brigham Young', 1),
 ('Brindisi', 9),
 ('British', 7),
 ('British India', 1),
 ('Broadway', 2),
 ('Buddhism', 1),
 ('Buddhist', 1),
 ('Bundelcund', 7),
 ('Burdwan', 1),
 ('Burhampoor', 1),
 ('Buxar', 1),
 ('Byronic', 1),
 ('Calais', 2),
 ('Calcutta', 29),
 ('California', 3),
 ('Californian', 1),
 ('Cambray', 1),
 ('Cancer', 1),
 ('Cani', 1),
 ('Captain', 1),
 ('Captain Speedy', 1),
 ('Cardiff', 1),
 ('Carnatic', 1),
 ('Certain', 1),
 ('Ceylon', 1),
 ('Ceylonese', 1),
 ('Chancery', 1),
 ('Chandernagor', 2),
 ('Cheshire', 1),
 ('Chicago', 8),
 ('Chili', 

In [64]:
# Create an empty table
loc_table = Table(['place','count'])

In [65]:
# Put the location data into the table
(loc_table['place'], loc_table['count']) = zip(*loc_count)

In [66]:
loc_table

place,count
Aden,5
Africa,3
Ahmehnagara,1
Allahabad,9
America,14
American,27
Americans,3
Anam,1
Aouda,33
Apiece,1


In [67]:
# Optional and arbitrary - remove minor mentions
top_locs = loc_table.where(loc_table['count'] > 5)
top_locs.sort('count', descending=True).show()

place,count
London,62
English,60
Hong Kong,57
Passepartout,55
Bombay,52
India,34
Aouda,33
England,30
Calcutta,29
New York,28


In [89]:
# Examining some of the non-loc locs and how the geocoder will disambiguate
x = gloc.geocode('Portuguese')
#x = gloc.geocode('European')
x

Location(Portuguesa, Rio de Janeiro - State of Rio de Janeiro, Brazil, (-22.7964731, -43.2077641, 0.0))

In [99]:
g = x.raw['address_components'][0]['types']
dir(g)
g1 = ','.join(map(str, g)) 
g1.find('locality')

13

In [90]:
print(x.raw)
print(x.raw['formatted_address'])
print(str(x.raw['address_components'][0]['types']))
print(x.raw['address_components'][0]['types'][0])
print(x.raw['geometry']['location']['lat'])
print(x.raw['geometry']['location']['lng'])


{'formatted_address': 'Portuguesa, Rio de Janeiro - State of Rio de Janeiro, Brazil', 'address_components': [{'long_name': 'Portuguesa', 'types': ['political', 'sublocality', 'sublocality_level_1'], 'short_name': 'Portuguesa'}, {'long_name': 'Rio de Janeiro', 'types': ['locality', 'political'], 'short_name': 'Rio de Janeiro'}, {'long_name': 'Rio de Janeiro', 'types': ['administrative_area_level_2', 'political'], 'short_name': 'Rio de Janeiro'}, {'long_name': 'State of Rio de Janeiro', 'types': ['administrative_area_level_1', 'political'], 'short_name': 'RJ'}, {'long_name': 'Brazil', 'types': ['country', 'political'], 'short_name': 'BR'}], 'types': ['political', 'sublocality', 'sublocality_level_1'], 'place_id': 'ChIJjyWZHId3mQARg8SXKHGphUo', 'geometry': {'location': {'lng': -43.2077641, 'lat': -22.7964731}, 'viewport': {'southwest': {'lng': -43.2127538, 'lat': -22.8059195}, 'northeast': {'lng': -43.2004043, 'lat': -22.7919554}}, 'bounds': {'southwest': {'lng': -43.2127538, 'lat': -22.8

In [70]:
# Half-hearted attempt to clean up locations - an example of how one might move forward..
# Could use regular expressions to clean this.
# remove places ending in "an", "ans", "ish", "ese", and "French" 
top_locs = top_locs.where((top_locs['place'] !=('French')) & (top_locs['place'] !=('American')))
top_locs = top_locs.where((top_locs['place'] !=('British')) & (top_locs['place'] !=('English')))
top_locs = top_locs.where((top_locs['place'] !=('French')) & (top_locs['place'] !=('American')))
top_locs = top_locs.where((top_locs['place'] !=('Chinese')) & (top_locs['place'] !=('Portuguese')))
top_locs = top_locs.where((top_locs['place'] !=('Indian')) & (top_locs['place'] !=('Indians')))
top_locs = top_locs.where((top_locs['place'] !=('Japanese')) & (top_locs['place'] !=('Mexican')))
top_locs = top_locs.where((top_locs['place'] !=('Servian'))) 
top_locs.show()

place,count
Allahabad,9
America,14
Aouda,33
Bombay,52
Brindisi,9
Bundelcund,7
Calcutta,29
Chicago,8
China,13
England,30


In [101]:
# A function to geocode the places one by one and 
# return output that will load nicely into our table
def getGeocodeInfo(place):
    print('geocoding...', place)
    x = gloc.geocode(place)
    if(x != None):
        mylat = float(x.raw['geometry']['location']['lat'])
        mylng= float(x.raw['geometry']['location']['lng'])
        myloctype = x.raw['address_components'][0]['types']
        myloctype = ','.join(map(str, myloctype)) 
        mylocname = x.raw['formatted_address']
 
        mystuff = [mylng,mylat,myloctype,mylocname]
    else:
        #what to return whn a place can't be geocoded
        mystuff = [0,0,"none","none"]
        
    return mystuff

In [102]:
# testing function
getGeocodeInfo('Berkeley')

geocoding... Berkeley


[-122.272747, 37.8715926, 'locality,political', 'Berkeley, CA, USA']

In [104]:
# test data that won't geocode
t = gloc.geocode('European')
print(t)
print(t.raw)

getGeocodeInfo('European')


Yevropeis'ka Square, Kyiv, Ukraine
{'formatted_address': "Yevropeis'ka Square, Kyiv, Ukraine", 'address_components': [{'long_name': "Yevropeis'ka Square", 'types': ['route'], 'short_name': "Yevropeis'ka Square"}, {'long_name': 'Kyiv', 'types': ['locality', 'political'], 'short_name': 'Kyiv'}, {'long_name': 'Kyiv City', 'types': ['administrative_area_level_2', 'political'], 'short_name': 'Kyiv City'}, {'long_name': 'Kyiv city', 'types': ['administrative_area_level_1', 'political'], 'short_name': 'Kyiv city'}, {'long_name': 'Ukraine', 'types': ['country', 'political'], 'short_name': 'UA'}], 'types': ['route'], 'place_id': 'ChIJwwXZJ07O1EAR6ii2949BS34', 'geometry': {'location': {'lng': 30.527185, 'lat': 50.452217}, 'viewport': {'southwest': {'lng': 30.5261012197085, 'lat': 50.4508852697085}, 'northeast': {'lng': 30.5287991802915, 'lat': 50.4535832302915}}, 'bounds': {'southwest': {'lng': 30.5266489, 'lat': 50.451461}, 'northeast': {'lng': 30.5282515, 'lat': 50.4530075}}, 'location_type': 

[30.527185, 50.452217, 'route', "Yevropeis'ka Square, Kyiv, Ukraine"]

In [74]:
## THIS DIDN"T WORK! But would be nice to git it to
#(loc_table['place'], loc_table['count']) = zip(*loc_count)
#(top_locs['lng'], top_locs['lat'], top_locs['fpl'], top_locs['fcl_name']) = top_locs.apply(lambda x: zip(*getGeocodeInfo(x)), ['place'])

In [105]:
# GEOCODE ALL PLACES IN THE TOP_LOCS TABLE
x = []
def getAll():
    for i in top_locs['place']:
        x.append(getGeocodeInfo(i))
    return x

In [106]:
x = getAll()
x

geocoding... Allahabad
geocoding... America
geocoding... Aouda
geocoding... Bombay
geocoding... Brindisi
geocoding... Bundelcund
geocoding... Calcutta
geocoding... Chicago
geocoding... China
geocoding... England
geocoding... Europe
geocoding... European
geocoding... Fix
geocoding... Frenchman
geocoding... Hong Kong
geocoding... India
geocoding... Japan
geocoding... London
geocoding... Mongolia
geocoding... Mormon
geocoding... New York
geocoding... Omaha
geocoding... Paris
geocoding... Parsee
geocoding... Passepartout
geocoding... Pillaji
geocoding... San
geocoding... San Francisco
geocoding... Saville
geocoding... Shanghai
geocoding... Singapore
geocoding... Suez
geocoding... United States
geocoding... Yokohama


[[81.846311,
  25.4358011,
  'locality,political',
  'Allahabad, Uttar Pradesh 211003, India'],
 [-95.712891, 37.09024, 'country,political', 'United States'],
 [1.0424995, 8.7219638, 'establishment,point_of_interest', 'Aouda, Togo'],
 [72.8776559, 19.0759837, 'locality,political', 'Mumbai, Maharashtra, India'],
 [17.9417616,
  40.6327278,
  'locality,political',
  '72100 Brindisi, Province of Brindisi, Italy'],
 [85.5441967,
  24.8752011,
  'political,sublocality,sublocality_level_1',
  'Bundelkhand, Nawada, Bihar, India'],
 [88.36389500000001,
  22.572646,
  'locality,political',
  'Kolkata, West Bengal 700001, India'],
 [-87.6297982, 41.8781136, 'locality,political', 'Chicago, IL, USA'],
 [104.195397, 35.86166, 'country,political', 'China'],
 [-1.1743197,
  52.3555177,
  'administrative_area_level_1,political',
  'England, UK'],
 [15.2551187, 54.5259614, 'continent,establishment,natural_feature', 'Europe'],
 [30.527185, 50.452217, 'route', "Yevropeis'ka Square, Kyiv, Ukraine"],
 [135

In [108]:
#MAKE SURE THE NUMBER OF GEOCODED LOCS IS SAME AS NUMBER OF ROWS IN OUR TABLE
top_locs.num_rows == len(x)

True

In [109]:
# Add geocoded location data to the table
(top_locs['longitude'], top_locs['latitude'], top_locs['loctype'], top_locs['locname']) = zip(*x)
top_locs.show()

place,count,longitude,latitude,loctype,locname
Allahabad,9,81.8463,25.4358,"locality,political","Allahabad, Uttar Pradesh 211003, India"
America,14,-95.7129,37.0902,"country,political",United States
Aouda,33,1.0425,8.72196,"establishment,point_of_interest","Aouda, Togo"
Bombay,52,72.8777,19.076,"locality,political","Mumbai, Maharashtra, India"
Brindisi,9,17.9418,40.6327,"locality,political","72100 Brindisi, Province of Brindisi, Italy"
Bundelcund,7,85.5442,24.8752,"political,sublocality,sublocality_level_1","Bundelkhand, Nawada, Bihar, India"
Calcutta,29,88.3639,22.5726,"locality,political","Kolkata, West Bengal 700001, India"
Chicago,8,-87.6298,41.8781,"locality,political","Chicago, IL, USA"
China,13,104.195,35.8617,"country,political",China
England,30,-1.17432,52.3555,"administrative_area_level_1,political","England, UK"


In [110]:
# Set the color and radius for each point we will map
top_locs['radius'] = 1000 * top_locs['count']
top_locs['color'] = 'red'
top_locs

place,count,longitude,latitude,loctype,locname,radius,color
Allahabad,9,81.8463,25.4358,"locality,political","Allahabad, Uttar Pradesh 211003, India",9000,red
America,14,-95.7129,37.0902,"country,political",United States,14000,red
Aouda,33,1.0425,8.72196,"establishment,point_of_interest","Aouda, Togo",33000,red
Bombay,52,72.8777,19.076,"locality,political","Mumbai, Maharashtra, India",52000,red
Brindisi,9,17.9418,40.6327,"locality,political","72100 Brindisi, Province of Brindisi, Italy",9000,red
Bundelcund,7,85.5442,24.8752,"political,sublocality,sublocality_level_1","Bundelkhand, Nawada, Bihar, India",7000,red
Calcutta,29,88.3639,22.5726,"locality,political","Kolkata, West Bengal 700001, India",29000,red
Chicago,8,-87.6298,41.8781,"locality,political","Chicago, IL, USA",8000,red
China,13,104.195,35.8617,"country,political",China,13000,red
England,30,-1.17432,52.3555,"administrative_area_level_1,political","England, UK",30000,red


In [136]:
# Creat descriptive text for popup
top_locs['description'] = top_locs.apply(lambda x,y,z: "%s, %s mentions (as %s)"% (x, str(y),z), ['locname', 'count', 'place'])

In [137]:
# Select only the columns that will be used to map the points
top_locs = top_locs.where(top_locs['latitude'] != 0)  # couldnt combine these two conditions
top_locs = top_locs.where(top_locs['longitude'] != 0) # assuming 0,0 not a valid locatoin

# This filtering with predicates requires most recent version of datascience package
#top_locs.where('loctype', are.equal_to, 'locality')
top_locs

place,count,longitude,latitude,loctype,locname,radius,color,description
Allahabad,9,81.8463,25.4358,"locality,political","Allahabad, Uttar Pradesh 211003, India",9000,red,"Allahabad, Uttar Pradesh 211003, India, 9 mentions (as A ..."
America,14,-95.7129,37.0902,"country,political",United States,14000,red,"United States, 14 mentions (as America)"
Aouda,33,1.0425,8.72196,"establishment,point_of_interest","Aouda, Togo",33000,red,"Aouda, Togo, 33 mentions (as Aouda)"
Bombay,52,72.8777,19.076,"locality,political","Mumbai, Maharashtra, India",52000,red,"Mumbai, Maharashtra, India, 52 mentions (as Bombay)"
Brindisi,9,17.9418,40.6327,"locality,political","72100 Brindisi, Province of Brindisi, Italy",9000,red,"72100 Brindisi, Province of Brindisi, Italy, 9 mentions ..."
Bundelcund,7,85.5442,24.8752,"political,sublocality,sublocality_level_1","Bundelkhand, Nawada, Bihar, India",7000,red,"Bundelkhand, Nawada, Bihar, India, 7 mentions (as Bundel ..."
Calcutta,29,88.3639,22.5726,"locality,political","Kolkata, West Bengal 700001, India",29000,red,"Kolkata, West Bengal 700001, India, 29 mentions (as Calc ..."
Chicago,8,-87.6298,41.8781,"locality,political","Chicago, IL, USA",8000,red,"Chicago, IL, USA, 8 mentions (as Chicago)"
China,13,104.195,35.8617,"country,political",China,13000,red,"China, 13 mentions (as China)"
England,30,-1.17432,52.3555,"administrative_area_level_1,political","England, UK",30000,red,"England, UK, 30 mentions (as England)"


In [138]:
locmap = top_locs.select(['latitude', 'longitude','description','color','radius'])
locmap

latitude,longitude,description,color,radius
25.4358,81.8463,"Allahabad, Uttar Pradesh 211003, India, 9 mentions (as A ...",red,9000
37.0902,-95.7129,"United States, 14 mentions (as America)",red,14000
8.72196,1.0425,"Aouda, Togo, 33 mentions (as Aouda)",red,33000
19.076,72.8777,"Mumbai, Maharashtra, India, 52 mentions (as Bombay)",red,52000
40.6327,17.9418,"72100 Brindisi, Province of Brindisi, Italy, 9 mentions ...",red,9000
24.8752,85.5442,"Bundelkhand, Nawada, Bihar, India, 7 mentions (as Bundel ...",red,7000
22.5726,88.3639,"Kolkata, West Bengal 700001, India, 29 mentions (as Calc ...",red,29000
41.8781,-87.6298,"Chicago, IL, USA, 8 mentions (as Chicago)",red,8000
35.8617,104.195,"China, 13 mentions (as China)",red,13000
52.3555,-1.17432,"England, UK, 30 mentions (as England)",red,30000


In [139]:
m = folium.Map([45,0], zoom_start=2)
m

In [140]:
for i in range(0,len(top_locs['latitude'])):
    folium.Marker([top_locs['latitude'][i], top_locs['longitude'][i]], popup=top_locs['description'][i]).add_to(m)
m

In [126]:
m = folium.Map([45,0], zoom_start=2)

for i in range(0,len(top_locs['latitude'])):
    folium.CircleMarker([top_locs['latitude'][i], top_locs['longitude'][i]], popup=top_locs['description'][i], radius=top_locs['radius'][i]).add_to(m)
m

## Questions:

- Name some of the characteristics of place names that geocoded well and of those that did poorly.
- Name some of the reasons why place name geoparsing (location named entity recognition NER) is difficult.
- Similarly why is place name "data cleaning" difficult.
- What are some of the benefits to automated geoparsing?
- What types of texts would geocode better than others? worse?
- What are the alternatives to automated geoparsing?
- Discuss the difference between georeferinging that big city across the bay as 'San Francisco' vs. 37.77493, '-122.41942'.
- What can you do with the results of automated geoparsing? How might they be used?
