## Extracting and Mapping Place Names
*last updated 01-10-2017*

### Introduction
This notebook demonstrates the power of and problems with automated geoparsing. **Geoparsing** is a term used for two related tasks:
1. **[NER: named entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)**, the automatic extraction of named terms from a text documents, with the focus on place names and locations.
2. **[Geocoding](https://en.wikipedia.org/wiki/Geocoding)**, the process of determining the geographic coordinates for place names, codes and street addresses.

Why geoparse? One reason might be to determine all locations that are mentioned in a text. Another might be to determine only the most important locations, which might be indicated by how often they are mentioned or where they are mentioned, e.g., in a title or chapter heading. The output of geoparsing can be used to answer all types of questions, like 'What was the geographic evolution of the use of the term *dude*?

NER and geocoding are complex computational tasks that are beyond the scope of this notebook. Instead, this notebook aims to get you thinking about the complexity and richness of place names and the value of coordinate representations of places, both of which make geoparasing so important.


#### In this notebook we:
1. Fetch the text of the [Project Gutenberg](http://www.gutenberg.org) ebook ['Around the World in 80 Days' by Jules Verne](http://www.gutenberg.org/cache/epub/103/pg103.txt).
2. Extract place name references from that text using the [NLTK](http://www.nltk.org) package 
3. Geocode the place names with [Geopy](https://geopy.readthedocs.io/en/1.10.0/) package and the [Google V3 Geocoding API ](https://developers.google.com/maps/documentation/geocoding/intro)
4. Map the named place locations with the maps module of the [datascience](https://github.com/data-8/datascience) package.

#### Learning goals:
- Think about the differences in place name vs. coordinate representations of locations.
- Experience the power and challenges of automated place name extraction.
- Consider some of the things you can do with the place names once they are geocoded.

#### Caveats:
- This notebook presents a very simple approach to geoparsing using the NLTK. Most references suggest geoparsing with the [Stanford NER (Named Entity Recognizer)](http://nlp.stanford.edu/software/CRF-NER.shtml) extension to NLTK.
<hr/>

### Step 1. Load needed Python Libraries.

- Note: press the **shift-return** keys to execute the code in a cell. You can also select **Cell > Run Cells and Select Below** from the menu.

In [None]:
# HIDDEN - run but do not change
from datascience import *
from datascience.predicates import are
import numpy as np

import matplotlib
matplotlib.use('Agg', warn=False)
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

from urllib.request import urlopen 
import re
def read_url(url): 
    return re.sub('\\s+', ' ', urlopen(url).read().decode())

In [None]:
# Run but don't change these libraries which are specific to this notebook

from scipy import stats
from scipy import misc

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

from itertools import groupby

from geopy.geocoders import *

### Step 2. Create a function to extract locations from text

The Natural Languate Toolkit is one of the leading Python packages for processing human language data. Our language processing task is to **geoparse** the text to extract all place name references. This is task known as NER - named entity recognition. But instead of caring about all named entities, we only want locations.  The type of named entities that we can parse with NLTK are:

- `FACLILTY`
- `GPE` (or geo-political entity)
- `GSP` (or geo-socio-political group)
- `LOCATION`
- `ORGANIZATION`
- `PERSON`

To do this, we create a function, here called **get_placename_chunks** which locates named entities in the text. The function returns a Python list of all of the chunks coded, or tagged, as `GPE`.  



In [None]:
def get_placename_chunks(text, debug_level=0):
    # NLTK NER location (GPE) parsing function
    # After: http://stackoverflow.com/questions/31836058/nltk-named-entity-recognition-to-a-python-list
    
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    if debug_level == 2: print(chunked) 
    prev = None
    continuous_chunk = []
    current_chunk = []
    for i in chunked:
        # This next line gets us the locations only
        # if type(i) == Tree and (str(i).find('GPE') >=0): # GPE is Geo-Political Entity
        if type(i) == Tree and (str(i).find('GPE') >=0) and (str(i).find('NNP') >=0): #GPE is Geo-Political Entity  
            if debug_level == 1: print(i)
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
            if debug_level == 1 : print("......", current_chunk)
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            continuous_chunk.append(named_entity)
            current_chunk = []
        else:
            continue
    return continuous_chunk

Now we create some text to test the function before we apply it to our book.

In [None]:
# Some text with place names for testing
## source: http://stackoverflow.com/questions/31836058/nltk-named-entity-recognition-to-a-python-list
my_text = '''Washington -- In the wake of a string of abuses by New York police officers in the 1990s, 
    Loretta E. Lynch the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken 
    trust that African-Americans felt and said the responsibility for repairing generations of miscommunication 
    and mistrust fell to law enforcement. Similar feelings were reported in Washington.'''





Test the **get_placename_chunks** function on the test data.

In [None]:
my_locs = get_placename_chunks(my_text)
my_locs

If you want to explore the NER process more throughly you can try setting some function arguments to print out the processing details. You will then see the chunks that are labeled as named entities (NNP) of type geo-political (GPE).


In [None]:
my_locs = get_placename_chunks(my_text,2)

Now that we have the list of place names, we count the number of time each place was mentioned. 

In [None]:
loc_count =  [(k, len(list(g))) for k, g in groupby(sorted(my_locs))]
print(loc_count)



The `loc_count` object is a list of place name - count pairs (or tuples). We can use the **zip** function to extract these values into separate lists that are in the same order - one for place names and one for the counts.  This will come in handy.

In [None]:
# What does this zip function do?
(a,b) = zip(*loc_count) # this returns two tuples that can then be columns in tables
print(a)
print(b)

### Geocoding

Geocoding is the process of determining the geographic coordinates for named places, zip codes, or street addresses. Geocoding will allow us to map these locations.

To do this we need to compare our place names with a database of places and geometric representations of these places. We will use the **geocoders** module of the **Geopy** package to do this. Geopy.geocoders provides access to several different geocoding tools, the most popular of which is the **Google Geocoding API**.

In [None]:
# Define a Geopy.geocoder 
gloc = GoogleV3()

In [None]:
# Test the geocoder on a place name and view the output
geocoded_place = gloc.geocode('Berkeley, CA')
geocoded_place


The `gloc.geocode` method returns a `Location` object. By entering **geocoded_place** we see the basic out put of the geocoder. To see the full output you need to reference the Locations raw output as **geocoded_place.raw**, as shown below. This is a Python dictionary from which you can return any of the elements.

In [None]:
# Full geocoder output
print(geocoded_place.raw)

You can also use the Location object's methods to retrieve the geocoded address, latitude, and longitude, among other elements.

In [None]:
print(geocoded_place.address, '[Longitude: ', geocoded_place.longitude, ', Latitude: ', geocoded_place.latitude,']')

### Extracting Place Names from our Text Document

We now have the functions we will need to extract place names from our ebook, **Around the World in 80 Days**.

In [None]:
# Read in a text to Geoparse
 
## Downloaded from Project Gutenberg
## http://www.gutenberg.org/cache/epub/103/pg103.txt
text_url = 'https://raw.githubusercontent.com/data-8/geospatial-connector/gh-pages/data/around_world_80days.txt' 
the_text = read_url(text_url)


In [None]:
# Take a look
print(the_text[:1000])

We can see from what prints above that we have the text of our book loaded into the variable `the_text`. Now we can extract place names.

In [None]:
# this can take a few minutes...
text_locs = get_placename_chunks(the_text)

In [None]:
# Let's take a look at the the extracted place names - first 10
text_locs[:10]

In [None]:
# Sort the locations and count the number of times each were referenced
loc_count =  [(k, len(list(g))) for k, g in groupby(sorted(text_locs))]

In [None]:
# and take a look at first ten - alpha order
loc_count[:10]

### Datascience Table

Now that we have retrieved the place names and the counts we can start organizing that information in a **Datascience Table**.

In [None]:
# Create an empty table for our locations
loc_table = Table.empty(['place','count'])
loc_table

In [None]:
# Put the location data into the table
(loc_table['place'], loc_table['count']) = zip(*loc_count)

In [None]:
# Take a look at the table
print('Number of places with more than 5 references: ', loc_table.num_rows)
loc_table.show()

We can see from the table above that:
1. many place names were extracted - 224, 
2. most of these have only 1 reference, and
3. some of them are not even place names, like `Brigham Young`

Let's try to weed this down to the most important places. To figure out a good cutoff, let's look at the histogram.

In [None]:
loc_table.hist('count')

The histogram tells us that most of the place names have fewer than 5 references. So let's limit our table to those places more than 5 counts.

In [None]:
# Optional and arbitrary - remove minor mentions
top_locs = loc_table.where(loc_table['count'] > 5)
print('Number of places with more than 5 references: ', top_locs.num_rows)
top_locs.sort('count', descending=True).show()

The new table of important places only has 35 names places, down from 224. That's a big improvement in terms of narrowing our focus. But we can see in this table that there are still a lot of terms that we wouldn't consider places. Can you name a few and why?

Let's try to remove these types of locations from our table. We should use **regular expressions** for this type of operation but that's a more advanced topic. So, for now we will just remove rows from the table that contain terms we are not interested in. This is just an example of how one might move forward with geoparsing - first figure out what you want to do and then figure out how to code and automate that thinking.

In [None]:
# remove rows with place names ending in "an", "ans", "ish", "ese", and "French" 
top_locs = top_locs.where((top_locs['place'] !=('Frenchman')) & (top_locs['place'] !=('American')))
top_locs = top_locs.where((top_locs['place'] !=('British')) & (top_locs['place'] !=('English')))
top_locs = top_locs.where((top_locs['place'] !=('French')) & (top_locs['place'] !=('American')))
top_locs = top_locs.where((top_locs['place'] !=('Chinese')) & (top_locs['place'] !=('Portuguese')))
top_locs = top_locs.where((top_locs['place'] !=('Indian')) & (top_locs['place'] !=('Indians')))
top_locs = top_locs.where((top_locs['place'] !=('Japanese')) & (top_locs['place'] !=('Mexican')))
top_locs = top_locs.where((top_locs['place'] !=('Europe')))
top_locs = top_locs.where((top_locs['place'] !=('San')))

# Remove "Passepartout" and 'Aouda'
top_locs = top_locs.where((top_locs['place'] !=('Passepartout'))  & (top_locs['place'] !=('Aouda')))


In [None]:
print('Number of places with more than 5 references: ', top_locs.num_rows)
top_locs.sort('count', descending=True).show()

### Geocoding, part II

We now have a pretty good list of place names to geocode. Let's create a function to go through the table and geocode each row. This function will return the formatted address, longitude, latitude, and place type of the matched location.

In [None]:
# A function to geocode the places one by one and 
# return output that will load nicely into our table
def getGeocodeInfo(place):
    print('geocoding...', place)
    x = gloc.geocode(place)
    if(x != None):
        mylat = float(x.raw['geometry']['location']['lat'])
        mylng= float(x.raw['geometry']['location']['lng'])
        myloctype = x.raw['address_components'][0]['types']
        myloctype = ','.join(map(str, myloctype)) 
        mylocname = x.raw['formatted_address']
 
        mystuff = [mylng,mylat,myloctype,mylocname]
    else:
        #what to return whn a place can't be geocoded
        mystuff = [0,0,"none","none"]
        
    return mystuff

In [None]:
# test the function
getGeocodeInfo('Berkeley, CA')

In [None]:
# test data that won't geocode
getGeocodeInfo('Berkeley')

In [None]:
# GEOCODE ALL PLACES IN THE TOP_LOCS TABLE
x = []
def getAll():
    for i in top_locs['place']:
        x.append(getGeocodeInfo(i))
    return x


In [None]:
x = getAll()
x

In [None]:
#MAKE SURE THE NUMBER OF GEOCODED LOCS IS SAME AS NUMBER OF ROWS IN OUR TABLE
top_locs.num_rows == len(x)

Now that we have all of our geocoded place data, we can add it to our top_locs table. We will then remove the places that we were unable to geocode - those where **loctype = 'none'**.


In [None]:
# Add geocoded location data to the table
(top_locs['longitude'], top_locs['latitude'], top_locs['loctype'], top_locs['locname']) = zip(*x)


In [None]:
# remove rows for places that were note geocoded
top_locs = top_locs.where((top_locs['loctype'] !=('none')))
print('Number of places with more than 5 references: ', top_locs.num_rows)
top_locs.show()

### Evaluating our work

What places were named in the book as being on the itenerary for the journey around the world?


>"These dates were inscribed in an itinerary divided into columns,
indicating the month, the day of the month, and the day for the
stipulated and actual arrivals at each principal point Paris, Brindisi,
Suez, Bombay, Calcutta, Singapore, Hong Kong, Yokohama, San Francisco,
New York, and London--from the 2nd of October to the 21st of December;"

### Mapping our Places

We are now ready to map our locations...

In [None]:
#map the locations
Circle.map_table(top_locs.select(['latitude', 'longitude']))

Those circles on the map are too small to see unless you zoom way in. So, you can't get a sense of the places that were visited during the journey around the world. Let's make a few adjustments. We will:

1. color the points red to make them more visible
2. increase the radius of the points proportional to the counts so that we can see the relative importance
3. add a descriptive text that will display in a popup window when you click on the map


In [None]:
# Set the color and radius for each point we will map
top_locs['radius'] = 10000 * top_locs['count']
top_locs['color'] = 'red'
top_locs

In [None]:
# Create descriptive text for popup
top_locs['description'] = top_locs.apply(lambda x,y,z: "%s, %s mentions (as %s)"% (x, str(y),z), ['locname', 'count', 'place'])

In [None]:
Circle.map_table(top_locs.select(['latitude', 'longitude','description','color','radius']))

Add the actual route map is...
<img src="http://kickasstrips.com/wp-content/uploads/2014/06/Around_the_World_in_Eighty_Days_map_Jules_Verne.jpg" width="800"></img>

See: (http://kickasstrips.com/2014/06/around-the-world-in-80-days-phileas-foggs-original-journey/)

## Questions:

1. Compare the place names that we extracted from the text (in the version of our top_locs table shown before the section **Evaluating our work**)   with the places listed in the text as the itenerary (quoted after the section **Evaluating our work**). How well did the code work? Are there places that you would like to remove from the table? If yes, why?
2. Name some of the reasons why place name geoparsing is difficult.
3. Why is place name "data cleaning" difficult. We did some of this in cell 84.
4. Why did we remove "Passepartout" and 'Aouda' in cell 84? Take a look at the ebook URL for clues.
5. Can you guess what types of documents would geoparse better than others? worse?
6. What are the alternatives to automated geoparsing?
7. Discuss the difference between referencing that big city across the bay as 'San Francisco' vs. 37.77493, '-122.41942'.

### BONUS Evaluation, part II

We can use the python **Folium** library to create a route map and see if it looks like the route shown in the image above. Don't worry about understanding the code, just execute it and view the results.

In [None]:
# Sort the locations for our route by longitude (east-west direction. 
# This is a little tricky because longitude range from -180 to +180 and those numbers reference the same locations.
top_locs['sorder'] = 0 # add a new column and set it to zero
top_locs['sorder'] = [(lon+360) if lon < -0.2 else (lon) for lon in top_locs['longitude']]

# display the locations sorted by sorder
top_locs.sort('sorder').show()


In [None]:
# save the sort order to a new table
top_locs2 = top_locs.sort('sorder')
top_locs2.show()

In [None]:
import folium

# Simple function to show folium maps inline
from IPython.display import HTML

def inline_map(m, height=500):
    """Takes a folium instance and embed HTML."""
    m._build_map()
    srcdoc = m.HTML.replace('"', '&quot;')
    embed = HTML('<iframe srcdoc="{0}" '
                 'style="width: 100%; height: {1}px; '
                 'border: none"></iframe>'.format(srcdoc, height))
    return embed

In [None]:
# Create an ordered list of our route locations
book_locs = list(zip(top_locs2['latitude'], top_locs2['longitude']))


In [None]:
m = folium.Map([0,0], zoom_start=2) 

m.line(locations=book_locs[0:16], line_color='red')
m.line(locations=book_locs[16:], line_color='red', line_weight=6)

def mapMyPoint(the_map, lat,lon, popupContent, m_color='blue'):
    the_map.simple_marker(location=(lat,lon), popup=popupContent, marker_color=m_color)

# Add the points along the route
top_locs2.apply(lambda lat,lon , thePopup: mapMyPoint(m, lat,lon, thePopup), ['latitude','longitude','description'])

inline_map(m)

### Bonus Question
What location(s) are messing up the route map, if any? 

Add the actual route map is...
<img src="http://kickasstrips.com/wp-content/uploads/2014/06/Around_the_World_in_Eighty_Days_map_Jules_Verne.jpg" width="800"></img>

See: (http://kickasstrips.com/2014/06/around-the-world-in-80-days-phileas-foggs-original-journey/)