In this notebook, we'll explore named entity recognition through the lens of toponym resolution, using NER to extract a list of geopolitical place names in a text, and then plotting those locations on a map (using the Folium mapping library -- see [here](https://blog.prototypr.io/interactive-maps-with-python-part-1-aa1563dbe5a9) for a Folium tutorial).


In [None]:
!pip install folium

In [None]:
!pip install wikipedia

In [None]:
import folium
import wikipedia
import spacy
from collections import Counter

In [None]:
nlp = spacy.load('en', disable=['parser'])

# workaround if you are getting an error loading the sapcy 'en' module:
# nlp = spacy.load('en_core_web_sm', disable=['parser'])

There are several good APIs for resolving place names to their latitude/longitude (such as [Nominatim](https://wiki.openstreetmap.org/wiki/Nominatim) from OpenStreetMap and Google's [Geocoding API](https://developers.google.com/maps/documentation/geocoding/)).  Those are typically rate-limited or not free, so for this notebook let's use a simple georeferencer using data from [GeoNames](http://download.geonames.org/export/dump/) -- we'll assign each mention of a geopolitical entity placename to the city with the same name; in cases of ambiguity (e.g., Cambridge, MA vs. Cambridge UK), we'll select the city with the greatest population.

In [None]:
def read_geonames(city_filename, country_filename):
    cities=[]
    countries=[]
    
    with open(city_filename) as file:
        for idx,line in enumerate(file):
            cols=line.rstrip().split("\t")
            name=cols[1].lower()
            lat=float(cols[4])
            long=float(cols[5])
            population=int(cols[14])
 
            cities.append((name, population, lat, long))

    with open(country_filename) as file:
        for idx,line in enumerate(file):
            if line.startswith("#"):
                continue
            cols=line.rstrip().split("\t")    
            name=cols[4].lower()
            countries.append(name)
            
    return cities, set(countries)

In [None]:
cities, countries=read_geonames("../data/cities500.txt", "../data/countryInfo.txt")

In [None]:
def resolve_toponyms(locations, cities, countries, doc):
    """ Resolve a counter of GPE entities to their latitude/longitude coordinates
    Input: 
        - locations: counter mapping GPE entities to their count in a text
        - cities: list of cities containing (placename, population, lat, long) tuples
        - countries: set of country names
        - doc: spacy-processed document containing all tokens, entities, etc.
        
    Output: dict mapping each GPE entity to (lat, long) tuple """
    
    coordinates={}
    
    new_geo={}
    
    for (placename, population, lat, long) in cities:
        if placename in countries:
            continue
            
        # for placenames that refer to multiple cities, just keep the city with biggest population
        if placename in new_geo:
            _, cur_pop, _, _=new_geo[placename]
            if population > cur_pop:
                new_geo[placename]=(placename, population, lat, long)
        else:
            new_geo[placename]=(placename, population, lat, long)
    
    
    for entity in locations:
        if entity in new_geo:
            coordinates[entity]=(new_geo[entity][2], new_geo[entity][3])
    
    return coordinates
    

In [None]:
def map_toponyms(text, cities, countries):
    doc=nlp(text)
    
    locations=Counter()
    for entity in doc.ents:
        # We'll select just entities that are tagged geopolitical entities (which include cities)
        if entity.label_ == "GPE":
            locations[entity.text.lower()]+=1


    coordinates=resolve_toponyms(locations, cities, countries, doc)

    center=None
    maxentity=None
    maxcount=0
    for entity in coordinates:
        if locations[entity] > maxcount:
            maxcount=locations[entity]
            center=[coordinates[entity][0], coordinates[entity][1]]

            maxentity=entity
            
    # Create map centered on the most frequently mentioned city
    folium_map = folium.Map(location=center,
                            zoom_start=3,
                            tiles="CartoDB dark_matter")

    # Add locadtions to map
    for entity in coordinates:
        radius=locations[entity]
        marker = folium.CircleMarker(location=[coordinates[entity][0], coordinates[entity][1]], radius=radius, fill=True, popup=entity)
        marker.add_to(folium_map)
    
    return folium_map

Let's test our method by pulling articles from Wikipedia and plotting the placenames mentioned in them.  Explore this -- try inputting other Wikipedia articles and visualizing the places.  Let us all know if you find an interesting one!

In [None]:
ucb = wikipedia.page("University of California, Berkeley")
nyc = wikipedia.page("New York City")
ww2 = wikipedia.page("World War II")

In [None]:
folium_map=map_toponyms(nyc.content, cities, countries)

In [None]:
folium_map

Now let's try it with the full text of a book (Mark Twain's travelogue *Innocents Abroad*).  Running this through spacy will take a minute.

In [None]:
with open("../data/twain_innocents_abroad.txt") as file:
    data=file.read()
folium_map=map_toponyms(data, cities, countries)

In [None]:
folium_map

Download a text of your own from [Project Gutenberg](https://www.gutenberg.org) and run it through the pipeline above (Project Gutenberg has many works of literature published before 1925).  Generate a visualization for it and be prepared to share your visualization in class.  Are the locations centered around the main setting of the work?