In this notebook, we'll explore named entity recognition through the lens of toponym resolution, using NER to extract a list of geopolitical place names in a text, and then plotting those locations on a map (using the Folium mapping library -- see [here](https://blog.prototypr.io/interactive-maps-with-python-part-1-aa1563dbe5a9) for a Folium tutorial).

You'll need to install the following:
```sh
pip install folium==0.8.3
pip install wikipedia==1.4.0
```

#### Cell 2: Importing Libraries
This cell imports all the necessary Python libraries for the project.
* **folium**: Used to create interactive maps.
* **wikipedia**: Used to fetch text content from Wikipedia articles.
* **spacy**: A powerful library for Natural Language Processing (NLP), used here for Named Entity Recognition (NER).
* **Counter**: A dictionary subclass from the `collections` module for counting hashable objects, perfect for tallying the frequency of locations.

In [None]:
# Import the folium library for creating interactive maps.
import folium
# Import the wikipedia library to fetch content from Wikipedia articles.
import wikipedia
# Import the spacy library for natural language processing tasks.
import spacy
# From the collections module, import the Counter class to count entity occurrences.
from collections import Counter

#### Cell 3: Loading the spaCy Model
Here, we load a pre-trained English language model from spaCy. We disable the 'parser' component because we only need the Named Entity Recognizer (NER) for this task, which makes the process more efficient. A commented-out line provides an alternative model name (`en_core_web_sm`) which is a common fix if the default 'en' alias isn't set up.

In [None]:
# Load the default English language model from spaCy.
# We disable the 'parser' pipeline component to speed up processing as we only need NER.
nlp = spacy.load('en', disable=['parser'])

# This is a workaround in case loading 'en' fails. 
# It explicitly loads a specific small English model.
# nlp = spacy.load('en_core_web_sm', disable=['parser'])

#### Cell 4: Georeferencing Strategy
There are several good APIs for resolving place names to their latitude/longitude (such as [Nominatim](https://wiki.openstreetmap.org/wiki/Nominatim) from OpenStreetMap and Google's [Geocoding API](https://developers.google.com/maps/documentation/geocoding/)).  Those are typically rate-limited or not free, so for this notebook let's use a simple georeferencer using data from [GeoNames](http://download.geonames.org/export/dump/) -- we'll assign each mention of a geopolitical entity placename to the city with the same name; in cases of ambiguity (e.g., Cambridge, MA vs. Cambridge UK), we'll select the city with the greatest population.

#### Cell 5: Function to Read GeoNames Data
This cell defines the `read_geonames` function, which reads and parses data from two text files: one for cities and one for countries. It extracts relevant information like name, population, latitude, and longitude for cities, and just the names for countries. This data will be our knowledge base for mapping place names to coordinates.

In [None]:
# Define a function to read and parse city and country data from files.
def read_geonames(city_filename, country_filename):
    # Initialize an empty list to store city data.
    cities=[]
    # Initialize an empty list to store country names.
    countries=[]
    
    # Open the city data file.
    with open(city_filename) as file:
        # Loop through each line in the file, with its index.
        for idx,line in enumerate(file):
            # Split the tab-separated line into columns.
            cols=line.rstrip().split("\t")
            # Get the city name from the second column and convert to lowercase.
            name=cols[1].lower()
            # Get the latitude from the fifth column and convert to a float.
            lat=float(cols[4])
            # Get the longitude from the sixth column and convert to a float.
            long=float(cols[5])
            # Get the population from the 15th column and convert to an integer.
            population=int(cols[14])
 
            # Append a tuple of the extracted city data to the cities list.
            cities.append((name, population, lat, long))

    # Open the country information file.
    with open(country_filename) as file:
        # Loop through each line in the file, with its index.
        for idx,line in enumerate(file):
            # Skip header lines that start with '#'.
            if line.startswith("#"):
                continue
            # Split the tab-separated line into columns.
            cols=line.rstrip().split("\t")    
            # Get the country name from the fifth column and convert to lowercase.
            name=cols[4].lower()
            # Add the country name to the countries list.
            countries.append(name)
            
    # Return the list of cities and a set of unique country names for faster lookups.
    return cities, set(countries)

#### Cell 6: Executing the Data Loading
This code calls the `read_geonames` function with the paths to our data files. It loads the city and country information into the `cities` and `countries` variables, making them available for the rest of the script.

In [None]:
# Call the function to read the GeoNames files and load the data into variables.
# Note: The file paths "../data/..." assume the data is in a folder named 'data' one directory up.
cities, countries=read_geonames("../data/cities500.txt", "../data/countryInfo.txt")

#### Cell 7: Function to Resolve Toponyms
The `resolve_toponyms` function is the core of our location resolution logic. It takes the list of locations found in a text and tries to map them to geographic coordinates. To handle ambiguity (e.g., "Paris," France vs. "Paris," Texas), it simplifies the problem by always choosing the city with the largest population for a given name. It also filters out country names from the city list to avoid confusion.

In [None]:
# Define the function to map extracted place names to coordinates.
def resolve_toponyms(locations, cities, countries, doc):
    """ Resolve a counter of GPE entities to their latitude/longitude coordinates
    Input: 
        - locations: counter mapping GPE entities to their count in a text
        - cities: list of cities containing (placename, population, lat, long) tuples
        - countries: set of country names
        - doc: spacy-processed document containing all tokens, entities, etc.
        
    Output: dict mapping each GPE entity to (lat, long) tuple """
    
    # Initialize a dictionary to store the final coordinates for entities found in the text.
    coordinates={}
    
    # Initialize a temporary dictionary to hold the most populous city for each unique place name.
    new_geo={}
    
    # Iterate through all cities from our GeoNames data.
    for (placename, population, lat, long) in cities:
        # Skip this entry if the city name is also a country name to avoid ambiguity.
        if placename in countries:
            continue
            
        # If we have already seen a city with this name...
        if placename in new_geo:
            # ...get its currently stored population.
            _, cur_pop, _, _=new_geo[placename]
            # If the new city's population is greater, replace the old entry.
            if population > cur_pop:
                new_geo[placename]=(placename, population, lat, long)
        # If this is the first time we've seen this city name, add it.
        else:
            new_geo[placename]=(placename, population, lat, long)
    
    
    # Now, iterate through the unique locations found in the input text.
    for entity in locations:
        # If the location from the text exists in our cleaned-up geo dictionary...
        if entity in new_geo:
            # ...add its latitude and longitude to our results.
            coordinates[entity]=(new_geo[entity][2], new_geo[entity][3])
    
    # Return the dictionary mapping found entities to their coordinates.
    return coordinates
    

#### Cell 8: Function to Map Toponyms
The `map_toponyms` function ties everything together. It takes raw text, processes it with spaCy to find all geopolitical entities (GPEs), counts their occurrences, resolves them to coordinates using our `resolve_toponyms` function, and finally generates and returns an interactive Folium map. The map is centered on the most frequently mentioned location, and each location is marked with a circle whose radius corresponds to how often it was mentioned.

In [None]:
# Define the main function to process text and generate a map.
def map_toponyms(text, cities, countries):
    # Process the input text with our spaCy nlp object.
    doc=nlp(text)
    
    # Create a Counter object to store the frequency of each location.
    locations=Counter()
    # Iterate through all named entities found by spaCy.
    for entity in doc.ents:
        # We are only interested in entities labeled as "GPE" (Geopolitical Entity).
        if entity.label_ == "GPE":
            # Increment the count for this location (converted to lowercase).
            locations[entity.text.lower()]+=1


    # Call our previously defined function to get coordinates for the found locations.
    coordinates=resolve_toponyms(locations, cities, countries, doc)

    # Initialize variables to find the most frequent entity to center the map on.
    center=None
    maxentity=None
    maxcount=0
    # Iterate through the locations that we successfully found coordinates for.
    for entity in coordinates:
        # Check if the current entity's count is the highest so far.
        if locations[entity] > maxcount:
            # If so, update the max count.
            maxcount=locations[entity]
            # Set the map's center to this entity's coordinates.
            center=[coordinates[entity][0], coordinates[entity][1]]

            # Keep track of the entity's name.
            maxentity=entity
            
    # Create a Folium map object.
    folium_map = folium.Map(location=center,       # Center the map on the most frequent location.
                            zoom_start=3,          # Set an initial zoom level.
                            tiles="CartoDB dark_matter") # Use a dark-themed map style.

    # Add markers for each location to the map.
    for entity in coordinates:
        # Set the radius of the circle marker based on the location's frequency.
        radius=locations[entity]
        # Create the circle marker with its location, radius, and a popup label.
        marker = folium.CircleMarker(location=[coordinates[entity][0], coordinates[entity][1]], radius=radius, fill=True, popup=entity)
        # Add the marker to the map.
        marker.add_to(folium_map)
    
    # Return the completed map object.
    return folium_map

#### Cell 9: Testing with Wikipedia Articles
Let's test our method by pulling articles from Wikipedia and plotting the placenames mentioned in them.  Explore this -- try inputting other Wikipedia articles and visualizing the places.  Let us all know if you find an interesting one!

#### Cell 10: Fetching Wikipedia Content
This cell uses the `wikipedia` library to download the full text content of three different articles. These will serve as our sample texts to test the mapping function.

In [None]:
# Get the Wikipedia page object for "University of California, Berkeley".
ucb = wikipedia.page("University of California, Berkeley")
# Get the Wikipedia page object for "New York City".
nyc = wikipedia.page("New York City")
# Get the Wikipedia page object for "World War II".
ww2 = wikipedia.page("World War II")

#### Cell 11: Generating the Map for New York City
Here, we call our main `map_toponyms` function, passing the content of the "New York City" Wikipedia article. This will perform all the steps—NER, coordinate resolution, and map creation—and store the final map object in the `folium_map` variable.

In [None]:
# Generate a map for the content of the New York City Wikipedia page.
folium_map=map_toponyms(nyc.content, cities, countries)

#### Cell 12: Displaying the Map
Simply referencing the `folium_map` object at the end of a cell in a Jupyter environment will cause it to be rendered as an interactive map in the output.

In [None]:
# Display the interactive map generated in the previous cell.
folium_map

#### Cell 13: Testing with a Full Book
Now let's try it with the full text of a book (Mark Twain's travelogue *Innocents Abroad*).  Running this through spacy will take a minute.

#### Cell 14: Processing the Book Text
This cell reads the entire text of Mark Twain's *Innocents Abroad* from a local file and then passes this large string to our `map_toponyms` function to generate a map of all the places mentioned.

In [None]:
# Open the text file containing the book.
with open("../data/twain_innocents_abroad.txt") as file:
    # Read the entire content of the file into the 'data' variable.
    data=file.read()
# Call the mapping function with the book's full text.
folium_map=map_toponyms(data, cities, countries)

#### Cell 15: Displaying the Book Map
As before, this cell displays the map created from the book's text.

In [None]:
# Display the interactive map for "Innocents Abroad".
folium_map

#### Cell 16: Reflection on Errors and Improvements
You can see the kind of errors that our homemade toponym resolution is making.  How would you go about improving it?  What kind of information do you have in a text to make it better? Try to adapt `resolve_toponyms` to improve it.

#### Cell 17: Your Turn to Code
This empty cell is a space for you to experiment with and implement improvements to the `resolve_toponyms` function based on the challenges identified.

In [None]:
# This is an empty cell for you to write your own improved function.