# Mapping Texts: Visualizing Geographic References in Literature

## Introduction

In the previous geoparsing notebook, we explored how to use geoparsing to **derive a single location** for entire documents - answering the question "where is this document about?" We compared different strategies for selecting the most representative location from multiple place mentions.

In this notebook, we take a different approach. Instead of reducing multiple locations to one, we embrace the richness of geographic references throughout a text. Our goal is to **map the geographic landscape of literature** - to visualize all the places mentioned and see patterns in how they appear.

This is particularly interesting for travel literature, where authors describe journeys through different places. We'll use the [Irchel Geoparser](https://docs.geoparser.app/) to extract and map locations from classic travel books, creating two types of visualizations:

1. **Frequency Map**: Shows which places are mentioned most often
2. **Journey Progression Map**: Shows the order in which places first appear in the narrative

These visualizations can reveal narrative patterns, help readers understand geographic scope, and provide a new way to explore literary texts.


## Setup

First, let's import the required libraries.


In [None]:
import pandas as pd
import numpy as np
import requests
import re
from pathlib import Path
from collections import Counter

# Geoparser imports
from geoparser import Geoparser
from geoparser.modules import SpacyRecognizer, SentenceTransformerResolver

# Visualization imports
import plotly.express as px
import plotly.graph_objects as go

print("Libraries imported successfully!")


## Select a Book

We'll work with classic travel literature from Project Gutenberg. Choose one of these books (or experiment with all three!):

1. **Jules Verne** - *Around the World in Eighty Days* (1873): Phileas Fogg's famous race around the globe
2. **Mark Twain** - *The Innocents Abroad* (1869): A humorous account of American tourists in Europe and the Holy Land
3. **William Makepeace Thackeray** - *Notes on a Journey from Cornhill to Grand Cairo* (1846): Travel sketches from London to Egypt


In [None]:
# Book options
BOOKS = {
    'verne': {
        'title': 'Around the World in Eighty Days',
        'author': 'Jules Verne',
        'url': 'https://www.gutenberg.org/files/103/103-0.txt'
    },
    'twain': {
        'title': 'The Innocents Abroad',
        'author': 'Mark Twain',
        'url': 'https://www.gutenberg.org/files/3176/3176-0.txt'
    },
    'thackeray': {
        'title': 'Notes on a Journey from Cornhill to Grand Cairo',
        'author': 'William Makepeace Thackeray',
        'url': 'https://www.gutenberg.org/files/1863/1863-0.txt'
    }
}

# Select book (change this to 'twain' or 'thackeray' to try other books)
selected_book = 'verne'

book_info = BOOKS[selected_book]
print(f"Selected: {book_info['title']} by {book_info['author']}")


## Download and Extract Chapters

We'll download the book from Project Gutenberg and split it into chapters. Splitting by chapters is important because:
1. It provides natural segmentation for the geoparser
2. It allows us to track when locations first appear in the narrative
3. It keeps individual text chunks manageable for processing


In [None]:
# Download "Around the World in Eighty Days" from Project Gutenberg
url = book_info['url']
response = requests.get(url)
full_text = response.text

# End marker
end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK"
end_idx = full_text.find(end_marker)
if end_idx != -1:
    full_text = full_text[:end_idx]

# Split into chapters
# Actual chapter headings: "CHAPTER I." on their own line (no leading space)
# TOC entries: " CHAPTER I. IN WHICH..." (leading space, description on same line)
lines = full_text.split('\n')
chapter_positions = []
current_pos = 0

for i, line in enumerate(lines):
    stripped = line.rstrip()
    # Match lines that are exactly "CHAPTER [ROMAN]." with no leading space
    if re.match(r'CHAPTER [IVXLCDM]+\b', stripped) and not line.startswith(' '):
        chapter_positions.append(current_pos)
    current_pos += len(line) + 1

# Extract chapter texts
chapter_texts = []
for i in range(len(chapter_positions)):
    if i < len(chapter_positions) - 1:
        chapter_text = full_text[chapter_positions[i]:chapter_positions[i+1]].strip()
    else:
        chapter_text = full_text[chapter_positions[i]:].strip()
    chapter_texts.append(chapter_text)

print(f"Downloaded book: {len(full_text):,} characters")
print(f"Split into {len(chapter_texts)} chapters")
print(f"\nFirst chapter preview (first 200 characters):")
print(chapter_texts[0][:200] + "...")


## Initialize the Geoparser

We'll use the same geoparser configuration as in the previous notebook:
- [SpacyRecognizer](https://docs.geoparser.app/en/latest/guides/modules.html#spacyrecognizer) for identifying place names
- [SentenceTransformerResolver](https://docs.geoparser.app/en/latest/guides/modules.html#sentencetransformerresolver) for linking them to coordinates


In [None]:
print("Initializing geoparser components...")

# Initialize recognizer (identifies place names in text)
recognizer = SpacyRecognizer(model_name="en_core_web_sm")

# Initialize resolver (links place names to coordinates)
resolver = SentenceTransformerResolver(min_similarity=0.5)

# Create geoparser instance
geoparser = Geoparser(recognizer=recognizer, resolver=resolver)

print("Geoparser ready!")


## Process Chapters with Geoparser

Now we'll process each chapter to extract place names and their coordinates. We pass all chapters as a list to the geoparser, which processes them efficiently while keeping them separate so we can track which chapter each location appears in.


In [None]:
print(f"Processing {len(chapter_texts)} chapters...")
print("This may take several minutes depending on the book length...\\n")

# Process all chapters - pass as a list, not concatenated
documents = geoparser.parse(chapter_texts)

print(f"\\nProcessed {len(documents)} chapters!")

# Quick summary
total_toponyms = sum(len([t for t in doc.toponyms if t.location is not None]) for doc in documents)
print(f"Total resolved toponyms across all chapters: {total_toponyms}")


## Extract and Structure Toponym Data

Now we'll extract all resolved toponyms from each chapter, keeping track of:
- The location name and coordinates
- Which chapter it appears in
- How many times it's mentioned throughout the book

We'll create a comprehensive dataset that we can use for both visualizations.


In [None]:
# Extract all toponyms with chapter information
all_locations = []

for chapter_idx, doc in enumerate(documents):
    chapter_num = chapter_idx + 1  # 1-indexed for human readability
    
    # Get all resolved toponyms in this chapter
    toponyms = [t for t in doc.toponyms if t.location is not None]
    
    for toponym in toponyms:
        location = toponym.location
        
        # Create a unique identifier for each geographic location
        # We use (name, lat, lon) to distinguish different places
        location_data = {
            'name': location.data.get('name'),
            'lat': location.data.get('latitude'),
            'lon': location.data.get('longitude'),
            'country': location.data.get('country_name'),
            'country_code': location.data.get('country_code'),
            'feature_class': location.data.get('feature_class'),
            'feature_code': location.data.get('feature_code'),
            'chapter': chapter_num,
            'toponym_text': toponym.text
        }
        
        all_locations.append(location_data)

# Create DataFrame
locations_df = pd.DataFrame(all_locations)

print(f"Extracted {len(locations_df)} toponym mentions")
print(f"Unique locations: {locations_df.groupby(['name', 'lat', 'lon']).ngroups}")
print(f"\\nFirst few mentions:")
print(locations_df[['chapter', 'toponym_text', 'name', 'country']].head(10))


## Prepare Data for Visualization

For our maps, we need to aggregate the data:
1. **For the frequency map**: Count how many times each location appears
2. **For the progression map**: Track the first chapter where each location appears


In [None]:
# Group by unique location (name + coordinates)
# This handles cases where the same place might be referred to with slight variations
location_groups = locations_df.groupby(['name', 'lat', 'lon']).agg({
    'country': 'first',
    'country_code': 'first',
    'chapter': ['min', 'count']  # min = first mention, count = frequency
}).reset_index()

# Flatten column names
location_groups.columns = ['name', 'lat', 'lon', 'country', 'country_code', 'first_chapter', 'frequency']

# Sort by first appearance
location_groups = location_groups.sort_values('first_chapter')

print(f"Unique locations: {len(location_groups)}")
print(f"\\nMost frequently mentioned locations:")
print(location_groups.nlargest(10, 'frequency')[['name', 'country', 'frequency', 'first_chapter']])


## Map 1: Frequency-Based Visualization

This map shows **which places are mentioned most often** in the book. The size of each marker corresponds to how many times that location appears. This helps us understand which places are most important to the narrative.

**Interactive features:**
- Hover over markers to see location name, country code, and frequency count
- Zoom and pan to explore different regions
- Transparent markers allow you to see overlapping locations


In [None]:
# Create hover text
location_groups['hover_text'] = (
    '<b>' + location_groups['name'] + '</b><br>' +
    'Country: ' + location_groups['country_code'].fillna('Unknown') + '<br>' +
    'Mentions: ' + location_groups['frequency'].astype(str)
)

# Create the frequency map
fig1 = go.Figure()

fig1.add_trace(go.Scattergeo(
    lon=location_groups['lon'],
    lat=location_groups['lat'],
    text=location_groups['hover_text'],
    mode='markers',
    marker=dict(
        size=location_groups['frequency'],
        sizemode='diameter',
        sizeref=location_groups['frequency'].max() / 40,  # Scale factor for visibility
        sizemin=4,
        color='steelblue',
        opacity=0.6,
        line=dict(width=0.5, color='white')
    ),
    hovertemplate='%{text}<extra></extra>'
))

fig1.update_layout(
    title=dict(
        text=f'Location Frequency Map: {book_info["title"]}<br><sub>Marker size indicates number of mentions</sub>',
        x=0.5,
        xanchor='center'
    ),
    geo=dict(
        projection_type='natural earth',
        showland=True,
        landcolor='rgb(243, 243, 243)',
        coastlinecolor='rgb(204, 204, 204)',
        showlakes=True,
        lakecolor='rgb(220, 230, 240)',
        showcountries=True,
        countrycolor='rgb(204, 204, 204)'
    ),
    height=600,
    margin=dict(l=0, r=0, t=80, b=0)
)

fig1.show()

print(f"\\nMap shows {len(location_groups)} unique locations")
print(f"Total mentions: {location_groups['frequency'].sum()}")


## Map 2: Journey Progression Visualization

This map shows **when places first appear** in the narrative. The color of each marker represents the chapter number where that location is first mentioned, creating a visual representation of the journey's progression through the book.

**Interactive features:**
- Hover over markers to see location name, country code, and first mention chapter
- Colors follow a gradient from early chapters (purple/blue) to later chapters (yellow/red)
- All markers are the same size to focus on temporal sequence rather than frequency


In [None]:
# Create hover text for progression map
location_groups['hover_text_progression'] = (
    '<b>' + location_groups['name'] + '</b><br>' +
    'Country: ' + location_groups['country_code'].fillna('Unknown') + '<br>' +
    'First mentioned in Chapter: ' + location_groups['first_chapter'].astype(str)
)

# Create the progression map
fig2 = go.Figure()

fig2.add_trace(go.Scattergeo(
    lon=location_groups['lon'],
    lat=location_groups['lat'],
    text=location_groups['hover_text_progression'],
    mode='markers',
    marker=dict(
        size=12,  # Fixed size for all markers
        color=location_groups['first_chapter'],
        colorscale='deep',  # Gradient colorscale that goes through spectrum
        cmin=location_groups['first_chapter'].min(),
        cmax=location_groups['first_chapter'].max(),
        showscale=False,  # Hide the colorbar - chapter info available in hover
        opacity=0.8,
        line=dict(width=0.5, color='white')
    ),
    hovertemplate='%{text}<extra></extra>'
))

fig2.update_layout(
    title=dict(
        text=f'Journey Progression Map: {book_info["title"]}<br><sub>Color indicates chapter of first mention</sub>',
        x=0.5,
        xanchor='center'
    ),
    geo=dict(
        projection_type='natural earth',
        showland=True,
        landcolor='rgb(243, 243, 243)',
        coastlinecolor='rgb(204, 204, 204)',
        showlakes=True,
        lakecolor='rgb(220, 230, 240)',
        showcountries=True,
        countrycolor='rgb(204, 204, 204)'
    ),
    height=600,
    margin=dict(l=0, r=0, t=80, b=0)
)

fig2.show()

print(f"\\nMap shows progression from Chapter {location_groups['first_chapter'].min()} to Chapter {location_groups['first_chapter'].max()}")


## Conclusion: From Document Geolocation to Text Mapping

This notebook demonstrates a different application of geoparsing compared to the previous notebook:

### Previous Notebook: Document Geolocation
- **Goal**: Derive a single location for each document
- **Challenge**: Choose the "best" location from multiple mentions
- **Use case**: Spatial search, filtering documents by location
- **Strategy**: We compared first mention, centroid, and domain-specific rules

### This Notebook: Text Mapping
- **Goal**: Visualize all geographic references in a text
- **Challenge**: Present multiple locations meaningfully
- **Use case**: Understanding geographic scope, exploring narrative journeys, analyzing travel literature
- **Strategy**: We created frequency and progression visualizations

### Key Insights

1. **Same technology, different applications**: The geoparser extracts the same information, but we use it differently depending on our goals

2. **Visualization reveals patterns**: The maps show not just where places are, but how authors structure their narratives geographically

3. **Multiple perspectives matter**: The frequency map and progression map tell different stories about the same text

4. **Literature as geographic data**: Travel literature, historical documents, and narratives become explorable through a spatial lens

### Try It Yourself

Change the `selected_book` variable at the top to explore different books and see how their geographic patterns differ!
