# Lesson 5: Geoparsing and Sentiment Mapping in Python

**🎯 Learning Objectives:**
- Extract geographic locations from text using advanced geoparser
- Combine location data with sentiment analysis
- Create interactive maps to visualize sentiment by location
- Apply data science techniques to literary analysis


## Overview

In this lesson, we'll take text data about Virginia's history and:

1. **Extract Locations**: Use a sophisticated geoparser to find and resolve geographic references
3. **Visualize on Maps**: Create interactive maps showing retrieved locations


---

In [1]:
# Import all required libraries
try:
    from geoparser import Geoparser
    from tqdm.notebook import tqdm
    import pandas as pd
    import plotly.express as px
    import mapclassify as mc
    import warnings
    
    # Suppress warnings for cleaner output
    warnings.simplefilter(action='ignore', category=FutureWarning)
    
    print("✅ All libraries imported successfully!")
    
except ImportError as e:
    print(f"❌ Missing library: {e}")
    print("Please run the installation notebook first: lesson_5_0_installation_setup.ipynb")

✅ All libraries imported successfully!


## Part 1: Initialize the Geoparser System

The geoparser is a sophisticated tool that can identify place names in text and resolve them to actual geographic coordinates.

### Step 1.1: Initialize the Geoparser

We'll create a geoparser with optimized settings for accuracy:

In [2]:
try:
    print("Initializing geoparser... (this may take a minute)")
    geo = Geoparser(
        spacy_model='en_core_web_trf',                    # Advanced language model
        transformer_model='dguzh/geo-all-distilroberta-v1', # Geographic transformer
        gazetteer='geonames'                              # Geographic database
    )
    print("✅ Geoparser initialized successfully!")
    
except Exception as e:
    print(f"❌ Error initializing geoparser: {e}")
    print("Make sure you ran the installation notebook first!")

Initializing geoparser... (this may take a minute)
✅ Geoparser initialized successfully!


**What these parameters do:**
- `spacy_model`: Advanced language processing for accurate text understanding
- `transformer_model`: Specialized AI model trained to recognize geographic references  
- `gazetteer`: Database containing millions of place names and their coordinates

### Step 1.2: Test the Geoparser

Let's test the geoparser with some sample sentences. Try changing the text below to include places you know:

In [3]:
# Test with sample sentences - feel free to modify these!
test_sentences = [
    "I traveled from New York to Richmond, Virginia last summer.",
    "The battle took place near Harrisonburg in the Shenandoah Valley.",
    "London and Paris are popular European destinations."
]

try:
    docs = geo.parse(test_sentences)
    print(f"✅ Successfully parsed {len(docs)} sentences!")
    
except Exception as e:
    print(f"❌ Error during parsing: {e}")
    print("Try restarting the kernel and running from the beginning.")

Toponym Recognition...


Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Toponym Resolution...


Batches:   0%|          | 0/66 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

✅ Successfully parsed 3 sentences!


### Step 1.3: Examine the Results

Let's see what locations the geoparser found. Each "toponym" is a place name with detailed geographic information:

In [4]:
print("🗺️  LOCATIONS FOUND:")
print("=" * 50)

for i, doc in enumerate(docs):
    print(f"\nSentence {i+1}: \"{test_sentences[i]}\"")
    
    if doc.toponyms:
        for toponym in doc.toponyms:
            print(f"  📍 Found: {toponym}")
    else:
        print("  ❌ No locations found in this sentence")
        
print("\n" + "=" * 50)

🗺️  LOCATIONS FOUND:

Sentence 1: "I traveled from New York to Richmond, Virginia last summer."
  📍 Found: New York
  📍 Found: Richmond
  📍 Found: Virginia

Sentence 2: "The battle took place near Harrisonburg in the Shenandoah Valley."
  📍 Found: Harrisonburg
  📍 Found: the Shenandoah Valley

Sentence 3: "London and Paris are popular European destinations."
  📍 Found: London
  📍 Found: Paris



### Understanding the Data Structure

Each toponym contains detailed geographic information. Here's what a complete location record looks like:

```python
{
    'geonameid': 2867714,
    'name': 'Munich',
    'latitude': 48.13743,
    'longitude': 11.57549,
    'country_name': 'Germany',
    'admin1_name': 'Bavaria',        # State/Province
    'admin2_name': 'Upper Bavaria',  # County/Region
    'feature_name': 'seat of a first-order administrative division',
    'population': 1260391
}
```

We can access specific pieces of information using `.location['key_name']`:

In [5]:
# Extract specific geographic information
print("📍 DETAILED LOCATION DATA:")
print("=" * 60)

for i, doc in enumerate(docs):
    print(f"\nSentence {i+1}: \"{test_sentences[i]}\"")
    
    for toponym in doc.toponyms:
        if toponym.location:
            name = toponym.location['name']
            lat = toponym.location['latitude']
            lon = toponym.location['longitude']
            country = toponym.location.get('country_name', 'Unknown')
            
            print(f"  🏛️  Place: {name}")
            print(f"  🌍 Country: {country}")
            print(f"  📐 Coordinates: ({lat:.4f}, {lon:.4f})")
            print()
        else:
            print(f"  ❌ Location '{toponym}' could not be resolved to coordinates")
            print()

📍 DETAILED LOCATION DATA:

Sentence 1: "I traveled from New York to Richmond, Virginia last summer."
  🏛️  Place: New York
  🌍 Country: United States
  📐 Coordinates: (43.0003, -75.4999)

  🏛️  Place: Richmond
  🌍 Country: United States
  📐 Coordinates: (37.5538, -77.4603)

  🏛️  Place: Virginia
  🌍 Country: United States
  📐 Coordinates: (37.5481, -77.4467)


Sentence 2: "The battle took place near Harrisonburg in the Shenandoah Valley."
  🏛️  Place: City of Harrisonburg
  🌍 Country: United States
  📐 Coordinates: (38.4496, -78.8689)

  🏛️  Place: Community Christian School of the Shenandoah Valley
  🌍 Country: United States
  📐 Coordinates: (38.9104, -78.4778)


Sentence 3: "London and Paris are popular European destinations."
  🏛️  Place: London
  🌍 Country: United Kingdom
  📐 Coordinates: (51.5085, -0.1257)

  🏛️  Place: Paris
  🌍 Country: France
  📐 Coordinates: (48.8534, 2.3488)



## Part 2: Load and Process Historical Text Data

Now we'll work with real historical text that already has sentiment analysis completed.

### Step 2.1: Load `JMU_reddit.pickle`

This dataset contains sentences sentences from the JMU reddit

In [12]:
try:
    df_jmu_reddit_toponyms = pd.read_pickle('data/jmu_reddit_toponyms.pickle')
    print(f"✅ Loaded {len(df_jmu_reddit_toponyms):,} sentences")
    print(f"📊 Columns: {list(df_jmu_reddit_toponyms.columns)}")

except FileNotFoundError:
    print("❌ Data file not found!")
    print("You may need to run previous lessons first to generate the sentiment data.")
    print("Or check that you're in the correct directory.")

✅ Loaded 30,005 sentences
📊 Columns: ['type', 'date', 'score', 'year_month', 'sentences', 'toponyms']


### Step 2.2: The Geoparsing Function

Here's a streamlined function that processes text and extracts geographic information:

**Key features:**
- Processes multiple sentences at once for efficiency
- Filters for Administrative areas (Countries, States) and Population centers (Cities)
- Extracts coordinates and place information

In [15]:
def geoparse_dataframe(df, text_column='sentences'):
    """
    Extract geographic locations from text data.
    
    Args:
        df: DataFrame with text data
        text_column: Column containing the text to parse
    
    Returns:
        DataFrame with added location columns
    """
    print(f"🔍 Processing {len(df)} sentences for geographic locations...")
    
    # Convert text column to list for batch processing
    sentences = df[text_column].tolist()
    
    try:
        # Process all sentences at once (more efficient)
        docs = geo.parse(sentences)  # Parse all sentences without feature filtering
        
        # Initialize storage for results
        places, latitudes, longitudes, feature_names = [], [], [], []
        
        # Extract information from each processed document
        for doc in tqdm(docs, desc="Extracting locations"):
            doc_places = []
            doc_latitudes = []
            doc_longitudes = []
            doc_feature_names = []
            
            # Get all toponyms found in this document
            for toponym in doc.toponyms:
                if toponym.location:
                    doc_places.append(toponym.location.get('name'))
                    doc_latitudes.append(toponym.location.get('latitude'))
                    doc_longitudes.append(toponym.location.get('longitude'))
                    doc_feature_names.append(toponym.location.get('feature_name'))
            
            # Store results (empty lists if no locations found)
            places.append(doc_places)
            latitudes.append(doc_latitudes)
            longitudes.append(doc_longitudes)
            feature_names.append(doc_feature_names)
        
        # Add new columns to dataframe
        df_result = df.copy()
        df_result['place'] = places
        df_result['latitude'] = latitudes
        df_result['longitude'] = longitudes
        df_result['feature_name'] = feature_names
        
        print(f"✅ Geoparsing complete!")
        return df_result
        
    except Exception as e:
        print(f"❌ Error during geoparsing: {e}")
        return df

There are several interesting things of note in the data. First, for some of the sentences the tokenizer did not find a toponym which is indicated by empty lists `[]`. This because this is a more accurate tokenizer and will likely have fewer false positives. We will have to remember to remove these. Likewise, right now the parsing has been set to include Administrative areas like countries and states (i.e. The US and Virginia) and population centers (Richmond, Harrisonburg). We will have to think of how to deal with these down the road.

**We can run the geoparser for all the data and expect to wait at least an hour!**

In [16]:
# Run the geoparser over the entire 'sentences' column
geoparse_results = geoparse_dataframe(df_jmu_reddit_toponyms)

🔍 Processing 30005 sentences for geographic locations...
Toponym Recognition...


🔍 Processing 30005 sentences for geographic locations...
Toponym Recognition...


Batches:   0%|          | 0/30005 [00:00<?, ?it/s]

🔍 Processing 30005 sentences for geographic locations...
Toponym Recognition...


Batches:   0%|          | 0/30005 [00:00<?, ?it/s]

Toponym Resolution...


🔍 Processing 30005 sentences for geographic locations...
Toponym Recognition...


Batches:   0%|          | 0/30005 [00:00<?, ?it/s]

Toponym Resolution...


Batches:   0%|          | 0/1650 [00:00<?, ?it/s]

🔍 Processing 30005 sentences for geographic locations...
Toponym Recognition...


Batches:   0%|          | 0/30005 [00:00<?, ?it/s]

Toponym Resolution...


Batches:   0%|          | 0/1650 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (849 > 512). Running this sequence through the model will result in indexing errors


🔍 Processing 30005 sentences for geographic locations...
Toponym Recognition...


Batches:   0%|          | 0/30005 [00:00<?, ?it/s]

Toponym Resolution...


Batches:   0%|          | 0/1650 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (849 > 512). Running this sequence through the model will result in indexing errors


Batches:   0%|          | 0/266 [00:00<?, ?it/s]

🔍 Processing 30005 sentences for geographic locations...
Toponym Recognition...


Batches:   0%|          | 0/30005 [00:00<?, ?it/s]

Toponym Resolution...


Batches:   0%|          | 0/1650 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (849 > 512). Running this sequence through the model will result in indexing errors


Batches:   0%|          | 0/266 [00:00<?, ?it/s]

Extracting locations:   0%|          | 0/30005 [00:00<?, ?it/s]

🔍 Processing 30005 sentences for geographic locations...
Toponym Recognition...


Batches:   0%|          | 0/30005 [00:00<?, ?it/s]

Toponym Resolution...


Batches:   0%|          | 0/1650 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (849 > 512). Running this sequence through the model will result in indexing errors


Batches:   0%|          | 0/266 [00:00<?, ?it/s]

Extracting locations:   0%|          | 0/30005 [00:00<?, ?it/s]

✅ Geoparsing complete!


In [22]:
# Display the updated DataFrame with new columns
geoparse_results.sample(50)

Unnamed: 0,type,date,score,year_month,sentences,toponyms,place,latitude,longitude,feature_name
5341,comment,2014-04-18 09:47:00,1,2014-04,"Fair warning though, its very expensive to get...",,[],[],[],[]
1114,post,2020-08-06 11:30:40,124,2020-08,JMU has not announced that it is starting onli...,,[],[],[],[]
3656,comment,2020-10-23 19:16:04,2,2020-10,I work on school work 7 days a week.,,[],[],[],[]
1883,comment,2020-09-04 09:50:26,43,2020-09,*** posted by [@jmu_sga](https://twitter.com/j...,,[],[],[],[]
10715,comment,2016-04-09 11:24:47,2,2016-04,You are required to log urec hours which means...,,[],[],[],[]
6983,comment,2014-06-27 16:29:32,1,2014-06,One of them already had a job offer rescinded ...,,[],[],[],[]
7783,post,2020-12-11 20:23:00,40,2020-12,It just feels like I haven't found my group ye...,,[],[],[],[]
10721,comment,2016-04-22 12:49:26,1,2016-04,I worked at the mall and during the winter my ...,[Massanutten],[],[],[],[]
7101,comment,2011-09-09 00:16:21,1,2011-09,"pretty awesome, man.",,[],[],[],[]
582,comment,2024-03-17 16:23:58,3,2024-03,They still haven’t put ac in the Village?,,[Village],[51.42727],[-0.23147],[None]


In [None]:
df_reddit_geoparsed = geoparse_results[geoparse_results['place'].apply(len) != 0]


---

In [27]:
df_reddit_geoparsed.sample(50)

Unnamed: 0,type,date,score,year_month,sentences,toponyms,place,latitude,longitude,feature_name
9773,post,2020-09-01 20:35:29,30,2020-09,“*Protecting the health of our Harrisonburg an...,"[Harrisonburg, Rockingham County]","[Harrisonburg, Rockingham County]","[38.44957, 42.98454]","[-78.86892, -71.08897]","[None, None]"
10019,comment,2021-12-05 14:07:30,1,2021-12,"Sorry, for this one I meant that they're joini...",[FCS],[Godwin],[42.51269],[-114.57476],[None]
8856,post,2023-04-05 13:07:58,35,2023-04,Hillside is in a decent location in terms of f...,,[Hillside],[41.87781],[-87.90284],[None]
2951,comment,2016-05-10 05:27:18,7,2016-05,"Unlike in 2001, when D-Hall was rebuilt in the...",,"[D and L Plaza Shopping Center, Hall]","[42.8995, 40.43368]","[-78.68447, -79.80199]","[None, None]"
6826,comment,2022-10-16 03:27:58,4,2022-10,I received this emergency notification on my p...,[Harrisonburg],[City of Harrisonburg],[38.44957],[-78.86892],[None]
8144,post,2020-07-07 06:22:32,39,2020-07,If JMU were to switch to an online-only format...,[the United States],[Territories of the United States],[35.27766],[-102.35403],[None]
8964,comment,2020-08-28 19:10:54,8,2020-08,I do agree if we had wide spread testing we’d ...,[America],[United States],[39.76],[-98.5],[None]
6855,post,2022-07-27 18:22:23,45,2022-07,Disabled Duke who needs help moving out of Har...,[Harrisonburg],[City of Harrisonburg],[38.44957],[-78.86892],[None]
1799,post,2020-09-08 17:05:23,105,2020-09,Harrisonburg now has more COVID cases than all...,"[Harrisonburg, New Zealand]","[Nords Ranch, New Zealand]","[33.73559, -42.0]","[-113.53132, 174.0]","[None, None]"
5964,post,2024-08-26 07:55:12,52,2024-08,Eagle Hall is like some fucked up version of a...,[Harrisonburg],"[Eagle Hall, Harrisonburg]","[38.42929, 38.44957]","[-75.96327, -78.86892]","[None, None]"


In [35]:
df_reddit_geoparsed_long = df_reddit_geoparsed.explode(['place', 'latitude', 'longitude', 'feature_name']).reset_index(drop=True)
df_reddit_geoparsed_long.sample(50)

Unnamed: 0,type,date,score,year_month,sentences,toponyms,place,latitude,longitude,feature_name
1087,comment,2020-02-16 11:56:06,4,2020-02,Potomac 2a represent,[Potomac],Potomac,40.15285,-80.52924,
248,comment,2020-10-17 10:39:45,7,2020-10,Just saw on the thread in Virginia subreddit t...,[Virginia],Virginia,37.54812,-77.44675,
243,comment,2020-10-17 08:34:59,17,2020-10,"Yeah, I heard it and felt it at Sunchase, woke...",,Sunchase Iv Resort,26.0908,-97.1652,
327,comment,2020-04-14 14:06:09,1,2020-04,"Also, Virginia’s decrim and North Carolina’s a...","[Virginia, North Carolina]",Virginia,37.54812,-77.44675,
1197,comment,2024-03-18 16:08:33,6,2024-03,I don't have an expert opinion on how much pre...,"[VA, NJ, NC, PA, NY]",New Jersey,40.16706,-74.49987,
1455,comment,2013-10-23 02:06:38,5,2013-10,That's how I got Wilson Hall cupola access bac...,,Wilson Hall,38.82917,-77.30222,
447,comment,2020-09-24 08:12:36,9,2020-09,No one killed but police violence and harassme...,[Harrisonburg],City of Harrisonburg,38.44957,-78.86892,
817,post,2020-06-16 12:58:03,49,2020-06,"To this day every December 7th, students will ...",,Carrier Library,38.43861,-78.87167,
536,comment,2023-10-31 12:04:35,38,2023-10,First floor Wilson Hall between 6-9 PM is the ...,,Wilson Hall,42.13093,-72.79315,
1602,comment,2021-02-15 22:29:37,2,2021-02,And East Salem is the furthest South.,[East Salem],University of South Alabama,30.69588,-88.17691,


## Part 3: Visualizing Spatial Relations

We can make a quick map of the results using `plotly` and `mapbox`

### Step 3.1: Load Pre-processed Data

The complete geoparsing process took over an hour on a modern computer:

![Processing Time](geoparser_completion.png)

In [37]:
# Count each place while keeping latitude and longitude
df_reddit_geoparsed_count = df_reddit_geoparsed_long.groupby('place').agg({
    'latitude': 'first',    # Keep the first latitude for each place
    'longitude': 'first',   # Keep the first longitude for each place
    'place': 'size'         # Count the occurrences
}).rename(columns={'place': 'count'}).reset_index()

# Filter for places mentioned more than 4 times
df_reddit_geoparsed_count = df_reddit_geoparsed_count[df_reddit_geoparsed_count['count'] > 4]
df_reddit_geoparsed_count.sample(50)

Unnamed: 0,place,latitude,longitude,count
68,Carrier Library,38.43861,-78.87167,6
292,Memorial,29.77303,-95.58442,5
56,Bridgeforth Stadium,38.43528,-78.87278,6
191,Harrison Hall,38.82944,-77.3025,5
467,United States,39.76,-98.5,73
464,Ukmergė,55.25,24.75,6
57,Bridgewater,41.53509,-73.36623,9
356,Pole Tavern,39.61678,-75.22907,6
199,Hillside,40.69601,-74.22866,10
331,Nova,65.65053,12.65206,14


In [42]:
fig = px.scatter_map(
    df_reddit_geoparsed_count,     # DataFrame with count data
    lat="latitude",               # Latitude column
    lon="longitude",              # Longitude column
    hover_name="place",           # Show place name on hover
    size="count",                 # Marker size based on count
    hover_data={
        "count": True,              # Show count in hover
        "latitude": False,          # Hide coordinates in hover
        "longitude": False
    },
    title="Geographic Locations Found in JMU Reddit Data",
    zoom=3,                       # Start with a broad view
    height=600
)

# Update the layout to use the default map style
fig.update_layout(
    mapbox_style="open-street-map",  # No token needed for this style
    margin={"r":0,"t":50,"l":0,"b":0}  # Remove margins but keep space for title
)

fig.show()

## Training a Custom Geoparser Model

You may have noticed that some locations in our map are incorrectly identified or resolved. This is a common issue when working with domain-specific text or regional data. The good news is that we can train a custom model to improve accuracy for our specific use case.

### Why Train a Custom Model?

The pre-trained geoparser works well for general text, but it may struggle with:
- **Domain-specific terminology** (academic jargon, local place names)
- **Regional variations** (local nicknames for places)
- **Context-specific disambiguation** (distinguishing between places with similar names)

Let's demonstrate how to create training data and fine-tune the model.

### Step 1: Preparing Training Data

Training data must be formatted as a list of dictionaries, where each document contains:
- **text**: The raw text content
- **toponyms**: List of location mentions with their positions and correct location IDs

Here's the required format:

In [None]:
# Example training corpus format for JMU-specific locations
train_corpus = [
    {
        "text": "I'm studying at James Madison University in Harrisonburg, Virginia.",
        "toponyms": [
            {
                "text": "Harrisonburg",
                "start": 44,  # Starting character position
                "end": 56,    # Ending character position
                "loc_id": "4761681"  # GeoNames ID for Harrisonburg, VA
            },
            {
                "text": "Virginia",
                "start": 58,
                "end": 66,
                "loc_id": "6254928"  # GeoNames ID for Virginia state
            }
        ]
    },
    {
        "text": "The campus is near downtown Harrisonburg and the Shenandoah Valley.",
        "toponyms": [
            {
                "text": "Harrisonburg",
                "start": 28,
                "end": 40,
                "loc_id": "4761681"
            },
            {
                "text": "Shenandoah Valley",
                "start": 49,
                "end": 66,
                "loc_id": "4787534"  # GeoNames ID for Shenandoah Valley
            }
        ]
    }
]

print("✅ Training corpus format example created!")
print(f"Number of training documents: {len(train_corpus)}")
print(f"First document text: '{train_corpus[0]['text']}'")
print(f"Number of toponyms in first document: {len(train_corpus[0]['toponyms'])}")

### Step 2: Initialize the GeoparserTrainer

The `GeoparserTrainer` allows us to fine-tune existing models or train from scratch. Key parameters:

- **spacy_model**: Used for tokenization and validating annotations
- **transformer_model**: The model to be fine-tuned 
- **gazetteer**: Must match the knowledge source used for annotations

In [None]:
try:
    from geoparser import GeoparserTrainer
    
    print("Initializing GeoparserTrainer...")
    trainer = GeoparserTrainer(
        spacy_model="en_core_web_trf",                    # Same as our geoparser
        transformer_model="dguzh/geo-all-distilroberta-v1", # Model to fine-tune
        gazetteer="geonames"                              # Knowledge source
    )
    print("✅ GeoparserTrainer initialized successfully!")
    
except ImportError:
    print("❌ GeoparserTrainer not available in this version")
    print("This is a demonstration of the training process")
except Exception as e:
    print(f"❌ Error initializing trainer: {e}")
    print("This is normal - we're demonstrating the training workflow")

### Step 3: Training Workflow

Here's the complete workflow for training a custom model:

1. **Load annotations**: Convert training corpus to GeoDoc objects
2. **Train model**: Fine-tune the transformer model
3. **Evaluate**: Test performance on evaluation data
4. **Use custom model**: Deploy the improved model

In [None]:
# STEP 1: Load and annotate training data
print("🔧 TRAINING WORKFLOW DEMONSTRATION")
print("=" * 50)

# This is demonstration code - actual training would require more data
if 'trainer' in locals():
    print("Step 1: Loading annotations...")
    # train_docs = trainer.annotate(train_corpus)
    print("✅ Training corpus would be converted to GeoDoc objects")
    
    print("\nStep 2: Training the model...")
    # trainer.train(
    #     train_docs, 
    #     output_path="models/jmu_custom_geoparser", 
    #     epochs=3, 
    #     batch_size=8
    # )
    print("✅ Model would be fine-tuned and saved")
    
    print("\nStep 3: Evaluating performance...")
    # eval_docs = trainer.annotate(eval_corpus)
    # eval_docs = trainer.resolve(eval_docs) 
    # metrics = trainer.evaluate(eval_docs)
    print("✅ Model performance would be measured")
    
    print("\nStep 4: Using the custom model...")
    # custom_geo = Geoparser(
    #     transformer_model="models/jmu_custom_geoparser",
    #     spacy_model='en_core_web_trf',
    #     gazetteer='geonames'
    # )
    print("✅ Custom model would be loaded for improved accuracy")
    
else:
    print("⚠️  GeoparserTrainer not available - this is a demonstration")
    print("In practice, you would:")
    print("1. Collect 100+ annotated examples")
    print("2. Train for 3-5 epochs") 
    print("3. Evaluate on held-out test data")
    print("4. Deploy the improved model")

print("\n" + "=" * 50)

### Step 4: Evaluation Metrics

When training a custom model, you'll get these performance metrics:

- **Accuracy**: Proportion of toponyms resolved to the exact correct location
- **Accuracy@161km**: Proportion resolved within 161km (100 miles) of correct location  
- **MeanErrorDistance**: Average distance in kilometers between predicted and correct locations
- **AreaUnderTheCurve**: Distribution of error distances (lower is better)

### Alternative: Using the Annotator Web App

Instead of manually creating training data, you can use the built-in annotation tool:

```bash
python -m geoparser annotator
```

This launches a web interface where you can:
- Upload your text files
- Click on location mentions to mark them
- Select the correct location from suggestions
- Export annotations in the proper format

### Tips for Creating Good Training Data

**Quality over Quantity:**
- Start with 50-100 carefully annotated examples
- Focus on problematic cases from your actual data
- Include examples of correctly resolved locations too

**Domain-Specific Examples:**
- Local place names and nicknames  
- Ambiguous locations (e.g., "Richmond" could be VA, CA, or UK)
- Institution-specific references (building names, campus locations)

**Geographic ID Sources:**
- Use [GeoNames.org](http://geonames.org) to find correct location IDs
- Search by place name to get the `geonameid`
- Verify coordinates match your intended location

**Common Issues to Address:**
- University buildings vs. city names
- State abbreviations vs. country codes  
- Historical vs. modern place names
- Colloquial names vs. official names