## Training a Custom Geoparser Model

You may have noticed that some locations in our map are incorrectly identified or resolved. This is a common issue when working with domain-specific text or regional data. The good news is that we can train a custom model to improve accuracy for our specific use case.

### Why Train a Custom Model?

The pre-trained geoparser works well for general text, but it may struggle with:
- **Domain-specific terminology** (academic jargon, local place names)
- **Regional variations** (local nicknames for places)
- **Context-specific disambiguation** (distinguishing between places with similar names)

Let's demonstrate how to create training data and fine-tune the model.

### Step 1: Preparing Training Data

Training data must be formatted as a list of dictionaries, where each document contains:
- **text**: The raw text content
- **toponyms**: List of location mentions with their positions and correct location IDs

Here's the required format:

In [None]:
# Example training corpus format for JMU-specific locations
train_corpus = [
    {
        "text": "I'm studying at James Madison University in Harrisonburg, Virginia.",
        "toponyms": [
            {
                "text": "Harrisonburg",
                "start": 44,  # Starting character position
                "end": 56,    # Ending character position
                "loc_id": "4761681"  # GeoNames ID for Harrisonburg, VA
            },
            {
                "text": "Virginia",
                "start": 58,
                "end": 66,
                "loc_id": "6254928"  # GeoNames ID for Virginia state
            }
        ]
    },
    {
        "text": "The campus is near downtown Harrisonburg and the Shenandoah Valley.",
        "toponyms": [
            {
                "text": "Harrisonburg",
                "start": 28,
                "end": 40,
                "loc_id": "4761681"
            },
            {
                "text": "Shenandoah Valley",
                "start": 49,
                "end": 66,
                "loc_id": "4787534"  # GeoNames ID for Shenandoah Valley
            }
        ]
    }
]

print("✅ Training corpus format example created!")
print(f"Number of training documents: {len(train_corpus)}")
print(f"First document text: '{train_corpus[0]['text']}'")
print(f"Number of toponyms in first document: {len(train_corpus[0]['toponyms'])}")

### Step 2: Initialize the GeoparserTrainer

The `GeoparserTrainer` allows us to fine-tune existing models or train from scratch. Key parameters:

- **spacy_model**: Used for tokenization and validating annotations
- **transformer_model**: The model to be fine-tuned 
- **gazetteer**: Must match the knowledge source used for annotations

In [None]:
try:
    from geoparser import GeoparserTrainer
    
    print("Initializing GeoparserTrainer...")
    trainer = GeoparserTrainer(
        spacy_model="en_core_web_trf",                    # Same as our geoparser
        transformer_model="dguzh/geo-all-distilroberta-v1", # Model to fine-tune
        gazetteer="geonames"                              # Knowledge source
    )
    print("✅ GeoparserTrainer initialized successfully!")
    
except ImportError:
    print("❌ GeoparserTrainer not available in this version")
    print("This is a demonstration of the training process")
except Exception as e:
    print(f"❌ Error initializing trainer: {e}")
    print("This is normal - we're demonstrating the training workflow")

### Step 3: Training Workflow

Here's the complete workflow for training a custom model:

1. **Load annotations**: Convert training corpus to GeoDoc objects
2. **Train model**: Fine-tune the transformer model
3. **Evaluate**: Test performance on evaluation data
4. **Use custom model**: Deploy the improved model

In [None]:
# STEP 1: Load and annotate training data
print("🔧 TRAINING WORKFLOW DEMONSTRATION")
print("=" * 50)

# This is demonstration code - actual training would require more data
if 'trainer' in locals():
    print("Step 1: Loading annotations...")
    # train_docs = trainer.annotate(train_corpus)
    print("✅ Training corpus would be converted to GeoDoc objects")
    
    print("\nStep 2: Training the model...")
    # trainer.train(
    #     train_docs, 
    #     output_path="models/jmu_custom_geoparser", 
    #     epochs=3, 
    #     batch_size=8
    # )
    print("✅ Model would be fine-tuned and saved")
    
    print("\nStep 3: Evaluating performance...")
    # eval_docs = trainer.annotate(eval_corpus)
    # eval_docs = trainer.resolve(eval_docs) 
    # metrics = trainer.evaluate(eval_docs)
    print("✅ Model performance would be measured")
    
    print("\nStep 4: Using the custom model...")
    # custom_geo = Geoparser(
    #     transformer_model="models/jmu_custom_geoparser",
    #     spacy_model='en_core_web_trf',
    #     gazetteer='geonames'
    # )
    print("✅ Custom model would be loaded for improved accuracy")
    
else:
    print("⚠️  GeoparserTrainer not available - this is a demonstration")
    print("In practice, you would:")
    print("1. Collect 100+ annotated examples")
    print("2. Train for 3-5 epochs") 
    print("3. Evaluate on held-out test data")
    print("4. Deploy the improved model")

print("\n" + "=" * 50)

### Step 4: Evaluation Metrics

When training a custom model, you'll get these performance metrics:

- **Accuracy**: Proportion of toponyms resolved to the exact correct location
- **Accuracy@161km**: Proportion resolved within 161km (100 miles) of correct location  
- **MeanErrorDistance**: Average distance in kilometers between predicted and correct locations
- **AreaUnderTheCurve**: Distribution of error distances (lower is better)

### Alternative: Using the Annotator Web App

Instead of manually creating training data, you can use the built-in annotation tool:

```bash
python -m geoparser annotator
```

This launches a web interface where you can:
- Upload your text files
- Click on location mentions to mark them
- Select the correct location from suggestions
- Export annotations in the proper format

### Tips for Creating Good Training Data

**Quality over Quantity:**
- Start with 50-100 carefully annotated examples
- Focus on problematic cases from your actual data
- Include examples of correctly resolved locations too

**Domain-Specific Examples:**
- Local place names and nicknames  
- Ambiguous locations (e.g., "Richmond" could be VA, CA, or UK)
- Institution-specific references (building names, campus locations)

**Geographic ID Sources:**
- Use [GeoNames.org](http://geonames.org) to find correct location IDs
- Search by place name to get the `geonameid`
- Verify coordinates match your intended location

**Common Issues to Address:**
- University buildings vs. city names
- State abbreviations vs. country codes  
- Historical vs. modern place names
- Colloquial names vs. official names