# Lesson 4.2: Geoparsing in Python


## A big thank you!
This lesson is made possible by the [geoparser](https://github.com/dguzh/geoparser) library created by Diego Gomes. 

## Overview

In this lesson, we'll take text data we scraped from Reddit and:

1. **Extract Locations**: Use a sophisticated geoparser to find and resolve geographic references
3. **Visualize on Maps**: Create interactive maps showing retrieved locations


---

In [9]:
# Import all required libraries
try:
    from geoparser import Geoparser
    from tqdm.notebook import tqdm
    import pandas as pd
    import plotly.express as px
    import mapclassify as mc
    import warnings
    
    # Suppress warnings for cleaner output
    warnings.simplefilter(action='ignore', category=FutureWarning)
    
    print("✅ All libraries imported successfully!")
    
except ImportError as e:
    print(f"❌ Missing library: {e}")
    print("Please run the installation notebook first: lesson_5_0_installation_setup.ipynb")

✅ All libraries imported successfully!


## Part 1: Initialize the Geoparser System

The geoparser is a sophisticated tool that can identify place names in text and resolve them to actual geographic coordinates. Here **resolve** means figure out the coordinates. The fancy word for this is actually **toponym disambiguation**. For example, we found Harrisonburg in our text, but we don't know necessarily which Harrisonburg. There is also a Harrisonburg in Louisiana. During toponym resolution the geoparser makes an educated guess as to which location is the right one. This can be based on a number of variables including the surrounding context, other places in the corpus, and even the weighting of the town size in a gazetteer. Harrisonburg, LA only has a few hundred residents, it is less likely someone is talking about that place.

Google Maps also uses this technique to figure out what location you might be trying to find when you type it in on maps. Note that when you search for "London", London, UK is the first hit even though London,KY is closer.

### Step 1.1: Initialize the Geoparser

We'll create a geoparser with optimized settings for accuracy:

In [2]:
try:
    print("Initializing geoparser... (this may take a minute)")
    geo = Geoparser(
        spacy_model='en_core_web_trf',                    # Advanced language model
        transformer_model='dguzh/geo-all-distilroberta-v1', # Geographic transformer
        gazetteer='geonames'                              # Geographic database
    )
    print("✅ Geoparser initialized successfully!")
    
except Exception as e:
    print(f"❌ Error initializing geoparser: {e}")
    print("Make sure you ran the installation notebook first!")

Initializing geoparser... (this may take a minute)
❌ Error initializing geoparser: name 'Geoparser' is not defined
Make sure you ran the installation notebook first!


**What these parameters do:**
- `spacy_model`: Advanced language processing for accurate text understanding
- `transformer_model`: Specialized AI model trained to recognize geographic references  
- `gazetteer`: Database containing millions of place names and their coordinates

### Step 1.2: Test the Geoparser

Let's test the geoparser with some sample sentences. Try changing the text below to include places you know:

In [11]:
# Test with sample sentences - feel free to modify these!
test_sentences = [
    "I traveled from New York to Richmond, Virginia last summer.",
    "The battle took place near Harrisonburg in the Shenandoah Valley.",
    "London and Paris are popular European destinations."
]

try:
    docs = geo.parse(test_sentences)
    print(f"✅ Successfully parsed {len(docs)} sentences!")
    
except Exception as e:
    print(f"❌ Error during parsing: {e}")
    print("Try restarting the kernel and running from the beginning.")

Toponym Recognition...


Batches: 100%|██████████| 3/3 [00:00<00:00, 29.04it/s]

Toponym Resolution...



Batches: 100%|██████████| 66/66 [00:06<00:00,  9.60it/s]
Batches: 100%|██████████| 66/66 [00:06<00:00,  9.60it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 14.50it/s]

✅ Successfully parsed 3 sentences!





### Step 1.3: Examine the Results

Let's see what locations the geoparser found. Each "toponym" is a place name with detailed geographic information:

In [4]:
print("🗺️  LOCATIONS FOUND:")
print("=" * 50)

for i, doc in enumerate(docs):
    print(f"\nSentence {i+1}: \"{test_sentences[i]}\"")
    
    if doc.toponyms:
        for toponym in doc.toponyms:
            print(f"  📍 Found: {toponym}")
    else:
        print("  ❌ No locations found in this sentence")
        
print("\n" + "=" * 50)

🗺️  LOCATIONS FOUND:

Sentence 1: "I traveled from New York to Richmond, Virginia last summer."
  📍 Found: New York
  📍 Found: Richmond
  📍 Found: Virginia

Sentence 2: "The battle took place near Harrisonburg in the Shenandoah Valley."
  📍 Found: Harrisonburg
  📍 Found: the Shenandoah Valley

Sentence 3: "London and Paris are popular European destinations."
  📍 Found: London
  📍 Found: Paris



### Understanding the Data Structure

Each toponym contains detailed geographic information. Here's what a complete location record looks like:

```python
{
    'geonameid': 2867714,
    'name': 'Munich',
    'latitude': 48.13743,
    'longitude': 11.57549,
    'country_name': 'Germany',
    'admin1_name': 'Bavaria',        # State/Province
    'admin2_name': 'Upper Bavaria',  # County/Region
    'feature_name': 'seat of a first-order administrative division',
    'population': 1260391
}
```

We can access specific pieces of information using `.location['key_name']`:

In [5]:
# Extract specific geographic information
print("📍 DETAILED LOCATION DATA:")
print("=" * 60)

for i, doc in enumerate(docs):
    print(f"\nSentence {i+1}: \"{test_sentences[i]}\"")
    
    for toponym in doc.toponyms:
        if toponym.location:
            name = toponym.location['name']
            lat = toponym.location['latitude']
            lon = toponym.location['longitude']
            country = toponym.location.get('country_name', 'Unknown')
            
            print(f"  🏛️  Place: {name}")
            print(f"  🌍 Country: {country}")
            print(f"  📐 Coordinates: ({lat:.4f}, {lon:.4f})")
            print()
        else:
            print(f"  ❌ Location '{toponym}' could not be resolved to coordinates")
            print()

📍 DETAILED LOCATION DATA:

Sentence 1: "I traveled from New York to Richmond, Virginia last summer."
  🏛️  Place: New York
  🌍 Country: United States
  📐 Coordinates: (43.0003, -75.4999)

  🏛️  Place: Richmond
  🌍 Country: United States
  📐 Coordinates: (37.5538, -77.4603)

  🏛️  Place: Virginia
  🌍 Country: United States
  📐 Coordinates: (37.5481, -77.4467)


Sentence 2: "The battle took place near Harrisonburg in the Shenandoah Valley."
  🏛️  Place: City of Harrisonburg
  🌍 Country: United States
  📐 Coordinates: (38.4496, -78.8689)

  🏛️  Place: Community Christian School of the Shenandoah Valley
  🌍 Country: United States
  📐 Coordinates: (38.9104, -78.4778)


Sentence 3: "London and Paris are popular European destinations."
  🏛️  Place: London
  🌍 Country: United Kingdom
  📐 Coordinates: (51.5085, -0.1257)

  🏛️  Place: Paris
  🌍 Country: France
  📐 Coordinates: (48.8534, 2.3488)



## Part 2: Load and Process JMU Text Data

Now we'll want to extract locations from the Reddit data. We'll do so by loading the `.pickle` file from the previous lesson.

### Step 2.1: Load `JMU_reddit.pickle`

This dataset contains sentences from the JMU reddit that have already been tagged for locations.

In [None]:
try:
    df_jmu_reddit_toponyms = pd.read_pickle('data/jmu_reddit_toponyms.pickle')
    print(f"✅ Loaded {len(df_jmu_reddit_toponyms):,} sentences")
   
except FileNotFoundError:
    print("❌ Data file not found!")
    print("You may need to run previous lessons first to generate the sentiment data.")
    print("Or check that you're in the correct directory.")

✅ Loaded 30,005 sentences
📊 Columns: ['type', 'date', 'score', 'year_month', 'sentences', 'toponyms']


### Step 2.2: The Geoparsing Function

Here's a streamlined function that processes text and extracts geographic information:

**Key features:**
- Processes multiple sentences at once for efficiency
- Extracts coordinates and place information

In [7]:
def geoparse_dataframe(df, text_column='sentences'):
    """
    Extract geographic locations from text data.
    
    Args:
        df: DataFrame with text data
        text_column: Column containing the text to parse
    
    Returns:
        DataFrame with added location columns
    """
    print(f"🔍 Processing {len(df)} sentences for geographic locations...")
    
    # Convert text column to list for batch processing
    sentences = df[text_column].tolist()
    
    try:
        # Process all sentences at once (more efficient)
        docs = geo.parse(sentences)  # Parse all sentences without feature filtering
        
        # Initialize storage for results
        places, latitudes, longitudes, feature_names = [], [], [], []
        
        # Extract information from each processed document
        for doc in tqdm(docs, desc="Extracting locations"):
            doc_places = []
            doc_latitudes = []
            doc_longitudes = []
            doc_feature_names = []
            
            # Get all toponyms found in this document
            for toponym in doc.toponyms:
                if toponym.location:
                    doc_places.append(toponym.location.get('name'))
                    doc_latitudes.append(toponym.location.get('latitude'))
                    doc_longitudes.append(toponym.location.get('longitude'))
                    doc_feature_names.append(toponym.location.get('feature_name'))
            
            # Store results (empty lists if no locations found)
            places.append(doc_places)
            latitudes.append(doc_latitudes)
            longitudes.append(doc_longitudes)
            feature_names.append(doc_feature_names)
        
        # Add new columns to dataframe
        df_result = df.copy()
        df_result['place'] = places
        df_result['latitude'] = latitudes
        df_result['longitude'] = longitudes
        df_result['feature_name'] = feature_names
        
        print(f"✅ Geoparsing complete!")
        return df_result
        
    except Exception as e:
        print(f"❌ Error during geoparsing: {e}")
        return df

There are several interesting things of note in the data. First, for some of the sentences the tokenizer did not find a toponym which is indicated by empty lists `[]`. This because this is a more accurate tokenizer and will likely have fewer false positives. We will have to remember to remove these. 

Likewise, right now the parsing has been set to include Administrative areas like countries and states (i.e. The US and Virginia) and population centers (Richmond, Harrisonburg). We will have to think of how to deal with these down the road.

**We can run the geoparser for all the data and expect to wait at least an hour!**

In [8]:
# Run the geoparser over the entire 'sentences' column
geoparse_results = geoparse_dataframe(df_jmu_reddit_toponyms)

🔍 Processing 30005 sentences for geographic locations...
Toponym Recognition...


Batches:   2%|▏         | 673/30005 [00:32<23:37, 20.69it/s]



KeyboardInterrupt: 

In [None]:
# Display the updated DataFrame with new columns
geoparse_results.sample(50)

In [3]:
df_reddit_geoparsed = geoparse_results[geoparse_results['place'].apply(len) != 0]


NameError: name 'geoparse_results' is not defined

---

In [None]:
df_reddit_geoparsed.sample(50)

In [None]:
df_reddit_geoparsed_long = df_reddit_geoparsed.explode(['place', 'latitude', 'longitude', 'feature_name']).reset_index(drop=True)
df_reddit_geoparsed_long.sample(50)

In [None]:
df_reddit_geoparsed_long.to_pickle('data/jmu_reddit_geoparsed_long.pickle')

## Part 3: Visualizing Spatial Relations

We can make a quick map of the results using `plotly` and `mapbox`

### Step 3.1: Load Pre-processed Data

The complete geoparsing process took over an hour on a modern computer.

In [None]:
# Count each place while keeping latitude and longitude
df_reddit_geoparsed_count = df_reddit_geoparsed_long.groupby('place').agg({
    'latitude': 'first',    # Keep the first latitude for each place
    'longitude': 'first',   # Keep the first longitude for each place
    'place': 'size'         # Count the occurrences
}).rename(columns={'place': 'count'}).reset_index()

# Filter for places mentioned more than 4 times
df_reddit_geoparsed_count = df_reddit_geoparsed_count[df_reddit_geoparsed_count['count'] > 4]
df_reddit_geoparsed_count.sample(50)

In [None]:
fig = px.scatter_map(
    df_reddit_geoparsed_count,     # DataFrame with count data
    lat="latitude",               # Latitude column
    lon="longitude",              # Longitude column
    hover_name="place",           # Show place name on hover
    size="count",                 # Marker size based on count
    hover_data={
        "count": True,              # Show count in hover
        "latitude": False,          # Hide coordinates in hover
        "longitude": False
    },
    title="Geographic Locations Found in JMU Reddit Data",
    zoom=3,                       # Start with a broad view
    height=600
)

# Update the layout to use the default map style
fig.update_layout(
    mapbox_style="open-street-map",  # No token needed for this style
    margin={"r":0,"t":50,"l":0,"b":0}  # Remove margins but keep space for title
)

fig.show()

#### Critical Questons

- What does the map show about the locations? 
- Where does it do well? 
- Where are the mistakes?