# Lesson 4: Extracting Toponyms in Texts

## Overview

This lesson will cover three sections:

- Prepping the data
- Running NER
- Visualizing the data



## Introduction

In this lesson we will learn how to extract place names from text. We will do so in two ways:

- Using known place names
- Natural Language Processing

Natural Language Processing (NLP) teaches computers how to "read" and understand text like humans do. When we want to find place names in text, this becomes tricky because the same word can mean different things depending on how it's used.

For example: 

- I visited **Washington** last summer and saw the Capitol building.
- **Washington** led the Continental Army during the Revolutionary War.

In the first sentence, "Washington" refers to Washington D.C. (a place). In the second sentence, "Washington" refers to George Washington (a person, not a place). Humans can easily tell the difference, but computers need help figuring this out.

If we had a system where we simply gave computers a list of place names, we would get a lot of false positives. Therefore linguists started teaching computers the rules of grammar so they could understand **parts of speech**. By knowing *how* a word is being used, computers get more accurate in predicting what the word means in that context. 

By having a basic grammar it is easier to figure out what a word means:

- The weather in London was cold and foggy.
- Jack London wrote *Call of the Wild*.

Here, a computer can figure out that the first "London" is a place (because we say "in London"), while "Jack London" is a person's name because it is performing the verb "write", something cities generally don't do.

Over time, more sophisticated models have developed to help computers understand how language is being used. This process of finding locations, people, organizations, and other important information in text is called Named Entity Recognition (NER). It's very useful because it helps us understand what a text is really about, beyond just counting words.

## 1 Prepping the data
This unit will require `spacy` and `nltk` to be installed. Please see the preparation instructions before proceeding.

In [3]:
import pandas as pd
import nltk
import re

#### 1.1.1 Load the data
The data folder should contain the `.pickle` file from last lesson with all of the type modifications we made.

In [4]:
df_reddit = pd.read_pickle('data/jmu_reddit.pickle')

#### 1.1.2 Check the data

Before doing anything it's always good to take a look at the data before processing. In previous instances we were using `.head()` to see the first 10 rows of the data frame. Now we'll use: `.sample()`. This provides a random sample. I have set the parameter to `5` meaning five rows. I have also included the `random_state` parameter. By setting this, I will get the exact same sample every time. This is for presentation purposes only. It just makes sure that your results and my results are the same.

In [6]:
df_reddit.sample(n= 5, random_state= 43)

Unnamed: 0,type,title,text,date,score,year_month
2238,comment,Explosion?,I saw it firsthand drove by it,2020-10-17 17:11:22,2,2020-10
8929,comment,Frustrated Student,"Oh I absolutely agree. But these ""kids"" also n...",2020-10-04 19:10:22,3,2020-10
6629,comment,JMU President,"Like don’t get me wrong, I liked Alger. Grante...",2025-08-30 21:15:12,2,2025-08
412,comment,Consider: the Duke Dog with no eyebrows,Homie is vibing hard,2019-11-29 01:30:41,12,2019-11
3548,comment,JMU is going to go online and you all should p...,"I agree, wholeheartedly. Many of my professors...",2020-07-22 12:32:05,3,2020-07


#### 1.1.3 Download the 'punkt' tokenizer from nltk 
This library helps us split text into sentences. It will download once and then we won't need it again.

In [7]:
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\burgerjx\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\burgerjx\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

## 2 Split `text` into sentences

To make our lives a bit easier, we will first split each text into individual sentences. When we use the tokenizer to find all of the locations later on, the process will be quicker, because the tokenizer does not like very long strings. Also, since we know that we want to know the emotions of the sentences where toponyms occur, we are anticipating that way of organizing data.

The following line of code takes all of the texts and turns them into sentences. First, we `.apply` the function `nltk.sent_tokenize` function from the `nltk` (natural language toolkit) library we downloaded. This splits each text into its sentences based on punctuation, and puts them into a list in the column `sentences`. We then take that list and `explode` it. That sounds more dramatic than it really is! All we're doing is telling pandas to make a new row for each sentence in the list. It is easier to see what's going on by simply running the function. 

In [8]:
# Step 1: Split each text into individual sentences using NLTK
df_with_sentence_lists = df_reddit.assign(sentences=df_reddit['text'].apply(nltk.sent_tokenize))

# Step 2: Create a new row for each sentence (explode the lists)
df_reddit_sentences = df_with_sentence_lists.explode('sentences')

In [None]:
# Display a random sample of 5 sentences to see the results
df_reddit_sentences.sample(n=5, random_state=43)

Unnamed: 0,type,title,text,date,score,year_month,sentences
7515,comment,I was just accepted!,"Congratz, you're gonna love it. Pick the quad ...",2014-01-09 17:09:40,8,2014-01,Pick the quad as your first choice for housing
4435,comment,Missing Dog,I don't know how? This dog is now safely home ...,2020-09-16 11:32:26,2,2020-09,This dog is now safely home but would apprecia...
2906,comment,"1 Year Ago, we thought it would only be 2 week...",Don't remind me.,2021-03-11 22:12:23,7,2021-03,Don't remind me.
8214,comment,Its so hard to find friends as an older JMU st...,My daughter went there as transfer student whe...,2025-02-09 21:58:25,2,2025-02,"She ended up joining some ""be kind"" group (som..."
1911,comment,President Alger Mishap!,Holy shit dude I thought I was the only one! M...,2019-08-27 21:36:40,23,2019-08,He just stared at us in the darkness for a goo...


### 2.1 Drop `text` and `title`

Since we are essentially copying `text_data` over and over again it's a good practice to `.drop` it. We already have that information in the new sentences column.

In [9]:
df_reddit_sentences = df_reddit_sentences.drop(columns=['text', 'title'])

In [10]:
df_reddit_sentences.head(5)

Unnamed: 0,type,date,score,year_month,sentences
0,post,2024-03-18 12:47:10,358,2024-03,President Alger leaving to take same job at Am...
1,comment,2024-03-18 12:49:04,82,2024-03,"Like him or not, he did help transform this sc..."
1,comment,2024-03-18 12:49:04,82,2024-03,Applications to JMU have drastically increased...
2,comment,2024-03-18 12:50:05,34,2024-03,Massive changes happening at JMU this year.
2,comment,2024-03-18 12:50:05,34,2024-03,"Alger stepping down, AD Bourne retiring, Cigne..."


## 3 Filtering by list

The most basic way to figure out if the texts contain places is to go through each text and filter it with a list of known place names. For example, we can create a list of Virginia places:

- 'Richmond'
- 'Harrisonburg'
- 'Lynchburg'
- 'Roanoke'
- 'Charlottesville'


We can then use our `str.contains()` method to filter out any texts that contain the words above.

In [12]:
# Define the Virginia cities we want to search for
virginia_cities = ['Richmond', 'Harrisonburg', 'Lynchburg', 'Roanoke', 'Charlottesville']

# Filter sentences that contain any of these city names
# The '|' symbol means "OR" - so we're looking for sentences with ANY of these cities
df_reddit_words = df_reddit_sentences[df_reddit_sentences.sentences.str.contains('|'.join(virginia_cities))]

In [13]:
df_reddit_words['sentences'].sample(5, random_state=43)

6523    Harrisonburg is a major hub for the undergroun...
9773    “*Protecting the health of our Harrisonburg an...
7852    Harrisonburg honestly has a great variety of f...
1973    She had a major impact on the students who wen...
4621    Then, you could narrow it down to that individ...
Name: sentences, dtype: object

## 4 Using Named Entity Recognition

Working with location names represents a unique computational problem. Since we do not know in advance what the names might be, and since some proper nouns can be both cities and people, (i.e. Jefferson, Washington, Lincoln, etc.) we need the computer to have some concept of what a place is. This is where language models come in. 

Language models have been trained on a lot of text and through statistical inference establish the different types of entities in a text (verb, noun, adjective etc.) and using this basic understanding of grammar, they can infer when something is being used as a location name and when something is being used as a person's name. 

One of the main libraries that Python uses to do this is `spacy`. The sample below goes through the basic procedure for extracting an entity. 

**Don't worry about how the code works, just look at the result**

Notice how accurately it is able to distinguish between an organization in Virginia (UVA), a person Jefferson, and the place Jefferson (geopolitical entity).

In [29]:
# Clear any cached spaCy models and reload fresh
import sys
if 'spacy' in sys.modules:
    del sys.modules['spacy']
if 'nlp' in locals():
    del nlp

import spacy
import importlib.util
import subprocess

# Check if the model is installed
model_name = "en_core_web_sm"
if importlib.util.find_spec(model_name) is None:
    subprocess.run(["python", "-m", "spacy", "download", model_name])

# Load spaCy's English model fresh
nlp = spacy.load('en_core_web_sm')
print(f"✅ Loaded fresh spaCy model: {model_name}")

# The sentence to process - clear example text
text = """The University of Virginia is in the town of Charlottesville. 
          It was created by Jefferson. In the 1950s, William Faulkner gave a series of lectures there 
          about his fiction, most of which is set in Jefferson."""

print(f"\n📝 Processing text:\n{text}")

# Process the text with spaCy
doc = nlp(text)

print("\n🔍 Named entities found:")
# Extract and display named entities along with their labels
for ent in doc.ents:
    print(f"  '{ent.text}' -> {ent.label_}")
    
print("\n📚 Entity Label Meanings:")
print("  ORG = Organization")
print("  GPE = Geopolitical Entity (places)")  
print("  PERSON = Person")
print("  DATE = Date or time period")

✅ Loaded fresh spaCy model: en_core_web_sm

📝 Processing text:
The University of Virginia is in the town of Charlottesville. 
          It was created by Jefferson. In the 1950s, William Faulkner gave a series of lectures there 
          about his fiction, most of which is set in Jefferson.

🔍 Named entities found:
  'The University of Virginia' -> ORG
  'Charlottesville' -> GPE
  'Jefferson' -> PERSON
  'the 1950s' -> DATE
  'William Faulkner' -> PERSON
  'Jefferson' -> GPE

📚 Entity Label Meanings:
  ORG = Organization
  GPE = Geopolitical Entity (places)
  PERSON = Person
  DATE = Date or time period


spaCy is only as accurate as the data provided to it. If the text data is garbled or too short, it will likely have trouble. Undoubtedly, there are sentences in our `sentences` column that are not going to be read properly, but what we are relying on is the sheer volume of text. Even with some false positives and false negatives, we should be able to build a pretty good overview of the most mentioned places.

## 5 Extract Entities in all `sentences`

Doing one extraction on one sentence in `spacy` is pretty straight forward. We simply run the function `nlp()` on whatever sentence we want to analyze and save the result to a new variable, usually called `doc`. When we run this on a column with thousands of sentences, you start to run into performance issues because you are doing the procedure one at a time, and you also don't really know what's going on because there's no feedback. The functions below modify the above procedure a bit and basically asks your computer to use multiple processors, and it also provides a little progress bar. Finally, instead of using the very small model that we used above, we are now going to use a slightly bigger model called `en_core_web_md`, this will hopefully help us find more locations!

In [27]:
from tqdm import tqdm
tqdm.pandas()

In [30]:
# Clear any cached spaCy models and reload fresh
import sys
if 'spacy' in sys.modules:
    del sys.modules['spacy']
if 'nlp' in locals():
    del nlp

import spacy
import importlib.util
import subprocess

# Check if the model is installed
model_name = "en_core_web_md"
if importlib.util.find_spec(model_name) is None:
    subprocess.run(["python", "-m", "spacy", "download", model_name])

# Load spaCy's English model fresh
nlp = spacy.load('en_core_web_md')
print(f"✅ Loaded fresh spaCy model: {model_name}")


# Function to extract GPE (Geopolitical Entities) from a batch of docs
def extract_gpe_from_docs(docs):
    return [[ent.text for ent in doc.ents if ent.label_ == 'GPE'] or None for doc in docs]

# Use nlp.pipe() for faster batch processing with multiple cores
def process_sentences_in_batches(sentences, batch_size=50, n_process=-1):
    # Process sentences using nlp.pipe with batch processing and multi-processing
    gpe_results = []
    for doc in tqdm(nlp.pipe(sentences, batch_size=batch_size, n_process=n_process), total=len(sentences)):
        gpes = [ent.text for ent in doc.ents if ent.label_ == 'GPE']
        gpe_results.append(gpes if gpes else None)
    return gpe_results

✅ Loaded fresh spaCy model: en_core_web_md


We are now ready to apply this function to `sentences` and create a new column called `toponyms`. 

**Warning this process will take a couple of minutes**

If this does not work, I have saved the results as `jmu_reddit_toponyms.pickle` and preloaded it in your data folder. You can simply keep running the code below on that imported file.

In [31]:
# Apply the function with tqdm progress bar and nlp.pipe() for batch processing
df_reddit_sentences['toponyms'] = process_sentences_in_batches(df_reddit_sentences['sentences'])

100%|██████████| 30005/30005 [01:59<00:00, 250.70it/s] 



In [32]:
# Display the result
df_reddit_sentences[['sentences', 'toponyms']].sample(15, random_state = 4)

Unnamed: 0,sentences,toponyms
3919,Breonna Taylor's boyfriend shot at the police,
183,We Re-Opened Campus!,
2466,"I’m not surprised they handled it this way, bu...",
9675,That or walk to Sheetz.,[Sheetz]
7196,I agree completely!,
5319,!,
7030,"I went to my general doctor, told her I think ...",
10716,I have no idea if it's good or bad,
5028,Next to be received is my high school one,
8912,Then go ahead and join the server,


Because not every sentence includes a toponym sometimes it will say `Missing value`. We will want to eliminate all the rows with none in them, because they are not relevant and they take up memory. Still, we might want to peek inside and calculate what percentage of sentences actually have toponyms.

In [21]:
# Calculate the number of sentences with toponyms (not None or empty)
num_sentences_with_toponyms = df_reddit_sentences['toponyms'].dropna().apply(lambda x: len(x) > 0).sum()

# Calculate the total number of sentences
total_sentences = len(df_reddit_sentences)

# Calculate the percentage of sentences with toponyms
percent_with_toponyms = (num_sentences_with_toponyms / total_sentences) * 100

# Display the result
print(f"Percentage of sentences with toponyms: {percent_with_toponyms:.2f}%")



Percentage of sentences with toponyms: 4.06%


The total is pretty low when we compare it to the Hauser text, but keep in mind there may still be false positives here. Let's clear out all of the rows without toponyms. We can use a filter to get all of the rows that have data in the `toponyms` column `.notna()`, and also to check to make sure that if there is a list, that the list actually contains values `str.len()>0`. 

In [33]:
df_reddit_toponyms = df_reddit_sentences[df_reddit_sentences['toponyms'].notna() & df_reddit_sentences['toponyms'].str.len() > 0]

# Display the first few rows of the new DataFrame
df_reddit_toponyms.sample(10, random_state = 4)

Unnamed: 0,type,date,score,year_month,sentences,toponyms
9015,post,2018-11-03 00:20:12,36,2018-11,"Hey JMU, if you're registered to vote in VA, b...","[VA, Harrisonburg]"
10127,post,2021-09-06 10:45:22,32,2021-09,anyone know where they moved the condom rack t...,[UREC]
763,comment,2020-08-13 10:47:20,-37,2020-08,Just remember this as you grow into adults tha...,[America]
6307,post,2024-09-25 08:12:43,48,2024-09,I read the owner of Dave’s now owns Francesco’...,"[Bridgewater, Harrisonburg]"
9442,comment,2021-04-28 11:36:37,23,2021-04,Yeah my place is managed by forrest hills and ...,[forrest hills]
3988,comment,2020-09-01 11:23:52,17,2020-09,"In this area of Virginia, per the VDH dashboar...",[Virginia]
6838,comment,2022-10-16 08:43:06,10,2022-10,If they’re getting airlifted to Charlottesvill...,[Charlottesville]
9013,comment,2020-06-06 17:59:30,1,2020-06,"However, nearly everyone in the state of VA kn...",[VA]
8640,comment,2021-05-20 21:43:29,2,2021-05,"Once new shrubs are put in front of Burruss, t...",[Burruss]
1602,comment,2020-08-27 12:39:26,98,2020-08,JMU knowingly and actively endangered the heal...,[Harrisonburg]


## 6 Counting Toponyms

We can gain more insights into the data if we visualize them. Since this takes some complicated coding, don't worry about how this result is achieved for now. Suffice to say that what the code does is count the total number of each toponym in the corpus to get the raw value.

In [34]:
from collections import Counter

In [35]:
# Step 1: Unnest the list (flatten the 'toponyms' column)
unnested_toponyms = df_reddit_toponyms['toponyms'].explode()

In [36]:
# Step 2: Drop any NaN values (if any exist)
unnested_toponyms = unnested_toponyms.dropna()

In [37]:
# Step 3: Collapse the unnested toponyms by count
toponym_counts = Counter(unnested_toponyms)


In [38]:
# Convert to a DataFrame for easy viewing
toponym_counts_df = pd.DataFrame(toponym_counts.items(), columns=['Toponym', 'Count']).sort_values(by='Count', ascending=False)
toponym_counts_df.head(50)

Unnamed: 0,Toponym,Count
4,Harrisonburg,240
7,Virginia,114
8,VT,47
40,VA,42
19,US,41
72,Florida,21
66,harrisonburg,20
23,America,19
65,Breeze,19
47,FCS,19


Not surprisingly, Harrisonburg is the top toponym, but there are also some "garbage" locations like `FCS` and `Breeze` that are throwing off the results.

### 6.1  Visualizing Toponyms

It is much easier to look at this data in a chart.

In [39]:
import plotly.express as px

# Take the top 10 most common toponyms for plotting
toponym_counts_top10 = toponym_counts_df.head(10)

# Create the bar chart using Plotly
fig = px.bar(
    toponym_counts_top10,
    x='Toponym',
    y='Count',
    title='Top 10 Most Common Toponyms',
    labels={'Toponym': 'Toponym', 'Count': 'Frequency'},
    text='Count'
)

# Display the plot
fig.show()

### Save Progress

Now that we have run the tokenizer we will want to save the newly created data

In [24]:
pd.to_pickle(df_reddit_sentences, 'data/jmu_reddit_toponyms.pickle')
