# Extracting location names and data from Gibbon

In this note book we will put together and practice a lot of skills we have learned so far this term. Starting with just the raw text files from Gibbon's Decline and Fall we will create a DataFrame containing location names, location counts, and location data.

The code in this notebook may seem complex, but if you read through it carefully, you will likely understand what most of the code is doing.


## Set-up

In [1]:
# install necessary libraries. The "%%capture" stops the notebook from printing
# out all the insall output. Remove if you need to trouble shoot.
%%capture
!pip install stanza

UsageError: Line magic function `%%capture` not found.


In [2]:
# install necessary libraries. The "%%capture" stops the notebook from printing
# out all the insall output. Remove if you need to trouble shoot.
%%capture
!pip install wget

UsageError: Line magic function `%%capture` not found.


In [4]:
# import necessary libraries
import os
import pandas as pd
import stanza
import json
import wget



## NLP pipeline
Now that all the necessary libraries have been installed and imported into our project, we need to set up our nlp pipeline. We will use [Stanza](https://stanfordnlp.github.io/stanza/).

In [5]:
# load stanza nlp pipeline that tokenizes and performs Named Entity Recognition
nlp_ner= stanza.Pipeline(lang='en', processors='tokenize, ner')

2023-11-06 01:26:46 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2023-11-06 01:26:47 INFO: Loading these models for language: en (English):
| Processor | Package          |
--------------------------------
| tokenize  | combined         |
| ner       | ontonotes_charlm |

2023-11-06 01:26:47 INFO: Using device: cpu
2023-11-06 01:26:47 INFO: Loading: tokenize
2023-11-06 01:26:48 INFO: Loading: ner
2023-11-06 01:26:49 INFO: Done loading processors!


## Load text data
If you are using a **Colab Notebook** you will need to run the cell below to get the text files.

Otherwise, you should have all of the text files for Gibbon's _Decline and Fall of the Roman Empire_ already downloaded from Canvas.

In [6]:
# load text files, Colab only
! git clone https://github.com/jdeen33/Gibbon_text.git

Cloning into 'Gibbon_text'...
remote: Enumerating objects: 74, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 74 (delta 0), reused 0 (delta 0), pack-reused 71[K
Receiving objects: 100% (74/74), 4.20 MiB | 16.93 MiB/s, done.


## Extract location infromation from text file(s)

In [7]:
# create function that will take a text string as input and return a dictionary
# with locations and location counts from the text string
def get_locations_from_text(text):
    locations_dict = {}
    doc = nlp_ner(text)
    for sentence in doc.sentences:
        for token in sentence.tokens:
            if token.ner == 'S-GPE':
                if not token.text in locations_dict.keys():
                    locations_dict[token.text] = 1
                else:
                    locations_dict[token.text] += 1
            else:
                continue
    return locations_dict

You will need to choose which chapter you would like to extract locations from. For this example I will use Chapter 16.

For **Colab** it will look something like this:
`/content/Gibbon_text/gibbon_decline_volume1_chap16.txt`

For **Jupyter** it will look something like this:
`../text/gibbon_decline_and_fall/gibbon_decline_volume1_chap16.txt`

In [15]:
# identify the path to the text file you want to use
path_to_file = './Gibbon_text/gibbon_decline_volume1_chap10.txt' # <-- Insert path to chosen file

In [16]:
# read text from text file
with open(path_to_file, encoding='utf-8', mode='r') as f:
       text  = f.read()

In [17]:
# apply function to get locations and location counts
# this will take a few minutes
locations = get_locations_from_text(text)

In [18]:
# sanity check
locations

{'Italy': 15,
 'Verona': 1,
 'Rome': 27,
 'Gaul': 1,
 'Spain': 3,
 'Scandinavia': 3,
 'Ravenna': 3,
 'Scythia': 1,
 'Sweden': 4,
 'Gothland': 1,
 'Iceland': 1,
 'Denmark': 1,
 'Maetois': 1,
 'Mithridates': 1,
 'Prussia': 5,
 'Carlscrona': 1,
 'Pomerania': 2,
 'Thorn': 1,
 'Elbing': 1,
 'Koningsberg': 1,
 'Dantzic': 1,
 'Mecklenburg': 1,
 'Dacia': 3,
 'Germany': 6,
 'Poland': 1,
 'Russia': 1,
 'Finland': 1,
 'Sarmatia': 2,
 'Japan': 1,
 'Ukraine': 4,
 'Dniester': 1,
 'Maesia': 3,
 'Marcianopolis': 1,
 'Philippopolis': 3,
 'Thrace': 2,
 'Pannonia': 2,
 'Spoleto': 2,
 'Westphalia': 1,
 'Brunswick': 1,
 'Luneburg': 1,
 'Gallienus': 3,
 'Tarragona': 1,
 'Mauritania': 1,
 'Suevi': 2,
 'Lombardy': 1,
 'Milan': 1,
 'Macedonia': 1,
 'Circassia': 1,
 'Pityus': 2,
 'Trebizond': 3,
 'Colchis': 1,
 'Pontus': 3,
 'Chalcedon': 3,
 'Bithynia': 3,
 'Nicomedia': 2,
 'Prusa': 1,
 'Apamaea': 1,
 'Cius': 1,
 'Cyzicus': 3,
 'Rhyndacus': 1,
 'Heraclea': 1,
 'Nice': 1,
 'Greece': 4,
 'Piraeus': 2,
 'Athens': 

In [19]:
# you may want to save the locations dictionary
path = './' # <-- Path of your choosing
file_name = 'locations_data.json'
with open(file_name, encoding='utf-8', mode='w') as f:
    json.dump(locations, f)

In [20]:
# convert dictionary to dataframe for easier processing
location_count_df = pd.DataFrame.from_dict(locations, orient='index').reset_index().rename(columns={'index':'place_name', 0:'count'})


In [21]:
# preview DataFrame
location_count_df.head()

Unnamed: 0,place_name,count
0,Italy,15
1,Verona,1
2,Rome,27
3,Gaul,1
4,Spain,3


## Load data from Pleiades

In [22]:
# data from Pleiades, thanks to Peter Nadel!
if not os.path.isfile('places.csv'):  # checkin to see if we have this file or not
    wget.download('https://raw.githubusercontent.com/pnadelofficial/FallDHCourseMaterials/main/places.csv')
if not os.path.isfile('names.csv'):
    wget.download('https://raw.githubusercontent.com/pnadelofficial/FallDHCourseMaterials/main/names.csv')

In [23]:
# load and preview places DataFrame
places_df = pd.read_csv('places.csv')
places_df.head()

Unnamed: 0,created,description,details,provenance,title,uri,id,representative_latitude,representative_longitude,bounding_box_wkt
0,2010-06-24T14:11:06Z,"The ancient region known to the Romans as ""Gal...",<p>The Barrington Atlas Directory notes: FRA</p>,Barrington Atlas: BAtlas 1 D1 Gallia,Gallia,https://pleiades.stoa.org/places/993,993,46.705437,1.013706,"POLYGON ((8.672222 42.4395125, 8.672222 51.981..."
1,2020-01-10T20:52:00Z,"A Roman house in Pompeii (I, 6, 15), also know...",<p>The house was excavated in 1913 and 1914. T...,Pleiades,House of the Ceii,https://pleiades.stoa.org/places/999909607,999909607,40.75001,14.489506,"POLYGON ((14.4895058 40.7500099, 14.4895058 40..."
2,2015-05-28T11:48:45Z,"The so-called ""House of the Lararium of Achill...",,Pleiades,House of the Lararium of Achilles,https://pleiades.stoa.org/places/999909608,999909608,40.750362,14.489286,"POLYGON ((14.489286 40.750362, 14.489286 40.75..."
3,2020-03-20T15:26:53Z,A necropolis with inhumations dating to the fi...,<p>A necropolis located close to Monte Bibele ...,Pleiades,Monte Tamburino necropolis,https://pleiades.stoa.org/places/999917206,999917206,44.272322,11.37588,"POLYGON ((11.3758803 44.2723222, 11.3758803 44..."
4,2018-05-30T03:08:22Z,The megalithic defensive circuit of Rusellae d...,,Pleiades,Circuit wall of Rusellae,https://pleiades.stoa.org/places/999951524,999951524,42.829089,11.163588,"POLYGON ((11.1635884 42.8290888, 11.1635884 42..."


In [24]:
# load and preview names DataFrame
names_df = pd.read_csv('names.csv')
names_df.head()

Unnamed: 0,created,description,details,provenance,title,uri,id,place_id,name_type,language_tag,attested_form,romanized_form_1,romanized_form_2,romanized_form_3,association_certainty,transcription_accuracy,transcription_completeness,year_after_which,year_before_which
0,2010-06-24T14:11:06Z,,,Barrington Atlas: BAtlas 1 D1 Gallia,Gallia,https://pleiades.stoa.org/places/993/gallia,gallia,993,geographic,,,Gallia,,,certain,accurate,complete,,
1,2010-06-24T14:11:07Z,,,Barrington Atlas: BAtlas 1 D1 Gallia,Galli,https://pleiades.stoa.org/places/993/galli,galli,993,geographic,,,Galli,,,certain,accurate,complete,,
2,2015-06-23T17:19:22Z,,,Pleiades,House of the Ceii,https://pleiades.stoa.org/places/999909607/nam...,name.2015-06-23.9071239510,999909607,geographic,en,House of the Ceii,House of the Ceii,,,certain,accurate,complete,1700.0,2100.0
3,2017-06-26T23:54:15Z,,,Pleiades,Casa dei Ceii,https://pleiades.stoa.org/places/999909607/nam...,name.2017-06-26.2201864114,999909607,geographic,it,Casa dei Ceii,Casa dei Ceii,,,certain,accurate,complete,1700.0,2100.0
4,2020-12-26T17:59:50Z,,,Pleiades,House of L. Ceius Secundus,https://pleiades.stoa.org/places/999909607/hou...,house-of-l-ceius-secundus,999909607,geographic,en,House of L. Ceius Secundus,House of L. Ceius Secundus,,,certain,accurate,complete,,


In [25]:
# quick example: find 'Roma' in places DataFrame
places_df.loc[places_df['title'] == 'Roma']

Unnamed: 0,created,description,details,provenance,title,uri,id,representative_latitude,representative_longitude,bounding_box_wkt
21483,2018-06-07T19:48:13Z,The capital of the Roman Republic and Empire.,<p>The Barrington Atlas Directory notes: Roma/...,Barrington Atlas: BAtlas 43 B2 Roma,Roma,https://pleiades.stoa.org/places/423025,423025,41.891775,12.486137,"POLYGON ((12.486137 41.891775, 12.486137 41.89..."


In [26]:
# quick example: find 'Rome' in names DataFrame
names_df.loc[names_df['romanized_form_1'] == 'Rome']

Unnamed: 0,created,description,details,provenance,title,uri,id,place_id,name_type,language_tag,attested_form,romanized_form_1,romanized_form_2,romanized_form_3,association_certainty,transcription_accuracy,transcription_completeness,year_after_which,year_before_which
20810,2012-02-04T23:39:48Z,The modern English appellation,,Barrington Atlas: BAtlas 43 B2 Roma,Rome,https://pleiades.stoa.org/places/423025/rome,rome,423025,geographic,en,Rome,Rome,,,certain,accurate,complete,1700.0,2100.0


## Extract data from Pleiades data
For each location in we identified from the text, we will extract extract the longitude, latitude, and a description. First we need to find each location in the Pleiades data.

In [27]:
def get_pleiades_id(location):
    """
    Iterates through all of the possible names in the names.csv file
    Returns None if no matched names
    """
    name_row = names_df.loc[names_df['attested_form'] == location]
    if len(name_row) == 1:
        return int(name_row.place_id.iloc[0])
    else:
        name_row = names_df.loc[names_df['romanized_form_1'] == location]
        if len(name_row) == 1:
            return int(name_row.place_id.iloc[0])
        else:
            name_row = names_df.loc[names_df['romanized_form_2'] == location]
            if len(name_row) == 1:
                return int(name_row.place_id.iloc[0])
            else:
                name_row = names_df.loc[names_df['romanized_form_3'] == location]
                if len(name_row) == 1:
                    return int(name_row.place_id.iloc[0])
                else:
                    return None

In [28]:
# apply the above founction to each row in our location count DataFrame and then
# add a new colum with the Pleiades id
location_count_df['pleiades_id'] = location_count_df['place_name'].apply(get_pleiades_id)

In [29]:
# preview new location count DataFrame.
# the NaN means we were unable to find the location in the Pleiades data.
location_count_df.head()

Unnamed: 0,place_name,count,pleiades_id
0,Italy,15,
1,Verona,1,383816.0
2,Rome,27,423025.0
3,Gaul,1,
4,Spain,3,


In [30]:
# we can drop the rows with NaN values
location_count_df = location_count_df.dropna().reset_index(drop=True)

In [31]:
# preview updated location count DataFrame
location_count_df.head()

Unnamed: 0,place_name,count,pleiades_id
0,Verona,1,383816.0
1,Rome,27,423025.0
2,Ravenna,3,393480.0
3,Sarmatia,2,825371.0
4,Marcianopolis,1,216878.0


Now that we have a `pleiades_id` for each location from names.csv, we can use that information to get more data from the places.csv. It would be possible to combine the functions below into one, but I have seperated them out for clarity.

In [32]:
def get_description(pleiades_id):
    """return description from a pleiades id"""
    places_row = places_df.loc[places_df['id'] == pleiades_id]
    if len(places_row) == 1:
        return places_row.description.iloc[0]

In [33]:
def get_uri(pleiades_id):
    """return uri from a pleiades id"""
    places_row = places_df.loc[places_df['id'] == pleiades_id]
    if len(places_row) == 1:
        return places_row.uri.iloc[0]

In [34]:
def get_latitude(pleiades_id):
    """return latitude from a pleiades id"""
    places_row = places_df.loc[places_df['id'] == pleiades_id]
    if len(places_row) == 1:
        return places_row.representative_latitude.iloc[0]

In [36]:
# Challenge: Can you write a function to get the longitude data?
def get_longitude(pleiades_id):
    """return longitude from a pleiades id"""
    places_row = places_df.loc[places_df['id'] == pleiades_id]
    if len(places_row) == 1:
        return places_row.representative_longitude.iloc[0]




In [37]:
# add new column for description
location_count_df['description'] = location_count_df['pleiades_id'].apply(get_description)

In [41]:
# Challenge: can you write the code to add a column for the uri?
location_count_df['uri'] = location_count_df['pleiades_id'].apply(get_uri)
location_count_df

Unnamed: 0,place_name,count,pleiades_id,description,uri
0,Verona,1,383816.0,Verona was an ancient settlement that became a...,https://pleiades.stoa.org/places/383816
1,Rome,27,423025.0,The capital of the Roman Republic and Empire.,https://pleiades.stoa.org/places/423025
2,Ravenna,3,393480.0,A city of northern Adriatic Italy that served ...,https://pleiades.stoa.org/places/393480
3,Sarmatia,2,825371.0,"An ancient place, cited: BAtlas 84 E3 Sarmatia",https://pleiades.stoa.org/places/825371
4,Marcianopolis,1,216878.0,"An ancient place, cited: BAtlas 22 E5 Marciano...",https://pleiades.stoa.org/places/216878
5,Tarragona,1,246349.0,An Iberian site in contact with Greek and Phoe...,https://pleiades.stoa.org/places/246349
6,Milan,1,383706.0,"A Celtic settlement and crossroads, later a We...",https://pleiades.stoa.org/places/383706
7,Macedonia,1,491656.0,The region of ancient Macedonia in the Balkan ...,https://pleiades.stoa.org/places/491656
8,Trebizond,3,857359.0,An ancient settlement of Asia Minor on the sou...,https://pleiades.stoa.org/places/857359
9,Pontus,3,857287.0,A region of northeastern Anatolia consisting o...,https://pleiades.stoa.org/places/857287


In [43]:
# add new column for latitude
location_count_df['latitude'] = location_count_df['pleiades_id'].apply(get_latitude)

In [44]:
# Challenge: can you write the code to add a column for the longitude?
location_count_df['longitude'] = location_count_df['pleiades_id'].apply(get_longitude)
location_count_df

Unnamed: 0,place_name,count,pleiades_id,description,uri,longitude,latitude
0,Verona,1,383816.0,Verona was an ancient settlement that became a...,https://pleiades.stoa.org/places/383816,10.995736,45.44213
1,Rome,27,423025.0,The capital of the Roman Republic and Empire.,https://pleiades.stoa.org/places/423025,12.486137,41.891775
2,Ravenna,3,393480.0,A city of northern Adriatic Italy that served ...,https://pleiades.stoa.org/places/393480,12.196604,44.415718
3,Sarmatia,2,825371.0,"An ancient place, cited: BAtlas 84 E3 Sarmatia",https://pleiades.stoa.org/places/825371,39.5,45.5
4,Marcianopolis,1,216878.0,"An ancient place, cited: BAtlas 22 E5 Marciano...",https://pleiades.stoa.org/places/216878,27.585033,43.22504
5,Tarragona,1,246349.0,An Iberian site in contact with Greek and Phoe...,https://pleiades.stoa.org/places/246349,1.258338,41.116892
6,Milan,1,383706.0,"A Celtic settlement and crossroads, later a We...",https://pleiades.stoa.org/places/383706,9.18806,45.463746
7,Macedonia,1,491656.0,The region of ancient Macedonia in the Balkan ...,https://pleiades.stoa.org/places/491656,21.75,41.25
8,Trebizond,3,857359.0,An ancient settlement of Asia Minor on the sou...,https://pleiades.stoa.org/places/857359,39.723312,41.004269
9,Pontus,3,857287.0,A region of northeastern Anatolia consisting o...,https://pleiades.stoa.org/places/857287,34.742551,43.078685


Now that we have all the data we need, I am going to make a few little changes to the DataFrame.

In [45]:
# now that we have a uri we don't need the pleiades_id
location_count_df = location_count_df.drop(columns=['pleiades_id'])

In [46]:
# for our purposes we don't really need an index, so I will make the place_name column the index
location_count_df.set_index('place_name', inplace=True)

In [47]:
# final sanity check
location_count_df

Unnamed: 0_level_0,count,description,uri,longitude,latitude
place_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Verona,1,Verona was an ancient settlement that became a...,https://pleiades.stoa.org/places/383816,10.995736,45.44213
Rome,27,The capital of the Roman Republic and Empire.,https://pleiades.stoa.org/places/423025,12.486137,41.891775
Ravenna,3,A city of northern Adriatic Italy that served ...,https://pleiades.stoa.org/places/393480,12.196604,44.415718
Sarmatia,2,"An ancient place, cited: BAtlas 84 E3 Sarmatia",https://pleiades.stoa.org/places/825371,39.5,45.5
Marcianopolis,1,"An ancient place, cited: BAtlas 22 E5 Marciano...",https://pleiades.stoa.org/places/216878,27.585033,43.22504
Tarragona,1,An Iberian site in contact with Greek and Phoe...,https://pleiades.stoa.org/places/246349,1.258338,41.116892
Milan,1,"A Celtic settlement and crossroads, later a We...",https://pleiades.stoa.org/places/383706,9.18806,45.463746
Macedonia,1,The region of ancient Macedonia in the Balkan ...,https://pleiades.stoa.org/places/491656,21.75,41.25
Trebizond,3,An ancient settlement of Asia Minor on the sou...,https://pleiades.stoa.org/places/857359,39.723312,41.004269
Pontus,3,A region of northeastern Anatolia consisting o...,https://pleiades.stoa.org/places/857287,34.742551,43.078685


## Save location data for further use

In [49]:
# create path and file name variables
path = os.getcwd() # <-- set path variable (not necessary for Colab)
file_name = 'location_data'# <-- set file_name variable

In [50]:
# save DataFrame to a .csv file
location_count_df.to_csv(file_name, index=False) # <-- For Jupyter you may want to add path

In [51]:
# Colab only
files.download(file_name)

NameError: name 'files' is not defined