# NER Assignment
Georeferencing automatically detected place names in Edward Gibbon's *Decline and Fall of the Roman Empire* using the Pleiades gazatteer and `spaCy`.

Your assignment this week is to output a `CSV` file of place names, frequency and coordinates, as we did in class for a chapter of Gibbon of your choice. Try to find a chapter with a lot of places as we will be turning this data into an online map next week. The steps are:

* Choose a chapter from the text
* Begin a function by parsing input text
* Create a `spaCy` dictionary of entities and frequency
* Use the Pleiaes data from class to find coordinates for each place name
* Run your function on your chosen chapter
* Save your final `CSV`
* Come to class on Monday, ready to use it

In [15]:
gibbon_by_chapter['StringText'][70]

"\nIn the last days of Pope Eugenius the Fourth, two of his servants, the learned Poggius and a friend, ascended the Capitoline Hill; reposed themselves among the ruins of columns and temples; and viewed, from that commanding spot, the wide and various prospect of desolation. The place and the object gave ample scope for moralising on the vicissitudes of fortune, which spares neither man nor the proudest of his works, which buries empires and cities in a common grave; and it was agreed that in proportion to her former greatness the fall of Rome was the more awful and deplorable. Her primeval state, such as she might appear in a remote age, when Evander entertained the stranger of Troy, has  been delineated by the fancy of Virgil. This Tarpeian rock was then a savage and solitary thicket: in the time of the poet, it was crowned with the golden roofs of a temple: the temple is overthrown, the gold has been pillaged, the wheel of fortune has accomplished her revolution, and the sacred gro

In [26]:
# import and download any relevant data
!python -m spacy download en_core_web_sm
import pandas as pd
import spacy
import wget

import os
if not os.path.isfile('gibbon_text.csv'):
    wget.download('https://raw.githubusercontent.com/pnadelofficial/FallDHCourseMaterials/main/gibbon_text.csv')

import collections

Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 4.3 MB/s eta 0:00:01
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [27]:
# any helper functions you may need
#make sure to run 'import collections'
#chapter = #some string ''
first_chapter_doc = nlp(first_chapter)
for entity in first_chapter_doc.ents:
    if (entity.label_ == 'GPE') or (entity.label_ == 'LOC'):
        print(entity.text)


def get_pleiades_id(term):
    """
    Iterates through all of the possible names in the names.csv file
    Returns None if no matched names
    """
    name_row = names.loc[names['attested_form'] == term]
    if len(name_row) == 1:
        return int(name_row.place_id.iloc[0])
    else:
        name_row = names.loc[names['romanized_form_1'] == term]
        if len(name_row) == 1:
            return int(name_row.place_id.iloc[0])
        else:
            name_row = names.loc[names['romanized_form_2'] == term]
            if len(name_row) == 1:
                return int(name_row.place_id.iloc[0])
            else:
                name_row = names.loc[names['romanized_form_3'] == term]
                if len(name_row) == 1:
                    return int(name_row.place_id.iloc[0])
                else:
                    return None
                
                
def get_lat(pl_id):
    places_row = places.loc[places['id'] == pl_id]
    if len(places_row) == 1:
        return places_row.representative_latitude.iloc[0]
    
def get_long(pl_id):
    places_row = places.loc[places['id'] == pl_id]
    if len(places_row) == 1:
        return places_row.representative_longitude.iloc[0]
    
    
gibbon_by_chapter = pd.read_csv('gibbon_text.csv').rename(columns={'Unnamed: 0':'chapter'})


nlp = spacy.load('en_core_web_sm')


first_chapter = gibbon_by_chapter['StringText'][0]

Rome
Nerva
Trajan
Marcus Antoninus
Rome
Rome
Aethiopia
Europe
Germany
the Atlantic Ocean
Euphrates
Arabia
Africa
Britain
Britain
Boadicea
Britain
Ireland
Britain
Scotland
Antoninus Pius
Antoninus
Edinburgh
gloomy hills
Rome
Trajan
the Euxine Sea
Trajan
Philip
Tigris
Armenia
the Persian gulf
Arabia
India
Bosphorus
Colchos
Iberia
Albania
Armenia
Mesopotamia
Assyria
Jupiter
Jupiter
Armenia
Mesopotamia
Assyria
Euphrates
Trajan
Antoninus
Caledonia
the Upper Egypt
Italy
Rome
Antoninus
Marcus
Euphrates
Europe
Rome
Italy
Spain
East
oblong
Rome
Britain
Lower
the Upper Germany
Noricum
Pannonia
Maesia
Syria
Egypt
Africa
Spain
Italy
Marseilles
Mediterranean
Italy
Misenum
Naples
Liburnians
Misenum
Mediterranean
Provence
Britain
Spain
Europe
Mediterranean
the Atlantic Ocean
Lusitania
Baetica
Portugal
East
North
Grenada
Andalusia
Baetica
Spain
Asturias
Biscay
Castilles
Murcia
Valencia
Catalonia
Arragon
Tarragona
Rome
Alps
Rhine
Ocean
France
Alsace
Switzerland
Rhine
Liege
Luxemburg
Mediterranean
Langu

In [28]:
# final functions
def get_locations(chapter, output_name):
    chapter_doc = nlp(chapter)
    place_freq = collections.defaultdict(int)
    for entity in first_chapter_doc.ents:
        if (entity.label_ == 'GPE') or (entity.label_ == 'LOC'):
            place_freq[entity.text] += 1 # the utility of defaultdict!
    place_freq = dict(place_freq)
    place_freq_df = pd.DataFrame.from_dict(place_freq, orient='index').reset_index().rename(columns={'index':'place_name',0:'frequency'})
    places = pd.read_csv('places.csv')
    names = pd.read_csv('names.csv')
    place_freq_df['pleiades_id'] = place_freq_df['place_name'].apply(get_pleiades_id)
    place_freq_final = place_freq_df.dropna().reset_index(drop=True)
    place_freq_final['lat'] = place_freq_final['pleiades_id'].apply(get_lat)
    place_freq_final['long'] = place_freq_final['pleiades_id'].apply(get_long)
    place_freq_final.to_csv(f'{output_name}.csv')
    return place_freq_final

In [31]:
# try your function out
get_locations(gibbon_by_chapter['StringText'][38], 'gibbon_chapter_38_places')

FileNotFoundError: [Errno 2] No such file or directory: 'names.csv'

In [30]:
# save result as CSV
