Now that we have done things at the chapter level, let's do it at the book level! Let's focus on mapping geographically the world of Jules verne by extracting GPE and LOC of **Around the World in 80 days**.

# 1. We import our libraries

In [1]:
import spacy

# 2. We get our data

This data has not been cleaned and pre-processed to avoid confusing the parser (only \r\n characters have been removed!)

In [2]:
with open("around_the_world.txt", "r", encoding = "utf-8") as f:
    data = f.read()

# 3. We import the English pipeline

In [3]:
nlp = spacy.load("en_core_web_sm")

# 4. We create the Spacy nlp object

In [4]:
doc = nlp(data)

# 5. We inspect the English model labels

Let's remember the entities that we have in Spacy:

In [5]:
nlp.get_pipe('ner').labels

('CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART')

# 6. We print the entities

In [6]:
for ent in doc.ents:
    print(ent.text, ent.label_)

CHAPTER ORG
ONE CARDINAL
1872 DATE
Saville Row PERSON
BurlingtonGardens ORG
Sheridan PERSON
1814 DATE
the Reform Club ORG
Byron ORG
Byronic ORG
wasa GPE
Byron ORG
a thousand years DATE
Phileas Fogg wasa Londoner ORG
Bank ORG
London GPE
the Inns of Court ORG
Temple GPE
Gray’s Inn ORG
Court ORG
Exchequer ORG
theEcclesiastical Courts ORG
the Royal Institution ORG
LondonInstitution ORG
the Artisan’s Association ORG
the Institution of Arts andSciences ORG
English NORP
Harmonic LOC
theEntomologists ORG
Fogg PERSON
Reform ORG
Barings ORG
Fogg PERSON
daily DATE
thousand CARDINAL
second ORDINAL
many years DATE
Fogg PERSON
Fogg PERSON
SavilleRow ORG
hours TIME
Reform ORG
hours TIME
twenty-four CARDINAL
Saville Row GPE
mosaic PRODUCT
twenty CARDINAL
American NORP
Saville Row GPE
this very 2nd DATE
October DATE
James Forster PERSON
eighty-four CARDINAL
Fahrenheit PERSON
the housebetween eleven and half EVENT
the hours TIME
the minutes, the seconds TIME
the days DATE
the months DATE
the years DATE


# 7. We create one list with GPE 

While possibly LOC is a lable that contains interesting information, as this is a DH introductory course, let's just focus on GPE!

In [7]:
GPE = []

for ent in doc.ents:
    if ent.label_ == "GPE":
        GPE.append([ent.text, ent.label_])

In [8]:
GPE

[['wasa', 'GPE'],
 ['London', 'GPE'],
 ['Temple', 'GPE'],
 ['Saville Row', 'GPE'],
 ['Saville Row', 'GPE'],
 ['Blondin', 'GPE'],
 ['France', 'GPE'],
 ['England', 'GPE'],
 ['theUnited Kingdom', 'GPE'],
 ['Passepartout', 'GPE'],
 ['Saville Row', 'GPE'],
 ['Tussaud', 'GPE'],
 ['London', 'GPE'],
 ['Leroy', 'GPE'],
 ['Paris', 'GPE'],
 ['England', 'GPE'],
 ['Saville Row', 'GPE'],
 ['Liverpool', 'GPE'],
 ['Glasgow', 'GPE'],
 ['Brindisi', 'GPE'],
 ['New York', 'GPE'],
 ['London', 'GPE'],
 ['Allahabad', 'GPE'],
 ['London', 'GPE'],
 ['Brindisi', 'GPE'],
 ['Bombay', 'GPE'],
 ['Bombay', 'GPE'],
 ['Hong Kong', 'GPE'],
 ['Hong Kong', 'GPE'],
 ['Yokohama', 'GPE'],
 ['Japan', 'GPE'],
 ['San Francisco', 'GPE'],
 ['New York', 'GPE'],
 ['New York', 'GPE'],
 ['London', 'GPE'],
 ['Hindoos', 'GPE'],
 ['’s', 'GPE'],
 ['Flanagan', 'GPE'],
 ['Dover', 'GPE'],
 ['London', 'GPE'],
 ['Saville Row', 'GPE'],
 ['Dover', 'GPE'],
 ['Calais', 'GPE'],
 ['Dover', 'GPE'],
 ['France', 'GPE'],
 ['Paris', 'GPE'],
 ['Paris', '

Now let's drop the duplicates in there!

In [9]:
GPE_places = []

for a, b in GPE:
    GPE_places.append(a)  

In [10]:
unique_GPE = set(GPE_places)

In [11]:
unique_GPE

{'AFABULOUS',
 'Aden',
 'Allahabad',
 'America',
 'Aoudainto',
 'Arkansas',
 'Athens',
 'Aurungabad',
 'Bab-el-Mandeb',
 'Behar',
 'Benares',
 'Bengal',
 'Birmingham',
 'Blondin',
 'Bombay',
 'Bordeaux',
 'Brazil',
 'Brindisi',
 'Bundelcund',
 'Burhampoor',
 'Buxar',
 'Calais',
 'Calcutta',
 'Calcuttamay',
 'California',
 'Camerfield',
 'Chicago',
 'Chili',
 'China',
 'Cisco',
 'Colorado',
 'Columbus',
 'Denver',
 'Des Moines',
 'Dover',
 'Dublin',
 'Edinburgh',
 'Egypt',
 'Elephanta',
 'Elko',
 'England',
 'Fixseemed',
 'Flanagan',
 'Fogg',
 'Formosa',
 'Fort Saunders',
 'Fort Wayne',
 'France',
 'Frenchmen',
 'Frenchover',
 'Glasgow',
 'Golconda',
 'Gour',
 'Great Island',
 'Great Salt Lake',
 'Green Creek',
 'Hamburg',
 'Hewould',
 'Hindoos',
 'Holland',
 'Hong Kong',
 'Hong Kong andCalcutta',
 'Hong Kong?”“Why',
 'Hong Kong—”“I',
 'HongKong',
 'ITHong Kong',
 'Idon’t',
 'Illinois',
 'Independence',
 'India',
 'Indiana',
 'Indus',
 'Iowa',
 'Iowa City',
 'Japan',
 'Japanese Empire',

Let's save our values!

In [12]:
with open("GPE_aroundtheworld.txt", "w", encoding = "utf-8") as f:
    f.write(str(unique_GPE))

# Exercise 3