# NLP Workshop
2019-11-07

## Hello!

In this notebook, we provide an introductory overview of named entity extractione on unstructured textual data using a contemporary NLP pipeline. We are primarily interested in the study of literary texts but the overall pipeline should be applicable to any kind of (English) texts.

**TO COMPLETE - More on NER?**

# Intro to `spaCy`

In the field of NLP, there are a number of competing frameworks such as NLTK, Stanford's CoreNLP, and UDPipe - to name only a few well-known examples.

In order to quickly sidestep a discussion of the relative merites of each of thse frameworks, we'll be working with the NLP package known as <i>spaCy</i>, which bills itself as 'industrial strength natural language processing'.

To begin, we import spaCy and load the English language model, which you should have been downloaded in advance.

(Press Shift+Enter to run cells and work through the notebook!)


In [21]:
import spacy
import en_core_web_sm
nlp = spacy.load("en_core_web_sm")

With the model now loaded, we can begin to do some very simple NLP tasks. 

Here, we create a spaCy object called a <i>doc</i>. This doc comprises smaller objects of two kinds - <i>tokens</i> and <i>entities</i>. These objects, in turn, have certain methods attached to them. A full outline of available methods can be found on the spaCy website.

In this case, for all token objects, let's return the token itself (token.text); its Part-of-Speech tag (token.pos_); and the grammatical dependency relations between the tokens (token.dep_).

In [4]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_)

Apple PROPN nsubj
is VERB aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj


We can inspect named entities in a similar manner. 

For each entity, we can see the entity itself (entity.text) and the type of entity it is (entity.label_).

In [5]:
for entity in doc.ents:
    print(entity.text, entity.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


spaCy has returned three entites in this case - an Organisation (ORG); a geopolitical entity (GPE); and a term describing a quantity of money (MONEY).

## The corpus

For this workshop, we'll be working with a pre-defined corpus of literary texts which can be found on the workshop's Github repo. 

You'll notice that the filenames of these novels feature regular descriptive patterns and contain metadata related to the novels. The pattern is as follows:

<i>Nationality-Author-Novel-Year-Gender.txt</i>

This means that we can easily read the filename 'B-Bronte_E-Wuthering_Heights-1847-F.txt' as being the 1847 novel <i>Wuthering Heights</i> by the British, female author Emily Brontë.

### **NB!**
While spaCy often comes top of the pack relative to other NLP frameworks in terms of speed, tagging long texts is still a computationally expensive task. With that in mind, we're going to work with smaller chunks of the full length novels. For those of you interested in tagging to full texts in your own time, you can find them on Github.

In [15]:
import requests

base_url = 'https://raw.githubusercontent.com/centre-for-humanities-computing/NER_workshop/master/texts_short/'
filenames = [
  'A-Alcott-Little_Women-1868-F.txt',
  'A-Cather-Antonia-1918-F.txt',
  'A-Chesnutt-Marrow-1901-M.txt',
  'A-Chopin-Awakening-1899-F.txt',
  'A-Crane-Maggie-1893-M.txt',
  'A-Davis-Life_Iron_Mills-1861-F.txt',
  'A-Dreiser-Sister_Carrie-1900-M.txt',
  'A-Freeman-Pembroke-1894-F.txt',
  'A-Gilman-Herland-1915-F.txt',
  'A-Harper-Iola_Leroy-1892-F.txt',
  'A-Hawthorne-Scarlet_Letter-1850-M.txt',
  'A-Howells-Silas_Lapham-1885-M.txt',
  'A-James-Golden_Bowl-1904-M.txt',
  'A-Jewett-Pointed_Firs-1896-F.txt',
  'A-London-Call_Wild-1903-M.txt',
  'A-Melville-Moby_Dick-1851-M.txt',
  'A-Norris-Pit-1903-M.txt',
  'A-Stowe-Uncle_Tom-1852-F.txt',
  'A-Twain-Huck_Finn-1885-M.txt',
  'A-Wharton-Age_Innocence-1920-F.txt',
  'B-Austen-Pride_Prejudice-1813-F.txt',
  'B-Bronte_C-Jane_Eyre-1847-F.txt',
  'B-Bronte_E-Wuthering_Heights-1847-F.txt',
  'B-Burney-Evelina-1778-F.txt',
  'B-Conrad-Heart_Darkness-1902-M.txt',
  'B-Dickens-Bleak_House-1853-M.txt',
  'B-Disraeli-Sybil-1845-M.txt',
  'B-Eliot-Middlemarch-1869-F.txt',
  'B-Forster-Room_View-1908-M.txt',
  'B-Gaskell-North_South-1855-F.txt',
  'B-Gissing-Grub_Street-1893-M.txt',
  'B-Hardy-Tess-1891-M.txt',
  'B-Mitford-Our_Village-1826-F.txt',
  'B-Radcliffe-Mysteries_Udolpho-1794-F.txt',
  'B-Shelley-Frankenstein-1818-F.txt',
  'B-Stevenson-Treasure_Island-1883-M.txt',
  'B-Thackeray-Vanity_Fair-1848-M.txt',
  'B-Trollope-Live_Now-1875-M.txt',
  'B-Wells-Time_Machine-1895-M.txt',
  'B-Woolf-Mrs_Dalloway-1925-F.txt'
]

# Use a dictionary to store full texts, keyed to file name
texts = {}

for f in filenames:
    texts[f] = requests.get(base_url+f).text

## Run `spaCy` over corpus 

Now that the corpus has been loaded, we can now run each of the novels through the spaCY pipeline.

We're interested primarily in locations, which in the spaCy framework corresponds to geopolitical entities or those entities tagged GPE.

In [22]:
# Run NLP on texts
nlp.max_length = 1e+7 # Max doc length in characters

from collections import Counter

gpe_counts = {} # Dict to hold output entity counts

for t in texts:
    counts = Counter()               # Entity counts per doc
    doc = nlp(texts[t])              # Perform NLP using spaCy
    print(t)                         # Inform user of progress
    for ent in doc.ents:
        if ent.label_ == 'GPE':
            counts[ent.text] += 1        # Increment by 1 for every GPE
    gpe_counts[t] = counts

A-Alcott-Little_Women-1868-F.txt
A-Cather-Antonia-1918-F.txt
A-Chesnutt-Marrow-1901-M.txt
A-Chopin-Awakening-1899-F.txt
A-Crane-Maggie-1893-M.txt
A-Davis-Life_Iron_Mills-1861-F.txt
A-Dreiser-Sister_Carrie-1900-M.txt
A-Freeman-Pembroke-1894-F.txt
A-Gilman-Herland-1915-F.txt
A-Harper-Iola_Leroy-1892-F.txt
A-Hawthorne-Scarlet_Letter-1850-M.txt
A-Howells-Silas_Lapham-1885-M.txt
A-James-Golden_Bowl-1904-M.txt
A-Jewett-Pointed_Firs-1896-F.txt
A-London-Call_Wild-1903-M.txt
A-Melville-Moby_Dick-1851-M.txt
A-Norris-Pit-1903-M.txt
A-Stowe-Uncle_Tom-1852-F.txt
A-Twain-Huck_Finn-1885-M.txt
A-Wharton-Age_Innocence-1920-F.txt
B-Austen-Pride_Prejudice-1813-F.txt
B-Bronte_C-Jane_Eyre-1847-F.txt
B-Bronte_E-Wuthering_Heights-1847-F.txt
B-Burney-Evelina-1778-F.txt
B-Conrad-Heart_Darkness-1902-M.txt
B-Dickens-Bleak_House-1853-M.txt
B-Disraeli-Sybil-1845-M.txt
B-Eliot-Middlemarch-1869-F.txt
B-Forster-Room_View-1908-M.txt
B-Gaskell-North_South-1855-F.txt
B-Gissing-Grub_Street-1893-M.txt
B-Hardy-Tess-1891-M.

We can then transform the results into a pandas dataframe for easy manipulation and exploration.



In [17]:
# Examine some output
import pandas as pd
# Create dataframe from dcit
locations = pd.DataFrame.from_dict(gpe_counts, orient='index')
# Stack df and reset index
locations = locations.stack().rename('occurs').astype(int)
locations.index.set_names(['text', 'location'], inplace=True)
locations = locations.to_frame().reset_index()
print(locations.shape)

(755, 3)


Run the following cell a couple of times in order to see some random samples of the data. 

What do you think of the results? Do you notice any issues? Any patterns?

In [18]:
# Show a random sample of 10 entities
display(locations.sample(10))

Unnamed: 0,text,location,occurs
196,A-Melville-Moby_Dick-1851-M.txt,Greenland,12
410,B-Dickens-Bleak_House-1853-M.txt,Devon,1
728,B-Trollope-Live_Now-1875-M.txt,Islington,3
701,B-Trollope-Live_Now-1875-M.txt,Rome,1
687,B-Thackeray-Vanity_Fair-1848-M.txt,Podder,1
139,A-Howells-Silas_Lapham-1885-M.txt,France,1
661,B-Thackeray-Vanity_Fair-1848-M.txt,Porteus,1
512,B-Gaskell-North_South-1855-F.txt,Miss.,1
740,B-Wells-Time_Machine-1895-M.txt,Dresden,1
698,B-Thackeray-Vanity_Fair-1848-M.txt,Iphigenia,1


We can then groups locations together and order by total occurrences, in order to see what are the most common locations mentioned across the corpus.

In [19]:
# Top locations
locations.groupby('location').occurs.sum().sort_values(ascending=False).head(10)

location
Charlotte     142
Rebecca       107
London        102
Ada            97
Drouet         91
\n             78
Valancourt     72
Rome           54
Amelia         53
England        51
Name: occurs, dtype: int64

In some cases, locations are just not very informative. For example, if they do not occur regularly, or only occur in small number of texts in the corpus. 

What happens if we chose to filter the results based on user-defined thresholds?

In [20]:
# Cull locations that are infrequent or occur in only one book
min_occurs = 5 # Must occur at least this many times total in the corpus
min_vols = 3   # Must occur in at least this many volumes

# Create a new dataframe, leaving the original unmodified
place_counts = pd.DataFrame(locations.groupby('location').occurs.sum())
place_counts['volumes'] = locations.groupby('location').occurs.size()

print("Total location occurences before culling:", place_counts.occurs.sum())
print(place_counts.describe())

# Cull placenames using thresholds
place_counts = place_counts.loc[
                                (place_counts['occurs']>=min_occurs) 
                                & (place_counts['volumes']>=min_vols)
                                ]
print("\nTotal location occurences after culling:", place_counts.occurs.sum())
print(place_counts.describe())

Total location occurences before culling: 2307
           occurs     volumes
count  547.000000  547.000000
mean     4.217550    1.380256
std     12.335445    2.067057
min      1.000000    1.000000
25%      1.000000    1.000000
50%      1.000000    1.000000
75%      3.000000    1.000000
max    142.000000   40.000000

Total location occurences after culling: 665
           occurs    volumes
count   21.000000  21.000000
mean    31.666667   7.809524
std     35.951820   8.195237
min      5.000000   3.000000
25%     10.000000   4.000000
50%     16.000000   5.000000
75%     40.000000   8.000000
max    142.000000  40.000000


In [14]:
# Show culled data
place_counts.sort_values(by='occurs', ascending=False).head(10)

Unnamed: 0_level_0,occurs,volumes
location,Unnamed: 1_level_1,Unnamed: 2_level_1
Charlotte,142,5
London,102,16
\n,78,40
Rome,54,8
England,51,14
New York,40,6
Venice,24,3
Paris,22,7
France,20,10
Brighton,20,3


Better, but not perfect. See, e.g., "Charlotte". There exists a well-known Charlotte, N.C. but in this case, Charlotte is almost certainly the name of a person, not a location.

Why might it be the case the some entities are miscategorised?

Despite the small errors in the results, they give us enough to work with for the moment. In the next section, we'll assign geographical coordinates to the places we've extracted from the novels.

In [25]:
# Create a list of locations to geocode
locations_to_geocode = place_counts.index.to_list()
locations_to_geocode[:5]

['America', 'Brighton', 'Canada', 'Charlotte', 'Chicago']


## Geocoding

<i>Geocoding</i> is the term used to describe the process of assigning geographical coordinates to locations, along with ant additional information that might be bound to that location.

As with NLP frameworks, there exists a bewlidering diversity of API's which can be used for geocoding. Each of these offers different levels of accuracy and scale.

Examples of geocoding providers:

* MapQuest
* Google Places ($)
* OpenCage Geocoder
* LocationIQ
* NetToolKit

In this workshop, we'll be demonstrating how to use both MapQuest and Google's geocoding service. Each of these come with some unfortuate baggage. Both require you to sign up for an API key and both have usage limits. For Mapquest, this limit is around 15,000 queries a month.

For Google, you have $200 worth of query credits per month. (See bibliography for links to Google's pricing scheme). Beyond this, you will be billed for additional queries. Hence, Google requires a billing account when you sign up. 

For our purposes, we're not going to exceed these limits in the workshop. However, if you do not want to sign up to Google's geocoding API, that's fine - you can still work with MapQuest.

## Using `MapQuest`

First, we install the geocoder package. You can comment out the pip command if you're running this remotely and you have already installed the package in advance.

In [35]:
# Install and import geocoder package
!pip install geocoder
import geocoder



Next, upload your MapQuest API key, which you should have saved as mapquest-key.txt

In [36]:
# Upload Mapquest key
mapquest_file = files.upload()

Saving mapquest-key.txt to mapquest-key.txt


In [0]:
# read key as string
mapquest_key = mapquest_file['mapquest-key.txt'].decode('utf-8').strip()

We're then able to query the MapQuest API by invoking geocoder.mapquest() on our predifined list of extract locations.

In [0]:
# Perform batch geocode
mapquest_geocode_output = geocoder.mapquest(locations_to_geocode, key=mapquest_key, method='batch')

We then do some quick post-processing of the results. This is primarily to bring the output into line with the output from the Google API, allowing for reproduceable results further down the line.

In [0]:
# Convert geocoder output to format matching Google
mapquest_geocode_data = []
for i in mapquest_geocode_output:
    try:
        result = {
        'formatted_address' : i.address,
        'location_type' : i.quality,
        'country' : i.country,
        'admin_1' : i.state,
        'admin_2' : i.county,
        'locality' : i.locality,
        'lat' : i.lat,
        'lon' : i.lng}
        mapquest_geocode_data.append(result)
    except:
        print(f"Bad geocode for result {i.address}")

If we create a dataframe from these results and inspect the first few results, you'll see the kind of information available via geocoding. 

Not only do we have precise geographical coordinates, we have information about the type of location (city, state, country, etc), as well as more detailed local administrative information such as county.

In [55]:
# create DataFrame and inspect first 5 entries
mapquest_geodata = pd.DataFrame(mapquest_geocode_data, index=locations_to_geocode)
mapquest_geodata.head()

Unnamed: 0,formatted_address,location_type,country,admin_1,admin_2,locality,lat,lon
America,America,CITY,US,OK,McCurtain County,America,33.8153,-94.548302
Brighton,Brighton,CITY,GB,,East Sussex,Brighton,50.82204,-0.137406
Canada,CA,COUNTRY,CA,,,,56.511018,-105.908203
Charlotte,Charlotte,CITY,US,NC,Mecklenburg County,Charlotte,35.222936,-80.840161
Chicago,Chicago,CITY,US,IL,Cook County,Chicago,41.883229,-87.632398


Now that we have the geocoded data, we can merge these results with our earlier dataframe called locations which showed which novel the entity came from. 

This gives us a new dataframe which contains information about the text, the locations mentioned, the number of occurrences, and the new geocoding data.

In [59]:
# Assemble NER and geo data
mapquest_data = locations.merge(mapquest_geodata, left_on='location', right_index=True).dropna(subset=['formatted_address'])
display(mapquest_data.sample(5))
print("Total location occurrences:", mapquest_data.occurs.sum())

Unnamed: 0,text,location,occurs,formatted_address,location_type,country,admin_1,admin_2,locality,lat,lon
525,B-Austen-Pride_Prejudice-1813-F.txt,London,4,London,CITY,GB,,Westminster,London,51.507276,-0.12766
596,B-Stevenson-Treasure_Island-1883-M.txt,England,5,GB,STATE,GB,ENGLAND,,,52.795479,-0.54024
577,B-Gaskell-North_South-1855-F.txt,England,3,GB,STATE,GB,ENGLAND,,,52.795479,-0.54024
9,A-Alcott-Little_Women-1868-F.txt,America,4,America,CITY,US,OK,McCurtain County,America,33.8153,-94.548302
430,B-Dickens-Bleak_House-1853-M.txt,India,2,IN,COUNTRY,IN,,,,27.0858,80.314003


Total location occurrences: 461


In [60]:
# Group by precise location
mapquest_map_data = pd.DataFrame(mapquest_data.groupby(['formatted_address', 'lat', 'lon', 'location_type']).occurs.sum()).reset_index()
print(mapquest_map_data.shape)
display(mapquest_map_data.head())

(18, 5)


Unnamed: 0,formatted_address,lat,lon,location_type,occurs
0,America,33.8153,-94.548302,CITY,15
1,Brighton,50.82204,-0.137406,CITY,18
2,CA,56.511018,-105.908203,COUNTRY,6
3,Charlotte,35.222936,-80.840161,CITY,38
4,Chicago,41.883229,-87.632398,CITY,10


What do you think of these results? Does anything seem wrong? 

## Visualise on map

Leaving aside our issues and concerns right now, we can visualise the results by plotting them on a map of Earth. We do this using the Plotly package.

In [0]:
# Show locations on a map
import plotly.express as px
fig = px.scatter_geo(mapquest_map_data, lat='lat', lon='lon',
                     hover_name="formatted_address", size="occurs",
                     color='location_type',
                     projection="natural earth")
fig.show()

What do these results show? What might they tell you about the literary texts from which they are extracted?

## Using `Google Places`

The section that follows reproduces the same kind of pipeline as above. In this case, though, we're using Google's API rather than MapQuest's. The following few cells are more-or-less identical to those at the start of the MapQuest pipeline.

In [62]:
!pip install googlemaps
import googlemaps



In [64]:
# Upload personal key file
from google.colab import files
uploaded = files.upload()

Saving google-api.txt to google-api.txt


In [65]:
# Examine uploaded file(s)
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

User uploaded file "google-api.txt" with length 39 bytes


In [66]:
# Save key content to variable for use later
google_key = uploaded['google-api.txt'].decode('utf-8').strip()
print(len(google_key))

39


To use Google's services, we need to initialise the Geocoding API and the Places API via the googlemaps package. Notice how, in this case, we're setting limits on the number of queries that we are performing every second!

In [0]:
# Set up googlemaps clients
gc_rate  =     50 # Geocoding queries per second
pl_rate  =      5 # Places queries per second

gc_client = googlemaps.Client(key=google_key, queries_per_second=gc_rate) # For Geocoding API
pl_client = googlemaps.Client(key=google_key, queries_per_second=pl_rate) # For Places API

We'll need some user-defined functions, in order to manipulated the results from the API calls.

Due to time contraints, we probably don't have time to go into the inner workings of each function right now but feel free to ask if you want to go into detail.

The multiline comments within each function explains how they works. If you're competent in Python, it should be fairly untroublesome to follow the logic. Take a minute to read the multiline comment and see if you can see what's happening.

Does the logic seem reasonable to you? Can you see anything which might be problematic?

In [0]:
# Functions to work with Google maps client data
def get_placeid(string, api_client):
    '''Takes a string and an established googlemaps places API client.
       Returns first place_id associated with that string.
       If no place_id found, returns "ZERO_RESULTS" or None, depending on result status code.'''
    try:
        place = api_client.places(string)
        status = place['status']
        if status == 'OK':
            place_id = place['results'][0]['place_id']
        elif status == 'ZERO_RESULTS':
            place_id = None
        else:
            place_id = None
    except:
        place_id = None
    return place_id

def process_id(placeid, api_client):
    '''Takes a Google place_id and an established googlemaps geocoding API client.
        Looks up and parses geo data for placeid.
        Returns int code on error, else dictionary of geo data.
    '''
    # Define all variables, initial to None
    result = {
    'formatted_address' : None,
    'location_type' : None,
    'country' : None,
    'admin_1' : None,
    'admin_2' : None,
    'locality' : None,
    'lat' : None,
    'lon' : None,
    'partial' : None,
    }
    # Perform reverse geocode.
    try:
        data = gc_client.reverse_geocode(placeid)
    except:
        return 1 # Problem with geocoding API call
    
    # Use the first result. Should only be one when reverse geocoding with place_id.
    try:
        data = data[0]
        result['formatted_address'] = data['formatted_address']
        result['location_type'] = data['types'][0]
        result['lat'] = data['geometry']['location']['lat']
        result['lon'] = data['geometry']['location']['lng']
        try:
            result['partial'] = result['partial_match']
        except:
            result['partial'] = False
    except:
        print("   Bad geocode for place_id %s" % (placeid))
        return 2 # Problem with basic geocode result
    
    try:
        for addr_comp in data['address_components']:
            comp_type = addr_comp['types'][0]
            if comp_type == 'locality':
                result['locality'] = addr_comp['long_name']
            elif comp_type == 'country':
                result['country'] = addr_comp['long_name']
            elif comp_type == 'administrative_area_level_1':
                result['admin_1'] = addr_comp['long_name']
            elif comp_type == 'administrative_area_level_2':
                result['admin_2'] = addr_comp['long_name']
    except:
        return 3 # Problem with address components
    
    return result

From this point on, the code again resembles the MapQuest pipeline above. Notice how it was beneficial to reshape the MapQuest results in the first place, so that we can recycle code!

In [69]:
# Perform geocoding
%%time
google_geocode_output = {}
for loc in locations_to_geocode:
  placeid = get_placeid(loc, pl_client)
  if placeid:
    google_geocode_output[loc] = process_id(placeid, gc_client)
print("Successful geocodes:", len(google_geocode_output.keys()))

Successful geocodes: 18
CPU times: user 127 ms, sys: 5.05 ms, total: 132 ms
Wall time: 13.4 s


In [71]:
# Examine output
google_geodata = pd.DataFrame.from_dict(google_geocode_output, orient='index')
google_geodata.head()

Unnamed: 0,formatted_address,location_type,country,admin_1,admin_2,locality,lat,lon,partial
America,United States,country,United States,,,,37.09024,-95.712891,False
Brighton,"5064 W 119th St, Leawood, KS 66209, USA",clothing_store,United States,Kansas,Johnson County,Leawood,38.914475,-94.644156,False
Canada,Canada,country,Canada,,,,56.130366,-106.346771,False
Charlotte,"Charlotte, NC, USA",locality,United States,North Carolina,Mecklenburg County,Charlotte,35.227087,-80.843127,False
Chicago,"Chicago, IL, USA",locality,United States,Illinois,Cook County,Chicago,41.878114,-87.629798,False


In [80]:
# Assemble NER and geo data
google_data = locations.merge(google_geodata, left_on='location', right_index=True).dropna(subset=['formatted_address'])
display(google_data.sample(5))
print("Total location occurrences:", google_data.occurs.sum())

Unnamed: 0,text,location,occurs,formatted_address,location_type,country,admin_1,admin_2,locality,lat,lon,partial
544,B-Radcliffe-Mysteries_Udolpho-1794-F.txt,France,9,France,country,France,,,,46.227638,2.213749,False
459,B-Woolf-Mrs_Dalloway-1925-F.txt,London,4,"London, UK",locality,United Kingdom,England,Greater London,London,51.507351,-0.127758,False
497,A-Chesnutt-Marrow-1901-M.txt,Washington,1,"Washington, USA",administrative_area_level_1,United States,Washington,,,47.751074,-120.740139,False
78,B-Eliot-Middlemarch-1869-F.txt,England,4,"England, UK",administrative_area_level_1,United Kingdom,England,,,52.355518,-1.17432,False
249,A-Stowe-Uncle_Tom-1852-F.txt,Providence,4,"1902 S 8th St, Rogers, AR 72758, USA",establishment,United States,Arkansas,Benton County,Rogers,36.311004,-94.127431,False


Total location occurrences: 461


In [82]:
# Group data by location for plotting
google_map_data = pd.DataFrame(google_data.groupby(['formatted_address', 'lat', 'lon', 'location_type']).occurs.sum()).reset_index()
print(google_map_data.shape)
google_map_data

(18, 5)


Unnamed: 0,formatted_address,lat,lon,location_type,occurs
0,"1902 S 8th St, Rogers, AR 72758, USA",36.311004,-94.127431,establishment,6
1,"5064 W 119th St, Leawood, KS 66209, USA",38.914475,-94.644156,clothing_store,18
2,Canada,56.130366,-106.346771,country,6
3,"Charlotte, NC, USA",35.227087,-80.843127,locality,38
4,"Chicago, IL, USA",41.878114,-87.629798,locality,10
5,"England, UK",52.355518,-1.17432,administrative_area_level_1,51
6,France,46.227638,2.213749,country,20
7,India,20.593684,78.96288,country,10
8,Italy,41.87194,12.56738,country,16
9,"London, UK",51.507351,-0.127758,locality,102


How do these results from Google Places compare to those from MapQuest?

## Visualize on map



In [0]:
# Show locations on a map
import plotly.express as px
fig = px.scatter_geo(google_map_data, lat='lat', lon='lon',
                     hover_name="formatted_address", size="occurs",
                     color='location_type',
                     projection="natural earth")
fig.show()

## Comparison

Good news - there's no more code! Bad news - now we have to contextualise!

In pairs or small groups, discuss what you've just done. Here are some conversation starters to get you going.

At a general level:
  * How might these results be used in the analysis of literary texts?
  * How might this pipeline be fruitfully employed on other texts? Are some texts more interesting than others? Why?
  * In what ways might text metadata be used in this pipeline? What would that show?

On a more technical level:
* What design decisions are most important in determining the output? 
* How important are the errors in the entity extraction? How might we minimise them?
* Note the depth of additional geo data from both services. How might this affect future analysis built upon these outputs?

# Links to packages and further reading

### NLP/NER Frameworks
* spaCy - https://spacy.io
* Stanford CoreNLP (Java) - https://stanfordnlp.github.io/CoreNLP/
* Polyglot - https://polyglot.readthedocs.io/en/latest/
* NLTK - https://www.nltk.org/

### Geocoding frameworks
* Google Places - https://developers.google.com/places/web-service/intro
* MapQuest - https://developer.mapquest.com/
* OpenCage - https://opencagedata.com/

### Other packages used in this notebook (not including those from Python standard library):
* pandas - https://pandas.pydata.org/pandas-docs/stable/
* googlemaps - https://github.com/googlemaps/google-maps-services-python
* plotly - https://plot.ly/python/
* geocoder - https://geocoder.readthedocs.io/

### Other mapping libraries not used here:
* cartopy - https://github.com/SciTools/cartopy
* folium - https://python-visualization.github.io/folium/

### NaN, NaN
* Null Island - https://blogs.loc.gov/maps/2016/04/the-geographical-oddity-of-null-island/

# Goodbye!