### Mapping Locations in Poetry
I went down a rabbit hole with this assignment. I assumed that there might be a library that would allow me to extract locations from text which I could then map.
This proved to be a much more complex problem then I initially thought. SpaCy and the NLTK libraries are both capable of recognizing locations in text, but they were not able to contextualize the locations.

For example, if I were to use the following text:
> "I went to a store in London and then went to Lake Erie to go fishing."

SpaCy and NLTK would both recognize London as a location, but are not able to contextualize that information. While in the text London refers to a city in Ontario, Canada, SpaCy and NLTK would not be able to tell you that. Not without another layer of processing.

I came across Mordecai, a project authored by MSU professor Andy Halterman. It uses SpaCy to extract locations and a database of locations to extract locations and their context from text.
Documentation for the project can be found here: https://andrewhalterman.com/post/mordecai3/

Process:
1. Setup Mordecai
    - initalize docker container for elastic search database with geonames gazetteer data
2. Extract locations from text using Mordecai and save to a dataframe
3. Export dataframe to csv
4. Map locations using Plotly or Tableau


In [124]:
from mordecai3 import Geoparser
import pprint
import pandas as pd

# set up pretty printer
pp = pprint.PrettyPrinter(indent=4)

# set up geoparser
geo = Geoparser()

test_text = 'I went to the store in London and then went to Thames River to go fishing.'

In [125]:
output = geo.geoparse_doc(test_text)
pp.pprint(output)
# print(output['doc_text'], '\nCity: ' + output['geolocated_ents'][0]['city_name'], '\nCountry: ' + output['geolocated_ents'][0]['country_code3'], '\nLatitude: ',output['geolocated_ents'][0]['lat'], '\nLongitude: ', output['geolocated_ents'][0]['lon'])

2023-09-19 04:54:01,443 elasticsearch INFO     GET http://localhost:9200/ [status:200 request:0.022s]
2023-09-19 04:54:01,518 elasticsearch INFO     POST http://localhost:9200/geonames/_search [status:200 request:0.074s]
2023-09-19 04:54:01,539 elasticsearch INFO     POST http://localhost:9200/geonames/_search [status:200 request:0.010s]


{   'doc_text': 'I went to the store in London and then went to Thames River '
                'to go fishing.',
    'event_location_raw': '',
    'geolocated_ents': [   {   'adm1_count': 1.0,
                               'admin1_code': 'ENG',
                               'admin1_name': 'England',
                               'admin1_parent_match': 0,
                               'admin2_code': 'GLA',
                               'admin2_name': 'Greater London',
                               'alt_name_length': 4.852030263919617,
                               'ascii_dist': 0.0,
                               'avg_dist': 0.09768606870229009,
                               'city_id': '2643743',
                               'city_name': 'London',
                               'country_code3': 'GBR',
                               'country_code_parent_match': 0,
                               'country_count': 1.0,
                               'end_char': 29,
               

With the test text, Mordecai correctly extracts the location and context as well as the latitude and longitude.

In the next step, we'll create a function to save the extracted locations to a dataframe.

In [86]:
df = pd.DataFrame(columns=['text', 'lat', 'lon', 'city', 'country', 'search'])
def extract_locations(text):
    output = geo.geoparse_doc(text)
    doctext = output['doc_text']
    locs = output['geolocated_ents']
    for i in range(len(locs)):
        pd.concat([df, pd.DataFrame([[doctext, locs[i]['lat'], locs[i]['lon'], locs[i]['city_name'], locs[i]['country_code3'], locs[i]['search_name']]], columns=['text', 'lat', 'lon', 'city', 'country', 'search'])])

In [71]:
# read leavesofgrass.txt
def read_text(text):
    with open(text, 'r') as f:
        text = f.read()
        text = text.replace('\n', ' ')
        text = ' '.join(text.split())
        return text


In [127]:
first = read_text('1.txt')
second = read_text('2.txt')
third = read_text('3.txt')
fourth = read_text('4.txt')
fifth = read_text('5.txt')

text = first

In [128]:
df = pd.DataFrame(columns=['text', 'lat', 'lon', 'city', 'country', 'search'])

output = geo.geoparse_doc(text)
doctext = output['doc_text']
locs = output['geolocated_ents']
print(locs)
# for i in range(len(locs)):
    # df = df.append(pd.DataFrame([[doctext, locs[i]['lat'], locs[i]['lon'], locs[i]['city_name'], locs[i]['country_code3'], locs[i]['search_name']]], columns=['text', 'lat', 'lon', 'city', 'country', 'search']))

2023-09-19 05:32:31,675 elasticsearch INFO     POST http://localhost:9200/geonames/_search [status:200 request:0.079s]
2023-09-19 05:32:31,736 elasticsearch INFO     POST http://localhost:9200/geonames/_search [status:200 request:0.055s]
2023-09-19 05:32:31,777 elasticsearch INFO     POST http://localhost:9200/geonames/_search [status:200 request:0.028s]


[{'feature_code': 'PPLX', 'feature_class': 'P', 'country_code3': 'JAM', 'lat': 18.42818, 'lon': -77.20451, 'name': 'New England', 'admin1_code': '09', 'admin1_name': 'St Ann', 'admin2_code': '606', 'admin2_name': "St. Ann's Bay", 'geonameid': '12572068', 'admin1_parent_match': 0, 'country_code_parent_match': 0, 'alt_name_length': 0.6931471805599453, 'min_dist': 0.0, 'max_dist': 0.19642857142857142, 'avg_dist': 0.13414634146341464, 'ascii_dist': 0.0, 'adm1_count': 0.6666666666666666, 'country_count': 0.6666666666666666, 'score': 0.9986833930015564, 'search_name': 'New England', 'start_char': 2160, 'end_char': 2175, 'city_id': 'New England', 'city_name': '12572068'}, {'feature_code': 'PPL', 'feature_class': 'P', 'country_code3': 'USA', 'lat': 41.64366, 'lon': -83.48688, 'name': 'Oregon', 'admin1_code': 'OH', 'admin1_name': 'Ohio', 'admin2_code': '095', 'admin2_name': 'Lucas County', 'geonameid': '5165734', 'admin1_parent_match': 0, 'country_code_parent_match': 0, 'alt_name_length': 2.079

In [112]:
df.head()

Unnamed: 0,text,lat,lon,city,country,search
0,Winter Solstice BY HILDA MORLEY A cold night c...,42.65258,-73.75623,Albany,USA,new york


In [59]:
df = extract_locations(third, df)

2023-09-19 03:13:44,918 elasticsearch INFO     POST http://localhost:9200/geonames/_search [status:200 request:0.075s]
2023-09-19 03:13:44,978 elasticsearch INFO     POST http://localhost:9200/geonames/_search [status:200 request:0.046s]
2023-09-19 03:13:45,029 elasticsearch INFO     POST http://localhost:9200/geonames/_search [status:200 request:0.039s]
2023-09-19 03:13:45,045 elasticsearch INFO     POST http://localhost:9200/geonames/_search [status:200 request:0.005s]
2023-09-19 03:13:45,127 elasticsearch INFO     POST http://localhost:9200/geonames/_search [status:200 request:0.082s]
2023-09-19 03:13:45,164 elasticsearch INFO     POST http://localhost:9200/geonames/_search [status:200 request:0.007s]
2023-09-19 03:13:45,202 elasticsearch INFO     POST http://localhost:9200/geonames/_search [status:200 request:0.038s]
2023-09-19 03:13:45,252 elasticsearch INFO     POST http://localhost:9200/geonames/_search [status:200 request:0.038s]
2023-09-19 03:13:45,317 elasticsearch INFO     P

In [116]:
df.head()

Unnamed: 0,text,lat,lon,city,country,search


In [119]:
df.to_csv('black.csv')