## Mapping locations on a map


In this lesson, we will engage in a frequently used and highly valuable exercise: mapping out the locations of places mentioned in a given corpus. This task serves a multitude of purposes. Firstly, it aids in uncovering the contents of an unfamiliar corpus. Additionally, it is intriguing to visually analyze the corpus and identify which places are mentioned most frequently. This lesson builds upon the concepts you have learned in text processing and information extraction lessons (Unit 3).

First,download english pre-trained spacy model in english

In [24]:
!python -m spacy download en_core_web_sm
!pip install pandas

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m73.4 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m:01[0m:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


Import required libraries and load spacy english model

In [6]:
import spacy
import os
import glob
from spacy import displacy
from collections import Counter

nlp = spacy.load("en_core_web_sm")

Here, we are reading the text extracted from our dataset. If your dataset is structured in Json format rather than as a directory of .txt files, you will need to modify the code accordingly (refer to the examples provided in other lessons within this unit).


Here we manipulate a folder with text in .txt format

In [7]:
# Path to the folder containing our dataset
folder_path = '../data/spanish_flu_light'

# Get a list of all the .txt files in the folder
file_list = glob.glob(os.path.join(folder_path, '*.txt'))

Then we can iterate over the list of files and visualize discovered entities

In [8]:
from spacy import displacy

# Loop through each file and extract named entities
for file_path in file_list:
    with open(file_path, 'r') as f:
        text = f.read()
    # Optionnally print the file where entities are discovered    
    print(file_path)
    # Run the spacy model on the text
    doc = nlp(text)
    for sent in doc.sents:
        if len(sent.ents) > 0:
            displacy.render(nlp(sent.text), style='ent', jupyter=True)

../data/spanish_flu_light/1918-06-30_new_york_herald_12148-bd6t547187_article_208.txt


../data/spanish_flu_light/1918-06-26_new_york_herald_12148-bd6t547140_article_56.txt




../data/spanish_flu_light/1918-04-28_new_york_herald_12148-bd6t546555_article_210.txt


../data/spanish_flu_light/1918-03-16_new_york_herald_12148-bd6t546124_article_83.txt


../data/spanish_flu_light/1918-05-28_new_york_herald_12148-bd6t546850_article_30.txt


../data/spanish_flu_light/1918-03-27_new_york_herald_12148-bd6t54623q_article_104.txt


../data/spanish_flu_light/1918-02-02_new_york_herald_12148-bd6t54570x_article_1.txt


../data/spanish_flu_light/1918-06-22_new_york_herald_12148-bd6t54710r_article_8.txt


../data/spanish_flu_light/1918-01-17_new_york_herald_12148-bd6t54554m_article_233.txt


../data/spanish_flu_light/1918-02-20_new_york_herald_12148-bd6t54588p_article_63.txt
../data/spanish_flu_light/1918-06-27_new_york_herald_12148-bd6t547159_article_89.txt


../data/spanish_flu_light/1918-04-18_new_york_herald_12148-bd6t54645w_article_100.txt


../data/spanish_flu_light/1918-05-26_new_york_herald_12148-bd6t54683c_article_218.txt


../data/spanish_flu_light/1918-06-07_new_york_herald_12148-bd6t546958_article_188.txt


../data/spanish_flu_light/1918-05-26_new_york_herald_12148-bd6t54683c_article_219.txt


../data/spanish_flu_light/1918-06-23_new_york_herald_12148-bd6t547112_article_193.txt


../data/spanish_flu_light/1918-05-22_new_york_herald_12148-bd6t54679z_article_5.txt


../data/spanish_flu_light/1918-06-22_new_york_herald_12148-bd6t54710r_article_55.txt


../data/spanish_flu_light/1918-06-22_new_york_herald_12148-bd6t54710r_article_54.txt


../data/spanish_flu_light/1918-06-06_new_york_herald_12148-bd6t54694z_article_207.txt


../data/spanish_flu_light/1918-06-15_new_york_herald_12148-bd6t54703d_article_146.txt


../data/spanish_flu_light/1918-04-15_new_york_herald_12148-bd6t54642z_article_65.txt


../data/spanish_flu_light/1918-06-12_new_york_herald_12148-bd6t54700g_article_50.txt


Alternatively, we can use our dataset in CSV or JSON as well and iterate over each line of the dataframe. 

Here, a new column is added to the dataframe. 

In [12]:
import pandas as pd

df = pd.read_csv('./data/spanish_flu_csv.csv', encoding='utf-8')
#display(df)

# Function to extract named entities from text
def extract_entities(text):
    doc = nlp(text)
    entities = []
    for ent in doc.ents:
        entities.append((ent.text, ent.label_))
    return entities

# Apply the extract_entities function to the 'text' column
df['named_entities'] = df['text'].apply(extract_entities)

# Print the DataFrame with named entities
display(df)

Unnamed: 0,id,type,date,text,named_entities
0,new_york_herald_12148-bd6t525546_article_23,article,1920-10-04,"(By Special Cable to the Herald.)\nCnicaco, Su...","[(Special Cable, ORG), (Herald, LOC), (Cnicaco..."
1,new_york_herald_12148-bd6t525546_article_192,article,1920-10-04,"phlegm,'' she says.No Ameriran audience,\nwith...","[(Ameriran, PERSON), (New Vork, GPE), (London,..."
2,new_york_herald_12148-bd6t52949f_article_168,article,1919-11-03,Fesigus are already erident that an\ninfluenza...,"[(Paris, GPE), (last autumn, DATE), (days, DAT..."
3,new_york_herald_12148-bd6t52466v_article_4,article,1920-07-08,For some days rumor has been busy- concerning ...,"[(some days, DATE), (Crown, PRODUCT), (fourtee..."
4,new_york_herald_12148-bd6t51517t_article_43,article,1919-06-06,"Sir Boverton Redwood, the petroleum\nexpert, d...","[(Boverton Redwood, PERSON), (London, GPE), (W..."
...,...,...,...,...,...
313,new_york_herald_12148-bd6t51418w_article_62,article,1919-02-26,"sonville, Pierre Loti, Duc de Bisaccia\nand ot...","[(Pierre Loti, PERSON), (Duc de Bisaccia, PERS..."
314,new_york_herald_12148-bd6t51418w_article_187,article,1919-02-26,The conditions are cxactiy these:\nThere are 6...,"[(63,000, CARDINAL), (43,000, CARDINAL), (20.0..."
315,new_york_herald_12148-bd6t54710r_article_8,article,1918-06-22,"Barrisu Fnorr, Friday.—The lull on\nthe whole ...","[(Barrisu Fnorr, PERSON), (Friday.—The, GPE), ..."
316,new_york_herald_12148-bd6t54710r_article_54,article,1918-06-22,"(FROM THE HIRRALD'S CORRESPONDENT.)\nLonpon, F...","[(HIRRALD, ORG), (CORRESPONDENT, ORG), (Lonpon..."


As you can see, there are numerous named entities present. However, for our objectives, we must retain solely those entities that are categorized as location entities.

If you want, it is easy to adapt the function to the CSV exemple. 

In [14]:
locations = []

# Loop through each file and extract named entities
for file_path in file_list:
    with open(file_path, 'r') as f:
        text = f.read()
    # Optionnally print the file where entities are discovered    
    #print(file_path)
    # Run the spacy model on the text
    doc = nlp(text)

    # Loop through each named entity in the doc
    for ent in doc.ents:
        # Keep entity if it is a location
        if ent.label_ in ['GPE', 'LOC']:
            locations.append(ent.text)

In [15]:
print(locations)

['Belfast', 'Belfast', 'Fortwillium Park', 'Belfast', 'Marlborough', 'Blenheim', 'HERALD', 'Darlington', 'France', 'Spain', 'Loxnos', 'LONDON', 'Losros', 'Kansas City', 'Norfolk', 'Va.', 'Bardwell', 'Ky.', 'Bonlee', 'N.C.', 'Rockville', 'America', 'Hensley Park', 'Greensburg', 'Ind.', 'Camden', 'Ala.', 'Crimea', 'Paris', 'Liége', 'Paris', 'Geneva', 'Spain', 'Friday.—The', 'Haras', 'Arbigland', 'M.C.', 'Melbourne', 'Australia', 'R.A.M.C.', 'London', 'Cannes', 'Egmont', 'Brandt', 'Nice', 'Nice', 'Lisbon', 'Paris', 'the Governments of Portugal', 'France', 'Caracas', 'Venezuela', "United States\nGovernment '", 'Latina', 'Madrid', 'Vienna', 'Berlin', 'Governments', 'Havas', 'Spain', 'Madrid', 'Madrid', 'Saint-Denis', 'Lonpox', 'Spain', 'Southwark', 'Central Europe', 'Russia', 'Austria', 'Germäny', 'Stronnaubruch', 'Bromberg', 'Mesopotamia', 'Londen', 'Ballinasloe', 'Helfast', 'Madrid', 'Mäcon', 'Spain', 'Scandinavia', 'Bingham', 'U.S.', 'Paris', 'Neuilly']


Install requirements to get locations and display it on a map

In [16]:
!pip install geopy
    
    
            



In [17]:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="dimpah-test")

def geolocate(country):
    loc = geolocator.geocode(country)
    return (loc.latitude, loc.longitude)

In [20]:
import time

# Remove duplicate entries
# print(locations)
unique_locations = list(set(locations))
print(unique_locations)

 


['Egmont', 'Belfast', 'Australia', 'Caracas', 'Geneva', 'Vienna', 'Norfolk', 'Marlborough', 'Losros', 'America', 'Nice', 'Spain', 'Berlin', 'Germäny', 'Mäcon', 'Rockville', 'Greensburg', 'Loxnos', 'Melbourne', 'Hensley Park', 'Central Europe', 'Madrid', 'LONDON', 'Neuilly', 'Ballinasloe', 'HERALD', 'Bardwell', 'Bonlee', "United States\nGovernment '", 'Latina', 'Helfast', 'Lisbon', 'Saint-Denis', 'M.C.', 'Arbigland', 'France', 'Venezuela', 'Kansas City', 'Paris', 'Cannes', 'Brandt', 'Ky.', 'London', 'Russia', 'Stronnaubruch', 'Governments', 'Lonpox', 'R.A.M.C.', 'Havas', 'U.S.', 'Liége', 'Crimea', 'Bromberg', 'Londen', 'Haras', 'Friday.—The', 'Mesopotamia', 'Va.', 'Scandinavia', 'Bingham', 'Austria', 'Darlington', 'the Governments of Portugal', 'Ala.', 'Ind.', 'N.C.', 'Fortwillium Park', 'Southwark', 'Camden', 'Blenheim']


Here, we will use a library that can call upon external services to find the geographical coordinates of each location. These APIs are often paid. In our case, we are using a free API, but to comply with its rules, we pause our program for a very short time for each place name. This slows down the overall process, and if you have a large dataset, this operation can be time-consuming

In [21]:
positions = []  

for location in unique_locations:
    try:
        latitude, longitude = geolocate(location)
        # This sleep() is to deal with Nominatim rate limit
        time.sleep(5)
        item = {"longitude": longitude, "latitude": latitude, "location": location, "Cases": 5}
        # cases refers to the number of same location, hardcoded here for an example
        positions.append(item)
        print(location, longitude, latitude)
    except:
        print("Couldn't retrieve location:", location)

Egmont -123.9319648 49.7498134
Belfast -5.9301829 54.596391
Australia 134.755 -24.7761086
Caracas -66.9146008 10.5060934
Geneva 6.1466014 46.2017559
Vienna 16.3725042 48.2083537
Norfolk 1.0 52.666667
Marlborough -71.5522874 42.3459271
Couldn't retrieve location: Losros
America -100.445882 39.7837304
Nice 7.2683912 43.7009358
Spain -4.8379791 39.3260685
Berlin 13.3888599 52.5170365
Germäny 10.4478313 51.1638175
Mäcon 4.8322266 46.3036683
Rockville -77.1516844 39.0817985
Greensburg -79.5389289 40.3014581
Couldn't retrieve location: Loxnos
Melbourne 144.9631732 -37.8142454
Hensley Park 142.04422081005248 -37.623772
Central Europe -95.167241 35.670119
Madrid -3.7035825 40.4167047
LONDON -0.12765 51.5073359
Neuilly 1.4209125 48.9321383
Ballinasloe -8.2401846 53.3363272
HERALD -121.2443919 38.2957474
Bardwell -96.6961027 32.2690354
Bonlee -79.4144654 35.6459769
United States
Government ' -77.9662258 39.4563185
Latina 13.012591212188894 41.45952605
Couldn't retrieve location: Helfast
Lisbon -

In [23]:
print(positions)


[{'longitude': -123.9319648, 'latitude': 49.7498134, 'location': 'Egmont', 'Cases': 5}, {'longitude': -5.9301829, 'latitude': 54.596391, 'location': 'Belfast', 'Cases': 5}, {'longitude': 134.755, 'latitude': -24.7761086, 'location': 'Australia', 'Cases': 5}, {'longitude': -66.9146008, 'latitude': 10.5060934, 'location': 'Caracas', 'Cases': 5}, {'longitude': 6.1466014, 'latitude': 46.2017559, 'location': 'Geneva', 'Cases': 5}, {'longitude': 16.3725042, 'latitude': 48.2083537, 'location': 'Vienna', 'Cases': 5}, {'longitude': 1.0, 'latitude': 52.666667, 'location': 'Norfolk', 'Cases': 5}, {'longitude': -71.5522874, 'latitude': 42.3459271, 'location': 'Marlborough', 'Cases': 5}, {'longitude': -100.445882, 'latitude': 39.7837304, 'location': 'America', 'Cases': 5}, {'longitude': 7.2683912, 'latitude': 43.7009358, 'location': 'Nice', 'Cases': 5}, {'longitude': -4.8379791, 'latitude': 39.3260685, 'location': 'Spain', 'Cases': 5}, {'longitude': 13.3888599, 'latitude': 52.5170365, 'location': '

Visualize data in a convenient way 



In [32]:
import pandas as pd
df = pd.DataFrame.from_dict(positions)

# Limit to the 20 first lines
df.head(100) 

Unnamed: 0,longitude,latitude,location,Cases
0,-123.931965,49.749813,Egmont,5
1,-5.930183,54.596391,Belfast,5
2,134.755000,-24.776109,Australia,5
3,-66.914601,10.506093,Caracas,5
4,6.146601,46.201756,Geneva,5
...,...,...,...,...
58,-86.174693,40.327013,Ind.,5
59,-79.039292,35.672964,N.C.,5
60,-0.104966,51.503925,Southwark,5
61,-75.119891,39.944840,Camden,5


In [26]:
# If you want, you can export your dataset in csv format
df.to_csv('geo_ex.csv')

                                               id     type        date   
0     new_york_herald_12148-bd6t525546_article_23  article  1920-10-04  \
1    new_york_herald_12148-bd6t525546_article_192  article  1920-10-04   
2    new_york_herald_12148-bd6t52949f_article_168  article  1919-11-03   
3      new_york_herald_12148-bd6t52466v_article_4  article  1920-07-08   
4     new_york_herald_12148-bd6t51517t_article_43  article  1919-06-06   
..                                            ...      ...         ...   
313   new_york_herald_12148-bd6t51418w_article_62  article  1919-02-26   
314  new_york_herald_12148-bd6t51418w_article_187  article  1919-02-26   
315    new_york_herald_12148-bd6t54710r_article_8  article  1918-06-22   
316   new_york_herald_12148-bd6t54710r_article_54  article  1918-06-22   
317   new_york_herald_12148-bd6t54710r_article_55  article  1918-06-22   

                                                  text   
0    (By Special Cable to the Herald.)\nCnicaco, Su..

Install dependencies to visualize the location on a map

In [27]:
!pip install folium



In [33]:
import folium
from folium.plugins import MarkerCluster

world_map= folium.Map(tiles="cartodbpositron")
marker_cluster = MarkerCluster().add_to(world_map)

#for each coordinate, create circlemarker of user percent
for i in range(len(df)):
        lat = df.iloc[i]['latitude']
        long = df.iloc[i]['longitude']
        radius=5
        popup_text = """{}<br>
                    {}<br>"""
        popup_text = popup_text.format(df.iloc[i]['location'],
                                   df.iloc[i]['Cases']
                                   )
        folium.CircleMarker(location = [lat, long], radius=radius, popup= popup_text, fill =True).add_to(marker_cluster)#show the map
        
world_map

In our dataset, names of places, cities, or countries are mentioned all around the world. Of course, in our case, we cannot draw immediate conclusions, for example, "the Spanish flu is a global problem." We cannot do so because many parameters need to be verified, articles are not always properly segmented, we only have a single press title, etc. However, this information can serve as a basis for questioning and can be a starting point for further detailed analysis.

That being said, if your dataset is well-controlled and you have a good understanding of its construction and analysis, then you can quickly and easily produce this representation. However, note that errors may occur during the stage that transforms the name of the location into geographical coordinates. If you observe unusual results, manual verification and correction may be necessary. Feel free to practice on your own data and try to make this code work for other languages as well.