## Mapping locations on a map


In this lesson, we will engage in a frequently used and highly valuable exercise: mapping out the locations of places mentioned in a given corpus. This task serves a multitude of purposes. Firstly, it aids in uncovering the contents of an unfamiliar corpus. Additionally, it is intriguing to visually analyze the corpus and identify which places are mentioned most frequently. This lesson builds upon the concepts you have learned in text processing and information extraction lessons (Unit 3).

First,download english pre-trained spacy model in english

In [1]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


Import required libraries and load spacy english model

In [2]:
import spacy
import os
import glob
from spacy import displacy
from collections import Counter

nlp = spacy.load("en_core_web_sm")

Here, we are reading the text extracted from our dataset. If your dataset is structured in Json format rather than as a directory of .txt files, you will need to modify the code accordingly (refer to the examples provided in other lessons within this unit).


In [None]:
# TODO : exemples en JSON et CSV (montrer la diff) même données mais pro/cons 

In [3]:
# Path to the folder containing our dataset
folder_path = '../data/spanish_flu_light'

# Get a list of all the .txt files in the folder
file_list = glob.glob(os.path.join(folder_path, '*.txt'))

If you need to visualize the discovered entities, it is possible 

In [4]:
from spacy import displacy

# Loop through each file and extract named entities
for file_path in file_list:
    with open(file_path, 'r') as f:
        text = f.read()
    # Optionnally print the file where entities are discovered    
    print(file_path)
    # Run the spacy model on the text
    doc = nlp(text)
    for sent in doc.sents:
        if len(sent.ents) > 0:
            displacy.render(nlp(sent.text), style='ent', jupyter=True)

../data/spanish_flu_light/1918-06-30_new_york_herald_12148-bd6t547187_article_208.txt


../data/spanish_flu_light/1918-06-26_new_york_herald_12148-bd6t547140_article_56.txt




../data/spanish_flu_light/1918-04-28_new_york_herald_12148-bd6t546555_article_210.txt


../data/spanish_flu_light/1918-03-16_new_york_herald_12148-bd6t546124_article_83.txt


../data/spanish_flu_light/1918-05-28_new_york_herald_12148-bd6t546850_article_30.txt


../data/spanish_flu_light/1918-03-27_new_york_herald_12148-bd6t54623q_article_104.txt


../data/spanish_flu_light/1918-02-02_new_york_herald_12148-bd6t54570x_article_1.txt


../data/spanish_flu_light/1918-06-22_new_york_herald_12148-bd6t54710r_article_8.txt


../data/spanish_flu_light/1918-01-17_new_york_herald_12148-bd6t54554m_article_233.txt


../data/spanish_flu_light/1918-02-20_new_york_herald_12148-bd6t54588p_article_63.txt
../data/spanish_flu_light/1918-06-27_new_york_herald_12148-bd6t547159_article_89.txt


../data/spanish_flu_light/1918-04-18_new_york_herald_12148-bd6t54645w_article_100.txt


../data/spanish_flu_light/1918-05-26_new_york_herald_12148-bd6t54683c_article_218.txt


../data/spanish_flu_light/1918-06-07_new_york_herald_12148-bd6t546958_article_188.txt


../data/spanish_flu_light/1918-05-26_new_york_herald_12148-bd6t54683c_article_219.txt


../data/spanish_flu_light/1918-06-23_new_york_herald_12148-bd6t547112_article_193.txt


../data/spanish_flu_light/1918-05-22_new_york_herald_12148-bd6t54679z_article_5.txt


../data/spanish_flu_light/1918-06-22_new_york_herald_12148-bd6t54710r_article_55.txt


../data/spanish_flu_light/1918-06-22_new_york_herald_12148-bd6t54710r_article_54.txt


../data/spanish_flu_light/1918-06-06_new_york_herald_12148-bd6t54694z_article_207.txt


../data/spanish_flu_light/1918-06-15_new_york_herald_12148-bd6t54703d_article_146.txt


../data/spanish_flu_light/1918-04-15_new_york_herald_12148-bd6t54642z_article_65.txt


../data/spanish_flu_light/1918-06-12_new_york_herald_12148-bd6t54700g_article_50.txt


As you can see, there are numerous named entities present. However, for our objectives, we must retain solely those entities that are categorized as location entities.

In [5]:
locations = []

# Loop through each file and extract named entities
for file_path in file_list:
    with open(file_path, 'r') as f:
        text = f.read()
    # Optionnally print the file where entities are discovered    
    #print(file_path)
    # Run the spacy model on the text
    doc = nlp(text)

    # Loop through each named entity in the doc
    for ent in doc.ents:
        if ent.label_ in ['GPE', 'LOC']:
            # Keep entity if it is a location
            #print('Entity:', ent.text, '---', 'Entity Type:', ent.label_)
            locations.append(ent.text)

In [6]:
print(locations)

['Belfast', 'Belfast', 'Fortwillium Park', 'Belfast', 'Marlborough', 'Blenheim', 'HERALD', 'Darlington', 'France', 'Spain', 'Loxnos', 'LONDON', 'Losros', 'Kansas City', 'Norfolk', 'Va.', 'Bardwell', 'Ky.', 'Bonlee', 'N.C.', 'Rockville', 'America', 'Hensley Park', 'Greensburg', 'Ind.', 'Camden', 'Ala.', 'Crimea', 'Paris', 'Liége', 'Paris', 'Geneva', 'Spain', 'Friday.—The', 'Haras', 'Arbigland', 'M.C.', 'Melbourne', 'Australia', 'R.A.M.C.', 'London', 'Cannes', 'Egmont', 'Brandt', 'Nice', 'Nice', 'Lisbon', 'Paris', 'the Governments of Portugal', 'France', 'Caracas', 'Venezuela', "United States\nGovernment '", 'Latina', 'Madrid', 'Vienna', 'Berlin', 'Governments', 'Havas', 'Spain', 'Madrid', 'Madrid', 'Saint-Denis', 'Lonpox', 'Spain', 'Southwark', 'Central Europe', 'Russia', 'Austria', 'Germäny', 'Stronnaubruch', 'Bromberg', 'Mesopotamia', 'Londen', 'Ballinasloe', 'Helfast', 'Madrid', 'Mäcon', 'Spain', 'Scandinavia', 'Bingham', 'U.S.', 'Paris', 'Neuilly']


Install requirements to get locations and display it on a map

In [7]:
!pip install geopy
    
    
            



In [14]:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="dimpah-test")

def geolocate(country):
    loc = geolocator.geocode(country)
    return (loc.latitude, loc.longitude)

In [15]:
import time

# Remove duplicate entries
#print(locations)
unique_locations = list(set(locations))
print(unique_locations)

 


['Bromberg', 'Kansas City', 'Rockville', 'Stronnaubruch', 'Norfolk', 'Hensley Park', 'Geneva', 'Arbigland', 'the Governments of Portugal', 'Crimea', 'Vienna', 'Havas', 'Germäny', 'Mesopotamia', 'Neuilly', 'Madrid', 'Venezuela', 'N.C.', 'Egmont', 'Haras', 'Ala.', 'Latina', 'Austria', 'Ind.', 'Darlington', 'U.S.', 'Brandt', 'Lisbon', 'Belfast', 'Friday.—The', 'Londen', 'Scandinavia', 'Ky.', 'Loxnos', 'Fortwillium Park', 'R.A.M.C.', 'Marlborough', 'Mäcon', 'Bingham', 'Bonlee', 'Southwark', 'Camden', 'France', 'Berlin', 'London', 'Governments', 'Helfast', 'LONDON', 'Caracas', 'Bardwell', 'Cannes', 'Russia', 'Melbourne', 'Greensburg', 'M.C.', 'Va.', 'Spain', 'Saint-Denis', 'Blenheim', 'Nice', 'America', 'Central Europe', 'HERALD', 'Ballinasloe', 'Paris', 'Lonpox', 'Losros', 'Australia', 'Liége', "United States\nGovernment '"]


Here, we will use a library that can call upon external services to find the geographical coordinates of each location. These APIs are often paid. In our case, we are using a free API, but to comply with its rules, we pause our program for a very short time for each place name. This slows down the overall process, and if you have a large dataset, this operation can be time-consuming

In [23]:
positions = []  

for location in unique_locations:
    try:
        latitude, longitude = geolocate(location)
        # This sleep() is to deal with Nominatim rate limit
        #TODO Check API rate limit
        time.sleep(5)
        item = {"longitude": longitude, "latitude": latitude, "location": location, "Cases": 5}
        # cases refers to the number of same location, hardcoded here for an example
        positions.append(item)
        print(location, longitude, latitude)
    except:
        print("Couldn't retrieve location:", location)

Bromberg 18.0002529 53.1219648
Kansas City -94.5781416 39.100105
Rockville -77.1516844 39.0817985
Couldn't retrieve location: Stronnaubruch
Norfolk 1.0 52.666667
Hensley Park 142.04422081005248 -37.623772
Geneva 6.1466014 46.2017559
Arbigland -3.5759515 54.9013488
Couldn't retrieve location: the Governments of Portugal
Crimea 34.20081877526554 45.28350435
Vienna 16.3725042 48.2083537
Havas 19.7996291 48.1377277
Germäny 10.4478313 51.1638175
Mesopotamia 42.059646594808214 34.9019732
Neuilly 1.4209125 48.9321383
Madrid -3.7035825 40.4167047
Venezuela -66.1109318 8.0018709
N.C. -79.0392919 35.6729639
Egmont -123.9319648 49.7498134
Haras -45.4380301 -22.9529864
Ala. -86.8295337 33.2588817
Latina 13.012591212188894 41.45952605
Austria 14.12456 47.59397
Ind. -86.1746933 40.3270127
Darlington -1.5555812 54.5242081
U.S. 9.009657397144327 48.684055799999996
Brandt -84.0918824 39.9020023
Lisbon -9.1365919 38.7077507
Belfast -5.9301829 54.596391
Friday.—The -97.5660374 35.579528
Londen -0.12765 5

In [17]:
print(positions)


[{'longitude': 18.0002529, 'latitude': 53.1219648, 'location': 'Bromberg', 'Cases': 5}, {'longitude': -94.5781416, 'latitude': 39.100105, 'location': 'Kansas City', 'Cases': 5}, {'longitude': -77.1516844, 'latitude': 39.0817985, 'location': 'Rockville', 'Cases': 5}, {'longitude': 1.0, 'latitude': 52.666667, 'location': 'Norfolk', 'Cases': 5}, {'longitude': 142.04422081005248, 'latitude': -37.623772, 'location': 'Hensley Park', 'Cases': 5}, {'longitude': 6.1466014, 'latitude': 46.2017559, 'location': 'Geneva', 'Cases': 5}, {'longitude': -3.5759515, 'latitude': 54.9013488, 'location': 'Arbigland', 'Cases': 5}, {'longitude': 34.20081877526554, 'latitude': 45.28350435, 'location': 'Crimea', 'Cases': 5}, {'longitude': 16.3725042, 'latitude': 48.2083537, 'location': 'Vienna', 'Cases': 5}, {'longitude': 19.7996291, 'latitude': 48.1377277, 'location': 'Havas', 'Cases': 5}, {'longitude': 10.4478313, 'latitude': 51.1638175, 'location': 'Germäny', 'Cases': 5}, {'longitude': 42.059646594808214, 'l

Visualize data in a convenient way 

In [22]:
# Install pandas library
!pip install pandas



In [19]:
import pandas as pd
df = pd.DataFrame.from_dict(positions)

# Limit to the 20 first lines
df.head(20) 

Unnamed: 0,longitude,latitude,location,Cases
0,18.000253,53.121965,Bromberg,5
1,-94.578142,39.100105,Kansas City,5
2,-77.151684,39.081798,Rockville,5
3,1.0,52.666667,Norfolk,5
4,142.044221,-37.623772,Hensley Park,5
5,6.146601,46.201756,Geneva,5
6,-3.575951,54.901349,Arbigland,5
7,34.200819,45.283504,Crimea,5
8,16.372504,48.208354,Vienna,5
9,19.799629,48.137728,Havas,5


In [None]:
# TODO : préparer le dataset complet avec lat/long, au cas ou ... 

Install dependencies to visualize the location on a map

In [None]:
!pip install folium

In [25]:
import folium
from folium.plugins import MarkerCluster

world_map= folium.Map(tiles="cartodbpositron")
marker_cluster = MarkerCluster().add_to(world_map)

#for each coordinate, create circlemarker of user percent
for i in range(len(df)):
        lat = df.iloc[i]['latitude']
        long = df.iloc[i]['longitude']
        radius=5
        popup_text = """{}<br>
                    {}<br>"""
        popup_text = popup_text.format(df.iloc[i]['location'],
                                   df.iloc[i]['Cases']
                                   )
        folium.CircleMarker(location = [lat, long], radius=radius, popup= popup_text, fill =True).add_to(marker_cluster)#show the map
        
world_map

In our dataset, names of places, cities, or countries are mentioned all around the world. Of course, in our case, we cannot draw immediate conclusions, for example, "the Spanish flu is a global problem." We cannot do so because many parameters need to be verified, articles are not always properly segmented, we only have a single press title, etc. However, this information can serve as a basis for questioning and can be a starting point for further detailed analysis.

That being said, if your dataset is well-controlled and you have a good understanding of its construction and analysis, then you can quickly and easily produce this representation. However, note that errors may occur during the stage that transforms the name of the location into geographical coordinates. If you observe unusual results, manual verification and correction may be necessary. Feel free to practice on your own data and try to make this code work for other languages as well.