# GeoParser NLP

A python Jupiter Notebook text geoparser to showcase Natural Language Processing (Spacy) capabilities in 
Named Entity Recognition (NER) as well as some geo packages for geocoding and map display.

GeoParser takes a text as input and uses Spacy NER capabilities using a Wiki trained pipeline to identify potential geo-locations 
in the text. It then performs a geocoding against those locations to retrieve map coordinates which are then plotted on a 
map for display.

In [None]:
# Import NLP package
import spacy
from spacy import displacy 

In [None]:
# Import pandas
import pandas as pd

In [None]:
# import geocoder 
import geopy 
import matplotlib.pyplot as plt
from geopy.extra.rate_limiter import RateLimiter

In [None]:
# import python library wrapper for Leaflet.js
import folium
from folium.plugins import FastMarkerCluster

Load the trained model containing the NER pipeline

In [None]:
nlp = spacy.load('xx_ent_wiki_sm')

In [None]:
# Show available pipelines
nlp.pipe_names

✨ For simplicity sample text is specified directly in this notebook. You may have multiple text files in a local directory or read from a web resource

In [None]:
# Load the doc with NER annotations
doc = nlp("""\Calabria, known in antiquity as Bruttium (US: /ˈbrʊtiəm, ˈbrʌt-/),[7][8] is an administrative region of Italy. Located in the south of the Italian Peninsula, separated from Sicily by the Strait of Messina. As of 2019, the region has a population of around 2,000,000 people.

The capital city of Calabria is Catanzaro. The Regional Council of Calabria is based at the Palazzo Campanella in the city of Reggio Calabria. The region is bordered to the north by the Basilicata Region, to the west by the Tyrrhenian Sea, and to the east by the Ionian Sea. The Strait of Messina separates it from the island of Sicily. The region covers 15,080 km2 (5,822 sq mi) and has a population of just under 2 million. The demonym of Calabria is calabrese in Italian and Calabrian in English.

In antiquity the name Calabria referred, not as in modern times to the toe, but to the heel tip of Italy, from Tarentum southwards,[9] a region nowadays known as Salento.

The region is generally known as the "toe" of the "boot" of Italy and is a long and narrow peninsula which stretches from north to south for 248 km (154 mi), with a maximum width of 110 km (68 mi). Some 42% of Calabria's area, corresponding to 15,080 km2, is mountainous, 49% is hilly, while plains occupy only 9% of the region's territory. It is surrounded by the Ionian and Tyrrhenian seas. It is separated from Sicily by the Strait of Messina, where the narrowest point between Capo Peloro in Sicily and Punta Pezzo in Calabria is only 3.2 km (2 mi).

Three mountain ranges are present: Pollino, La Sila and Aspromonte, each with its own flora and fauna. The Pollino Mountains in the north of the region are rugged and form a natural barrier separating Calabria from the rest of Italy. Parts of the area are heavily wooded, while others are vast, wind-swept plateaus with little vegetation. These mountains are home to a rare Bosnian Pine variety and are included in the Pollino National Park, which is the largest national park in Italy, covering 1,925.65 square kilometres.

La Sila, which has been referred to as the "Great Wood of Italy",[16][17][18] is a vast mountainous plateau about 1,200 metres (3,900 feet) above sea level and stretches for nearly 2,000 square kilometres (770 square miles) along the central part of Calabria. The highest point is Botte Donato, which reaches 1,928 metres (6,325 feet). The area boasts numerous lakes and dense coniferous forests. La Sila also has some of the tallest trees in Italy which are called the "Giants of the Sila" and can reach up to 40 metres (130 feet) in height.[19][20][21] The Sila National Park is also known to have the purest air in Europe.[22]

The Aspromonte massif forms the southernmost tip of the Italian peninsula bordered by the sea on three sides. This unique mountainous structure reaches its highest point at Montalto, at 1,995 metres (6,545 feet), and is full of wide, man-made terraces that slope down towards the sea.

Most of the lower terrain in Calabria has been agricultural for centuries, and exhibits indigenous scrubland as well as introduced plants such as the prickly pear cactus. The lowest slopes are rich in vineyards and orchards of citrus fruit, including the Diamante citron. Further up, olives and chestnut trees appear while in the higher regions there are often dense forests of oak, pine, beech and fir trees

The region is generally known as the "toe" of the "boot" of Italy and is a long and narrow peninsula which stretches from north to south for 248 km (154 mi), with a maximum width of 110 km (68 mi). Some 42% of Calabria's area, corresponding to 15,080 km2, is mountainous, 49% is hilly, while plains occupy only 9% of the region's territory. It is surrounded by the Ionian and Tyrrhenian seas. It is separated from Sicily by the Strait of Messina, where the narrowest point between Capo Peloro in Sicily and Punta Pezzo in Calabria is only 3.2 km (2 mi).

Three mountain ranges are present: Pollino, La Sila and Aspromonte, each with its own flora and fauna. The Pollino Mountains in the north of the region are rugged and form a natural barrier separating Calabria from the rest of Italy. Parts of the area are heavily wooded, while others are vast, wind-swept plateaus with little vegetation. These mountains are home to a rare Bosnian Pine variety and are included in the Pollino National Park, which is the largest national park in Italy, covering 1,925.65 square kilometres.

La Sila, which has been referred to as the "Great Wood of Italy",[16][17][18] is a vast mountainous plateau about 1,200 metres (3,900 feet) above sea level and stretches for nearly 2,000 square kilometres (770 square miles) along the central part of Calabria. The highest point is Botte Donato, which reaches 1,928 metres (6,325 feet). The area boasts numerous lakes and dense coniferous forests. La Sila also has some of the tallest trees in Italy which are called the "Giants of the Sila" and can reach up to 40 metres (130 feet) in height.[19][20][21] The Sila National Park is also known to have the purest air in Europe.[22]

The Aspromonte massif forms the southernmost tip of the Italian peninsula bordered by the sea on three sides. This unique mountainous structure reaches its highest point at Montalto, at 1,995 metres (6,545 feet), and is full of wide, man-made terraces that slope down towards the sea.

Most of the lower terrain in Calabria has been agricultural for centuries, and exhibits indigenous scrubland as well as introduced plants such as the prickly pear cactus. The lowest slopes are rich in vineyards and orchards of citrus fruit, including the Diamante citron. Further up, olives and chestnut trees appear while in the higher regions there are often dense forests of oak, pine, beech and fir trees.""")

In [None]:
# show what named entities were captured and labelling applied
displacy.render(doc, style="ent")

In [None]:
# capture in an array only NERs labeled as 'LOC'
locations = []
locations.extend([[ent.text, ent.start, ent.end] for ent in doc.ents if ent.label_ in ['LOC']])

In [None]:
# populate a panda dataframe with above location info
df = pd.DataFrame(locations, columns=['Location', 'start','end'])

In [None]:
df

In [None]:
# remove any duplicates from locations and sort
df.drop_duplicates(subset='Location', keep='first', inplace=True)
df.sort_values("Location", inplace = True)

In [None]:
df

In [None]:
# declare the geolocator using the OpenStreetMap 
locator = geopy.geocoders.Nominatim(user_agent="geoparser")
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)

In [None]:
# Geocode locations, assigning to new field
df["address"] = df["Location"].apply(geocode)

In [None]:
df

In [None]:
# Extract Lat/Lon when available from address field using lambda expressions
df['coordinates'] = df['address'].apply(lambda loc: tuple(loc.point) if loc else None)
df[['latitude', 'longitude', 'altitude']] = pd.DataFrame(df['coordinates'].tolist(), index=df.index)
df.latitude.isnull().sum()
df = df[pd.notnull(df["latitude"])]

In [None]:
# Should now have lat/lon/alt available
df

In [None]:
# declare the folium map (folium is a python wrapper to LeafletJS)
folium_map = folium.Map(location=[43,11], zoom_start=5, tiles='CartoDB dark_matter')

In [None]:
# declare a callback function to handle popups
callback = ('function (row) {' 
                'var marker = L.marker(new L.LatLng(row[0], row[1]), {color: "red"});'
                'var icon = L.AwesomeMarkers.icon({'
                "icon: 'info-sign',"
                "iconColor: 'white',"
                "markerColor: 'green',"
                "prefix: 'glyphicon',"
                "extraClasses: 'fa-rotate-0'"
                    '});'
                'marker.setIcon(icon);'
                "var popup = L.popup({maxWidth: '300'});"
                "const display_text = {text: row[2]};"
                "var mytext = $(`<div id='mytext' class='display_text' style='width: 100.0%; height: 100.0%;'> ${display_text.text}</div>`)[0];"
                "popup.setContent(mytext);"
                "marker.bindPopup(popup);"
                'return marker};')

In [22]:
# Now add a FastMarkerCluster to the map specifying lat/lon and callback function
folium_map.add_child(FastMarkerCluster(df[['latitude', 'longitude','Location']].values.tolist(), callback=callback))