# Geocoding with Python

This tutorial explores a way of mapping locations from place names using Python. We will use two key Python libraries: [geopy](https://pypi.python.org/pypi/geopy) and [folium](https://pypi.python.org/pypi/folium). Geopy will be used to look up the coordinates of place names using OpenStreetMap data and folium will be used to display those coordinates on a Leaflet map.

The dataset used is from the [Italian Academies Project](https://data.bl.uk/iad/iad1.html), which is available for download via data.bl.uk. The XML records it contains include metadata for the Italian academies of the late Renaissance and early modern periods. Within these XML records we have tags that indicate the names and locations of each academy and it is this data that we will extract and use to build our map.

## Prerequisites

This tutorial assumes that you already have [Python](https://www.python.org/) installed and have some familiarity with running Python scripts.

With that in mind, we will start by installing the required software libraries via [PyPi](https://pypi.python.org/pypi). Open up a command-line interface and run the following:

```
pip install geopy folium requests tqdm
```

## Import the libraries

We can now start writing our Python script. 

Create a new file and save it as `run.py`, then enter the following.

In [13]:
import os
import tqdm
import time
import requests
import zipfile
import folium
from folium.plugins import MarkerCluster
import xml.etree.ElementTree as ET
from geopy.geocoders import Nominatim

We then declare some common variables to be used in various places throughout the notebook. The comments above each variable describes its purpose.

In [14]:
# The directory to which we will download our dataset.
DATA_DIR = './data'

# An HTTP header that we add to identify our application over a network.
USER_AGENT = 'bl-digischol-notebooks'

## Prepare the dataset

We now need to download our dataset and extract the contained files. For more details of how the process works, see [Downloading datasets with Python](downloading_datasets_with_python.ipynb).

In [15]:
if not os.path.exists(DATA_DIR):
    os.mkdir(DATA_DIR)

In [16]:
def download_dataset(url, directory, user_agent):
    download_fn = url.split('/')[-1]
    download_path = os.path.join(directory, download_fn)
    if not os.path.exists(download_path):
        headers = {'User-agent': user_agent}
        r = requests.get(url, stream=True, headers=headers)
        total_length = int(r.headers.get('Content-Length'))
        total_size = (total_length/1024) + 1
        with open(download_path, 'wb') as f:
            for chunk in tqdm.tqdm(r.iter_content(chunk_size=1024), 
                                   total=total_size, 
                                   desc='Downloading', 
                                   unit='kb',
                                   unit_scale=True, 
                                   miniters=1, 
                                   leave=False): 
                if chunk:
                    f.write(chunk)

download_dataset('https://data.bl.uk/iad/iad-xml.zip', DATA_DIR, USER_AGENT)

In [17]:
def extract_dataset(fn, directory):
    basename = os.path.splitext(fn)[-2]
    in_path = os.path.join(directory, fn)
    out_path = os.path.join(directory, basename)
    with zipfile.ZipFile(in_path) as archive:
        unextracted = [name for name in archive.namelist() 
                       if not os.path.exists(os.path.join(out_path, name))]
        if unextracted:
            for i in tqdm.tqdm(range(len(unextracted)), desc='Extracting', unit='file', leave=False):
                archive.extract(unextracted[i], path=out_path)

extract_dataset('iad-xml.zip', DATA_DIR)

## Locate the coordinates of the place names

Now we're ready to locate the coordinates of the place name for each academy.

In [18]:
def get_academy_summaries():
    """Get the name and city for each academy."""
    data = []
    records_dir = './data/iad-xml/records/ItacAcademyItem'
    for xml_file in os.listdir(records_dir):
        path = os.path.join(records_dir, xml_file)
        with open(path) as f:
            tree = ET.parse(f)
            root = tree.getroot()
            city = root.find(".//*/CityItalianName").text
            name = root.find(".//Name").text
            academy = dict(name=name, city=city)
            data.append(academy)
    return data
        
academies = get_academy_summaries()

In [19]:
def get_markers(academies, user_agent):
    geolocator = Nominatim(user_agent=user_agent)
    locations = {}
    markers = []
    for academy in tqdm.tqdm(academies, desc='Locating', unit='file', leave=False):
        city = academy['city']
        name = academy['name']
        
        if not locations.get(city):
            location = geolocator.geocode(city)
            if not location:
                continue
                
            coordinates = (location.latitude, location.longitude)
            locations[city] = coordinates
            
            # Comply with usage policy of a maximum of 1 request per second                  
            time.sleep(1)
        
        marker = dict(location=locations[city], popup=name)
        markers.append(marker)
    return markers
    
    
markers = get_markers(academies, USER_AGENT)



In [20]:
italy_coords = (41.87, 12.56)
map = folium.Map(location=italy_coords, zoom_start=6)

marker_cluster = MarkerCluster().add_to(map)

for marker in markers:
    folium.Marker(**marker).add_to(marker_cluster)

map.save(os.path.join(DATA_DIR, 'iad.html'))

map