# Geocoding with Python

In this tutorial we will explore a method for mapping locations from a list of place names using Python.

The dataset used is from the [Italian Academies Project](italianacademies.org/), which is available for [download](https://doi.org/10.21250/iad1) via data.bl.uk. The XML records contained in the dataset include metadata for the Italian academies of the late Renaissance and early modern periods. These records contain tags that specify the names and locations of each academy. This is the data that we will extract and use to build our map.

We will use two key Python libraries: [geopy](https://pypi.python.org/pypi/geopy) and [folium](https://pypi.python.org/pypi/folium). Geopy will be used to look up the coordinates of place names using [OpenStreetMap](https://www.openstreetmap.org/) data and folium will be used to display those coordinates on a [Leaflet](http://leafletjs.com/) map.

## Prerequisites

This tutorial assumes that you already have [Python](https://www.python.org/) installed and have some familiarity with running Python scripts.

With that in mind, we will start by installing the required software libraries via [PyPi](https://pypi.python.org/pypi). Open up a command-line interface and run the following:

```
pip install geopy folium requests tqdm
```

## Import the libraries

With the required libraries installed we can now start writing our Python script. Using a text editor, create a new file and save it as `run.py`, then enter the following code to import the required libraries.

In [20]:
import os
import tqdm
import time
import requests
import zipfile
import folium
from folium.plugins import MarkerCluster
import xml.etree.ElementTree as ET
from geopy.geocoders import Nominatim

# Declare some common variables

There are some common variables that will be used in various places throughout the tutorial. We will declare these below the imports for easier reference. The comments above each variable indicate their purpose. 

In [21]:
# The directory to which we will download our dataset.
DATA_DIR = '../data'

# An HTTP header that we add to identify our application over a network.
USER_AGENT = 'bl-digischol-notebooks'

# The name of the dataset collection
COLLECTION = 'iad'

# The name of the dataset
DATASET = 'iad-xml'

## Prepare the dataset

We now need to download our dataset and extract the files is contains. The code block below will handle this programmatically. Copy the code into your Python script, save the file, then open up a command-line interface, navigate to the location of your script and run the following:

```
python run.py
```

Assuming the dataset does not already exist in the correct location it will be downloaded and the files extracted. For more details about how the process works, see [Downloading datasets with Python](downloading_datasets_with_python.ipynb).

In [22]:
def create_data_dir(directory):
    if not os.path.exists(directory):
        os.mkdir(directory)
    

def download_dataset(collection, dataset, directory, user_agent):
    url = 'https://data.bl.uk/{0}/{1}.zip'.format(collection, dataset)
    download_fn = url.split('/')[-1]
    download_path = os.path.join(directory, download_fn)
    if not os.path.exists(download_path):
        headers = {'User-agent': user_agent}
        r = requests.get(url, stream=True, headers=headers)
        total_length = int(r.headers.get('Content-Length'))
        total_size = (total_length/1024) + 1
        with open(download_path, 'wb') as f:
            for chunk in tqdm.tqdm(r.iter_content(chunk_size=1024), 
                                   total=total_size, 
                                   desc='Downloading', 
                                   unit='kb',
                                   unit_scale=True, 
                                   miniters=1, 
                                   leave=False): 
                if chunk:
                    f.write(chunk)


def extract_dataset(dataset, data_dir):
    fn = '{}.zip'.format(dataset)
    in_path = os.path.join(data_dir, fn)
    with zipfile.ZipFile(in_path) as archive:
        unextracted = [name for name in archive.namelist() 
                       if not os.path.exists(os.path.join(data_dir, name))]
        if unextracted:
            for i in tqdm.tqdm(range(len(unextracted)), 
                               desc='Extracting', 
                               unit='file', 
                               leave=False):
                archive.extract(unextracted[i], path=data_dir)


create_data_dir(DATA_DIR)
download_dataset(COLLECTION, DATASET, DATA_DIR, USER_AGENT)
extract_dataset(DATASET, DATA_DIR)

## Extract details of the academies

With our libraries imported and dataset prepared we're now ready to begin extracting the location details for each academy from the XML records. To do this we will [ElementTree](https://docs.python.org/2/library/xml.etree.elementtree.html), which is an API for manipulating XML, implemented in the Python standard library.

In [23]:
def get_academy_summaries(data_dir, dataset):
    data = []
    records_dir = '{0}/{1}/records/ItacAcademyItem'.format(data_dir, dataset)
    for xml_file in os.listdir(records_dir):
        path = os.path.join(records_dir, xml_file)
        with open(path) as f:
            tree = ET.parse(f)
            root = tree.getroot()
            city = root.find(".//*/CityItalianName").text
            name = root.find(".//Name").text
            academy = dict(name=name, city=city)
            data.append(academy)
    return data

academies = get_academy_summaries(DATA_DIR, DATASET)

## Convert the place names into coordinates

In [24]:
def get_markers(academies, user_agent):
    geolocator = Nominatim(user_agent=user_agent)
    locations = {}
    markers = []
    for academy in tqdm.tqdm(academies, desc='Locating', unit='file', leave=False):
        city = academy['city']
        name = academy['name']
        
        if not locations.get(city):
            location = geolocator.geocode(city + ', Italy')
            if not location:
                continue
                
            coordinates = (location.latitude, location.longitude)
            locations[city] = coordinates
            
            # Comply with usage policy of a maximum of 1 request per second                  
            time.sleep(1)
        
        marker = dict(location=locations[city], popup=name)
        markers.append(marker)
    return markers
    
    
markers = get_markers(academies, USER_AGENT)



## Create our interactive map

In [27]:
def build_map():
    italy_coords = (41.87, 12.56)
    map = folium.Map(location=italy_coords, zoom_start=6)

    marker_cluster = MarkerCluster().add_to(map)

    for marker in markers:
        folium.Marker(**marker).add_to(marker_cluster)

    map.save(os.path.join(DATA_DIR, 'iad-academies.html'))

    map

build_map()

## Wrap up

Wrapping it up...

[Link](../data/iad-academies.html)