# Georeferencer Example

This notebook provides an example of how to use the georeferencer module of the nmnh_ms_tools package. This application tries to emulate how a person might georeference a natural history record, based very loosely based on the [MaNIS georeferencing guidelines](http://georeferencing.org/manis/GeorefGuide.html) (or see here for an update to those guidelines [available on GBIF](https://docs.gbif.org/georeferencing-calculator-manual/1.0/en/). It is very much an unfinished product, but can produce decent results, particularly for relatively simple records. One thing it does that I think is nice is that it generates a description describing how the georeference was made, including citing the sources used to make the final determination.

1. Install the nmn_ms_tools package using the instructions on the [main page of the repository](https://github.com/adamancer/nmnh_ms_tools)
2. Navigate to the examples folder in the cloned repo. Make a copy of the georeferencer folder in a location of your choosing.
3. In the folder you created in step 2, find the .nmtrc file and open it in a text editor. Depending on your OS, that file may be hidden. Add an email address and a GeoNames username (register [here](https://www.geonames.org/login) if you don't have one). The email address will be used to populate the user agent on API requests made by this application. The GeoNames username is required to access the webservices on that website.
4. Download the following files and extract to data/databases under the georeferencer folder:
    - allCountries.zip from https://download.geonames.org/export/dump/ (local copy of GeoNames to use for querying)
    - natural_earth_vector.sqlite.zip from https://naciscdn.org/naturalearth/packages/ (Natural Earth polygons)
5. Open and run the refresh-data.ipynb notebook. This notebook creates SQLite databases that will be used by the georeferencer. The GeoNames dump is almost 13 million records, so this will probably take a good long while to run.
6. Run this notebook

By default, this notebook uses a CSV file that is (in theory) used to test the georeferencer application. You can look at this file for hints about how to format other locality data. The location is printed out below. The format is otherwise undocumented but is based on Darwin Core.

All data is EPSG:4326 unless otherwise noted.

**If you are wary but curious,** try running the script on data that already include coordinates. This will give you some idea of the kind of error radii you can expect and help identify records that the application may struggle with. Feel free to report any disastrous results or bugs.



In [None]:
import csv
import logging
import os

from nmnh_ms_tools.bots import Bot
from nmnh_ms_tools.config import TEST_DIR
from nmnh_ms_tools.databases.admin import init_db as init_admin_db
from nmnh_ms_tools.databases.custom import init_db as init_custom_db
from nmnh_ms_tools.databases.geohelper import init_db as init_geohelper_db
from nmnh_ms_tools.databases.geonames import init_db as init_geonames_db
from nmnh_ms_tools.databases.georef_job import (
    init_db as init_geojob_db,
    use_observed_uncertainties,
)
from nmnh_ms_tools.records import Site
from nmnh_ms_tools.processes.georeferencer import Georeferencer
from nmnh_ms_tools.processes.georeferencer.pipes import *
from nmnh_ms_tools.utils import configure_log, skip_hashed
from xmu import EMuReader, EMuRecord, write_import

In [None]:
class GeoreferencerOnlyFilter(logging.Filter):
    """Limits log to messages from the georeferencer application"""
    def filter(self, record):
        return record.name.startswith("nmnh_ms_tools.processes.georeferencer")

# Unhash to enable log
#configure_log("DEBUG", filters=[GeoreferencerOnlyFilter()])

In [None]:
# Set required class attributes
Site.pipe = MatchGeoNames()

# Update default uncertainties (broken)
#init_geojob_db(r"data/databases/uncertainties.sqlite")
#use_observed_uncertainties(percentile=50)

# Initialize databases. Paths are defined in config.yml.
init_admin_db()               # db for administrative divisions
init_custom_db()              # db for custom features
init_geohelper_db()           # db with helper tables, for example, with alternative polygons 
init_geonames_db()            # db with GeoNames places
init_geojob_db("job.sqlite")  # db with info about the current job

# Pipes are used to handle different types of strings
pipes = [
    MatchManual(),     # captures manually georeferenced places
    MatchPLSS(),       # section/township/range
    MatchBetween(),    # for example, between A and B
    MatchBorder(),     # for example, border of A and B
    #MatchOffshore(),   # for example, offshore from A (broken)
    MatchDirection(),  # for example, 1 km N of A
    MatchCustom(),     # matches custom places
    MatchGeoNames(),   # matches GeoNames places
]

# Set up tests
tests = None

# Enable caches
if tests is None:
    MatchGeoNames.enable_sqlite_cache("caches/records.sqlite")
    Site.enable_sqlite_cache("caches/localities.sqlite")

In [None]:
# Load test records
sites = []
fp = os.path.join(TEST_DIR, "test_georeferencer.csv")
print(f"CSV: {fp}")
with open(fp, "r", encoding="utf-8-sig", newline="") as f:
    for row in csv.DictReader(skip_hashed(f), dialect="excel"):
        sites.append(Site(row))

In [None]:
# Set up georeferencer
geo = Georeferencer([s.to_dict() for s in sites],
                    pipes=pipes,
                    tests=tests,
                    skip=0,
                    limit=200)

geo.id_key = "location_id"        # key with the locality identifier 
geo.allow_sparse = True           # controls whether to georeference records with limited data
geo.allow_invalid_coords = False  # controls whether to georeference records with bad coordinates
geo.coord_type = "any"            # controls place type allowed (any, marine, or terrestrial)
geo.place_type = "any"            # controls determinations allowed for existing coordinates (any, georeferenced, or measured)
geo.require_coords = False        # controls whether coordinates must be present, for example, for QA
geo.include_failed = True         # controls whether to include failed georeferences when reporting

# Georeference ahoy!
geo.georeference()

In [None]:
with open("results.csv", "w", encoding="utf-8-sig", newline="") as f:
    writer = csv.DictWriter(f, ["record_number", "geometry", "radius_km", "description"])
    writer.writeheader()
    for result in geo.evaluated.values():
        try:
            geom = result["site"].geometry.simplify().geom[0]
            writer.writerow({
                "record_number": result["location_id"],
                "geometry": str(geom),
                "radius_km": result["radius_km"],
                "description": result["description"],
            })                
        except KeyError:
            pass