# Retrieving Coordinates and Hierarchy from GeoNames

## Part 1: Introduction
This Jupyter Notebook is intended to implement GeoNames API to autopopulate coordinates and hierarchy of place names.<br>

###  1. What is GeoNames?
- **<a href='http://www.geonames.org/'>GeoNames</a>** is geographic database available for pulling in information like bounding boxes, centroids, hierarchy, children based on place name.<br> More specifically, all features are categorized into one out of 9 feature classes and further subcategorized into one out of 645 <a href='http://www.geonames.org/export/codes.html'>feature codes</a>.
- Also, a Python library called **<a href='https://geocoder.readthedocs.io/providers/GeoNames.html'>Geocoder</a>** supports webservices below.  

### 2. Limitations of GeoNames
- GeoNames does not support searching for exact match and always returns the first one among the result. There's a great chance that it ends up with a mismatch. 
- Daily and hourly limit of web service request. 

## Part 2: Preparation
We will be using **Jupyter Notebook(anaconda 3)** to edit and run the script. Information on Anaconda installation can be found <a href='https://docs.anaconda.com/anaconda/install/'>here</a>. Please note that this script is running on Python 3.

Before running the script, you may need to:
### 1. Install Libraries
- <a href='https://geocoder.readthedocs.io/providers/GeoNames.html'>Geocoder</a> `pip install geocoder`
- <a href='https://chardet.readthedocs.io/en/latest/index.html'>chardet</a> would auto-detect the character encoding. It is incorporated to deal with non-English metadata. `pip install chardet`

### 2. Restructure Directories
- ***fetch.ipynb***
- ***fetch.py***
- **data** folder
    - **code** foloder
        - ***code.csv*** formatted in GBL Metadata Template
        
### 3. Inspect Output
In order to avoid exceeding the hourly/daily limit, this script includes a function to split the csv file into multiple smaller ones so users will be executing part of the script several times to fetch the GeoNames information. The final product would be one csv file named ***code_done.csv***. 

> Original created on Jan 24 2021 <br>
> @author: Yijing Zhou @YijingZhou33

## Part 3: Get Started

### Step 1: Import modules

In [None]:
import geocoder.geonames as geonames
import os
import pandas as pd
import re
import chardet
import csv

### Step 2: Manual items to change
Note that you need a GeoNames user account, you can register a free one <a href='http://www.geonames.org/login'>here</a>.<br>
According to `Terms and Conditions`, the hourly limit for personal account is 1000 credits and 1 credit is 1 hit for webservice request. 

In [None]:
# GeoNames account name
username = ''
try:
    from config import * 
except ImportError:
    pass

# code/name of the rawdata
code = 'sample'

# The number of records per file after splitting  
# The recommended rowlimit is 250.  
rowlimit = 250

## Part 4: Process Raw Data

### Step 3: Convert GeoBlackLight Metadata csv file to dataframe

In [None]:
filepath = os.path.join('data', code, code + '.csv')

df = pd.read_csv(filepath)
## List of metadata fields from the GBL metadata template required in the final product.
collist = ['Slug', 'Title', 'Spatial Coverage']
## Alternative columns
## Check if exists, if so then add it to the list.
## Also more properties can be added here!
altlist = ['Information', 'Download', 'Static image']
collist = collist + [i for i in altlist if i in df]

df = df[collist]

### Step 4: Predict encoding 

In [None]:
def predict_encoding(file_path):
    rawdata = open(file_path, 'rb').read()
    result = chardet.detect(rawdata)
    charenc = result['encoding']
    return charenc

encoding = predict_encoding(filepath)

### Step 5: Extract keyword(s) from place name

In [None]:
def extract_placename(df):
    df['Locator'] = ''
    for _, row in df.iterrows():
        if str(row['Spatial Coverage']) == 'nan':
            #### e.g. 'Spatial Coverage' is empty
            #### return []
            row['Locator'] = []
        elif '|' in row['Spatial Coverage']:
            if ',' in row['Spatial Coverage']:
            #### e.g. Hellam, Pennsylvania|York County, Pennsylvania|Pennsylvania
            #### return ['Hellam, Pennsylvania', 'York County, Pennsylvania']
                row['Locator'] = [x for x in row['Spatial Coverage'].split('|')[0:-1]]
            else:
            #### e.g. Asia|Europe|Arctic Circle|Arctic
            #### return ['Asia', 'Europe', 'Arctic Circle', 'Arctic']
                row['Locator'] = [x for x in row['Spatial Coverage'].split('|')]
        else:
            #### e.g. Oneida County, Wisconsin|Wisconsin
            #### return ['Oneida County, Wisconsin']
            row['Locator'] = [row['Spatial Coverage']]
    return df

### Step 6: Split raw data into multiple smaller files

In [None]:
## Prints out the number of files after spliting
## Appnend the according number to each file
## e.g. code_1.csv, code_2.csv, ...
def split_csvs():
    suffix = 1
    suffixlist = []
    for i in range(len(df)):
        if i % rowlimit == 0:
            suffixlist.append(suffix)
            splitpath = os.path.join('data', code, f'{code}_{suffix}.csv')
            df[i:i + rowlimit].to_csv(splitpath, index = False, encoding = encoding)
            suffix += 1
    return suffixlist

splitlist = split_csvs()
print(splitlist)

## Part 5: Fetch URI, coordinates, and hierarchy of place name from GeoNames

### Step 7: Fetch URI, coordinates, and hierarchy of place name from GeoNames

In [None]:
def geoid(geoname, featureClass):
    if featureClass:
        return geonames(geoname, featureClass = featureClass, key = username)
    else: 
        return geonames(geoname, key = username)
    
def bbox(geonames_id):
    return geonames(geonames_id, method = 'details', key = username).bbox

def hierarchy(geonames_id):
    return geonames(geonames_id, method = 'hierarchy', key = username)

***
> Note that the code block below may need to be run multiple times for each smaller csv file. <br>
> Also the parameter should be changed manually according to the list of split files before execution. 

In [None]:
def bbox_hierarchy(suffix): 
    splitpath = os.path.join('data', code, f'{code}_{suffix}.csv')
    outputpath = os.path.join('data', code, f'{code}_done_{suffix}.csv')
    df = pd.read_csv(splitpath)
    extract_placename(df)

    df['GeoNames URI'] = ''
    df['GeoNames Bbox'] = ''
    df['GeoNames Hierarchy'] = ''
    for _, row in df.iterrows():
        geonames_ids = []
        geonames = []
        ## Filter out GeoNames id using place name(['Locator']) and featureClass 
        for x in row['Locator']:
            geonames.append(x)
            #### For more information about featureClass, find here: http://www.geonames.org/export/codes.html
            #### A - country, state, region,...
            #### H - stream, lake, ...
            #### P - city, village,...
            if re.search('river|lake', x, re.IGNORECASE) and ',' not in x:
                geonames_id = geoid(x, 'H').geonames_id
            elif re.search('Antarctica|Arctic', x, re.IGNORECASE):
                if geoid(x, ['L', 'H', 'T']):
                    geonames_id = geoid(x, ['L', 'H', 'T']).geonames_id
                else:
                    geonames_id = geoid(x, '').geonames_id
            elif geoid(x, 'A'):
                geonames_id = geoid(x, 'A').geonames_id
            elif geoid(x, 'P'):
                geonames_id = geoid(x, 'P').geonames_id
            else:
                geonames_id = geoid(x, '').geonames_id
            
            if geonames_id:
                geonames_ids.append(geonames_id)
            else: 
                pass
        
        ## Inspect each record    
        print(geonames, geonames_ids)
        
        ## If column['Locator'] only includes one place name    
        if len(geonames_ids) == 1:
            ##### GeoNames URI #####
            row['GeoNames URI'] = 'https://sws.geonames.org/' + str(geonames_ids[0])
            
            ##### GeoNames Bounding Box #####
            bboxdic = bbox(geonames_ids[0])
            #### Check if bounding box exists since some place names only contain centroids
            if bboxdic:
                bboxlist = [round(x, 2) for x in (bboxdic['southwest'] + bboxdic['northeast'])]
                row['GeoNames Bbox'] = ', '.join(str(x) for x in bboxlist)
            else:
                pass
            
            ##### GeoNames Hierarchy #####
            hlist = [r.address for r in hierarchy(geonames_ids[0])][1:]
            #### Check if this place is inside of U.S., if so append 'County' to county name
            if len(hlist) > 3 and hlist[1] == 'United States':
                hlist[3] = hlist[3] + ' County'
            row['GeoNames Hierarchy'] = '|'.join(str(x) for x in hlist)
        
        ## If column['Locator'] only includes multiple place names
        elif len(geonames_ids) > 1:
            hlists = []
            bboxlists = []
            
            ##### GeoNames URI #####
            urilist = [('https://sws.geonames.org/' + str(x)) for x in geonames_ids]
            row['GeoNames URI'] = ', '.join(str(x) for x in urilist)
            
            ##### GeoNames Bounding Box #####
            for i in geonames_ids:
                hlist = [r.address for r in hierarchy(i)][1:]
                if len(hlist) > 3 and hlist[1] == 'United States':
                    hlist[3] = hlist[3] + ' County'
                hlists.append(hlist)
                
                bboxdic = bbox(i)
                if bboxdic:
                    bboxlist = bboxdic['southwest'] + bboxdic['northeast']
                    bboxlists.append(bboxlist)
                else:
                    pass
            #### Find the largest extend
            if bboxlists:
                minX = min(x[0] for x in bboxlists)
                minY = min(x[1] for x in bboxlists)
                maxX = max(x[2] for x in bboxlists)
                maxY = max(x[3] for x in bboxlists)
                combinedBbox = [round(x, 2) for x in [minX, minY, maxX, maxY]]
                row['GeoNames Bbox'] = ', '.join(str(x) for x in combinedBbox)
            else: 
                pass               
            
            ##### GeoNames Bounding Box #####
            cols = list(range(0, max(len(i) for i in hlists)))
            df_h = pd.DataFrame(columns = cols, data = hlists)
            #### Merge the same hierarchy
            #### e.g. ['Hellam, Pennsylvania', 'York County, Pennsylvania']
            #### return North America|United States|Pennsylvania|York County|Township of Hellam
            nonan = [','.join(list(filter(None, set(df_h[col])))) for col in cols]
            row['GeoNames Hierarchy'] = '|'.join(str(x) for x in nonan)
    
    print('-------- End --------')
    df_clean = df.drop(columns = ['Locator'])
    df_clean.to_csv(outputpath, index = False, encoding = encoding)
    
bbox_hierarchy(1)

## Part 6: Merge smaller output files into one

In [None]:
def combined_csvs():
    ## Combine all output files into one called 'code_done.csv'
    combined_csv = pd.concat([pd.read_csv(os.path.join('data', code, f'{code}_done_{suffix}.csv')) for suffix in splitlist])
    outputpath = os.path.join('data', code, code + '_done.csv')
    combined_csv.to_csv(outputpath, index = False, encoding = encoding)
    ## Remove small output files
    for suffix in splitlist:
        os.remove(os.path.join('data', code, f'{code}_{suffix}.csv'))
        os.remove(os.path.join('data', code, f'{code}_done_{suffix}.csv'))

combined_csvs()