# Trucial Coast Towns: Building a Historical Gazetteer Dataset with GeoNames


### Prerequisites:

This tutorial assumes a basic understanding of how to work with Jupyter Notebook. We strongly recommend that you take the `Getting Started with Jupyter Notebook` tutorial before taking this tutorial.


## 1. Introduction
"Gazetteers" are structured lists of place names with information about their locations, variants, and historical context. In this tutorial, we will create a small dataset of historical places from the Trucial Coast (pre-UAE region) and enrich it using the GeoNames API.


## 2. Objective
We will:
- Start from a CSV file containing a few historical town names
- Use the GeoNames API to retrieve coordinates and metadata
- Save the enriched dataset to a new CSV file

## 3. Setup: Install Required Libraries

We’ll be using the `pandas` and `requests` libraries in this tutorial. You can install them by running the following command.

> **Note:** These libraries might already be installed in your environment.
In that case, running the command will display: `Requirement already satisfied`

In [None]:
!pip install pandas requests



## 4. Load the Input CSV

Your input file should be named `trucial_towns.csv` and have at least the following column:
- `name`: historical place name

Aditional columns (optional but recommended):
- `type`: A description of the kind of place — e.g., fort, settlement, port, etc.
- `source`: A note about where the name came from — e.g., a historical map, archive, or article

Columns `source, type` won’t be used in the GeoNames query itself, but they are useful for organizing, filtering, or analyzing your results later.

Example row:
```
Ras Al Khaimah,	port, Lorimer Gazetteer
```
📌 Note: You do not need to include a country code to make a query. In that case, the GeoNames API will search globally using only the place name.


**Load the CSV:**

Make sure to save `trucial_towns.csv` in the same folder as this notebook file (.ipynb) so it can be loaded correctly.

The following code imports the `pandas` library and then loads the `trucial_towns.csv` file into a DataFrame called `df`. Finally, it displays the first five rows of the DataFrame to give you a preview of the loaded data.

In [22]:
import pandas as pd

# Load input CSV
input_file = "trucial_towns.csv"  # Make sure this file is in the same folder as your notebook
df = pd.read_csv(input_file)

# Display the data
df.head()

Unnamed: 0,name,country_code,type,notes,source
0,Ras Al Khaimah,AE,port,Base of Qawasim naval power in the 19th century,"Lorimer, Gazetteer of the Persian Gulf"
1,Sharjah,AE,port,Major settlement on the Trucial Coast,"Lorimer, Gazetteer of the Persian Gulf"
2,Umm Al Quwain,AE,town,Historically under Al Ali rule,"Lorimer, Gazetteer of the Persian Gulf"
3,Fujairah,AE,town,Political independence recognized in early 20t...,UAE National Archives
4,Dibba,AE,coastal town,"Split among three rulers, major trade point","Arabian Gulf Studies, vol. 2"


## 5. Define the GeoNames Query Function
We will use the free GeoNames API to retrieve information for each town. You must register at [https://www.geonames.org/login](https://www.geonames.org/login) and get a username.

To enrich the dataset, the tutorial leverages the free GeoNames API. Before proceeding, you need to register on the GeoNames website [https://www.geonames.org/login](https://www.geonames.org/login) to obtain a username.

The `query_geonames` function is designed to send a request to the GeoNames API using a given place name. It processes the API's JSON response and extracts relevant information such as latitude, longitude, GeoNames ID, feature class, feature code, and filtered alternate names. The function includes logic to filter out irrelevant alternate names, such as URLs, airport codes (IATA, ICAO, FAAC), postal codes, Wikidata IDs, and short, all-caps codes.

The following code defines the `query_geonames` function. Remember to replace "geonames_username" with your actual GeoNames username.

This function constructs the API request, sends it, and then parses the JSON response to extract and return the desired geographic data and a semicolon-separated string of filtered alternate names. Error handling for requests exceptions is also included.

In [40]:
import requests # For making HTTP requests to web services.
import time     # For pausing execution to avoid hitting API rate limits.

GEONAMES_USERNAME = "yourGeonamesUsername"  # CHANGE THIS with your GeoNames username

def query_geonames(place_name): # Function to query the GeoNames API for a given place name.
    base_url = "http://api.geonames.org/searchJSON" # Base URL for GeoNames search API.
    params = {
        'q': place_name,    # Query parameter: the place name to search.
        'maxRows': 1,       # Limit results to the top 1 match.
        'username': GEONAMES_USERNAME, # Your GeoNames username.
        'style': 'FULL'     # Request full details in the response.
    }
    
    try:
        response = requests.get(base_url, params=params) # Send GET request.
        response.raise_for_status()                     # Raise an exception for bad status codes (4xx or 5xx).
        
        print(f"Querying: {place_name}")               # Debug print: shows current query.
        print(f"Status Code: {response.status_code}") # Debug print: shows HTTP status.
        
        results = response.json()                       # Parse JSON response into a Python dictionary.
        # print(f"Response JSON: {results}")             # Debug print: shows full API response.
        geonames_data = results.get('geonames', [])     # Extract 'geonames' list, default to empty list if not found.
        
        if geonames_data:
            top = geonames_data[0]                      # Get the first (and only) result.
            
            filtered_alt_names = []                     # Initialize list for cleaned alternate names.
            if 'alternateNames' in top:
                for alt_obj in top['alternateNames']:
                    name = alt_obj.get('name', '')      # Get alternate name.
                    lang = alt_obj.get('lang', '')      # Get language code.

                    # Filter rules for alternate names:
                    if name.startswith('http'): continue          # Exclude URLs.
                    if lang in ['iata', 'icao', 'faac', 'post', 'wkdt']: continue # Exclude specific codes.
                    if len(name) <= 5 and name.isupper(): continue # Exclude short, all-caps codes.
                    if name.startswith('Q') and name[1:].isdigit(): continue # Exclude Wikidata Q-codes.

                    filtered_alt_names.append(name)     # Add valid alternate name.
            
            unique_alt_names = list(dict.fromkeys(filtered_alt_names)) # Remove duplicates while preserving order.
            alt_names_str = "; ".join(unique_alt_names) # Join unique names with semicolon.
            
            return (top.get('lat'),                     # Return latitude.
                    top.get('lng'),                     # Return longitude.
                    top.get('geonameId'),               # Return GeoNames ID.
                    top.get('fcl'),                     # Return feature class.
                    top.get('fcode'),                   # Return feature code.
                    alt_names_str)                      # Return filtered alternate names string.

    except requests.exceptions.RequestException as e:
        print(f"Error querying for '{place_name}': {e}") # Print error message for request failures.
        
    return '', '', '', '', '', '' # Return empty strings on error or no results.

## 6. Enrich the Dataset

This section iterates through each row of the loaded dataset (`df`). For each historical place name, it calls the `query_geonames()` function to fetch relevant geographical data from the GeoNames API. 

The retrieved data, including latitude, longitude, GeoNames ID, feature class, feature code, and alternate names, is then added as new columns to the DataFrame. 

A one-second pause (`time.sleep(1)`) is incorporated between API requests to prevent exceeding GeoNames' rate limits. 
After processing all entries, the updated DataFrame with the newly added information is displayed.

In [41]:
df['latitude'] = ''
df['longitude'] = ''
df['geonames_id'] = ''
df['feature_class'] = ''
df['feature_code'] = ''
df['alternate_names'] = ''

# Query each place
for idx, row in df.iterrows():
    # The function now returns the alternate names string directly
    lat, lon, gid, fcl, fcode, alt_names = query_geonames(row['name']) 
    
    df.at[idx, 'latitude'] = lat
    df.at[idx, 'longitude'] = lon
    df.at[idx, 'geonames_id'] = gid
    df.at[idx, 'feature_class'] = fcl
    df.at[idx, 'feature_code'] = fcode
    df.at[idx, 'alternate_names'] = alt_names # Populate the new column
    
    time.sleep(1)  # Pause to avoid rate limits

# Show the first five entries of the table
print(df.head())

Querying: Ras Al Khaimah
Status Code: 200
Querying: Sharjah
Status Code: 200
Querying: Umm Al Quwain
Status Code: 200
Querying: Fujairah
Status Code: 200
Querying: Dibba
Status Code: 200
             name country_code          type  \
0  Ras Al Khaimah           AE          port   
1         Sharjah           AE          port   
2   Umm Al Quwain           AE          town   
3        Fujairah           AE          town   
4           Dibba           AE  coastal town   

                                               notes  \
0    Base of Qawasim naval power in the 19th century   
1              Major settlement on the Trucial Coast   
2                     Historically under Al Ali rule   
3  Political independence recognized in early 20t...   
4        Split among three rulers, major trade point   

                                   source  latitude longitude geonames_id  \
0  Lorimer, Gazetteer of the Persian Gulf  25.78953   55.9432      291074   
1  Lorimer, Gazetteer of the Pers

## 7. Save the Result into a CSV file

After the dataset has been enriched with information from GeoNames, the next step is to save the updated DataFrame to a new CSV file. This ensures that the retrieved data is persistently stored and can be used for further analysis or visualization.

The following code block saves the df DataFrame to a new CSV file named `trucial_towns_enriched.csv`. 
The `index=False` argument prevents pandas from writing the DataFrame index as a column in the CSV file. 
A confirmation message is then printed to indicate that the file has been saved.

In [29]:
output_file = "trucial_towns_enriched.csv"
df.to_csv(output_file, index=False) # Save the output as a csv file
print(f"Saved enriched dataset to {output_file}")

Saved enriched dataset to trucial_towns_enriched.csv


## 8. Querying by Country, Feature Class, and Feature Code

Sometimes, you might want to find specific types of geographical features within a particular country, rather than searching for a named place globally. The GeoNames API allows you to refine your search using parameters like 
- `country` (two-letter ISO country code)
- `featureClass` (a broad category like 'H' for hydrographic features or 'P' for populated places)
- `featureCode` (a more specific type within a feature class, like 'RVR' for river or 'LAKE' for lake).

You can get the list of featureCodes that belong to each featureClass [here](https://www.geonames.org/export/codes.html).

This section demonstrates how to query for all hydrographic features (rivers, lakes, etc.) within a specific country (e.g., Turkey) and then save these results to a CSV file.

First, let's define a new function `query_geonames_by_criteria` that takes `country_code`, `feature_class`, and an optional `feature_code` as arguments. This function will fetch results based on these criteria.

In [34]:
import requests
import pandas as pd # Import pandas to work with DataFrames and save to CSV

GEONAMES_USERNAME = "yourGeonamesUsername"  # Replace with your actual GeoNames username

def query_geonames_by_criteria(country_code, feature_class, feature_code=None, max_rows=1000):
    # This function queries GeoNames for features based on country, feature class, and optional feature code
    url = "http://api.geonames.org/searchJSON"
    params = {
        'country': country_code,     # ISO 2-letter country code (e.g., 'TR' for Turkey)
        'featureClass': feature_class, # Feature Class (e.g., 'H' for hydrographic, 'P' for populated place)
        'maxRows': max_rows,         # Maximum number of results to retrieve (up to 1000 for free account)
        'username': GEONAMES_USERNAME
    }
    
    if feature_code:
        params['featureCode'] = feature_code # Add featureCode to parameters if provided.

    print(f"Querying GeoNames for features in {country_code} (Class: {feature_class}, Code: {feature_code if feature_code else 'Any'})...")
    
    try:
        response = requests.get(url, params=params) # Send GET request
        response.raise_for_status() # Raise an exception for HTTP errors
        
        print(f"Status Code: {response.status_code}") # Print HTTP status
        data = response.json() # Parse JSON response
        results = data.get('geonames', []) # Extract 'geonames' list
        
        print(f"Found {len(results)} features.") # Print number of results
        
        # Optionally print a sample of the results
        for i, place in enumerate(results[:5]):
            print(f"  Sample {i+1}: Name: {place.get('name')}, Code: {place.get('fcode')}, Lat: {place.get('lat')}, Lng: {place.get('lng')}")
            
        return results
        
    except requests.exceptions.RequestException as e: # Catch errors as exceptions
        print(f"Error querying GeoNames: {e}") # Print error message
        return []

# Example Usage: Query for hydrographic features in Turkey
print("\n--- Querying Hydrographic Features in Turkey ---")
tr_hydro_features = query_geonames_by_criteria(country_code='TR', feature_class='H')

# Example Usage: Query for specific feature code - Rivers (RVR) in Turkey
print("\n--- Querying Rivers (RVR) in Turkey ---")
tr_lakes= query_geonames_by_criteria(country_code='TR', feature_class='H', feature_code='LK')


--- Querying Hydrographic Features in Turkey ---
Querying GeoNames for features in TR (Class: H, Code: Any)...
Status Code: 200
Found 1000 features.
  Sample 1: Name: Lake Van, Code: LK, Lat: 38.62457, Lng: 42.90604
  Sample 2: Name: Bosporus, Code: STRT, Lat: 41.10164, Lng: 29.06097
  Sample 3: Name: Lake Tuz, Code: LK, Lat: 38.72044, Lng: 33.38254
  Sample 4: Name: Aegean Sea, Code: SEA, Lat: 39, Lng: 25
  Sample 5: Name: Lake İznik, Code: LK, Lat: 40.43361, Lng: 29.51861

--- Querying Rivers (RVR) in Turkey ---
Querying GeoNames for features in TR (Class: H, Code: LK)...
Status Code: 200
Found 257 features.
  Sample 1: Name: Lake Van, Code: LK, Lat: 38.62457, Lng: 42.90604
  Sample 2: Name: Lake Tuz, Code: LK, Lat: 38.72044, Lng: 33.38254
  Sample 3: Name: Kartsakhi Lake, Code: LK, Lat: 41.20741, Lng: 43.2091
  Sample 4: Name: Lake İznik, Code: LK, Lat: 40.43361, Lng: 29.51861
  Sample 5: Name: Abant Gölü, Code: LK, Lat: 40.605, Lng: 31.27972


Now that we have functions to query based on specific criteria, let's save the results into a CSV file for further use. We'll convert the list of dictionaries returned by the function into a Pandas DataFrame and then save it.

In [36]:
# Define the desired column order
desired_columns = ['name', 'lat', 'lng', 'countryName', 'countryCode', 'fcl', 'fclName', 'fcode', 'fcodeName', 'adminName1']

# Process and save hydrographic features for Turkey
if tr_hydro_features:
    df_tr_hydro = pd.DataFrame(tr_hydro_features) # Convert results to DataFrame
    df_tr_hydro_selected = df_tr_hydro[desired_columns]     # Select and reorder columns
    output_filename_hydro_tr = "tr_hydrographic_features.csv" # Define output filename
    df_tr_hydro_selected.to_csv(output_filename_hydro_tr, index=False) # Save to CSV, no index
    print(f"\nSaved {len(tr_hydro_features)} hydrographic features from Turkey to {output_filename_hydro_tr} with specified columns.")

# Process and save rivers for Turkey
if tr_rivers:
    df_tr_rivers = pd.DataFrame(tr_rivers)
    df_tr_rivers_selected = df_tr_rivers[desired_columns]
    output_filename_rivers_tr = "tr_rivers.csv"
    df_tr_rivers_selected.to_csv(output_filename_rivers_tr, index=False)
    print(f"Saved {len(tr_rivers)} rivers from Turkey to {output_filename_rivers_tr} with specified columns.")


Saved 1000 hydrographic features from Turkey to tr_hydrographic_features.csv with specified columns.
Saved 257 rivers from Turkey to tr_rivers.csv with specified columns.


## 9. What's Next?
Now that you have coordinates and metadata, you can:
- Import into QGIS or Google My Maps for visualization
- Align with the World Historical Gazetteer (WHG)
- Add temporal information and submit your dataset to WHG for educational use

You can also extend this notebook to:
- Query for multiple alternate names
- Visualize the towns using Python libraries like `folium` or `plotly`
- Compare coverage with Wikidata or WHG