# Purpose

This script will scan the DCAT 1.1 APIs of ArcGIS Hubs and return the metadata for all suitable items. It will produce two CSV files
1. A CSV with all metadata except the link fields
2. A CSV with only the link fields (landing page, download, and REST services)

## Prepare the list of active ArcGIS Hubs

We maintain a list of active ArcGIS Hub sites in GEOMG. (Access to GEOMG requires a login account. External users can create their own list or use one provided in this repository)

1. Filter for items with these parameters:
   - Resource Class: Websites
   - Accrual Method: DCAT US 1.1
   - [Shortcut query](https://geo.btaa.org/admin/documents?f%5Bb1g_dct_accrualMethod_s%5D%5B%5D=DCAT+US+1.1&f%5Bdct_format_s%5D%5B%5D=ArcGIS+Hub&f%5Bgbl_resourceClass_sm%5D%5B%5D=Websites&rows=20&sort=score+desc)
   
2. Rename the downloaded file `arcHubs.csv` and move it into the same directory as this Notebook.


    
Exporting from GEOMG will produce a CSV containing all of the metadata associated with each Hub. For this script, the only fields used are:

* **ID**: Unique code assigned to each portal. This is transferred to the "Is Part Of" field for each dataset.
* **Title**: The name of the Hub. This is transferred to the "Provider" field for each dataset
* **Publisher**: The place or administration associated with the portal. This is applied to the title in each dataset in brackets
* **Spatial Coverage**: A list of place names. These are transferred to the Spatial Coverage for each dataset
* **Member Of**: The ID of a larger collection level record. Most of the Hubs are either part of our [Government Open Geospatial Data Collection](https://geo.btaa.org/catalog/ba5cc745-21c5-4ae9-954b-72dd8db6815a) or the [Research Institutes Geospatial Data Collection](https://geo.btaa.org/catalog/b0153110-e455-4ced-9114-9b13250a7093)


-------------------

In [31]:
directory = "."  # Directory containing arcHubs.csv
hubFile = "arcHubs.csv"  # the name of the CSV file with the list of ArcGIS Hubs

In [32]:
import csv
import json
import os
import re
import time
import urllib.request
from html.parser import HTMLParser
from urllib.parse import urlparse, parse_qs
import sys

import numpy as np
import pandas as pd
import requests

ActionDate = time.strftime('%Y%m%d')

In [33]:
fieldnames = [
    "Title",
    "Alternative Title",
    "Description",
    "Language",
    "Display Note",
    "Creator",
    "Title Source",
    "Resource Class",
    "Resource Type",
    "Keyword",
    "Date Issued",
    "Temporal Coverage",
    "Date Range",
    "Spatial Coverage",
    "Bounding Box",
    "Format",
    "full_layer_description",
    "download",
    "arcgis_dynamic_map_layer",
    "arcgis_feature_layer",
    "arcgis_image_map_layer",
    "arcgis_tiled_map_layer",
    "ID",
    "Identifier",
    "Provider",
    "Code",
    "Member Of",
    "Is Part Of",
    "Rights",
    "Accrual Method",
    "Date Accessioned",
    "Access Rights",
]

In [34]:
class MLStripper(HTMLParser): 
    def __init__(self):
        super().__init__()
        self.reset()
        self.strict = False
        self.convert_charrefs = True
        self.fed = []

    def handle_data(self, d): 
        self.fed.append(d)

    def get_data(self): 
        return "".join(self.fed)


def strip_tags(html): 
    s = MLStripper()
    s.feed(html)
    return s.get_data()


def cleanData(value):
    return strip_tags(value)

In [35]:
def getIdentifiers(data):
    json_ids = {}
    for x in range(len(data["dataset"])):
        json_ids[x] = data["dataset"][x]["identifier"]
    return json_ids


def format_title(alternativeTitle, titleSource):
    year = ''
    try:  
        # Matches a range of years (yyyy-yyyy)
        year_range = re.findall(r'\b(\d{4})-(\d{4})\b', alternativeTitle)
    except:
        year_range = ''
    try: 
        # Matches standalone years or years adjacent to letters but not part of a longer sequence
        single_year = re.match(r'.*(?<!\d)(17\d{2}|18\d{2}|19\d{2}|20\d{2})(?!\d)', alternativeTitle)
    except:
        single_year = ''    

    if year_range:   
        year = '-'.join(year_range[0])
        alternativeTitle = alternativeTitle.replace(year, '').strip().rstrip(',')
    elif single_year:  
        year = single_year.group(1)
        alternativeTitle = alternativeTitle.replace(year, '').strip().rstrip(',')

    altTitle = str(alternativeTitle)
    title = altTitle + ' [{}]'.format(titleSource)   
    if year:
        title += ' {' + year +'}'
    return title

In [36]:
def metadataNewItems(newdata, newitem_ids, language, displayNote, titleSource, spatialCoverage, 
                     provider, hubCode, memberOf, isPartOf, accrualMethod, dateAccessioned, accessRights, default_bbox):
    newItemDict = {}
    for y, v in newitem_ids.items():
        identifier = v
        metadata = []
        
        # Alternative Title
        try:
            alternativeTitle = str(cleanData(newdata["dataset"][y]['title']))
        except:
            alternativeTitle = str(newdata["dataset"][y]['title'])

        # Format title
        title = format_title(alternativeTitle, titleSource)

        # Description
        description = cleanData(newdata["dataset"][y]['description'])
        description = description.replace("{{default.description}}", "").replace("{{description}}", "")
        description = re.sub(r'[\n]+|[\r\n]+', ' ', description, flags=re.S)
        description = re.sub(r'\s{2,}', ' ', description)
        description = description.translate({8217: "'", 8220: '"', 8221: '"', 160: "", 183: "", 8226: "", 8211: "-", 8203: ""})

        # Creator
        creator = newdata["dataset"][y]["publisher"]
        for pub in creator.values():
            try:
                creator = pub.replace(u"\u2019", "'")
            except:
                creator = pub

        # Initialize link-related variables
        information = cleanData(newdata["dataset"][y]['landingPage'])
        downloadURL = ""
        mapServer = ""
        featureServer = ""
        imageServer = ""
        tileServer = ""

        # Resource properties
        resourceClass = ""
        formatElement = ""
        resourceType = ""
        keyword_list = '|'.join(newdata["dataset"][y].get("keyword", [])).replace(' ', '')

        dateIssued = cleanData(newdata["dataset"][y]['issued']).split('T', 1)[0] 
        dateModified = cleanData(newdata["dataset"][y]['modified']).split('T', 1)[0]

        # Temporal Coverage
        temporalCoverage = ""
        dateRange = ""
        if re.search(r"\{(.*?)\}", title):     
            temporalCoverage = re.search(r"\{(.*?)\}", title).group(1)
            # If temporalCoverage = YYYY or YYYY-YYYY, dateRange can be set accordingly
            if '-' in temporalCoverage:
                # format: YYYY-YYYY
                dateRange = temporalCoverage
            else:
                # single year
                dateRange = temporalCoverage + '-' + temporalCoverage
        else:
            temporalCoverage = 'Last modified ' + dateModified

        # Rights
        rights = cleanData(newdata["dataset"][y].get('license', ''))

        # Bounding Box
        try:
            scanned_bbox = newdata["dataset"][y]["spatial"]

            # Ensure it has the expected format
            if scanned_bbox.get("type") == "envelope" and "coordinates" in scanned_bbox:
                coordinates = scanned_bbox["coordinates"]
                # Flatten the list of coordinates and round them
                rounded_coordinates = [
                    str(round(coord, 2)) for pair in coordinates for coord in pair
                ]
                # Format the bounding box as a comma-separated string
                bbox = ','.join(rounded_coordinates)
            else:
                bbox = default_bbox  # Use the default bbox from the CSV
        except Exception as e:
            bbox = default_bbox  # Use the default bbox from the CSV in case of errors

        # Determine Resource Class/Type and check for shapefile downloads
        distribution = newdata["dataset"][y].get("distribution", [])
        for dictionary in distribution:
            try:
                dist_title = dictionary.get("title", "")
                # If we find a shapefile distribution
                if dist_title == "Shapefile":
                    resourceClass = "Datasets|Web services"
                    formatElement = "Shapefile"
                    downloadURL = dictionary.get("accessURL", "")
                # ArcGIS GeoService 
                if dist_title == "ArcGIS GeoService":
                    webService = dictionary.get('accessURL', '')
                    if "FeatureServer" in webService:
                        featureServer = webService
                        if resourceClass == "":
                            resourceClass = "Web services"
                            
                    if "MapServer" in webService:
                        mapServer = webService
                        if resourceClass == "":
                            resourceClass = "Web services"
                            
                    if "ImageServer" in webService:
                        imageServer = webService
                        resourceClass = "Imagery|Web services"
                        formatElement = 'Imagery'
                        resourceType = "Raster data"
                    if "TileServer" in webService:
                        tileServer = webService
                        if resourceClass == "":
                            resourceClass = "Web services"

            except:
                continue

        # If LiDAR found in title or description
        if 'LiDAR' in title or 'LiDAR' in description:
            resourceType = 'LiDAR'


        # ID/Identifier
        slug = identifier.split('=', 1)[-1].replace("&sublayer=", "_")
        querystring = parse_qs(urlparse(identifier).query)
        if "id" in querystring:
            identifier_new = "https://hub.arcgis.com/datasets/" + querystring["id"][0]
        else:
            identifier_new = identifier

        metadataList = [
            title, 
            alternativeTitle, 
            description,
            language,
            displayNote,
            creator,
            titleSource,
            resourceClass, 
            resourceType,
            keyword_list, 
            dateIssued, 
            temporalCoverage,
            dateRange, 
            spatialCoverage, 
            bbox,
            formatElement, 
            information, 
            downloadURL,
            mapServer, 
            featureServer,
            imageServer, 
            tileServer,
            slug, 
            identifier_new, 
            provider, 
            hubCode, 
            memberOf, 
            isPartOf, 
            rights,
            accrualMethod,
            dateAccessioned, 
            accessRights
        ]     

        newItemDict[slug] = metadataList

    return newItemDict

In [37]:
# Main execution
allRecords = []

with open(hubFile, newline='', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        hubCode = row['ID']
        url = row['Identifier']
        provider = row['Title']
        titleSource = row['Publisher']
        spatialCoverage = row['Spatial Coverage']
        isPartOf = row['ID']
        memberOf = row['Member Of']
        accrualMethod = "ArcGIS Hub"
        dateAccessioned = time.strftime('%Y-%m-%d')
        accessRights = "Public"
        language = "eng"
        displayNote = ("This dataset was automatically cataloged from the provider's ArcGIS Hub. "
                       "In some cases, information shown here may be incorrect or out-of-date. "
                       "Click the 'Visit Source' button to search for items on the original provider's website.")

        print("scanning ", hubCode, url)
        response = requests.get(url)
        if response.status_code == 200 and 'application/json' in response.headers['Content-Type']:
            newdata = response.json()
        else:
            print("Failed to fetch or incorrect content type: ", response.status_code, response.headers.get('Content-Type', ''))
            continue

        newjson_ids = getIdentifiers(newdata)
        record_dict = metadataNewItems(newdata, newjson_ids, language, displayNote, titleSource, spatialCoverage, 
                               provider, hubCode, memberOf, isPartOf, accrualMethod, dateAccessioned, accessRights, row['Bounding Box'])

        if record_dict:
            allRecords.append(record_dict)

scanning  04b-24003 https://maps.aacounty.org/api/feed/dcat-us/1.1.json
scanning  08b-42003 https://openac-alcogis.opendata.arcgis.com/api/feed/dcat-us/1.1.json
scanning  11b-39009 https://data-athgis.opendata.arcgis.com/api/feed/dcat-us/1.1.json


In [38]:
# Flatten allRecords into a list of dicts suitable for DataFrame creation
flat_data = []
for rec in allRecords:
    for slug, values in rec.items():
        row_dict = dict(zip(fieldnames, values))
        flat_data.append(row_dict)

if not flat_data:
    print("No records found, no CSV will be created.")
    sys.exit(0)

df = pd.DataFrame(flat_data)

### Cleanup

In [39]:
# Drop duplicates by ID and Title
df = df.drop_duplicates(subset=['ID'])
df = df.drop_duplicates(subset=['Title'])

# Fix Department of the Interior issue
df['Creator'] = df['Creator'].replace("{'name': 'Department of the Interior'}", "Department of the Interior")

# Drop rows without a Resource Cl
df['Resource Class'] = df['Resource Class'].replace(r'^\s*$', np.nan, regex=True) #clear NaN values
df = df.dropna(subset=['Resource Class'])

In [40]:
# Create the first CSV (all fields except links)
# Updated fields for links
link_fields = ['full_layer_description', 'download', 'arcgis_dynamic_map_layer', 'arcgis_feature_layer', 'arcgis_image_map_layer', 'arcgis_tiled_map_layer']
df_first_csv = df.drop(columns=link_fields)
df_first_csv.to_csv(f'{ActionDate}_ArcHubs-metadata.csv', index=False)

# Create the second CSV with friendlier_id, reference_type, distribution_url, and label
rows = []
for _, r in df.iterrows():
    slug = r['ID']
    for lf in link_fields:
        if pd.notna(r[lf]) and r[lf] != "":
            rows.append({'friendlier_id': slug, 'reference_type': lf, 'distribution_url': r[lf], 'label': r['Format']})

df_second_csv = pd.DataFrame(rows, columns=['friendlier_id', 'reference_type', 'distribution_url', 'label'])
df_second_csv.to_csv(f'{ActionDate}_ArcHubs-links.csv', index=False)

print("CSV files have been created successfully.")

CSV files have been created successfully.
