## Purpose

This script will scan the [DCAT 1.1 API](https://resources.data.gov/resources/dcat-us/) of ArcGIS Hubs and return the metadata for all suitable items as a CSV file in the GeoBTAA Metadata Application Profile.

## View full recipe

This Notebook is part of a workflow documented in the [Metadata Handbook Recipes section](https://geobtaa.github.io/metadata/recipes). This is the first recipe, called [ArcGIS](https://geobtaa.github.io/metadata/recipes/R-01_arcgis-hubs/).

## Prepare the list of active ArcGIS Hubs

We maintain a list of active ArcGIS Hub sites in GEOMG. (Access to GEOMG requires a login account. External users can create their own list or use one provided in this repository)

1. Filter for items with these parameters:
   - Resource Class: Websites
   - Accrual Method: DCAT US 1.1
   - [Shortcut query](https://geomg.lib.umn.edu/documents?f%5Bb1g_dct_accrualMethod_s%5D%5B%5D=DCAT+US+1.1&f%5Bgbl_resourceClass_sm%5D%5B%5D=Websites&rows=20&sort=score+desc)
   
2. Rename the downloaded file `arcHubs.csv` and move it into the same directory as this Notebook.


    
Exporting from GEOMG will produce a CSV containing all of the metadata associated with each Hub. For this script, the only fields used are:

* **ID**: Unique code assigned to each portal. This is transferred to the "Is Part Of" field for each dataset.
* **Title**: The name of the Hub. This is transferred to the "Provider" field for each dataset
* **Publisher**: The place or administration associated with the portal. This is applied to the title in each dataset in brackets
* **Spatial Coverage**: A list of place names. These are transferred to the Spatial Coverage for each dataset
* **Member Of**: a larger collection level record. Most of the Hubs are either part of our [Government Open Geospatial Data Collection](https://geo.btaa.org/catalog/ba5cc745-21c5-4ae9-954b-72dd8db6815a) or the [Research Institutes Geospatial Data Collection](https://geo.btaa.org/catalog/b0153110-e455-4ced-9114-9b13250a7093)


-------------------

## Define the module-level code

This section includes the necessary imports, configuration settings, and function/class definitions that will be used by the rest of the code in the module.

In [10]:
import csv # Provides functionality to read from and write to CSV files.
import json # Provides functionality to work with JSON data.
import os # Provides a way of using operating system dependent functionality, like reading or writing the file system.
import re # Provides regular expression matching operations.
import time # Provides functions for working with time, including time conversion, sleep function and timers.
import urllib.request # provides functions for working with URLs, like opening URLs, reading data from URLs, etc.
from html.parser import HTMLParser # provides an HTML parsing library that can be used to extract data from HTML docs.
from urllib.parse import urlparse, parse_qs # provides a way to parse URLs into their components.

import numpy as np # Provides numerical operations and array manipulation tools.
import pandas as pd # Provides data manipulation and analysis functionality.
import requests # Provides HTTP library for sending requests to servers and receiving responses.

**Set up paths and output CSV field names**

In [11]:
directory = "."  # Set to directory containing arcHubs.csv
hubFile = "arcHubs.csv"  # the name of the CSV file with the list of ArcGIS Hubs
fieldnames = [  # DCAT schema fields to be included in report
    "Title",
    "Alternative Title",
    "Description",
    "Language",
    "Creator",
    "Title Source",
    "Resource Class",
    "Resource Type",
    "Keyword",
    "Date Issued",
    "Temporal Coverage",
    "Date Range",
    "Spatial Coverage",
    "Bounding Box",
    "Format",
    "Information",
    "Download",
    "MapServer",
    "FeatureServer",
    "ImageServer",
    "ID",
    "Identifier",
    "Provider",
    "Code",
    "Member Of",
    "Is Part Of",
    "Rights",
    "Accrual Method",
    "Date Accessioned",
    "Access Rights",
]

ActionDate = time.strftime('%Y%m%d') # Generate the current local time with the format like 'YYYYMMDD' and save to the variable named 'ActionDate'

json_ids = {}

**Function to remove HTML tags**

Sometimes, the metadata fields we scrape contain HTML tags, such as links or formatting that do not work in the Geoportal.

In [12]:
class MLStripper(HTMLParser): 
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs = True
        self.fed = []

    def handle_data(self, d): 
        self.fed.append(d)

    def get_data(self): # Returns a string of all the data in the list concatenated together.
        return "".join(self.fed)


def strip_tags(html): # Defined by the MLS Stripper 
    s = MLStripper()
    s.feed(html)
    return s.get_data()


def cleanData(value): # Calls strip_tags on the input value to remove any HTML tags present.
    return strip_tags(value)

**Function to generate an output CSV**

iterate over the keys and writes the corresponding values to the CSV


In [13]:
def printItemReport(report, fields, dictionary):
    with open(report, 'w', newline='', encoding='utf-8') as outfile:
        csvout = csv.writer(outfile)
        csvout.writerow(fields)
        for hub in dictionary:
            for keys in hub:
                allvalues = hub[keys]
                csvout.writerow(allvalues)

**Function to create a dictionary of metadata in the JSONs**

In [14]:
# use the len function to get the number of datasets and the range function to loop through each dataset
        
def getIdentifiers(data):
    json_ids = {}  # Dictionary List
    for x in range(len(data["dataset"])):
        json_ids[x] = data["dataset"][x]["identifier"]
    return json_ids


**Function to generate the title as: alternativeTitle [place name] {year}**

In [15]:
# The function uses regular expressions to extract the year from the alternative title, and replaces it with an empty string to remove it from the title.

def format_title(alternativeTitle, titleSource):
    # find if year exist in alternativeTitle
    year = ''
    try:  
      year_range = re.findall(r'(\d{4})-(\d{4})', alternativeTitle)
    except:
      year_range = ''
    try: 
      single_year = re.match(r'.*(17\d{2}|18\d{2}|19\d{2}|20\d{2})', alternativeTitle)
    except:
      single_year = ''    
    if year_range:   # if a 'yyyy-yyyy' exists
        year = '-'.join(year_range[0])
        alternativeTitle = alternativeTitle.replace(year, '').strip().rstrip(',')
    elif single_year:  # or if a 'yyyy' exists
        year = single_year.group(1)
        alternativeTitle = alternativeTitle.replace(year, '').strip().rstrip(',')
     
    altTitle = str(alternativeTitle)
    title = altTitle + ' [{}]'.format(titleSource)   
    if year:
        title += ' {' + year +'}'       
    return title

**Function to create a dictionary of scanned metadata**

This code defines a function called `metadataNewItems()` which takes two arguments 

* `newdata` (a dictionary containing metadata information about new items)
* `newitem_ids` (a dictionary containing information about the new items such as the position and the landing page URLs).

The function processes the metadata information for each new item and creates a dictionary containing the formatted metadata.

In [16]:
# This includes blank fields '' for some columns

def metadataNewItems(newdata, newitem_ids):
    newItemDict = {}
    # y = position of the dataset in the DCAT metadata json, v = landing page URLs
    for y, v in newitem_ids.items():
        identifier = v
        metadata = []
        

#ALTERNATIVE TITLE
       
        alternativeTitle = ""
        try:
            alternativeTitle = str(cleanData(newdata["dataset"][y]['title']))
        except:
            alternativeTitle = str(newdata["dataset"][y]['title'])
            
# TITLE
            
        # call the format_title function
#         title = format_title(alternativeTitle, titleSource)
        title = alternativeTitle
            
#DESCRIPTION

        description = cleanData(newdata["dataset"][y]['description'])
        description = description.replace("{{default.description}}", "").replace("{{description}}", "")
        description = re.sub(r'[\n]+|[\r\n]+', ' ', description, flags=re.S)
        description = re.sub(r'\s{2,}', ' ', description)
        description = description.translate({8217: "'", 8220: '"', 8221: '"', 160: "", 183: "", 8226: "", 8211: "-", 8203: ""})


# RESOURCE TYPE

        # if 'LiDAR' exists in Title or Description, add it to Resource Type
        if 'LiDAR' in title or 'LiDAR' in description:
            resourceType = 'LiDAR'
                            
#CREATOR
        creator = newdata["dataset"][y]["publisher"]
        for pub in creator.values():
            try:
                creator = pub.replace(u"\u2019", "'")
            except:
                creator = pub


# DISTRIBUTION

        information = cleanData(newdata["dataset"][y]['landingPage'])

        format_types = []
        resourceClass = ""
        formatElement = ""
        downloadURL = ""
        resourceType = ""
        webService = ""
        featureServer = ""
        mapServer = ""
        imageServer = ""



        distribution = newdata["dataset"][y]["distribution"]
        for dictionary in distribution:
            try:
                # If one of the distributions is a shapefile, change genre/format and get the downloadURL
                format_types.append(dictionary["title"])
                if dictionary["title"] == "Shapefile":
                    resourceClass = "Datasets|Web services"
                    formatElement = "Shapefile"
                    if 'downloadURL' in dictionary.keys():
                        downloadURL = dictionary["downloadURL"].split('?')[0]
                    else:
                        downloadURL = dictionary["accessURL"].split('?')[0]

                    resourceType = "Vector data"

                # If the Rest API is based on an ImageServer, change genre, type, and format to relate to imagery
                if dictionary["title"] == "ArcGIS GeoService":
                    if 'accessURL' in dictionary.keys():
                        webService = dictionary['accessURL']

                        if webService.rsplit('/', 1)[-1] == 'ImageServer':
                            resourceClass = "Imagery|Web services"
                            formatElement = 'Imagery'
                            resourceType = "Satellite imagery"
                    else:
                        resourceClass = ""
                        formatElement = ""
                        downloadURL = ""

            # If the distribution section of the metadata is not structured in a typical way
            except:
                resourceClass = ""
                formatElement = ""
                downloadURL = ""
                continue

        try:
            if "FeatureServer" in webService:
                featureServer = webService
            if "MapServer" in webService:
                mapServer = webService
            if "ImageServer" in webService:
                imageServer = webService
        except:
            print(identifier)



# BOUNDING BOX
        
        bbox = newdata["dataset"][y]["spatial"]                
    
        try:
            bboxList = []
            bbox = ''
            spatial = cleanData(newdata["dataset"][y]['spatial'])
            typeDmal = decimal.Decimal
            fix4 = typeDmal("0.01")
            for coord in spatial.split(","):
                coordFix = typeDmal(coord).quantize(fix4)
                bboxList.append(str(coordFix))
            bbox = ','.join(bboxList)
        except:
            spatial = ""
            
# KEYWORDS

        keyword = newdata["dataset"][y]["keyword"]
        keyword_list = []
        keyword_list = '|'.join(keyword).replace(' ', '')

        
# DATES

        dateIssued = cleanData(newdata["dataset"][y]['issued']).split('T', 1)[0] 
        temporalCoverage = ""
        dateRange = ""

        # auto-generate Temporal Coverage and Date Range
        if re.search(r"\{(.*?)\}", title):     # if title has {YYYY} or {YYYY-YYYY}
            temporalCoverage = re.search(r"\{(.*?)\}", title).group(1)
            dateRange = temporalCoverage[:4] + '-' + temporalCoverage[-4:]
        else:
            temporalCoverage = 'Continually updated resource'
        
#RIGHTS

        rights = cleanData(newdata["dataset"][y]['license']) if 'license' in newdata["dataset"][y] else ""


# IDENTIFIER
        slug = identifier.split('=', 1)[-1].replace("&sublayer=", "_")
        querystring = parse_qs(urlparse(identifier).query)
        identifier_new = "https://hub.arcgis.com/datasets/" + "" + querystring["id"][0]

            
# Define full metadata list

        metadataList = [
            title, 
            alternativeTitle, 
            description, 
            language, 
            creator,
            titleSource,
            resourceClass, 
            resourceType, 
            keyword_list, 
            dateIssued, 
            temporalCoverage,
            dateRange, 
            spatialCoverage, 
            bbox,
            formatElement, 
            information, 
            downloadURL, 
            mapServer, 
            featureServer,
            imageServer, 
            slug, 
            identifier_new, 
            provider, 
            hubCode, 
            memberOf, 
            isPartOf, 
            rights,
            accrualMethod,
            dateAccessioned, 
            accessRights
        ]     

        # deletes items where the resourceClass is empty
        for i in range(len(metadataList)):
            if metadataList[6] != "":
                metadata.append(metadataList[i])

        newItemDict[slug] = metadata

        for k in list(newItemDict.keys()):
            if not newItemDict[k]:
                del newItemDict[k]

    return newItemDict

## Run the executable code

**Declare a list to hold the scanned metadata**

In [17]:
allRecords = []
json_ids = {}

**Scan the metadata for each Hub**

This code reads data from `hubFile.csv` using the `csv.DictReader` function. It then iterates over each row in the file and extracts values from specific columns to be used later in the script.

For each row, the script also defines default values for a set of metadata fields. It then checks if the URL provided in the CSV file exists and is a valid JSON response. If the response is not valid, the script prints an error message and continues to the next row. Otherwise, it extracts dataset identifiers from the JSON response and passes the response along with the identifiers to a function called metadataNewItems. The metadata for each row is then appended to a list called `allRecords`.

In [18]:
with open(hubFile, newline='', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        # Read in values from arcHubs.csv to be used within the script or as part of the metadata report
        hubCode = row['ID']
        url = row['Identifier']
        provider = row['Title']
        titleSource = row['Publisher']
        spatialCoverage = row['Spatial Coverage']
        isPartOf = row['ID']
        memberOf = row['Member Of']
        
        # Define default values for each record
        accrualMethod = "ArcGIS Hub"
        dateAccessioned = time.strftime('%Y-%m-%d')
        accessRights = "Public"
        language = "eng"

        print("scanning ", hubCode, url)
        
        
        response = urllib.request.urlopen(url)
        # check if the Hub's URL is broken
        if response.headers['content-type'] != 'application/json; charset=utf-8':
            print("\n--------------------- Data hub URL does not exist --------------------\n",
                  hubCode, url,  "\n--------------------------------------------------------------------------\n")
            continue
        else:
            newdata = json.load(response)


        # Makes a list of dataset identifiers
        newjson_ids = getIdentifiers(newdata)


        allRecords.append(metadataNewItems(newdata, newjson_ids))


scanning  08b-42003 https://openac-alcogis.opendata.arcgis.com/api/feed/dcat-us/1.1.json
scanning  04b-24003 https://maps.aacounty.org//api/feed/dcat-us/1.1.json
scanning  10b-55003 https://data-ashlandcountywi.opendata.arcgis.com/api/feed/dcat-us/1.1.json
scanning  11b-39009 https://data-athgis.opendata.arcgis.com/api/feed/dcat-us/1.1.json
scanning  04b-24005 https://opendata.baltimorecountymd.gov/api/feed/dcat-us/1.1.json
scanning  05b-27011 https://data-bigstonecounty.opendata.arcgis.com/api/feed/dcat-us/1.1.json
scanning  04b-24009 https://calvert-county-open-data-calvertgis.hub.arcgis.com/api/feed/dcat-us/1.1.json
scanning  10b-55025-01 https://data-carpc.opendata.arcgis.com/api/feed/dcat-us/1.1.json
scanning  04b-24013 https://data-carrollco-md.opendata.arcgis.com/api/feed/dcat-us/1.1.json
scanning  05b-27019 http://data-carver.opendata.arcgis.com/api/feed/dcat-us/1.1.json
scanning  08b-42027 http://gisdata-centrecountygov.opendata.arcgis.com/api/feed/dcat-us/1.1.json
scanning  0

**Write the scanned metadata to a CSV in the GeoBTAA Metadata Profile**

In [19]:
newItemsReport = f"{directory}/{ActionDate}_scannedRecords.csv"
printItemReport(newItemsReport, fieldnames, allRecords)

**Remove Download links with "tif.zip"**

A large number of items in the Download column are causing errors in the Geoportal. These items can be identified by the string "tif.zip". The following code removes Download links for those items.

In [20]:
# Read the dataframe from the file
df_newitems = pd.read_csv(newItemsReport)

# Replace missing values with an empty string
df_newitems['Download'] = df_newitems['Download'].fillna('')

# Remove "tif.zip" values in the "Download" column
df_newitems.loc[df_newitems['Download'].str.contains('tif.zip', na=False), 'Download'] = ''

**Drop duplicate items**

ArcGIS Hub administrators can include datasets from other Hubs in their own site. As a result, some datasets are duplicated in other Hubs. However, they always have the same Identifier, so we can use pandas to detect and remove duplicate rows.

In [21]:
# Drop duplicates based on the 'ID' column
df_finalItems = df_newitems.drop_duplicates(subset=['ID'])

# Save the modified dataframe back to the file
df_finalItems.to_csv(newItemsReport, index=False)