## Purpose

This script will scan a list of ArcGIS Hubs and return the metadata for all suitable items as a CSV file in the GeoBTAA Metadata Application Profile. We can then upload the file to our metadata editor, GEOMG.


### Before you run this Notebook: Get the currently active portal list by downloading them from GEOMG. 

1. Filter for items with these parameters:
   - Resource Class: Websites
   - Accrual Method: DCAT US 1.1
   - This link should work: https://geomg.lib.umn.edu/documents?f%5Bb1g_dct_accrualMethod_s%5D%5B%5D=DCAT+US+1.1&f%5Bgbl_resourceClass_sm%5D%5B%5D=Websites&rows=20&sort=score+desc
   
2. Rename the downloaded file `arcPortals.csv` and move it into the same directory as this Notebook.

In [55]:
import csv
import json
import os
import re
import time
import urllib.request
from html.parser import HTMLParser
from urllib.parse import urlparse, parse_qs

import numpy as np
import pandas as pd
import requests

In [56]:
# Generate the current local time with the format like 'YYYYMMDD' and save to the variable named 'ActionDate'
ActionDate = time.strftime('%Y%m%d')

In [57]:
# Define constants and set up paths
directory = "."  # Set to directory containing arcPortals.csv
portalFile = "arcPortals.csv"  # Name of portal list csv file
fieldnames = [  # DCAT schema fields to be included in report
    "Title",
    "Alternative Title",
    "Description",
    "Language",
    "Creator",
    "Resource Class",
    "Resource Type",
    "Keyword",
    "Date Issued",
    "Temporal Coverage",
    "Date Range",
    "Spatial Coverage",
    "Bounding Box",
    "Format",
    "Information",
    "Download",
    "MapServer",
    "FeatureServer",
    "ImageServer",
    "ID",
    "Identifier",
    "Provider",
    "Code",
    "Member Of",
    "Is Part Of",
    "Rights",
    "Accrual Method",
    "Date Accessioned",
    "Access Rights",
]
json_ids = {}

In [58]:
# Function to remove html tags from text
class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs = True
        self.fed = []

    def handle_data(self, d):
        self.fed.append(d)

    def get_data(self):
        return "".join(self.fed)


def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()


def cleanData(value):
    return strip_tags(value)

In [59]:
# Function to generate an output CSV

def printItemReport(report, fields, dictionary):
    with open(report, 'w', newline='', encoding='utf-8') as outfile:
        csvout = csv.writer(outfile)
        csvout.writerow(fields)
        for portal in dictionary:
            for keys in portal:
                allvalues = portal[keys]
                csvout.writerow(allvalues)

In [60]:
# Function to create a dictionary of the JSONs

def getIdentifiers(data):
    json_ids = {}
    for x in range(len(data["dataset"])):
        json_ids[x] = data["dataset"][x]["identifier"]
    return json_ids

In [61]:
# Function to generate the title as: alternativeTitle [place name] {year}

def format_title(alternativeTitle, titleSource):
    # find if year exist in alternativeTitle
    year = ''
    try:  
      year_range = re.findall(r'(\d{4})-(\d{4})', alternativeTitle)
    except:
      year_range = ''
    try: 
      single_year = re.match(r'.*(17\d{2}|18\d{2}|19\d{2}|20\d{2})', alternativeTitle)
    except:
      single_year = ''    
    if year_range:   # if a 'yyyy-yyyy' exists
        year = '-'.join(year_range[0])
        alternativeTitle = alternativeTitle.replace(year, '').strip().rstrip(',')
    elif single_year:  # or if a 'yyyy' exists
        year = single_year.group(1)
        alternativeTitle = alternativeTitle.replace(year, '').strip().rstrip(',')
     
    altTitle = str(alternativeTitle)
    title = altTitle + ' [{}]'.format(titleSource)   
    if year:
        title += ' {' + year +'}'       
    return title

In [62]:
# Function to create a dictionary of selected metadata elements
# This includes blank fields '' for some columns

def metadataNewItems(newdata, newitem_ids):
    newItemDict = {}
    # y = position of the dataset in the DCAT metadata json, v = landing page URLs
    for y, v in newitem_ids.items():
        identifier = v
        metadata = []
        

#ALTERNATIVE TITLE
       
        alternativeTitle = ""
        try:
            alternativeTitle = cleanData(newdata["dataset"][y]['title'])
        except:
            alternativeTitle = newdata["dataset"][y]['title']
            
# TITLE
            
        # call the format_title function
        title = format_title(alternativeTitle, titleSource)
            
#DESCRIPTION

        description = cleanData(newdata["dataset"][y]['description'])
        description = description.replace("{{default.description}}", "").replace("{{description}}", "")
        description = re.sub(r'[\n]+|[\r\n]+', ' ', description, flags=re.S)
        description = re.sub(r'\s{2,}', ' ', description)
        description = description.translate({8217: "'", 8220: '"', 8221: '"', 160: "", 183: "", 8226: "", 8211: "-", 8203: ""})


# RESOURCE TYPE

        # if 'LiDAR' exists in Title or Description, add it to Resource Type
        if 'LiDAR' in title or 'LiDAR' in description:
            resourceType = 'LiDAR'
                            
#CREATOR
        creator = newdata["dataset"][y]["publisher"]
        for pub in creator.values():
            try:
                creator = pub.replace(u"\u2019", "'")
            except:
                creator = pub


# DISTRIBUTION

        information = cleanData(newdata["dataset"][y]['landingPage'])

        format_types = []
        resourceClass = ""
        formatElement = ""
        downloadURL = ""
        resourceType = ""
        webService = ""
        featureServer = ""
        mapServer = ""
        imageServer = ""



        distribution = newdata["dataset"][y]["distribution"]
        for dictionary in distribution:
            try:
                # If one of the distributions is a shapefile, change genre/format and get the downloadURL
                format_types.append(dictionary["title"])
                if dictionary["title"] == "Shapefile":
                    resourceClass = "Datasets|Web services"
                    formatElement = "Shapefile"
                    if 'downloadURL' in dictionary.keys():
                        downloadURL = dictionary["downloadURL"].split('?')[0]
                    else:
                        downloadURL = dictionary["accessURL"].split('?')[0]

                    resourceType = "Vector data"

                # If the Rest API is based on an ImageServer, change genre, type, and format to relate to imagery
                if dictionary["title"] == "ArcGIS GeoService":
                    if 'accessURL' in dictionary.keys():
                        webService = dictionary['accessURL']

                        if webService.rsplit('/', 1)[-1] == 'ImageServer':
                            resourceClass = "Imagery|Web services"
                            formatElement = 'Imagery'
                            resourceType = "Satellite imagery"
                    else:
                        resourceClass = ""
                        formatElement = ""
                        downloadURL = ""

            # If the distribution section of the metadata is not structured in a typical way
            except:
                resourceClass = ""
                formatElement = ""
                downloadURL = ""
                continue

        try:
            if "FeatureServer" in webService:
                featureServer = webService
            if "MapServer" in webService:
                mapServer = webService
            if "ImageServer" in webService:
                imageServer = webService
        except:
            print(identifier)



# BOUNDING BOX
                
        try:
            bboxList = []
            bbox = ''
            spatial = cleanData(newdata["dataset"][y]['spatial'])
            typeDmal = decimal.Decimal
            fix4 = typeDmal("0.01")
            for coord in spatial.split(","):
                coordFix = typeDmal(coord).quantize(fix4)
                bboxList.append(str(coordFix))
            bbox = ','.join(bboxList)
        except:
            spatial = ""
            
# KEYWORDS

        keyword = newdata["dataset"][y]["keyword"]
        keyword_list = []
        keyword_list = '|'.join(keyword).replace(' ', '')

        
# DATES

        dateIssued = cleanData(newdata["dataset"][y]['issued']).split('T', 1)[0] 
        temporalCoverage = ""
        dateRange = ""

        # auto-generate Temporal Coverage and Date Range
        if re.search(r"\{(.*?)\}", title):     # if title has {YYYY} or {YYYY-YYYY}
            temporalCoverage = re.search(r"\{(.*?)\}", title).group(1)
            dateRange = temporalCoverage[:4] + '-' + temporalCoverage[-4:]
        else:
            temporalCoverage = 'Continually updated resource'
        
#RIGHTS

        rights = cleanData(newdata["dataset"][y]['license']) if 'license' in newdata["dataset"][y] else ""


# IDENTIFIER
        slug = identifier.split('=', 1)[-1].replace("&sublayer=", "_")
        querystring = parse_qs(urlparse(identifier).query)
        identifier_new = "https://hub.arcgis.com/datasets/" + "" + querystring["id"][0]

            
# Define full metadata list

        metadataList = [
            title, 
            alternativeTitle, 
            description, 
            language, 
            creator,
            resourceClass, 
            resourceType, 
            keyword_list, 
            dateIssued, 
            temporalCoverage,
            dateRange, 
            spatialCoverage, 
            bbox,
            formatElement, 
            information, 
            downloadURL, 
            mapServer, 
            featureServer,
            imageServer, 
            slug, 
            identifier_new, 
            provider, 
            portalCode, 
            memberOf, 
            isPartOf, 
            rights,
            accrualMethod,
            dateAccessioned, 
            accessRights
        ]     

        # deletes items where the resourceClass is empty
        for i in range(len(metadataList)):
            if metadataList[5] != "":
                metadata.append(metadataList[i])

        newItemDict[slug] = metadata

        for k in list(newItemDict.keys()):
            if not newItemDict[k]:
                del newItemDict[k]

    return newItemDict

In [63]:
allRecords = []

In [64]:
with open(portalFile, newline='', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        # Read in values from arcPortals.csv to be used within the script or as part of the metadata report
        portalCode = row['ID']
        url = row['Identifier']
        provider = row['Title']
        titleSource = row['Publisher']
        spatialCoverage = row['Spatial Coverage']
        isPartOf = row['ID']
        memberOf = row['Member Of']
        accrualMethod = "ArcGIS Hub"
        dateAccessioned = time.strftime('%Y-%m-%d')
        accessRights = "Public"
        language = "eng"

        print(portalCode, url)
        
        
        response = urllib.request.urlopen(url)
        # check if data portal URL is broken
        if response.headers['content-type'] != 'application/json; charset=utf-8':
            print("\n--------------------- Data portal URL does not exist --------------------\n",
                  portalCode, url,  "\n--------------------------------------------------------------------------\n")
            continue
        else:
            newdata = json.load(response)


        # Makes a list of dataset identifiers
        newjson_ids = getIdentifiers(newdata)

        allRecords.append(metadataNewItems(newdata, newjson_ids))

08b-42003 https://openac-alcogis.opendata.arcgis.com/api/feed/dcat-us/1.1.json
04b-24003 https://maps.aacounty.org//api/feed/dcat-us/1.1.json
10b-55003 https://data-ashlandcountywi.opendata.arcgis.com/api/feed/dcat-us/1.1.json
11b-39009 https://data-athgis.opendata.arcgis.com/api/feed/dcat-us/1.1.json
04b-24005 https://opendata.baltimorecountymd.gov/api/feed/dcat-us/1.1.json
05b-27011 https://data-bigstonecounty.opendata.arcgis.com/api/feed/dcat-us/1.1.json
04b-24009 https://calvert-county-open-data-calvertgis.hub.arcgis.com/api/feed/dcat-us/1.1.json
10b-55025-01 https://data-carpc.opendata.arcgis.com/api/feed/dcat-us/1.1.json
04b-24013 https://data-carrollco-md.opendata.arcgis.com/api/feed/dcat-us/1.1.json
05b-27019 http://data-carver.opendata.arcgis.com/api/feed/dcat-us/1.1.json
08b-42027 http://gisdata-centrecountygov.opendata.arcgis.com/api/feed/dcat-us/1.1.json
08b-42029 https://chester-county-s-gis-hub-chesco.hub.arcgis.com/api/feed/dcat-us/1.1.json
05b-27023 https://data-chippew

In [65]:
newItemsReport = f"{directory}/{ActionDate}_scannedRecords.csv"
printItemReport(newItemsReport, fieldnames, allRecords)

In [66]:
# reopen the new CSV and drop duplicate items with the same ID

df_newitems = pd.read_csv(newItemsReport)
df_finalItems = df_newitems.drop_duplicates(subset=['ID'])
df_finalItems.to_csv(newItemsReport, index=False)

## Troubleshooting

The Hub sites are fairly unstable and it is likely that one or more of them will fail and interrupt the script. Check and see if the site is down, moved, etc. Make any updates to GEOMG directly. For tracking problems, the Status field in GEOMG is plain text and can be used for admin notes.

- If a site is missing, Unpublish it from GEOMG and indicate the Date Retired, and make a note in the Status field.  
- If a site just isn't working, Remove the value "DCAT US 1.1" from the Accrual Method field and make a note in the Status field.

Edit the arcPortals.csv (or re-download it) and keep running this Notebook until it works.


## How to upload to GEOMG

### Review the previous upload

1. Check the Date Accessioned field of the last harvest and copy it. 


### Upload everything that you just harvested.

2. Upload the new CSV file. This will overwrite the Date Accessioned value for any items that were already present.

### Delete items that were retired from ArcGIS Hubs
3. Use the old Date Accessioned value to search for the previous harvest date. This example uses 2023-03-07: (https://geomg.lib.umn.edu/documents?f%5Bb1g_dct_accrualMethod_s%5D%5B%5D=ArcGIS+Hub&q=%222023-03-07%22&rows=20&sort=score+desc)
4. Unpublished the ones that have the old date in the Date Accessioned field - record this number in the ticket under Number Deleted

### Publish items that are new as of the latest harvest
5. Look for records in the uploaded batch that are still "Draft" - these are new records. 
6. Publish them and record this number in the GitHub issue ticked under Number Added