## Purpose

This script will scan a list of Socrata data portals and return the metadata for all suitable items as a CSV file in the GeoBTAA Metadata Application Profile.

## View full recipe

This Notebook is part of a workflow documented in the [Metadata Handbook Recipes section](https://geobtaa.github.io/metadata/recipes). This is the second recipe, called [Socrata](https://geobtaa.github.io/metadata/recipes/R-02_socrata/).

## Prepare the list of active Socrata portals

We maintain a list of active Socrata portals in GBL Admin. (Access to GBL Admin requires a login account. External users can create their own list or use one provided in this repository)

1. Filter for items with these parameters:
   - Resource Class: Websites
   - Format: Socrata data portal
   - [Shortcut query](https://geo.btaa.org/admin/documents?f%5Bb1g_publication_state_s%5D%5B%5D=published&f%5Bdct_format_s%5D%5B%5D=Socrata+data+portal&f%5Bgbl_resourceClass_sm%5D%5B%5D=Websites&rows=20&sort=score+desc)
   
2. Rename the downloaded file `socrataPortals.csv` and move it into the same directory as this Notebook.

    
Exporting from GBL Admin will produce a CSV containing all of the metadata associated with each Hub. For this script, the only fields used are:

* **ID**: Unique code assigned to each portal. This is transferred to the "Is Part Of" field for each dataset.
* **Title**: The name of the Hub. This is transferred to the "Provider" field for each dataset
* **Publisher**: The place or administration associated with the portal. This is applied to the title in each dataset in brackets
* **Spatial Coverage**: A list of place names. These are transferred to the Spatial Coverage for each dataset
* **Bounding Box**: The Socrata metadata API does not include coordinates, so we just use the default bounding box for the portal's region
* **Member Of**: a larger collection level record. Most of the portals are either part of our [Government Open Geospatial Data Collection](https://geo.btaa.org/catalog/ba5cc745-21c5-4ae9-954b-72dd8db6815a) or the [Research Institutes Geospatial Data Collection](https://geo.btaa.org/catalog/b0153110-e455-4ced-9114-9b13250a7093)


-------------------

## Define the module-level code

This section includes the necessary imports, configuration settings, and function/class definitions that will be used by the rest of the code in the module.

In [1]:
import csv # Provides functionality to read from and write to CSV files.
import json # Provides functionality to work with JSON data.
import os # Provides a way of using operating system dependent functionality, like reading or writing the file system.
import re # Provides regular expression matching operations.
import time # Provides functions for working with time, including time conversion, sleep function and timers.
import urllib.request # provides functions for working with URLs, like opening URLs, reading data from URLs, etc.
from html.parser import HTMLParser # provides an HTML parsing library that can be used to extract data from HTML docs.
from urllib.parse import urlparse, parse_qs # provides a way to parse URLs into their components.

import numpy as np # Provides numerical operations and array manipulation tools.
import pandas as pd # Provides data manipulation and analysis functionality.
import requests # Provides HTTP library for sending requests to servers and receiving responses.



**Set up paths and output CSV field names**

In [5]:
directory = "."  # Set to directory containing socrataPortals.csv
hubFile = "socrataPortals.csv"  # the name of the CSV file with the list of ArcGIS Hubs
fieldnames = [  # DCAT schema fields to be included in report
    "Title",
    "Alternative Title",
    "Description",
    "Language",
    "Creator",
    "Title Source",
    "Resource Class",
    "Resource Type",
    "Date Issued",
    "Temporal Coverage",
    "Date Range",
    "Spatial Coverage",
    "Bounding Box",
    "Format",
    "Information",
    "Download",
    "ID",
    "Identifier",
    "Provider",
    "Code",
    "Member Of",
    "Is Part Of",
    "Rights",
    "Accrual Method",
    "Date Accessioned",
    "Access Rights",
]

ActionDate = time.strftime('%Y%m%d') # Generate the current local time with the format like 'YYYYMMDD' and save to the variable named 'ActionDate'

json_ids = {}

**Function to remove HTML tags**

Sometimes, the metadata fields we scrape contain HTML tags, such as links or formatting that do not work in the Geoportal.

In [6]:
class MLStripper(HTMLParser): 
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs = True
        self.fed = []

    def handle_data(self, d): 
        self.fed.append(d)

    def get_data(self): # Returns a string of all the data in the list concatenated together.
        return "".join(self.fed)


def strip_tags(html): # Defined by the MLS Stripper 
    s = MLStripper()
    s.feed(html)
    return s.get_data()


def cleanData(value): # Calls strip_tags on the input value to remove any HTML tags present.
    return strip_tags(value)

**Function to generate an output CSV**

iterate over the keys and writes the corresponding values to the CSV


In [7]:
def printItemReport(report, fields, dictionary):
    with open(report, 'w', newline='', encoding='utf-8') as outfile:
        csvout = csv.writer(outfile)
        csvout.writerow(fields)
        for hub in dictionary:
            for keys in hub:
                allvalues = hub[keys]
                csvout.writerow(allvalues)

**Function to create a dictionary of metadata in the JSONs**

In [8]:
# use the len function to get the number of datasets and the range function to loop through each dataset
        
def getIdentifiers(data):
    json_ids = {}  # Dictionary List
    for x in range(len(data["dataset"])):
        json_ids[x] = data["dataset"][x]["identifier"]
    return json_ids


**Function to generate the title as: alternativeTitle [place name] {year}**

In [9]:
# The function uses regular expressions to extract the year from the alternative title, and replaces it with an empty string to remove it from the title.

def format_title(alternativeTitle, titleSource):
    # find if year exist in alternativeTitle
    year = ''
    try:  
      year_range = re.findall(r'(\d{4})-(\d{4})', alternativeTitle)
    except:
      year_range = ''
    try: 
      single_year = re.match(r'.*(17\d{2}|18\d{2}|19\d{2}|20\d{2})', alternativeTitle)
    except:
      single_year = ''    
    if year_range:   # if a 'yyyy-yyyy' exists
        year = '-'.join(year_range[0])
        alternativeTitle = alternativeTitle.replace(year, '').strip().rstrip(',')
    elif single_year:  # or if a 'yyyy' exists
        year = single_year.group(1)
        alternativeTitle = alternativeTitle.replace(year, '').strip().rstrip(',')
     
    altTitle = str(alternativeTitle)
    title = altTitle + ' [{}]'.format(titleSource)   
    if year:
        title += ' {' + year +'}'       
    return title

**Function to create a dictionary of scanned metadata**

This code defines a function called `metadataNewItems()` which takes two arguments 

* `newdata` (a dictionary containing metadata information about new items)
* `newitem_ids` (a dictionary containing information about the new items such as the position and the landing page URLs).

The function processes the metadata information for each new item and creates a dictionary containing the formatted metadata.

In [10]:
# This includes blank fields '' for some columns

def metadataNewItems(newdata, newitem_ids):
    newItemDict = {}
    # y = position of the dataset in the DCAT metadata json, v = landing page URLs
    for y, v in newitem_ids.items():
        identifier = v
        metadata = []
        

#ALTERNATIVE TITLE
       
        alternativeTitle = ""
        try:
            alternativeTitle = str(cleanData(newdata["dataset"][y]['title']))
        except:
            alternativeTitle = str(newdata["dataset"][y]['title'])
            
# TITLE
            
        # call the format_title function
        title = format_title(alternativeTitle, titleSource)
#         title = alternativeTitle
            
#DESCRIPTION

        description = cleanData(newdata["dataset"][y]['description'])
        description = description.replace("{{default.description}}", "").replace("{{description}}", "")
        description = re.sub(r'[\n]+|[\r\n]+', ' ', description, flags=re.S)
        description = re.sub(r'\s{2,}', ' ', description)
        description = description.translate({8217: "'", 8220: '"', 8221: '"', 160: "", 183: "", 8226: "", 8211: "-", 8203: ""})


# RESOURCE TYPE

        # if 'LiDAR' exists in Title or Description, add it to Resource Type
        if 'LiDAR' in title or 'LiDAR' in description:
            resourceType = 'LiDAR'
                            
#CREATOR
        creator = newdata["dataset"][y]["publisher"]
        for pub in creator.values():
            try:
                creator = pub.replace(u"\u2019", "'")
            except:
                creator = pub


# DISTRIBUTION

        information = str(newdata["dataset"][y]['landingPage'])

        format_types = []
        formatElement = ""
        downloadURL = ""
        resourceType = ""


        # Only fills metadata for Shapefile downloads
        
        try:
            distribution = newdata["dataset"][y]["distribution"]
            for dictionary in distribution:
                media_type = dictionary["mediaType"]

                if media_type == "application/zip":
                    formatElement = "Shapefile"
                    resourceType = "Vector data"
                    downloadURL = dictionary["downloadURL"]
                else:
                    continue

        except:
            pass  # Handle exceptions (e.g. if "distribution" or "mediaType" keys are missing)


        
# DATES

        dateIssued = cleanData(newdata["dataset"][y]['issued']).split('T', 1)[0] 
        temporalCoverage = ""
        dateRange = ""

        # auto-generate Temporal Coverage and Date Range
        if re.search(r"\{(.*?)\}", title):     # if title has {YYYY} or {YYYY-YYYY}
            temporalCoverage = re.search(r"\{(.*?)\}", title).group(1)
            dateRange = temporalCoverage[:4] + '-' + temporalCoverage[-4:]
        else:
            temporalCoverage = 'Continually updated resource'
        
#RIGHTS

        rights = cleanData(newdata["dataset"][y]['license']) if 'license' in newdata["dataset"][y] else ""


# IDENTIFIER
        slug = identifier.split('/views/', 1)[-1]
        identifier = identifier

            
# Define full metadata list

        metadataList = [
            title, 
            alternativeTitle, 
            description, 
            language, 
            creator,
            titleSource,
            resourceClass, 
            resourceType, 
            dateIssued, 
            temporalCoverage,
            dateRange, 
            spatialCoverage, 
            bbox,
            formatElement, 
            information, 
            downloadURL, 
            slug, 
            identifier,
            provider, 
            hubCode, 
            memberOf, 
            isPartOf, 
            rights,
            accrualMethod,
            dateAccessioned, 
            accessRights
        ]     

        # deletes items where the resourceClass is empty
        for i in range(len(metadataList)):
            if metadataList[13] != "":
                metadata.append(metadataList[i])

        newItemDict[slug] = metadata

        for k in list(newItemDict.keys()):
            if not newItemDict[k]:
                del newItemDict[k]

    return newItemDict

## Run the executable code

**Declare a list to hold the scanned metadata**

In [11]:
allRecords = []
json_ids = {}

**Scan the metadata for each Hub**

This code reads data from `socrataPortals.csv` using the `csv.DictReader` function. It then iterates over each row in the file and extracts values from specific columns to be used later in the script.

For each row, the script also defines default values for a set of metadata fields. It then checks if the URL provided in the CSV file exists and is a valid JSON response. If the response is not valid, the script prints an error message and continues to the next row. Otherwise, it extracts dataset identifiers from the JSON response and passes the response along with the identifiers to a function called metadataNewItems. The metadata for each row is then appended to a list called `allRecords`.

In [12]:
with open(hubFile, newline='', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        # Read in values from socrataPortals.csv to be used within the script or as part of the metadata report
        hubCode = row['ID']
        url = row['Identifier']
        provider = row['Title']
        titleSource = row['Publisher']
        spatialCoverage = row['Spatial Coverage']
        bbox = row['Bounding Box']
        isPartOf = row['ID']
        memberOf = row['Member Of']
        
        # Define default values for each record
        accrualMethod = "Socrata"
        dateAccessioned = time.strftime('%Y-%m-%d')
        accessRights = "Public"
        language = "eng"
        resourceClass = "Datasets"

        print("scanning ", hubCode, url)
        
        
        response = urllib.request.urlopen(url)
        # check if the Hub's URL is broken
        if response.headers['content-type'] != 'application/json; charset=utf-8':
            print("\n--------------------- Data hub URL does not exist --------------------\n",
                  hubCode, url,  "\n--------------------------------------------------------------------------\n")
            continue
        else:
            newdata = json.load(response)


        # Makes a list of dataset identifiers
        newjson_ids = getIdentifiers(newdata)


        allRecords.append(metadataNewItems(newdata, newjson_ids))


scanning  01c-01 https://data.bloomington.in.gov/data.json
scanning  02b-17117 https://data.macoupincountyil.gov/data.json
scanning  04b-24027 https://opendata.howardcountymd.gov/data.json
scanning  04b-24033 https://data.princegeorgescountymd.gov/data.json
scanning  11c-01 https://data.cincinnati-oh.gov/data.json
scanning  12b-17031-2 https://datacatalog.cookcountyil.gov/data.json
scanning  12c-01 https://data.cityofchicago.org/data.json


**Write the scanned metadata to a CSV in the GeoBTAA Metadata Profile**

In [13]:
newItemsReport = f"{directory}/{ActionDate}_scannedRecords.csv"
printItemReport(newItemsReport, fieldnames, allRecords)

In [21]:
#Load into a CSV

df = pd.read_csv(newItemsReport)

# Drop duplicates based on the 'ID' column
df = df.drop_duplicates(subset=['ID'])

In [22]:
# 1. Create the first CSV with all columns except 'information' and 'downloadURL'
df_first_csv = df.drop(columns=['Information', 'Download'])
df_first_csv.to_csv(f'{ActionDate}_Socrata-metadata.csv', index=False)

# 2. Create the second CSV with 'friendlier_id', 'type', and 'value'
rows = []
for _, row in df.iterrows():
    slug = row['ID']
    
    # Check if 'information' is not null or empty, then add to rows
    if pd.notna(row['Information']) and row['Information'] != "":
        rows.append({'friendlier_id': slug, 'type': 'Information', 'value': row['Information']})
    
    # Check if 'downloadURL' is not null or empty, then add to rows
    if pd.notna(row['Download']) and row['Download'] != "":
        rows.append({'friendlier_id': slug, 'type': 'Download', 'value': row['Download']})

# Create a DataFrame from the rows and save to CSV
df_second_csv = pd.DataFrame(rows)
df_second_csv.to_csv(f'{ActionDate}_Socrata-links.csv', index=False)

print("CSV files have been created successfully.")


CSV files have been created successfully.


**Drop duplicate items**

Socrata administrators may have datasets from other portals in their own site. As a result, some datasets are duplicated. However, they always have the same Identifier, so we can use pandas to detect and remove duplicate rows.

In [11]:
# Read the dataframe from the file
df_newitems = pd.read_csv(newItemsReport)

# Drop duplicates based on the 'ID' column
df_finalItems = df_newitems.drop_duplicates(subset=['ID'])

# Save the modified dataframe back to the file
df_finalItems.to_csv(newItemsReport, index=False)