## Transform a batch GBL 1.0 JSON files from UW-Madison

**Purpose: This script will read a batch of GBL-1.0 metadata JSON files and tranform them into a single CSV.** 

This is a version that is specific for UW-Madison records

Metadata records in the [GeoBlacklight](https://opengeometadata.org/docs/gbl-1.0) or [OpenGeoMetadata](https://opengeometadata.org/docs/ogm-aardvark) standards are frequently shared as batches of JSON files. The entire [OpenGeoMetadata organization](https://github.com/OpenGeoMetadata) contains repositories full of hundreds of thousands of GeoBlacklight JSONs.

In order to ingest these into the BTAA Geoportal, we need to transform them into a CSV.  


## Part 1: Load the modules and JSON files

### Import python modules

In [508]:
import csv
import json
import os
import pandas as pd
import uuid
import datetime
import re

### Declare the paths and file names

First, move a folder of the JSONs into this directory. Files in the folder can be nested.

In [509]:
json_path = r"edu.wisc" # enter the name of the folder
csv_name = "wiscview" # create a name for the output CSV without the .csv extension

### Load the files into a pandas DataFrame

In [510]:
dataset = [] # empty list

# through all items, format and append to dataset list
for path, dir, files in os.walk(json_path):
    for filename in files:
        if filename.endswith(".json"):
            file_path = os.path.join(path, filename)
            json_file_open = open(file_path, 'rb')
            data = json_file_open.read().decode('utf-8', errors='ignore')
            loaded = json.loads(data)
            dataset.append(loaded)
            
df = pd.DataFrame(dataset) # convert dataset into dataframe

## Part 2: Split multivalued and compound fields

### Split multivalued fields (arrays)

This will remove the punctuation from fields that are formatted as arrays and separate them with pipes ('|')

In [511]:
# .str.join('') takes each item, whether a list or a single character, and joins them with a pipe

df['dc_creator_sm']=df['dc_creator_sm'].str.join('|')
df['dc_subject_sm']=df['dc_subject_sm'].str.join('|')
df['dct_spatial_sm']=df['dct_spatial_sm'].str.join('|')
df['dct_isPartOf_sm']=df['dct_isPartOf_sm'].str.join('|')
df['dct_temporal_sm']=df['dct_temporal_sm'].str.join('|')


### Split the References into separate columns

This step makes it easier to edit individual links.

In [512]:
import logging

# Configure logging
logging.basicConfig(filename='processing.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')


def extract_values(row):
#     dct_references_s = json.loads(row['dct_references_s'].replace('""', '"'))
# #     dct_references_s = json.loads(row['dct_references_s'])
#     return dct_references_s

    try:
        dct_references_s = json.loads(row['dct_references_s'].replace('""', '"'))
        return dct_references_s
    except json.JSONDecodeError as e:
        # Log the error and the record causing it
        logging.error(f'Error processing record with ID: {row["layer_slug_s"]}, Error: {str(e)}')
        return None

# Apply the function to split the column and expand into separate columns
df = pd.concat([df, df.apply(extract_values, axis=1).apply(pd.Series)], axis=1)

# Rename columns based on keys in the JSON
df = df.rename(columns={
    'http://schema.org/downloadUrl': 'Download',
    'http://schema.org/url': 'Information',
    'http://www.isotc211.org/schemas/2005/gmd/': 'ISO19139',
    'http://www.opengis.net/cat/csw/csdgm': 'FGDC',
    'http://www.w3.org/1999/xhtml': 'HTML',
    'http://lccn.loc.gov/sh85035852': 'Documentation',
    'http://iiif.io/api/image': 'IIIF',
    'http://iiif.io/api/presentation#manifest': 'Manifest',
    'http://www.loc.gov/mods/v3': 'MODS',
    'https://openindexmaps.org': 'Index Map',
    'http://www.opengis.net/def/serviceType/ogc/wms': 'WMS',
    'http://www.opengis.net/def/serviceType/ogc/wfs': 'WFS',
    'urn:x-esri:serviceType:ArcGIS#FeatureLayer': 'FeatureServer',
    'urn:x-esri:serviceType:ArcGIS#TiledMapLayer': 'TileServer',
    'urn:x-esri:serviceType:ArcGIS#DynamicMapLayer': 'MapServer',
    'urn:x-esri:serviceType:ArcGIS#ImageMapLayer': 'ImageServer',
    'http://schema.org/DownloadAction': 'Harvard Download'
    # Add more key-value pairs for renaming columns as needed
})

### Reorder coordinates

This will reorder the 4 bbox coordinates into W,S,E,N, which is what the Klokan Bounding Box tool produces on the CSV export option. The BTAA metadata editor uses this order as well when ingesting items. However, Aardvark ultimately uses W,E,N,S, so these would need to be reordered before converting back to JSON.

In [513]:
# Split solr_geom coordinates and reorder from WENS to WSEN
df[['w','e','n','s']] = df['solr_geom'].str.strip('ENVELOPE()').str.split(',', expand=True)
df['Bounding Box'] = df[['w', 's','e','n']].agg(','.join, axis=1) 


## Part 3: Transform values for fields without a straight crosswalk

In [514]:
#Convert Data Type to Resource Class value
df['Resource Class'] = df['dc_type_s'].apply(lambda x: 'Imagery' if x == 'Image' else 'Datasets')


#Convert Geometry Type to Resource Type value
df['Resource Type'] = df['layer_geom_type_s'].astype(str) + ' data'

# Create Date Range field
# 10. Handle the "Date Range" field
# df['Date Range'] = df.apply(lambda row: f"{row['dct_temporal_sm']}-{row['dct_temporal_sm']}" if pd.notna(row['dct_temporal_sm']) else '', axis=1)

def format_temporal_coverage(row):
    temporal_coverage = row['dct_temporal_sm']
    
    # Check if the value is already a valid date range in yyyy-yyyy format
    if pd.notna(temporal_coverage) and re.match(r'\d{4}-\d{4}', temporal_coverage):
        return temporal_coverage  # Value is already formatted, so no change needed
    
    # Apply your existing logic to duplicate and format the value
    return f"{temporal_coverage}-{temporal_coverage}" if pd.notna(temporal_coverage) else ''

# Apply the function to create or update the "Date Range" column
df['Date Range'] = df.apply(format_temporal_coverage, axis=1)

### Check for GeoTIFFs

In [515]:
# Define a function to check if "GeoTIFF" is present in the "dc_format_s" column
def check_geotiff(value):
    if pd.notna(value) and "GeoTIFF" in value:
        return "true"
    else:
        return "false"

# Create the "Georeferenced" column using the check_geotiff function
df["Georeferenced"] = df["dc_format_s"].apply(check_geotiff)


### Concatenate custom field 'uw_supplemental_s' to Description

In [516]:
# Fill empty (NaN) values in the 'uw_supplemental_s' column with empty strings
df['uw_supplemental_s'] = df['uw_supplemental_s'].fillna('')

# Concatenate the 'Description' and 'uw_supplemental_s' columns with the pipe separator
df['dc_description_s'] = df['dc_description_s'] + '|' + df['uw_supplemental_s']

### Convert values in "dc_subject_sm" and create a new "Theme" column

In [517]:

# Define the conversion mappings from old values to new values
subject_sm_mapping = {
    "farming": "Agriculture",
    "biota": "Biology",
    "boundaries": "Boundaries",
    "climatologymeteorologyatmosphere": "Climate",
    "economy": "Economy",
    "elevation": "Elevation",
    "environment": "Environment",
    "society; climatologyMeteorologyAtmosphere": "Events",
    "geoscientificinformation": "Geology",
    "health": "Health",
    "imagerybasemapsearthcover": "Imagery",
    "inlandwaters": "Inland Waters",
    "location": "Location",
    "intelligencemilitary": "Military",
    "oceans": "Oceans",
    "planningcadastre": "Property",
    "society": "Society",
    "structure": "Structure",
    "transportation": "Transportation",
    "utilitiescommunication": "Utilities"
    
    # Add more key-value pairs for other conversions as needed
}


# Function to apply the mapping and join the values back together
def convert_and_join(row):
    subject_values = row['dc_subject_sm']
    if pd.notna(subject_values):  # Check for NaN before splitting
        subject_values = subject_values.split('|')
        converted_values = []
        for value in subject_values:
            value_lower = value.lower()
            if value_lower in subject_sm_mapping:
                converted_values.append(subject_sm_mapping[value_lower])
        return '|'.join(converted_values)
    else:
        return ''  # Return an empty string if the value is NaN

# Apply the mapping and create the new "Theme" column
df['Theme'] = df.apply(convert_and_join, axis=1)

# Drop duplicates from the "Theme" column
df['Theme'] = df['Theme'].str.split('|').apply(lambda x: '|'.join(sorted(set(x), key=x.index)))


## Part 4: Export to a new CSV with Aardvark labels as headers

### Remove unnecessary columns

In [518]:
df = df.drop(columns=[
    'geoblacklight_version',
    'layer_modified_dt', 
#     'thumbnail_path_ss',
    'w','e','n','s', 
    'solr_year_i',
    'layer_geom_type_s',
    'solr_geom',
    'dct_references_s',
    'uw_supplemental_s'
])

### Add some fields with default values

In [519]:
# Get the current date in yyyy-mm-dd format
today_date = datetime.date.today().isoformat()

# Add the "Date Accessioned" column with the today's date value to the DataFrame
df['Date Accessioned'] = today_date
df['Code'] = "10"
df['Is Part Of'] = "10d-03"
df['Member Of'] = "dc8c18df-7d64-4ff4-a754-d18d0891187d"
df['Accrual Method'] = "GBL-1.0"


### Rename the remaining columns

In [520]:
df = df.rename(columns={
    'dc_title_s': 'Alternative Title', 
    'dc_description_s': 'Description',
    'dc_creator_sm': 'Creator',
    'dct_issued_s': 'Date Issued',
    'dct_issued_dt': 'Date Issued',
    'dc_rights_s' : 'Access Rights',
    'dc_format_s': 'Format',
    'layer_slug_s' : 'ID',
    'layer_id_s' : 'WxS Identifier', 
#     'dc_identifier_s' : 'Identifier',
    'dc_language_s' : 'Language',
    'dct_provenance_s' : 'Provider',
    'dc_publisher_s' : 'Publisher',
    'dc_publisher_sm' : 'Publisher',
    'dc_source_sm' : 'Source',
    'dct_spatial_sm' : 'Spatial Coverage',
    'dct_temporal_sm' : 'Temporal Coverage',
    'dct_isPartOf_sm' : 'Keyword',
    'uw_notice_s' : 'Display Note'
})


In [521]:
df['Identifier'] = "https://geodata.wisc.edu/catalog/" + df['ID']

In [522]:
counties_in_wisconsin = [
    'Adams', 'Ashland', 'Barron', 'Bayfield', 'Brown', 'Buffalo',
    'Burnett', 'Calumet', 'Chippewa', 'Clark', 'Columbia', 'Crawford',
    'Dane', 'Dodge', 'Door', 'Douglas', 'Dunn', 'Eau Claire',
    'Florence', 'Fond du Lac', 'Fond Du Lac', 'Forest', 'Grant', 'Green', 'Green Lake',
    'Iowa', 'Iron', 'Jackson', 'Jefferson', 'Juneau', 'Kenosha',
    'Kewaunee', 'La Crosse', 'Lafayette', 'Langlade', 'Lincoln',
    'Manitowoc', 'Marathon', 'Marinette', 'Marquette', 'Menominee',
    'Milwaukee', 'Monroe', 'Oconto', 'Oneida', 'Outagamie', 'Ozaukee',
    'Pepin', 'Pierce', 'Polk', 'Portage', 'Price', 'Racine',
    'Richland', 'Rock', 'Rusk', 'Sauk', 'Sawyer', 'Shawano',
    'Sheboygan', 'St. Croix', 'Taylor', 'Trempealeau', 'Vernon', 'Vilas',
    'Walworth', 'Washburn', 'Washington', 'Waukesha', 'Waupaca',
    'Waushara', 'Winnebago', 'Wood'
]

cities_in_wisconsin = [
    'Milwaukee', 'Washington', 'Waukesha', 'Appleton', 'Outagamie', 
    'Winnebago', 'Eau Claire', 'Fond du Lac', 'Green Bay', 'Janesville', 
    'Kenosha', 'La Crosse', 'Madison', 'Oshkosh', 'Racine', 'Sheboygan', 
    'Waukesha', 'Wausau', 'Wauwatosa', 'West Allis'
]


def transform_title(alt_title):
    # Function to check if a word is an acronym (three or more capital letters)
    def is_acronym(word):
        return len(word) >= 3 and word.isupper()

    # Split the title into words and apply title casing selectively
    words = alt_title.split()
    alt_title = ' '.join(word if is_acronym(word) else word.title() for word in words)

    # Search for a city or county name in the title.
    for county in counties_in_wisconsin:
        if county in alt_title:
            alt_title = re.sub(f"{county} County, Wi", f"[Wisconsin--{county} County]", alt_title, 1)
            break
    else:
        for city in cities_in_wisconsin:
            if city in alt_title:
                alt_title = re.sub(f"City Of {city}, Wi", f"[Wisconsin--{city}]", alt_title, 1)
                break
        else:
            alt_title = re.sub("Wi ", "[Wisconsin] ", alt_title, 1)

    # Replace the year.
    year = re.findall(r"\b\d{4}\b", alt_title)
    if year:
        alt_title = re.sub(year[0], "{"+year[0]+"}", alt_title)

    # Cleanup phrases post-transformation
    alt_title = re.sub(r",\s*\[", " [", alt_title)
    alt_title = re.sub(r"For \[", "[", alt_title)
    alt_title = re.sub(r"For The \[", "[", alt_title)
    alt_title = re.sub(r"For The City Of \[", "[", alt_title)
#     alt_title = re.sub(r"Plss", "PLSS", alt_title)

    return alt_title

df['Title'] = df['Alternative Title'].apply(transform_title)


In [523]:
def extract_spatial_coverage(title):
    coverage = re.search(r'\[(.*?)\]', title)
    if coverage:
        coverage = coverage.group(1)
        return coverage if coverage.endswith('Wisconsin') else coverage + "|Wisconsin"
    return "Wisconsin"

# Apply the function to the "Title" column and assign the results to the "Spatial Coverage" column.
df['Spatial Coverage'] = df['Title'].apply(extract_spatial_coverage)


In [524]:
def trim_pipes_and_spaces(value):
    if isinstance(value, str):
        return value.strip('| ').strip('| ')
    return value

# Apply the function to the entire DataFrame
df = df.applymap(trim_pipes_and_spaces)

In [525]:
# Define the desired order of columns
desired_order = [
'Title',
'Alternative Title',
'Description',
'Language',
'Display Note',
'Creator',
'Publisher',
'Provider',
'Resource Class',
'Resource Type',
'Theme',
'Subject',
'Keyword',
'Temporal Coverage',
'Date Issued',
'Date Range',
'Spatial Coverage',
'Bounding Box',
'Geometry',
'Member Of',
'Is Part Of',
'Source',
'Format',
'WxS Identifier',
'Georeferenced',
'Documentation',
'Download',
'FeatureServer',
'FGDC',
'Harvard Download',
'HTML',
'IIIF',
'ImageServer',
'Information',
'ISO19139',
'Manifest',
'MapServer',
'MODS',
'oEmbed',
'Index Map',
'TileServer',
'WCS',
'WFS',
'WMS',
'ID',
'Identifier',
'Rights',
'Rights Holder',
'License',
'Access Rights',
'Suppressed',
'Child Record',
'Date Accessioned',
'Code',
'Accrual Method'

# Add more columns as needed in the desired order
]

# Reindex the DataFrame based on the desired order of columns
df = df.reindex(columns=desired_order)

### Check for multiple downloads and create a secondary CSV called "multiple-downloads.csv"

See https://geobtaa.github.io/metadata/recipes/secondary-tables/ for more info.

In [526]:
# Function to check if the value is an array
def is_array_type(value):
    return isinstance(value, list)

# Function to extract the download information and write to "multiple-downloads.csv"
def extract_downloads(row):
    friendlier_id = row["ID"]
    downloads = row["Download"]
    extracted_downloads = []
    if is_array_type(downloads):
        for download in downloads:
            if isinstance(download, dict):
                label = download.get("label", "")
                value = download.get("url", "")
                extracted_downloads.append({"friendlier_id": friendlier_id, "label": label, "value": value})

    return extracted_downloads

# Apply the function to each row in the DataFrame where "Download" is an array
download_list = df[df["Download"].apply(is_array_type)].apply(extract_downloads, axis=1).explode().tolist()

# Write the extracted downloads to "multiple-downloads.csv"
with open("multiple-downloads.csv", "w", newline="", encoding="utf-8") as csvfile:
    fieldnames = ["friendlier_id", "label", "value"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(download_list)

# Update the "Download" column in the main DataFrame (df) to remove array-type values
df.loc[df["Download"].apply(is_array_type), "Download"] = ""



### Write the DataFrame to a CSV file with Aardvark labels
This can be uploaded to GEOMG

In [None]:
df.to_csv("{}.csv".format(csv_name), index=False, na_rep='')