## Transform a batch OpenGeoMetadata JSON files

**Purpose: This script will read a batch of GeoBlacklight metadata JSON files in the OGM Aardvark schema and tranform them into a single CSV.** 

Metadata records in the [OGM Aardvark](https://opengeometadata.org/docs/aardvark) schema are frequently shared as batches of JSON files. The entire [OpenGeoMetadata organization](https://github.com/OpenGeoMetadata) contains repositories full of hundreds of thousands of GeoBlacklight JSONs.

In order to ingest these into the BTAA Geoportal, we need to transform them into a CSV.  


## Part 1: Load the modules and JSON files

### Import python modules

In [1]:
import csv
import json
import os
import pandas as pd



### Declare the paths and file names

First, move a folder of the JSONs into this directory. Files in the folder can be nested.

In [2]:
json_path = r"solr" # enter the name of the folder
csv_name = "fixtures" # create a name for the output CSV without the .csv extension

### Load the files into a pandas DataFrame

In [3]:
dataset = [] # empty list

# through all items, format and append to dataset list
for path, dir, files in os.walk(json_path):
    for filename in files:
        if filename.endswith(".json"):
            file_path = os.path.join(path, filename)
            json_file_open = open(file_path, 'rb')
            data = json_file_open.read().decode('utf-8', errors='ignore')
            loaded = json.loads(data)
            dataset.append(loaded)
            
df = pd.DataFrame(dataset) # convert dataset into dataframe

## Part 2: Split multivalued and compound fields

### Split multivalued fields (arrays)

This will remove the punctuation from fields that are formatted as arrays and separate them with pipes ('|')

In [4]:
# Function to join array elements into a pipe-separated string
def join_multivalues(val):
    if isinstance(val, list):
        return '|'.join(map(str, val))
    return val

# Apply the function to each column that needs processing
multivalue_columns = [
    'dcat_keyword_sm', 'dcat_theme_sm', 'dct_subject_sm', 'dct_creator_sm',
    'dct_publisher_sm', 'dct_alternative_sm', 'dct_description_sm',
    'dct_language_sm', 'dct_identifier_sm', 'dct_isPartOf_sm', 
    'dct_isReplacedBy_sm', 'dct_isVersionOf_sm', 'dct_relation_sm',
    'dct_replaces_sm', 'dct_source_sm', 'dct_license_sm', 'dct_rights_sm',
    'dct_rightsHolder_sm', 'dct_spatial_sm', 'dct_temporal_sm',
    'gbl_resourceClass_sm', 'gbl_resourceType_sm', 'gbl_displayNote_sm',
    'pcdm_memberOf_sm', 'gbl_indexYear_im'
]

for column in multivalue_columns:
    if column in df.columns:
        df[column] = df[column].apply(join_multivalues)
    else:
        print(f"Column {column} not found in DataFrame.")


Column dct_license_sm not found in DataFrame.
Column dct_rightsHolder_sm not found in DataFrame.


In [5]:
print(df.columns)

Index(['dct_title_s', 'dct_alternative_sm', 'dct_description_sm',
       'dct_language_sm', 'dct_publisher_sm', 'schema_provider_s',
       'gbl_resourceClass_sm', 'gbl_resourceType_sm', 'dct_subject_sm',
       'dcat_theme_sm', 'dcat_keyword_sm', 'dct_temporal_sm', 'dct_issued_s',
       'gbl_indexYear_im', 'gbl_dateRange_drsim', 'dct_spatial_sm',
       'locn_geometry', 'dcat_bbox', 'dcat_centroid', 'dct_accessRights_s',
       'dct_format_s', 'gbl_wxsIdentifier_s', 'dct_references_s', 'id',
       'dct_identifier_sm', 'gbl_mdModified_dt', 'gbl_mdVersion_s',
       'dct_creator_sm', 'pcdm_memberOf_sm', 'dct_isPartOf_sm',
       'dct_source_sm', 'dct_isReplacedBy_sm', 'dct_isVersionOf_sm',
       'dct_relation_sm', 'dct_replaces_sm', 'gbl_georeferenced_b',
       'gbl_suppressed_b', 'dct_rights_sm', 'gbl_fileSize_s',
       'layer_geom_type_s', 'gbl_displayNote_sm'],
      dtype='object')


### Split the References into separate columns

This step makes it easier to edit individual links.

In [6]:
import json
import pandas as pd

def extract_values(row):
    # Check if the value is a string; otherwise, return None or an empty dict
    if isinstance(row['dct_references_s'], str):
        try:
            dct_references_s = json.loads(row['dct_references_s'].replace('""', '"'))
            return dct_references_s
        except json.JSONDecodeError:
            print(f"JSON decode error in row: {row}")
            return {}
    else:
        return {}

# Apply the function to split the column and expand into separate columns
df_expanded = df.apply(extract_values, axis=1).apply(pd.Series)

# Concatenate the original DataFrame with the expanded DataFrame
df = pd.concat([df, df_expanded], axis=1)

# Rename columns based on keys in the JSON
column_mapping = {
    'http://schema.org/downloadUrl': 'Download',
    'http://schema.org/url': 'Information',
    'http://www.isotc211.org/schemas/2005/gmd/': 'ISO19139',
    'http://www.opengis.net/cat/csw/csdgm': 'FGDC',
    'http://www.w3.org/1999/xhtml': 'HTML',
    'http://lccn.loc.gov/sh85035852': 'Documentation',
    'http://iiif.io/api/image': 'IIIF',
    'http://iiif.io/api/presentation#manifest': 'Manifest',
    'http://www.loc.gov/mods/v3': 'MODS',
    'https://openindexmaps.org': 'Index Map',
    'http://www.opengis.net/def/serviceType/ogc/wms': 'WMS',
    'http://www.opengis.net/def/serviceType/ogc/wfs': 'WFS',
    'http://www.opengis.net/def/serviceType/ogc/wcs': 'WCS',
    'urn:x-esri:serviceType:ArcGIS#FeatureLayer': 'FeatureServer',
    'urn:x-esri:serviceType:ArcGIS#TiledMapLayer': 'TileServer',
    'urn:x-esri:serviceType:ArcGIS#DynamicMapLayer': 'MapServer',
    'urn:x-esri:serviceType:ArcGIS#ImageMapLayer': 'ImageServer',
    'http://schema.org/DownloadAction': 'Harvard Download',
    'https://github.com/cogeotiff/cog-spec': 'COG',
    'https://github.com/protomaps/PMTiles': 'PMTiles',
    'https://wiki.openstreetmap.org/wiki/Slippy_map_tilenames': 'XYZ Tiles',
    'http://schema.org/thumbnailUrl': 'B1G Image',
    'http://www.opengis.net/def/serviceType/ogc/wmts': 'WMTS',
    'https://oembed.com': 'oembed',
    'https://github.com/mapbox/tilejson-spec': 'TileJSON',
    'https://wiki.osgeo.org/wiki/Tile_Map_Service_Specification': 'Tile Map Service'
    }
df.rename(columns=column_mapping, inplace=True)


### Reorder coordinates

This will reorder the 4 bbox coordinates into W,S,E,N, which is what the Klokan Bounding Box tool produces on the CSV export option. The BTAA metadata editor uses this order as well when ingesting items. However, Aardvark ultimately uses W,E,N,S, so these would need to be reordered before converting back to JSON.

In [7]:
# Make sure the 'dcat_bbox' column is a string and handle missing or incorrect formats
if 'dcat_bbox' in df.columns:
    # Strip 'ENVELOPE()' and split, ensuring all entries are treated as strings
    df[['w', 'e', 'n', 's']] = df['dcat_bbox'].apply(
        lambda x: x.strip('ENVELOPE()').split(',') if isinstance(x, str) and 'ENVELOPE(' in x and ')' in x else [None, None, None, None]
    ).tolist()

    # Ensure all elements are strings for the join operation
    df['Bounding Box'] = df[['w', 's', 'e', 'n']].apply(
        lambda row: ','.join(str(item) for item in row if item is not None),
        axis=1
    )
else:
    print("Column 'dcat_bbox' is missing from the DataFrame.")

## Part 4: Export to a new CSV with Aardvark labels as headers

### Rename the remaining columns

In [8]:
# Define the mapping of old field names to new labels
column_mapping = {
    'dcat_keyword_sm': 'Keyword',
    'dcat_theme_sm': 'Theme',
    'dcat_centroid': 'Centroid',
    'dct_subject_sm': 'Subject',
    'dct_creator_sm': 'Creator',
    'dct_publisher_sm': 'Publisher',
    'dct_alternative_sm': 'Alternative Title',
    'dct_description_sm': 'Description',
    'dct_language_sm': 'Language',
    'dct_title_s': 'Title',
    'dct_identifier_sm': 'Identifier',
    'dct_format_s': 'Format',
    'dct_isPartOf_sm': 'Is Part Of',
    'dct_isReplacedBy_sm': 'Is Replaced By',
    'dct_isVersionOf_sm': 'Is Version Of',
    'dct_relation_sm': 'Relation',
    'dct_replaces_sm': 'Replaces',
    'dct_source_sm': 'Source',
    'dct_accessRights_s': 'Access Rights',
    'dct_license_sm': 'License',
    'dct_rights_sm': 'Rights',
    'dct_rightsHolder_sm': 'Rights Holder',
    'dct_spatial_sm': 'Spatial Coverage',
    'dct_issued_s': 'Date Issued',
    'dct_temporal_sm': 'Temporal Coverage',
    'gbl_mdVersion_s': 'Metadata Version',
    'gbl_mdModified_dt': 'Modified',
    'gbl_suppressed_b': 'Suppressed',
    'gbl_resourceClass_sm': 'Resource Class',
    'gbl_resourceType_sm': 'Resource Type',
    'gbl_displayNote_sm': 'Display Note',
    'id': 'ID',
    'gbl_wxsIdentifier_s': 'WxS Identifier',
    'gbl_fileSize_s': 'File Size',
    'gbl_georeferenced_b': 'Georeferenced',
    'gbl_dateRange_drsim': 'Date Range',
    'gbl_indexYear_im': 'Index Year',
    'locn_geometry': 'Geometry',
    'pcdm_memberOf_sm': 'Member Of',
    'schema_provider_s': 'Provider'
}

# Rename the columns
df.rename(columns=column_mapping, inplace=True)

### Check for multiple downloads and create a secondary CSV called "multiple-downloads.csv"

See https://geobtaa.github.io/metadata/recipes/secondary-tables/ for more info.

In [12]:
# Function to check if the value is an array
def is_array_type(value):
    return isinstance(value, list)

# Function to extract the download information and write to "multiple-downloads.csv"
def extract_downloads(row):
    friendlier_id = row["ID"]
    downloads = row["Download"]
    extracted_downloads = []
    if is_array_type(downloads):
        for download in downloads:
            if isinstance(download, dict):
                label = download.get("label", "")
                value = download.get("url", "")
                extracted_downloads.append({"friendlier_id": friendlier_id, "label": label, "value": value})

    return extracted_downloads

# Apply the function to each row in the DataFrame where "Download" is an array
download_list = df[df["Download"].apply(is_array_type)].apply(extract_downloads, axis=1).explode().tolist()

# Write the extracted downloads to "multiple-downloads.csv"
with open("multiple-downloads.csv", "w", newline="", encoding="utf-8") as csvfile:
    fieldnames = ["friendlier_id", "label", "value"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(download_list)

# Update the "Download" column in the main DataFrame (df) to remove array-type values
df["Download"] = df["Download"].apply(lambda x: x[0]["url"] if is_array_type(x) else x)



### Write the DataFrame to a CSV file with Aardvark labels

This can be uploaded to GBL Admin

In [13]:
df.to_csv("{}.csv".format(csv_name), index=False, na_rep='')