## Transform a batch OpenGeoMetadata JSON files

**Purpose: This script will read a batch of GeoBlacklight metadata JSON files and tranform them into a single CSV.** 

Metadata records in the [GeoBlacklight](https://opengeometadata.org/docs/gbl-1.0) or [OpenGeoMetadata](https://opengeometadata.org/docs/ogm-aardvark) standards are frequently shared as batches of JSON files. The entire [OpenGeoMetadata organization](https://github.com/OpenGeoMetadata) contains repositories full of hundreds of thousands of GeoBlacklight JSONs.

In order to ingest these into the BTAA Geoportal, we need to transform them into a CSV.  


## Part 1: Load the modules and JSON files

### Import python modules

In [633]:
import csv
import json
import os
import pandas as pd
import uuid
import datetime

### Declare the paths and file names

First, move a folder of the JSONs into this directory. Files in the folder can be nested.

In [634]:
json_path = r"wi" # enter the name of the folder
csv_name = "new-out" # create a name for the output CSV without the .csv extension

### Load the files into a pandas DataFrame

In [635]:
dataset = [] # empty list

# through all items, format and append to dataset list
for path, dir, files in os.walk(json_path):
    for filename in files:
        if filename.endswith(".json"):
            file_path = os.path.join(path, filename)
            json_file_open = open(file_path, 'rb')
            data = json_file_open.read().decode('utf-8', errors='ignore')
            loaded = json.loads(data)
            dataset.append(loaded)
            
df = pd.DataFrame(dataset) # convert dataset into dataframe

## Part 2: Split multivalued and compound fields

### Split multivalued fields (arrays)

This will remove the punctuation from fields that are formatted as arrays and separate them with pipes ('|')

In [636]:
# .str.join('') takes each item, whether a list or a single character, and joins them with a pipe

df['dc_creator_sm']=df['dc_creator_sm'].str.join('|')
df['dc_subject_sm']=df['dc_subject_sm'].str.join('|')
df['dct_spatial_sm']=df['dct_spatial_sm'].str.join('|')
df['dct_isPartOf_sm']=df['dct_isPartOf_sm'].str.join('|')
df['dct_temporal_sm']=df['dct_temporal_sm'].str.join('|')


### Split the References into separate columns

This step makes it easier to edit individual links.

In [637]:
def extract_values(row):
    dct_references_s = json.loads(row['dct_references_s'].replace('""', '"'))
    return dct_references_s

# Apply the function to split the column and expand into separate columns
df = pd.concat([df, df.apply(extract_values, axis=1).apply(pd.Series)], axis=1)

# Rename columns based on keys in the JSON
df = df.rename(columns={
    'http://schema.org/downloadUrl': 'Download',
    'http://schema.org/url': 'Information',
    'http://www.isotc211.org/schemas/2005/gmd/': 'ISO19139',
    'http://www.opengis.net/cat/csw/csdgm': 'FGDC',
    'http://www.w3.org/1999/xhtml': 'HTML',
    'http://lccn.loc.gov/sh85035852': 'Documentation',
    'http://iiif.io/api/image': 'IIIF',
    'http://iiif.io/api/presentation#manifest': 'Manifest',
    'http://www.loc.gov/mods/v3': 'MODS',
    'https://openindexmaps.org': 'Index Map',
    'http://www.opengis.net/def/serviceType/ogc/wms': 'WMS',
    'http://www.opengis.net/def/serviceType/ogc/wfs': 'WFS',
    'urn:x-esri:serviceType:ArcGIS#FeatureLayer': 'FeatureServer',
    'urn:x-esri:serviceType:ArcGIS#TiledMapLayer': 'TileServer',
    'urn:x-esri:serviceType:ArcGIS#DynamicMapLayer': 'MapServer',
    'urn:x-esri:serviceType:ArcGIS#ImageMapLayer': 'ImageServer',
    'http://schema.org/DownloadAction': 'Harvard Download'
    # Add more key-value pairs for renaming columns as needed
})

### Reorder coordinates

This will reorder the 4 bbox coordinates into W,S,E,N, which is what the Klokan Bounding Box tool produces on the CSV export option. The BTAA metadata editor uses this order as well when ingesting items. However, Aardvark ultimately uses W,E,N,S, so these would need to be reordered before converting back to JSON.

In [638]:
# Split solr_geom coordinates and reorder from WENS to WSEN
df[['w','e','n','s']] = df['solr_geom'].str.strip('ENVELOPE()').str.split(',', expand=True)
df['Bounding Box'] = df[['w', 's','e','n']].agg(','.join, axis=1) 


## Part 3: Transform values for fields without a straight crosswalk

In [639]:
#Convert Data Type to Resource Class value
df['Resource Class'] = df['dc_type_s'].apply(lambda x: 'Datasets' if x == 'Dataset' else '')

#Convert Geometry Type to Resource Type value
df['Resource Type'] = df['layer_geom_type_s'].astype(str) + ' data'

# Create Date Range field
# 10. Handle the "Date Range" field
df['Date Range'] = df.apply(lambda row: f"{row['dct_temporal_sm']}-{row['dct_temporal_sm']}" if pd.notna(row['dct_temporal_sm']) else '', axis=1)

### Check for GeoTIFFs

In [640]:
# Define a function to check if "GeoTIFF" is present in the "dc_format_s" column
def check_geotiff(value):
    if pd.notna(value) and "GeoTIFF" in value:
        return "true"
    else:
        return "false"

# Create the "Georeferenced" column using the check_geotiff function
df["Georeferenced"] = df["dc_format_s"].apply(check_geotiff)


### Extract Is Part Of strings and create new Collection level records 

In [641]:
processed_collections = {}

for index, row in df.iterrows():
    dct_isPartOf_sm = row['dct_isPartOf_sm']
    if pd.notna(dct_isPartOf_sm):
        if isinstance(dct_isPartOf_sm, str):
            title = dct_isPartOf_sm.strip()
            if title not in processed_collections:
                new_id = str(uuid.uuid4())  # Generating a new UUID as the ID for the collection
                new_row = {
                    'dc_title_s': title,
                    'layer_slug_s': new_id,
                    'Resource Class': 'Collections',
                    'dc_rights_s': 'Public'
                }
                df = df.append(new_row, ignore_index=True)
                processed_collections[title] = new_id
        elif isinstance(dct_isPartOf_sm, list):
            for title in dct_isPartOf_sm:
                title = title.strip()
                if title not in processed_collections:
                    new_id = str(uuid.uuid4())  # Generating a new UUID as the ID for each collection
                    new_row = {
                        'dc_title_s': title,
                        'layer_slug_s': new_id,
                        'Resource Class': 'Collections',
                        'dc_rights_s': 'Public'
                    }
                    df = df.append(new_row, ignore_index=True)
                    processed_collections[title] = new_id
                    
# Append the new rows to the DataFrame
df = pd.concat([df, pd.DataFrame(new_rows)], ignore_index=True)

  df = df.append(new_row, ignore_index=True)
  df = df.append(new_row, ignore_index=True)
  df = df.append(new_row, ignore_index=True)
  df = df.append(new_row, ignore_index=True)
  df = df.append(new_row, ignore_index=True)
  df = df.append(new_row, ignore_index=True)
  df = df.append(new_row, ignore_index=True)
  df = df.append(new_row, ignore_index=True)
  df = df.append(new_row, ignore_index=True)


### Convert values in "dc_subject_sm" and create a new "Theme" column

In [642]:

# Define the conversion mappings from old values to new values
subject_sm_mapping = {
    "farming": "Agriculture",
    "biota": "Biology",
    "boundaries": "Boundaries",
    "climatologymeteorologyatmosphere": "Climate",
    "economy": "Economy",
    "elevation": "Elevation",
    "environment": "Environment",
    "society; climatologyMeteorologyAtmosphere": "Events",
    "geoscientificinformation": "Geology",
    "health": "Health",
    "imagerybasemapsearthcover": "Imagery",
    "inlandwaters": "Inland Waters",
    "location": "Location",
    "intelligencemilitary": "Military",
    "oceans": "Oceans",
    "planningcadastre": "Property",
    "society": "Society",
    "structure": "Structure",
    "transportation": "Transportation",
    "utilitiescommunication": "Utilities"
    
    # Add more key-value pairs for other conversions as needed
}


# Function to apply the mapping and join the values back together
def convert_and_join(row):
    subject_values = row['dc_subject_sm']
    if pd.notna(subject_values):  # Check for NaN before splitting
        subject_values = subject_values.split('|')
        converted_values = []
        for value in subject_values:
            value_lower = value.lower()
            if value_lower in subject_sm_mapping:
                converted_values.append(subject_sm_mapping[value_lower])
        return '|'.join(converted_values)
    else:
        return ''  # Return an empty string if the value is NaN

# Apply the mapping and create the new "Theme" column
df['Theme'] = df.apply(convert_and_join, axis=1)

# Drop duplicates from the "Theme" column
df['Theme'] = df['Theme'].str.split('|').apply(lambda x: '|'.join(sorted(set(x), key=x.index)))


## Part 4: Export to a new CSV with Aardvark labels as headers

### Remove unnecessary columns

In [643]:
df = df.drop(columns=[
    'geoblacklight_version',
    'layer_modified_dt', 
#     'thumbnail_path_ss',
    'w','e','n','s', 
    'solr_year_i',
    'layer_geom_type_s',
    'solr_geom',
    'dct_references_s'
])

### Rename the remaining columns

In [644]:
df = df.rename(columns={
    'dc_title_s': 'Title', 
    'dc_description_s': 'Description',
    'dc_creator_sm': 'Creator',
    'dct_issued_s': 'Date Issued',
    'dct_issued_dt': 'Date Issued',
    'dc_rights_s' : 'Access Rights',
    'dc_format_s': 'Format',
    'layer_slug_s' : 'ID',
    'layer_id_s' : 'WxS Identifier', 
    'dc_identifier_s' : 'Identifier',
    'dc_language_s' : 'Language',
    'dct_provenance_s' : 'Provider',
    'dc_publisher_s' : 'Publisher',
    'dc_publisher_sm' : 'Publisher',
    'dc_source_sm' : 'Source',
    'dct_spatial_sm' : 'Spatial Coverage',
    'dc_subject_sm' : 'Subject',
    'dct_temporal_sm' : 'Temporal Coverage',
    'dct_isPartOf_sm' : 'Is Part Of'
})


In [645]:
# Define the desired order of columns
desired_order = [
'Title',
'Alternative Title',
'Description',
'Language',
'Creator',
'Publisher',
'Provider',
'Resource Class',
'Resource Type',
'Subject',
'Theme',
'Keyword',
'Temporal Coverage',
'Date Issued',
'Date Range',
'Spatial Coverage',
'Bounding Box',
'Geometry',
'Member Of',
'Source',
'Format',
'WxS Identifier',
'Georeferenced',
'Documentation',
'Download',
'FeatureServer',
'FGDC',
'Harvard Download',
'HTML',
'IIIF',
'ImageServer',
'Information',
'ISO19139',
'Manifest',
'MapServer',
'MODS',
'oEmbed',
'Index Map',
'TileServer',
'WCS',
'WFS',
'WMS',
'ID',
'Identifier',
'Rights',
'Rights Holder',
'License',
'Access Rights',
'Suppressed',
'Child Record',

# Add more columns as needed in the desired order
]

# Reindex the DataFrame based on the desired order of columns
df = df.reindex(columns=desired_order)

### Check for multiple downloads and create a secondary CSV called "multiple-downloads.csv"

See https://geobtaa.github.io/metadata/recipes/secondary-tables/ for more info.

In [646]:
import json
import csv

# Function to check if the value is an array
def is_array_type(value):
    return isinstance(value, list)

# Function to extract the download information and write to "multiple-downloads.csv"
def extract_downloads(row):
    friendlier_id = row["ID"]
    downloads = row["Download"]
    extracted_downloads = []
    if is_array_type(downloads):
        for download in downloads:
            if isinstance(download, dict):
                label = download.get("label", "")
                value = download.get("url", "")
                extracted_downloads.append({"friendlier_id": friendlier_id, "label": label, "value": value})

    return extracted_downloads

# Apply the function to each row in the DataFrame where "Download" is an array
download_list = df[df["Download"].apply(is_array_type)].apply(extract_downloads, axis=1).explode().tolist()

# Write the extracted downloads to "multiple-downloads.csv"
with open("multiple-downloads.csv", "w", newline="", encoding="utf-8") as csvfile:
    fieldnames = ["friendlier_id", "label", "value"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(download_list)

# Update the "Download" column in the main DataFrame (df) to remove array-type values
df["Download"] = df["Download"].apply(lambda x: x[0]["url"] if is_array_type(x) else x)



### Write the DataFrame to a CSV file with Aardvark labels
This can be uploaded to GEOMG

In [647]:
df.to_csv("{}.csv".format(csv_name), index=False, na_rep='')