## Transform a batch OpenGeoMetadata JSON files

Customized for UW Milwaukee

**Purpose: This script will read a batch of GeoBlacklight metadata JSON files in the OGM Aardvark schema and tranform them into a single CSV.** 

Metadata records in the [OGM Aardvark](https://opengeometadata.org/docs/aardvark) schema are frequently shared as batches of JSON files. The entire [OpenGeoMetadata organization](https://github.com/OpenGeoMetadata) contains repositories full of hundreds of thousands of GeoBlacklight JSONs.

In order to ingest these into the BTAA Geoportal, we need to transform them into a CSV.  


To do:
- fix title formatting script to account for more place name values, including just states
- fix the locations.json loading to be able to do it at the beginning

## Part 1: Load the modules and JSON files

### Import python modules

In [None]:
import csv
import json
import os
import re
import pandas as pd

### Declare the paths and file names

First, move a folder of the JSONs into this directory. Files in the folder can be nested.

In [None]:
json_path = r"uwm" # enter the name of the folder
csv_name = "uwm" # create a name for the output CSV without the .csv extension

In [None]:
# json_path = os.path.join('../../', 'data', 'locations.json')
# with open(json_path, 'r') as file:
#     locations = json.load(file)

# counties_in_wisconsin = locations['counties_in_wisconsin']
# cities_in_wisconsin = locations['cities_in_wisconsin']
# states_in_us = locations['states']
# nations = locations['nations']

### Load the files into a pandas DataFrame

In [None]:
dataset = [] # empty list

# through all items, format and append to dataset list
for path, dir, files in os.walk(json_path):
    for filename in files:
        if filename.endswith(".json"):
            file_path = os.path.join(path, filename)
            json_file_open = open(file_path, 'rb')
            data = json_file_open.read().decode('utf-8', errors='ignore')
            loaded = json.loads(data)
            dataset.append(loaded)
            
df = pd.DataFrame(dataset) # convert dataset into dataframe

## Part 2: Split multivalued and compound fields

### Split multivalued fields (arrays)

This will remove the punctuation from fields that are formatted as arrays and separate them with pipes ('|')

In [None]:
# Function to join array elements into a pipe-separated string
def join_multivalues(val):
    if isinstance(val, list):
        return '|'.join(map(str, val))
    return val

# Apply the function to each column that needs processing
multivalue_columns = [
    'dcat_keyword_sm', 'dcat_theme_sm', 'dct_subject_sm', 'dct_creator_sm',
    'dct_publisher_sm', 'dct_alternative_sm', 'dct_description_sm',
    'dct_language_sm', 'dct_identifier_sm', 'dct_isPartOf_sm', 
    'dct_isReplacedBy_sm', 'dct_isVersionOf_sm', 'dct_relation_sm',
    'dct_replaces_sm', 'dct_source_sm', 'dct_license_sm', 'dct_rights_sm',
    'dct_rightsHolder_sm', 'dct_spatial_sm', 'dct_temporal_sm',
    'gbl_resourceClass_sm', 'gbl_resourceType_sm', 'gbl_displayNote_sm',
    'pcdm_memberOf_sm', 'gbl_indexYear_im'
]

for column in multivalue_columns:
    if column in df.columns:
        df[column] = df[column].apply(join_multivalues)
    else:
        print(f"Column {column} not found in DataFrame.")


In [None]:
print(df.columns)

### Split the References into separate columns

This step makes it easier to edit individual links.

In [None]:
def extract_values(row):
    # Check if the value is a string; otherwise, return None or an empty dict
    if isinstance(row['dct_references_s'], str):
        try:
            dct_references_s = json.loads(row['dct_references_s'].replace('""', '"'))
            return dct_references_s
        except json.JSONDecodeError:
            print(f"JSON decode error in row: {row}")
            return {}
    else:
        return {}

# Apply the function to split the column and expand into separate columns
df_expanded = df.apply(extract_values, axis=1).apply(pd.Series)

# Concatenate the original DataFrame with the expanded DataFrame
df = pd.concat([df, df_expanded], axis=1)

# Rename columns based on keys in the JSON
column_mapping = {
    'http://schema.org/downloadUrl': 'Download',
    'http://schema.org/url': 'Information',
#     'http://www.isotc211.org/schemas/2005/gmd/': 'ISO19139',
    'http://www.isotc211.org/schemas/2005/gmd': 'ISO19139',
    'http://www.opengis.net/cat/csw/csdgm': 'FGDC',
    'http://www.w3.org/1999/xhtml': 'HTML',
    'http://lccn.loc.gov/sh85035852': 'Documentation',
    'http://iiif.io/api/image': 'IIIF',
    'http://iiif.io/api/presentation#manifest': 'Manifest',
    'http://www.loc.gov/mods/v3': 'MODS',
    'https://openindexmaps.org': 'Index Map',
    'http://www.opengis.net/def/serviceType/ogc/wms': 'WMS',
    'http://www.opengis.net/def/serviceType/ogc/wfs': 'WFS',
    'http://www.opengis.net/def/serviceType/ogc/wcs': 'WCS',
    'urn:x-esri:serviceType:ArcGIS#FeatureLayer': 'FeatureServer',
    'urn:x-esri:serviceType:ArcGIS#TiledMapLayer': 'TileServer',
    'urn:x-esri:serviceType:ArcGIS#DynamicMapLayer': 'MapServer',
    'urn:x-esri:serviceType:ArcGIS#ImageMapLayer': 'ImageServer',
    'http://schema.org/DownloadAction': 'Harvard Download',
    'https://github.com/cogeotiff/cog-spec': 'COG',
    'https://github.com/protomaps/PMTiles': 'PMTiles',
    'https://wiki.openstreetmap.org/wiki/Slippy_map_tilenames': 'XYZ Tiles',
    'http://schema.org/thumbnailUrl': 'B1G Image',
    'http://www.opengis.net/def/serviceType/ogc/wmts': 'WMTS',
    'https://oembed.com': 'oembed',
    'https://github.com/mapbox/tilejson-spec': 'TileJSON',
    'https://wiki.osgeo.org/wiki/Tile_Map_Service_Specification': 'Tile Map Service'
    }
df.rename(columns=column_mapping, inplace=True)


### Reorder coordinates

This will reorder the 4 bbox coordinates into W,S,E,N, which is what the Klokan Bounding Box tool produces on the CSV export option. The BTAA metadata editor uses this order as well when ingesting items. However, Aardvark ultimately uses W,E,N,S, so these would need to be reordered before converting back to JSON.

In [None]:
# Make sure the 'dcat_bbox' column is a string and handle missing or incorrect formats
if 'dcat_bbox' in df.columns:
    # Strip 'ENVELOPE()' and split, ensuring all entries are treated as strings
    df[['w', 'e', 'n', 's']] = df['dcat_bbox'].apply(
        lambda x: x.strip('ENVELOPE()').split(',') if isinstance(x, str) and 'ENVELOPE(' in x and ')' in x else [None, None, None, None]
    ).tolist()

    # Ensure all elements are strings for the join operation
    df['Bounding Box'] = df[['w', 's', 'e', 'n']].apply(
        lambda row: ','.join(str(item) for item in row if item is not None),
        axis=1
    )
else:
    print("Column 'dcat_bbox' is missing from the DataFrame.")

## Part 4: Export to a new CSV with Aardvark labels as headers

### Rename the remaining columns

In [None]:
# Define the mapping of old field names to new labels
column_mapping = {
    'dcat_keyword_sm': 'Keyword',
    'dcat_theme_sm': 'Theme',
    'dcat_centroid': 'Centroid',
    'dct_subject_sm': 'Subject',
    'dct_creator_sm': 'Creator',
    'dct_publisher_sm': 'Publisher',
#     'dct_alternative_sm': 'Alternative Title',
    'dct_description_sm': 'Description',
    'dct_language_sm': 'Language',
    'dct_title_s': 'Alternative Title',
    'dct_identifier_sm': 'Identifier',
    'dct_format_s': 'Format',
    'dct_isPartOf_sm': 'Is Part Of',
    'dct_isReplacedBy_sm': 'Is Replaced By',
    'dct_isVersionOf_sm': 'Is Version Of',
    'dct_relation_sm': 'Relation',
    'dct_replaces_sm': 'Replaces',
    'dct_source_sm': 'Source',
    'dct_accessRights_s': 'Access Rights',
    'dct_license_sm': 'License',
    'dct_rights_sm': 'Rights',
    'dct_rightsHolder_sm': 'Rights Holder',
    'dct_spatial_sm': 'Spatial Coverage',
    'dct_issued_s': 'Date Issued',
    'dct_temporal_sm': 'Temporal Coverage',
    'gbl_mdVersion_s': 'Metadata Version',
    'gbl_mdModified_dt': 'Modified',
    'gbl_suppressed_b': 'Suppressed',
    'gbl_resourceClass_sm': 'Resource Class',
    'gbl_resourceType_sm': 'Resource Type',
    'gbl_displayNote_sm': 'Display Note',
    'id': 'ID',
    'gbl_wxsIdentifier_s': 'WxS Identifier',
    'gbl_fileSize_s': 'File Size',
    'gbl_georeferenced_b': 'Georeferenced',
    'gbl_dateRange_drsim': 'Date Range',
    'gbl_indexYear_im': 'Index Year',
    'locn_geometry': 'Geometry',
    'pcdm_memberOf_sm': 'Member Of',
    'schema_provider_s': 'Provider'
}

# Rename the columns
df.rename(columns=column_mapping, inplace=True)


In [None]:
# Load locations.json
json_path = os.path.join('../../', 'data', 'locations.json')
with open(json_path, 'r') as file:
    locations = json.load(file)

counties_in_wisconsin = locations['counties_in_wisconsin']
cities_in_wisconsin = locations['cities_in_wisconsin']

# Define the transform_title function
def transform_title(row):
    alt_title = row['Alternative Title']
    
    # Search for a city or county name in the title.
    for county in counties_in_wisconsin:
        if re.search(f"{county} County", alt_title, re.I):
            alt_title = re.sub(f"{county} County, Wisconsin", f"[Wisconsin--{county} County]", alt_title, flags=re.I, count=1)
            break
    else:
        for city in cities_in_wisconsin:
            if re.search(rf"\b{city}\b", alt_title, re.I):
                alt_title = re.sub(rf"{city}, Wisconsin", f"[Wisconsin--{city}]", alt_title, flags=re.I, count=1)
                break
        else:
            alt_title = re.sub(r"\b(Wisconsin)\b, Wisconsin", "[Wisconsin]", alt_title, flags=re.I, count=1)

    # Capture content in brackets
    bracket_content = re.findall(r'\[(.*?)\]', alt_title)
    
    if bracket_content:
        # Remove bracketed content from original position
        alt_title = re.sub(r'\[.*?\]', '', alt_title).strip()
        
        # Append bracketed content to the end of the title
        alt_title = f"{alt_title} [{bracket_content[0]}]"

    # Remove unwanted dashes at the beginning or just before a bracket
    alt_title = re.sub(r"^\s*-\s*|\s*-+\s*(?=\[)", "", alt_title)
    
    # Make sure first letter is capitalized
    alt_title = alt_title[0].capitalize() + alt_title[1:]
    
    # Remove multiple spaces
    alt_title = re.sub(r'\s+', ' ', alt_title).strip()
    
    return alt_title

# Define the format_title function
def format_title(row):
    alternativeTitle = row['Alternative Title']
    year = ''
    try:  
        year_range = re.findall(r'(\d{4})-(\d{4})', alternativeTitle)
    except:
        year_range = ''
    try: 
        single_year = re.match(r'.*(17\d{2}|18\d{2}|19\d{2}|20\d{2})', alternativeTitle)
    except:
        single_year = ''    
    if year_range:   # if a 'yyyy-yyyy' exists
        year = '-'.join(year_range[0])
        alternativeTitle = alternativeTitle.replace(year, '').strip().rstrip(',')
    elif single_year:  # or if a 'yyyy' exists
        year = single_year.group(1)
        alternativeTitle = alternativeTitle.replace(year, '').strip().rstrip(',')
     
    altTitle = str(alternativeTitle)
    title = altTitle
    if year:
        title += ' {' + year +'}'
    
    # Remove multiple spaces
    title = re.sub(r'\s+', ' ', title).strip()
    
    return title

# Combine both functions to transform and format the title
def combined_title(row):
    row['Alternative Title'] = transform_title(row)
    return format_title(row)

# Apply the combined function to the DataFrame
df['Title'] = df.apply(combined_title, axis=1)

In [None]:
# Load locations.json
json_path = os.path.join('../../', 'data', 'locations.json')
states_in_us = locations['states']
nations = locations['nations']

# Define the function to format the "Spatial Coverage" values
def format_spatial_coverage(value):
    if pd.isna(value):
        return value
    parts = value.split('|')
    formatted_parts = []
    for part in parts:
        if part in states_in_us:
            # If part is a state, keep it as is
            formatted_parts.append(part)
        elif 'County' in part or 'Metro' in part:
            # Format as "state--county" or "state--city"
            state = parts[-1]  # Assume last part is the state
            if state in states_in_us and part != state:  # Avoid duplicating the state in "state--state"
                formatted_parts.append(f'{state}--{part}')
            else:
                formatted_parts.append(part)
        elif part not in states_in_us and part not in nations:
            # Other parts, format as "state--city" for cities
            state = parts[-1]  # Assume last part is the state
            if state in states_in_us and part != state:  # Avoid duplicating the state in "state--state"
                formatted_parts.append(f'{state}--{part}')
            else:
                formatted_parts.append(part)
        else:
            # If part is a nation, keep it as is
            formatted_parts.append(part)
    return '|'.join(formatted_parts)

# Apply the function to the "Spatial Coverage" column
df['Original Spatial Coverage'] = df['Spatial Coverage']
df['Spatial Coverage'] = df['Spatial Coverage'].apply(format_spatial_coverage)

### Write the DataFrame to a CSV file with Aardvark labels

This can be uploaded to GBL Admin

In [None]:
df.to_csv("{}.csv".format(csv_name), index=False, na_rep='')