## Transform a batch GBL 1.0 JSON files from UW-Madison

**Purpose: This script will read a batch of GBL-1.0 metadata JSON files and tranform them into a single CSV.** 

This is a version that is specific for UW-Madison records

Metadata records in the [GeoBlacklight](https://opengeometadata.org/docs/gbl-1.0) or [OpenGeoMetadata](https://opengeometadata.org/docs/ogm-aardvark) standards are frequently shared as batches of JSON files. The entire [OpenGeoMetadata organization](https://github.com/OpenGeoMetadata) contains repositories full of hundreds of thousands of GeoBlacklight JSONs.

In order to ingest these into the BTAA Geoportal, we need to transform them into a CSV.  


## Part 1: Load the modules and JSON files

### Import python modules

In [1]:
import csv
import json
import os
import pandas as pd
import uuid
import datetime
import re



### Declare the paths and file names

First, move a folder of the JSONs into this directory. Files in the folder can be nested.

In [2]:
json_path = r"/Users/majew030/GitHub/OGM/edu.wisc" # enter the name of the folder
csv_name = "10d-03" # create a name for the output CSV without the .csv extension

### Load the files into a pandas DataFrame

In [3]:
dataset = [] # empty list

# through all items, format and append to dataset list
for path, dir, files in os.walk(json_path):
    for filename in files:
        if filename.endswith(".json"):
            file_path = os.path.join(path, filename)
            json_file_open = open(file_path, 'rb')
            data = json_file_open.read().decode('utf-8', errors='ignore')
            loaded = json.loads(data)
            dataset.append(loaded)
            
df = pd.DataFrame(dataset) # convert dataset into dataframe

## Part 2: Split multivalued and compound fields

### Split multivalued fields (arrays)

This will remove the punctuation from fields that are formatted as arrays and separate them with pipes ('|')

In [4]:
# .str.join('') takes each item, whether a list or a single character, and joins them with a pipe

df['dc_creator_sm']=df['dc_creator_sm'].str.join('|')
df['dc_subject_sm']=df['dc_subject_sm'].str.join('|')
df['dct_spatial_sm']=df['dct_spatial_sm'].str.join('|')
df['dct_isPartOf_sm']=df['dct_isPartOf_sm'].str.join('|')
df['dct_temporal_sm']=df['dct_temporal_sm'].str.join('|')


### Reorder coordinates

This will reorder the 4 bbox coordinates into W,S,E,N, which is what the Klokan Bounding Box tool produces on the CSV export option. The BTAA metadata editor uses this order as well when ingesting items. However, Aardvark ultimately uses W,E,N,S, so these would need to be reordered before converting back to JSON.

In [5]:
# Split solr_geom coordinates and reorder from WENS to WSEN
df[['w','e','n','s']] = df['solr_geom'].str.strip('ENVELOPE()').str.split(',', expand=True)
df['Bounding Box'] = df[['w', 's','e','n']].agg(','.join, axis=1) 


df = df.drop(columns=['w','e','n','s','solr_geom'])

# Part 2a Distributions

Parse the distribution links

### Split the References into separate columns

In [6]:
import logging

# Configure logging
logging.basicConfig(filename='processing.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')


def extract_values(row):
    try:
        dct_references_s = json.loads(row['dct_references_s'].replace('""', '"'))
        return dct_references_s
    except json.JSONDecodeError as e:
        # Log the error and the record causing it
        logging.error(f'Error processing record with ID: {row["layer_slug_s"]}, Error: {str(e)}')
        return None

# Apply the function to split the column and expand into separate columns
df = pd.concat([df, df.apply(extract_values, axis=1).apply(pd.Series)], axis=1)

# Rename columns based on keys in the JSON
df = df.rename(columns={
    'http://schema.org/downloadUrl': 'download',
    'http://schema.org/url': 'full_layer_description',
    'http://www.isotc211.org/schemas/2005/gmd/': 'metadata_iso',
    'http://www.opengis.net/cat/csw/csdgm': 'metadata_fgdc',
    'http://www.w3.org/1999/xhtml': 'metadata_html',
    'http://lccn.loc.gov/sh85035852': 'documentation_download',
    'http://iiif.io/api/image': 'iiif_image',
    'http://iiif.io/api/presentation#manifest': 'iiif_manifest',
    'http://www.loc.gov/mods/v3': 'metadata_mods',
    'https://openindexmaps.org': 'open_index_map',
    'http://www.opengis.net/def/serviceType/ogc/wms': 'wms',
    'http://www.opengis.net/def/serviceType/ogc/wfs': 'wfs',
    'urn:x-esri:serviceType:ArcGIS#FeatureLayer': 'arcgis_feature_layer',
    'urn:x-esri:serviceType:ArcGIS#TiledMapLayer': 'arcgis_tiled_map_layer',
    'urn:x-esri:serviceType:ArcGIS#DynamicMapLayer': 'arcgis_dynamic_map_layer',
    'urn:x-esri:serviceType:ArcGIS#ImageMapLayer': 'arcgis_image_map_layer'
})

# df = df.drop(columns=['dct_references_s'])

### Write the links to a distributions spreadsheet


In [9]:
# Define the columns in the DataFrame that correspond to distribution types
distribution_columns = [
    'download', 'full_layer_description', 'metadata_iso', 'metadata_fgdc', 'metadata_html',
    'documentation_download', 'iiif_image', 'iiif_manifest', 'metadata_mods',
    'open_index_map', 'wms', 'wfs', 'arcgis_feature_layer',
    'arcgis_tiled_map_layer', 'arcgis_dynamic_map_layer', 'arcgis_image_map_layer'
]

# Function to check if the value is an array
def is_array_type(value):
    return isinstance(value, list)

# Function to extract the download information for rows with multiple downloads
def extract_multiple_downloads(row):
    friendlier_id = row["layer_slug_s"]
    downloads = row.get("download", None)
    extracted_downloads = []
    if is_array_type(downloads):
        for download in downloads:
            if isinstance(download, dict):
                label = download.get("label", "")  # Use the label from the array
                url = download.get("url", "")
                extracted_downloads.append({
                    "friendlier_id": friendlier_id,
                    "label": label,
                    "reference_type": "download",
                    "distribution_url": url
                })
    return extracted_downloads

# Prepare a list to store the rows for the new CSV
distribution_rows = []

# Iterate over each row in the DataFrame
for _, row in df.iterrows():
    friendlier_id = row['layer_slug_s']
    label = row.get('dc_format_s', "")  # Default label for single download
    
    # Handle array-type download values
    if is_array_type(row.get('download', None)):
        # Extract multiple download links
        distribution_rows.extend(extract_multiple_downloads(row))
    elif pd.notnull(row.get('download', None)):  # Single download
        distribution_rows.append({
            'friendlier_id': friendlier_id,
            'label': label,  # Use dc_format_s for single download
            'reference_type': 'download',
            'distribution_url': row['download']
        })
    
    # Process other distribution columns
    for col in distribution_columns:
        if col != "download" and col in df.columns and pd.notnull(row.get(col, None)):
            distribution_rows.append({
                'friendlier_id': friendlier_id,
                'label': "",  # Leave blank for non-download rows
                'reference_type': col,
                'distribution_url': row[col]
            })

# Create a new DataFrame for the distribution links
distribution_df = pd.DataFrame(distribution_rows)

# Save the distribution DataFrame to a CSV file
distribution_df.to_csv('distribution_links.csv', index=False)


## Part 3: Transform values for fields without a straight crosswalk

In [26]:
#Convert Data Type to Resource Class value
df['Resource Class'] = df['dc_type_s'].apply(lambda x: 'Imagery' if x == 'Image' else 'Datasets')


#Convert Geometry Type to Resource Type value
df['Resource Type'] = df['layer_geom_type_s'].astype(str) + ' data'

# Create Date Range field
# 10. Handle the "Date Range" field
# df['Date Range'] = df.apply(lambda row: f"{row['dct_temporal_sm']}-{row['dct_temporal_sm']}" if pd.notna(row['dct_temporal_sm']) else '', axis=1)

def format_temporal_coverage(row):
    temporal_coverage = row['dct_temporal_sm']
    
    # Check if the value is already a valid date range in yyyy-yyyy format
    if pd.notna(temporal_coverage) and re.match(r'\d{4}-\d{4}', temporal_coverage):
        return temporal_coverage  # Value is already formatted, so no change needed
    
    # Apply your existing logic to duplicate and format the value
    return f"{temporal_coverage}-{temporal_coverage}" if pd.notna(temporal_coverage) else ''

# Apply the function to create or update the "Date Range" column
df['Date Range'] = df.apply(format_temporal_coverage, axis=1)

### Check for GeoTIFFs

In [27]:
# Define a function to check if "GeoTIFF" is present in the "dc_format_s" column
def check_geotiff(value):
    if pd.notna(value) and "GeoTIFF" in value:
        return "true"
    else:
        return "false"

# Create the "Georeferenced" column using the check_geotiff function
df["Georeferenced"] = df["dc_format_s"].apply(check_geotiff)


## Part 4: Custom actions for UW-Madison records

### Concatenate custom field 'uw_supplemental_s' to Description

In [28]:
# Fill empty (NaN) values in the 'uw_supplemental_s' column with empty strings
df['uw_supplemental_s'] = df['uw_supplemental_s'].fillna('')

# Concatenate the 'Description' and 'uw_supplemental_s' columns with the pipe separator
df['dc_description_s'] = df['dc_description_s'] + '|' + df['uw_supplemental_s']

# Drop 'uw_supplemental_s' column
df = df.drop(columns=['uw_supplemental_s'])

### Convert values in "dc_subject_sm" and create a new "Theme" column

In [29]:

# Define the conversion mappings from old values to new values
subject_sm_mapping = {
    "farming": "Agriculture",
    "biota": "Biology",
    "boundaries": "Boundaries",
    "climatologymeteorologyatmosphere": "Climate",
    "economy": "Economy",
    "elevation": "Elevation",
    "environment": "Environment",
    "society; climatologyMeteorologyAtmosphere": "Events",
    "geoscientificinformation": "Geology",
    "health": "Health",
    "imagerybasemapsearthcover": "Imagery",
    "inlandwaters": "Inland Waters",
    "location": "Location",
    "intelligencemilitary": "Military",
    "oceans": "Oceans",
    "planningcadastre": "Property",
    "society": "Society",
    "structure": "Structure",
    "transportation": "Transportation",
    "utilitiescommunication": "Utilities"
}


# Function to apply the mapping and join the values back together
def convert_and_join(row):
    subject_values = row['dc_subject_sm']
    if pd.notna(subject_values):  # Check for NaN before splitting
        subject_values = subject_values.split('|')
        converted_values = []
        for value in subject_values:
            value_lower = value.lower()
            if value_lower in subject_sm_mapping:
                converted_values.append(subject_sm_mapping[value_lower])
        return '|'.join(converted_values)
    else:
        return ''  # Return an empty string if the value is NaN

# Apply the mapping and create the new "Theme" column
df['Theme'] = df.apply(convert_and_join, axis=1)

# Drop duplicates from the "Theme" column
df['Theme'] = df['Theme'].str.split('|').apply(lambda x: '|'.join(sorted(set(x), key=x.index)))


### Define County and City names

In [30]:
counties_in_wisconsin = [
    'Adams', 'Ashland', 'Barron', 'Bayfield', 'Brown', 'Buffalo',
    'Burnett', 'Calumet', 'Chippewa', 'Clark', 'Columbia', 'Crawford',
    'Dane', 'Dodge', 'Door', 'Douglas', 'Dunn', 'Eau Claire',
    'Florence', 'Fond du Lac', 'Fond Du Lac', 'Forest', 'Grant', 'Green', 'Green Lake',
    'Iowa', 'Iron', 'Jackson', 'Jefferson', 'Juneau', 'Kenosha',
    'Kewaunee', 'La Crosse', 'Lacrosse', 'Lafayette', 'Langlade', 'Lincoln',
    'Manitowoc', 'Marathon', 'Marinette', 'Marquette', 'Menominee',
    'Milwaukee', 'Monroe', 'Oconto', 'Oneida', 'Outagamie', 'Ozaukee',
    'Pepin', 'Pierce', 'Polk', 'Portage', 'Price', 'Racine',
    'Richland', 'Rock', 'Rusk', 'Sauk', 'Sawyer', 'Shawano',
    'Sheboygan', 'St. Croix', 'St Croix', 'Taylor', 'Trempealeau', 'Vernon', 'Vilas',
    'Walworth', 'Washburn', 'Washington', 'Waukesha', 'Waupaca',
    'Waushara', 'Winnebago', 'Wood'
]

cities_in_wisconsin = [
    'Milwaukee', 'Washington', 'Waukesha', 'Appleton', 'Outagamie', 
    'Winnebago', 'Eau Claire', 'Fitchburg', 'Fond du Lac', 'Fond Du Lac','Green Bay', 'Janesville', 
    'Kenosha', 'La Crosse', 'Lacrosse', 'Madison', 'Oshkosh', 'Racine', 'Rhinelander', 
    'Sheboygan', 'Superior',
    'Waukesha', 'Wausau', 'Wauwatosa', 'West Allis', 'Wisconsin Rapids'
]



### Transform the title into "theme [place] {date}"

In [None]:
def transform_title(alt_title):
    # Function to check if a word is an acronym (three or more capital letters)
    def is_acronym(word):
        return len(word) >= 3 and word.isupper()

    # Split the title into words and apply title casing selectively
    words = alt_title.split()
    alt_title = ' '.join(word if is_acronym(word) else word.title() for word in words)
    
    # Sort counties by length (longest first) to match longer names before shorter ones
    sorted_counties = sorted(counties_in_wisconsin, key=len, reverse=True)


    
    
    # Search for a city name in the title
    city_found = False
    for city in cities_in_wisconsin:
        if f"City of {city}" in alt_title or f"City Of {city}" in alt_title:
            alt_title = re.sub(f"City Of {city}, Wi", f"[Wisconsin--{city}]", alt_title, 1)
            alt_title = re.sub(f"City of {city}, Wi", f"[Wisconsin--{city}]", alt_title, 1)
            city_found = True
            break

    # If no city was found, search for a county name
    
    for county in sorted_counties:
        # Using a more flexible regex pattern to match different formats
        regex_pattern = fr"{county} County,?\s*WI"
        if re.search(regex_pattern, alt_title, re.IGNORECASE):
            alt_title = re.sub(regex_pattern, f"[Wisconsin--{county} County]", alt_title, flags=re.IGNORECASE)
            break
    
    
    alt_title = re.sub("Wi ", "[Wisconsin] ", alt_title, 1)
    alt_title = re.sub("Wiscsonsin ", "[Wisconsin] ", alt_title, 1)
    alt_title = re.sub(" Wisconsin ", " [Wisconsin] ", alt_title, 1)
    alt_title = re.sub(" Wisconsin, ", " [Wisconsin] ", alt_title, 1)

            
# Replace a single year or a year range with curly brackets

    date_pattern = r"\b\d{4}(?:-\d{4})?\b"
    dates = re.findall(date_pattern, alt_title)
    if dates:
        # Replace the first occurrence of a year or year range with curly brackets
        alt_title = re.sub(dates[0], "{" + dates[0] + "}", alt_title)

    # Cleanup phrases post-transformation
    alt_title = re.sub(r",\s*\[", " [", alt_title)
    alt_title = re.sub(r",\s*\{", " {", alt_title)
    alt_title = re.sub(r":\s*\[", " [", alt_title)
    alt_title = re.sub(r"For \[", "[", alt_title)
    alt_title = re.sub(r"For The \[", "[", alt_title)
    alt_title = re.sub(r"For The City Of \[", "[", alt_title)
    alt_title = re.sub(r"Doqqs", "DOQQs", alt_title)
    alt_title = re.sub(r"Lidar", "LiDAR", alt_title)
    alt_title = re.sub(r"For City Of Superior \(Douglas County\) \[Wisconsin\]", "[Wisconsin--Superior]", alt_title)
    return alt_title

def move_place_and_date(title):
    # Move content in square brackets to the end
    square_bracket_pattern = r"\[.*?\]"
    match = re.search(square_bracket_pattern, title)
    if match:
        content_in_square_brackets = match.group()
        title = re.sub(square_bracket_pattern, "", title).strip()
        title = f"{title} {content_in_square_brackets}"

    # Move content in curly brackets to the end
    curly_bracket_pattern = r"\{.*?\}"
    match = re.search(curly_bracket_pattern, title)
    if match:
        content_in_curly_brackets = match.group()
        title = re.sub(curly_bracket_pattern, "", title).strip()
        title = f"{title} {content_in_curly_brackets}"

    return title


df['Title'] = df['dc_title_s'].apply(transform_title)
df['Title'] = df['Title'].apply(move_place_and_date)

### Use the newly formated to extract place names for the Spatial Coverage field

In [32]:
def extract_spatial_coverage(title):
    coverage = re.search(r'\[(.*?)\]', title)
    if coverage:
        coverage = coverage.group(1)
        return coverage if coverage.endswith('Wisconsin') else coverage + "|Wisconsin"
    return "Wisconsin"

# Apply the function to the "Title" column and assign the results to the "Spatial Coverage" column.
df['Spatial Coverage'] = df['Title'].apply(extract_spatial_coverage)

### Retain the original title in the Alternative Title field

In [33]:
df = df.rename(columns={'dc_title_s' : 'Alternative Title'})

### Update the Creator field into the FAST format

In [34]:
def transform_creator(creator):
    # Dictionary mapping of creators for direct transformation
    creator_mappings = {
        "United States General Land Office" : "United States. General Land Office",
        "U.S. Geological Survey" : "Geological Survey (U.S.)",
        "Wisconsin Department of Transportation" : "Wisconsin. Department of Transportation",
        "U.S. Department of Agriculture" : "United States. Department of Agriculture",
        "Wisconsin Department of Natural Resources" : "Wisconsin. Department of Natural Resources",
        "Wisconsin Department of Administration" : "Wisconsin. Department of Administration"
    }
    
    # If a direct mapping is found, return the transformed value
    if creator in creator_mappings:
        return creator_mappings[creator]
    
    # Check for mixed values that start with the same string
#     if creator.startswith("Wisconsin Department of Tra"):
#         return "Wisconsin. Department of Transportation"
    
    # Search for a county name in the publisher string.
    for county in counties_in_wisconsin:
        if county + " County" in creator:
            return f"Wisconsin--{county} County"
    else:
        for city in cities_in_wisconsin:
            if f"City of {city}" in creator or city == creator:
                return f"Wisconsin--{city}"
    
    # If no match found, return the original publisher string.
    return creator

df['Creator'] = df['dc_creator_sm'].apply(transform_creator)
df = df.drop(columns=['dc_creator_sm'])

## Part 4: Export to a new CSV with Aardvark labels as headers

### Remove unnecessary columns

In [35]:
df = df.drop(columns=[
    'geoblacklight_version',
    'layer_modified_dt',  
    'solr_year_i',
    'layer_geom_type_s',
    'dc_language_s',
    'dc_identifier_s'
])

### Add some fields with default values

In [36]:
# Get the current date in yyyy-mm-dd format
today_date = datetime.date.today().isoformat()

# Add the "Date Accessioned" column with the today's date value to the DataFrame
df['Date Accessioned'] = today_date
df['Code'] = "10"
df['Is Part Of'] = "10d-03"
df['Member Of'] = "dc8c18df-7d64-4ff4-a754-d18d0891187d"
df['Accrual Method'] = "GBL-1.0"
df['Language'] = "eng"


### Rename the remaining columns

In [37]:
df = df.rename(columns={ 
    'dc_description_s': 'Description',
    'dct_issued_s': 'Date Issued',
    'dc_rights_s' : 'Access Rights',
    'dc_format_s': 'Format',
    'layer_slug_s' : 'ID',
    'layer_id_s' : 'WxS Identifier', 
    'dct_provenance_s' : 'Provider',
    'dc_publisher_s' : 'Publisher',
    'dc_publisher_sm' : 'Publisher',
    'dc_source_sm' : 'Source',
    'dct_temporal_sm' : 'Temporal Coverage',
    'dct_isPartOf_sm' : 'Keyword',
    'uw_notice_s' : 'Display Note'
})


In [38]:
df['Identifier'] = "https://geodata.wisc.edu/catalog/" + df['ID']

In [None]:
def trim_pipes_and_spaces(value):
    if isinstance(value, str):
        # Remove leading and trailing pipes and spaces
        value = value.strip('| ').strip('| ')
        # Replace double or more spaces with a single space
        value = re.sub(r'\s{2,}', ' ', value)
        return value
    return value

# Apply the function to the entire DataFrame
df = df.map(trim_pipes_and_spaces)

### Write the DataFrame to a CSV file with Aardvark labels
This can be uploaded to GBL Admin

In [40]:
# Define the desired order of columns
desired_order = [
'Title',
'Alternative Title',
'Description',
'Language',
'Display Note',
'Creator',
'Publisher',
'Provider',
'Resource Class',
'Resource Type',
'Theme',
'Subject',
'Keyword',
'Temporal Coverage',
'Date Issued',
'Date Range',
'Spatial Coverage',
'Bounding Box',
'Geometry',
'Member Of',
'Is Part Of',
'Source',
'Format',
'WxS Identifier',
'Georeferenced',
'ID',
'Identifier',
'Rights',
'Rights Holder',
'License',
'Access Rights',
'Suppressed',
'Child Record',
'Date Accessioned',
'Code',
'Accrual Method'
]

# Reindex the DataFrame based on the desired order of columns
df = df.reindex(columns=desired_order)

In [41]:
df.to_csv("{}.csv".format(csv_name), index=False, na_rep='')