# Transform GeoBlacklight JSONs into a CSV (Wisconsin)

**Purpose: This script will read a batch of GeoBlacklight metadata JSON files submitted by the University of Wisconsin and tranform them into a single CSV.** 

Metadata records in the [GeoBlacklight](https://opengeometadata.org/docs/gbl-1.0) or [OpenGeoMetadata](https://opengeometadata.org/docs/ogm-aardvark) standards are frequently shared as batches of JSON files. The entire [OpenGeoMetadata organization](https://github.com/OpenGeoMetadata) contains repositories full of hundreds of thousands of GeoBlacklight JSONs.

In order to ingest these into the BTAA Geoportal, we need to transform them into a CSV.  


## 1. Import python modules

In [1]:
import csv
import json
import os
import pandas as pd

## 2. Declare the paths and file names

Put a folder of the JSONs into this directory. They can be nested.

In [2]:
json_path = r" " # enter the name of the folder
csv_name = " " # create a name for the output CSV without the .csv extension

## 3. Load the files into a pandas DataFrame

In [3]:
dataset = [] # empty list

# through all items, format and append to dataset list
for path, dir, files in os.walk(json_path):
    for filename in files:
        if filename.endswith(".json"):
            file_path = os.path.join(path, filename)
            json_file_open = open(file_path, 'rb')
            data = json_file_open.read().decode('utf-8', errors='ignore')
            loaded = json.loads(data)
            dataset.append(loaded)
            
df = pd.DataFrame(dataset) # convert dataset into dataframe

## 4. Edit the values of various fields

In [4]:
# return the first value of a multivalued cell;this removes the []
df['dc_creator_sm']=df['dc_creator_sm'].str[0]
df['dc_subject_sm']=df['dc_subject_sm'].str[0]

# remove brackets from Temporal Coverage which is a mix of single values and lists
# other methods like split or .str[0] return weird results because of the mixed values
# .str.join('') takes each item, whether a list or a single character, and joins them with nothing in between
df['dct_temporal_sm']=df['dct_temporal_sm'].str.join('')

# Split solr_geom coordinates and reorder from WENS to WSEN
df[['w', 'e','n','s']] = df['solr_geom'].str.strip('ENVELOPE()').str.split(',', expand=True)
df['Bounding Box'] = df[['w', 's','e','n']].agg(', '.join, axis=1) 

#Convert Geometry Type to Resource Type value
df['Resource Type'] = df['layer_geom_type_s'].astype(str) + ' data'

# Create Date Range field
df['Date Range'] = df['dct_temporal_sm'].astype(str) +'-' + df['dct_temporal_sm'].astype(str) 


# To do: figure out how to split the key:value pairs in the references cells
# df['dct_references_s'] = df['dct_references_s'].str.split(',', expand=True)

## 5. Remove unnecessary columns

In [5]:
df = df.drop(columns=[
    'geoblacklight_version',
    'layer_modified_dt', 
    'thumbnail_path_ss',
    'w','e','n','s', 
    'layer_id_s',
    'solr_year_i',
    'layer_geom_type_s',
    'solr_geom'
])

## 5. Rename  columns

In [6]:
df = df.rename(columns={
    'dc_title_s': 'Title', 
    'dc_description_s': 'Description',
    'dc_creator_sm': 'Creator',
    'dct_issued_s': 'Date Issued',
    'dc_rights_s' : 'Access Rights',
    'dc_format_s': 'Format',
    'layer_slug_s' : 'ID',
    'dc_identifier_s' : 'Identifier',
    'dc_language_s' : 'Language',
    'dct_provenance_s' : 'Provider',
    'dc_publisher_s' : 'Publisher',
    'dc_publisher_sm' : 'Publisher',
    'dc_source_sm' : 'Source',
    'dct_spatial_sm' : 'Spatial Coverage',
    'dc_subject_sm' : 'Subject',
    'dct_temporal_sm' : 'Temporal Coverage',
})


## 6. Write to a CSV file

In [7]:
df.to_csv("{}.csv".format(csv_name))