# Curation: Trade Statistics Data Curation Pilot


## About
- This script uploads content in the inventory of files associated with the volume: `Trade statistics of the treaty ports, for the period 1863-1872` to `demo.dataverse.org`. The curation approach creates one dataset per series name.
- **Created:** 2023/03/28
- **Updated:** 2023/04/03

## Globals
- Global variables for this script. 
- Set variable names (e.g., `g_api_key` as needed)

In [None]:
# set curation source path
g_module_path = './'

# path to output file
g_dataverse_inventory_file = './trade_statistics_inventory.csv'

# series names
g_series_names = []

# dataset inventories (keyed on series name)
g_series_inventories = {}

# dataset metadata (keyed on series name)
g_dataset_metadata = {}

# dataverse installation
g_dataverse_installation_url = 'https://demo.dataverse.org'

# dataverse API key
g_dataverse_api_key = 'xxxxxx'

# dataverse collection name
g_dataverse_collection = 'trade_statistics'

# dataverse inventory dataframe
g_dataverse_inventory_df = None

# dataset author
g_dataset_author = 'Last, First'

# dataset author affiliation
g_dataset_author_affiliation = 'Harvard Library'

# dataset contact information
g_dataset_contact = 'Last, First'
g_dataset_contact_email = 'last_first@harvard.edu'

# full path to location of datafiles (e.g., ../data/trade_statistics)
g_datafiles_path = 'xxxxxxxx'

# demo dataverse dataset information (keyed on series name)
g_dataverse_dataset_info = {}

# datafile metadata (dataframe of datafile metadata, keyed on series name)
g_datafile_metadata = {}

# datafile description template
g_datafile_description_template = 'File associated with data tables series:'

# dataset batches (array of batches of series to create/upload)
g_dataset_batches = []

## Modules

- Add local modules path to Jupyter system path
- Load all modules including local modules such as `curate`

In [None]:
import sys
if g_module_path not in sys.path:
    sys.path.append(g_module_path)

import curate
import numpy as np
import pandas as pd
import pprint as pprint
from pyDataverse.api import NativeApi

## Local Functions

In [None]:
# get a dictionary of dataset pids keyed on series name
def get_dataset_pids(batch, dataset_info):
    pids = {}
    for series_name in batch:
        pids[series_name] = dataset_info[series_name].get('dataset_pid')
    return pids

# get dictionary of datafile inventories keyed on series name
def get_datafile_inventories(batch, datafile_metadata):
    inventories = {}
    for series_name in batch:
        inventories[series_name] = datafile_metadata[series_name]
    return inventories 

# upload the datafiles associated with a batch
def upload_dataset_batch(api, dataverse_url, batch_list, batch_pids, batch_datafile_metadata, data_directory):
    # upload the datafiles associated with each series in the batch
    results = {}
    for series_name in batch_list:
        pid = batch_pids[series_name]
        datafiles_metadata = batch_datafile_metadata[series_name]
        results[series_name] = curate.direct_upload_datafiles(api, dataverse_url, pid, data_directory, datafiles_metadata)
    return results

## Curate Inventory

### 1. Prepare inventory data for curation

#### 1.1 Read `dataverse_inventory`
- Create a `DataFrame` for later use
- Note: It was necessary to delete `csv` entries with `table_type` = Missing because those files did not appear in the inventory 
- Note: Two additional file paths were removed from the inventory: `005825557_pt1_00118.innodata.csv` and `005825557_pt2_00127.innodata.csv`. Although these two files appeared in the METS file, they did not appear in the original files using DRS ids as names, and therefore were not renamed using the owner-supplied naming scheme.
- Note: Also, the `curate:direct_upload_datafiles` function expects all files to be in a single directory (not grouped by file type)

In [None]:
# read the dataverse inventory file
g_dataverse_inventory_df = pd.read_csv(g_dataverse_inventory_file,index_col=None,low_memory=False)

#### 1.2 Create Dataset Inventories
- Get the list of series names
- Create a `dict` of file inventories keyed on series name

In [None]:
# get list of series in the full inventory
g_series_names = list(g_dataverse_inventory_df.series_name.unique())

# create series inventories
for name in g_series_names:
    # get series inventory
    g_series_inventories[name] = g_dataverse_inventory_df.loc[g_dataverse_inventory_df['series_name'] == name]

pprint.pprint(g_series_names)

#### 1.3 Create Dataset Metadata
- Create a `dict` of dataset metadata extracted from each inventory

In [None]:
# for each series name, create dataset metadata
for series_name in g_series_names:
    # get series inventory
    series_inventory = g_series_inventories[series_name]
    md = curate.create_dataset_metadata(g_dataset_author, g_dataset_author_affiliation, 
                                        g_dataset_contact, g_dataset_contact_email,
                                        series_name, series_inventory)
    g_dataset_metadata[series_name] = md

pprint.pprint(g_dataset_metadata)

### 1.4 Create Datafile Metadata
- Create a `dict` of `DataFrames` containing metadata about individual files

In [None]:
for series_name in g_series_names:
    # get dataset metadata for the series
    series_metadata = g_dataset_metadata[series_name]
    # get the series inventory
    series_inventory_df = g_series_inventories[series_name]
    # create datafile metadata
    g_datafile_metadata[series_name] = curate.create_datafile_metadata(series_inventory_df, g_datafile_description_template)

### 1.4 Create Series Batches
- Create a set of (approximately) equal length batches of series (to create dataset and upload datafiles)
- Generally, there are too many series in a volume to create the related datasets and then upload all their datafiles in a single tight loop. Therefore, it's useful to create batches of these series and perform the create/upload operation on a single batch at a time.

In [None]:
# max number of series in a batch
batch_size = 5
g_batches = np.array_split(g_series_names, len(g_series_names)/batch_size)

pprint.pprint(g_batches)

### 2. Initialize `pyDataverse` API
- Use `pyDataverse` to initialize the API to the dataverse installation

In [None]:
# set pyDataverse API adapter
g_api = NativeApi(g_dataverse_installation_url, g_dataverse_api_key)

# print results
print('{}'.format(g_api))

### 3. Create Datasets and Upload Datafiles

#### 3.1 Create all datasets
- For each series name, create a dataset and retain status information

In [None]:
# for each series, create a dataset and save its information
for series_name in g_series_names:
    # get the series metadata
    series_metadata = g_dataset_metadata[series_name]
    # create the dataset
    g_dataverse_dataset_info[series_name] = curate.create_dataset(g_api, g_dataverse_collection, series_metadata)

pprint.pprint(g_dataverse_dataset_info)

#### 3.2 Upload dataset datafiles, one batch at a time
- Upload the datafiles associated with each dataset in a batch

In [None]:
# Batch 0
index = 0
batch = g_batches[index]
pids = get_dataset_pids(batch, g_dataverse_dataset_info)
datafile_metadata = get_datafile_inventories(batch, g_datafile_metadata)
print('Uploading batch: {}, series: {}'.format(index, batch))
errors = upload_dataset_batch(g_api, g_dataverse_installation_url, 
                              batch, pids, datafile_metadata, g_datafiles_path)
pprint.pprint(errors)
pprint.pprint(pids)
pprint.pprint(datafile_metadata)

In [None]:
# Batch 1
index = 1
batch = g_batches[index]
pids = get_dataset_pids(batch, g_dataverse_dataset_info)
datafile_metadata = get_datafile_inventories(batch, g_datafile_metadata)
print('Uploading batch: {}, series: {}'.format(index, batch))
errors = upload_dataset_batch(g_api, g_dataverse_installation_url, 
                              batch, pids, datafile_metadata, g_datafiles_path)
pprint.pprint(errors)

In [None]:
# Batch 2
index = 2
batch = g_batches[index]
pids = get_dataset_pids(batch, g_dataverse_dataset_info)
datafile_metadata = get_datafile_inventories(batch, g_datafile_metadata)
print('Uploading batch: {}, series: {}'.format(index, batch))
errors = upload_dataset_batch(g_api, g_dataverse_installation_url, 
                              batch, pids, datafile_metadata, g_datafiles_path)
pprint.pprint(errors)

In [None]:
# Batch 3
index = 3
batch = g_batches[index]
pids = get_dataset_pids(batch, g_dataverse_dataset_info)
datafile_metadata = get_datafile_inventories(batch, g_datafile_metadata)
print('Uploading batch: {}, series: {}'.format(index, batch))
errors = upload_dataset_batch(g_api, g_dataverse_installation_url, 
                              batch, pids, datafile_metadata, g_datafiles_path)
pprint.pprint(errors)

In [None]:
# Batch 4
index = 4
batch = g_batches[index]
pids = get_dataset_pids(batch, g_dataverse_dataset_info)
datafile_metadata = get_datafile_inventories(batch, g_datafile_metadata)
print('Uploading batch: {}, series: {}'.format(index, batch))
errors = upload_dataset_batch(g_api, g_dataverse_installation_url, 
                              batch, pids, datafile_metadata, g_datafiles_path)
pprint.pprint(errors)

In [None]:
# Batch 5
index = 5
batch = g_batches[index]
pids = get_dataset_pids(batch, g_dataverse_dataset_info)
datafile_metadata = get_datafile_inventories(batch, g_datafile_metadata)
print('Uploading batch: {}, series: {}'.format(index, batch))
errors = upload_dataset_batch(g_api, g_dataverse_installation_url, 
                              batch, pids, datafile_metadata, g_datafiles_path)
pprint.pprint(errors)

In [None]:
# Batch 6
index = 6
batch = g_batches[index]
pids = get_dataset_pids(batch, g_dataverse_dataset_info)
datafile_metadata = get_datafile_inventories(batch, g_datafile_metadata)
print('Uploading batch: {}, series: {}'.format(index, batch))
errors = upload_dataset_batch(g_api, g_dataverse_installation_url, 
                              batch, pids, datafile_metadata, g_datafiles_path)
pprint.pprint(errors)

In [None]:
# Batch 7
index = 7
batch = g_batches[index]
pids = get_dataset_pids(batch, g_dataverse_dataset_info)
datafile_metadata = get_datafile_inventories(batch, g_datafile_metadata)
print('Uploading batch: {}, series: {}'.format(index, batch))
errors = upload_dataset_batch(g_api, g_dataverse_installation_url, 
                              batch, pids, datafile_metadata, g_datafiles_path)
pprint.pprint(errors)

In [None]:
# Batch 8
index = 8
batch = g_batches[index]
pids = get_dataset_pids(batch, g_dataverse_dataset_info)
datafile_metadata = get_datafile_inventories(batch, g_datafile_metadata)
print('Uploading batch: {}, series: {}'.format(index, batch))
errors = upload_dataset_batch(g_api, g_dataverse_installation_url, 
                              batch, pids, datafile_metadata, g_datafiles_path)
pprint.pprint(errors)

In [None]:
# Batch 9
index = 9
batch = g_batches[index]
pids = get_dataset_pids(batch, g_dataverse_dataset_info)
datafile_metadata = get_datafile_inventories(batch, g_datafile_metadata)
print('Uploading batch: {}, series: {}'.format(index, batch))
errors = upload_dataset_batch(g_api, g_dataverse_installation_url, 
                              batch, pids, datafile_metadata, g_datafiles_path)
pprint.pprint(errors)

In [None]:
# Batch 10
index = 10
batch = g_batches[index]
pids = get_dataset_pids(batch, g_dataverse_dataset_info)
datafile_metadata = get_datafile_inventories(batch, g_datafile_metadata)
print('Uploading batch: {}, series: {}'.format(index, batch))
errors = upload_dataset_batch(g_api, g_dataverse_installation_url, 
                              batch, pids, datafile_metadata, g_datafiles_path)
pprint.pprint(errors)

In [None]:
# Batch 11
index = 11
batch = g_batches[index]
pids = get_dataset_pids(batch, g_dataverse_dataset_info)
datafile_metadata = get_datafile_inventories(batch, g_datafile_metadata)
print('Uploading batch: {}, series: {}'.format(index, batch))
errors = upload_dataset_batch(g_api, g_dataverse_installation_url, 
                              batch, pids, datafile_metadata, g_datafiles_path)
pprint.pprint(errors)

In [None]:
# Batch 12
index = 12
batch = g_batches[index]
pids = get_dataset_pids(batch, g_dataverse_dataset_info)
datafile_metadata = get_datafile_inventories(batch, g_datafile_metadata)
print('Uploading batch: {}, series: {}'.format(index, batch))
errors = upload_dataset_batch(g_api, g_dataverse_installation_url, 
                              batch, pids, datafile_metadata, g_datafiles_path)
pprint.pprint(errors)

In [None]:
# Batch 13
index = 13
batch = g_batches[index]
pids = get_dataset_pids(batch, g_dataverse_dataset_info)
datafile_metadata = get_datafile_inventories(batch, g_datafile_metadata)
print('Uploading batch: {}, series: {}'.format(index, batch))
errors = upload_dataset_batch(g_api, g_dataverse_installation_url, 
                              batch, pids, datafile_metadata, g_datafiles_path)
pprint.pprint(errors)

In [None]:
# Batch 14
index = 14
batch = g_batches[index]
pids = get_dataset_pids(batch, g_dataverse_dataset_info)
datafile_metadata = get_datafile_inventories(batch, g_datafile_metadata)
print('Uploading batch: {}, series: {}'.format(index, batch))
errors = upload_dataset_batch(g_api, g_dataverse_installation_url, 
                              batch, pids, datafile_metadata, g_datafiles_path)
pprint.pprint(errors)

#### 3.3 Publish datasets

In [None]:
# create dataset
import importlib
importlib.reload(curate)

# publish the datasets
errors = curate.publish_datasets(g_api, g_dataverse_collection, version='major')

pprint.pprint(errors)

-----------

## Test Curation Process

### Test: Create a single dataset
This test allows users to create a single dataset and upload its related datafiles. 
Useful for troubleshooting and to test other collections.

#### 1 Test: Create datafile metadata

In [None]:
# create datafile metadata
# get the first series
first_series = g_series_names[0]
first_series_metadata = g_dataset_metadata[first_series]
first_series_inventory_df = g_series_inventories[first_series]

# set the template
template = 'File associated with data tables series:'
datafile_metadata_df = curate.create_datafile_metadata(first_series_inventory_df, template)

#### 2. Test: Create the dataset

In [None]:
# create the test dataset
dataset_ret = curate.create_dataset(g_api, g_dataverse_collection, first_series_metadata)
pprint.pprint(dataset_ret)

#### 3. Test: Direct upload the datafiles associated with the dataset (series name)

In [None]:
# upload the series dataset datafiles 
pid = dataset_ret.get('dataset_pid')
ret = curate.direct_upload_datafiles(g_api, g_dataverse_installation_url, pid, g_datafiles_path, datafile_metadata_df)

#### 4. Test: Examine a directory to make certain all files exist before attempting an upload of datafiles

In [None]:
# test to see if all files are there and report the ones that aren't

import os
errors = {}
for row in g_dataverse_inventory_df.iterrows():
    filename = row[1].get('filename_osn')
    filepath = g_datafiles_path + '/' + filename
    if (os.path.exists(filepath)):
        errors[filepath] = True
    else:
        print('File not found: {}'.format(filepath))
        errors[filepath] = False

#### 5. Test: Delete all the datasets in the collection and start again
- WARNING: This is a permanent operation. Be very certain you want to perform this operation!

In [None]:
# delete all the datasets
# ARE YOU SURE ABOUT THIS? if so, uncomment the next line and execute
#ret = curate.delete_datasets(g_api, g_dataverse_collection)

**End document.**