Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
README.md
create_datetemp_table.sql
get_aos_without_dates.sql
get_unstructured_dates.sql
parse_dates.py
parse_dates_w_db.py
update_dates.sql
update_dates_by_expression.py

README.md

date normalization

A process to identify and normalize unstructured dates in ArchivesSpace.

Requirements

  • Python 3.4+
  • ArchivesSpace 2.1+
  • Access to ArchivesSpace database
  • Access to ArchivesSpace API
  • MySQL client
  • Ruby 2.0+
  • Timetwister

NOTE: parse_dates.py was written for OS X. It may work on a PC but it has not yet been tested.

Tutorial

Step 1: Identifying unstructured dates with get_unstructured_dates.sql

Returns a report of all date records in ArchivesSpace which have a value in the expression field but which do not have values in the begin or end fields.

Output of query should be used as the input for parse_dates.py.

Step 2: Parsing unstructured dates with parse_dates.py

To run:

$ cd git/yams_data_auditing/dates
$ python parse_dates.py

Follow the prompts in the terminal:

  1. Please enter path to log file - enter the desired file name or path for your log file. Ex: log.log or /Users/username/folder/log.log
  2. Please enter path to CSV - enter the filename or path to your input CSV file.
  3. Please enter path to output CSV - enter the filename or path to your output CSV file
  4. Enter "Y" to split output into multiple spreadsheets by date type, or any key to continue - indicate whether you'd like a set of subreports to be produced in addition to the output CSV. These subreports are used as input for the database update scripts described below.

Step 3a: Updating dates via the ArchivesSpace database

This is the most accurate method to update records in the date table, as it uses the date subrecord's database ID to identify and update the record.

However, subrecords in ArchivesSpace are not persistent - every time a top-level record (resource, archival object, etc.) is saved, all of its subrecords are deleted and recreated, and thus are assigned new database IDs. It is recommended to run the parse_dates.py script as close to update as possible, so that fewer database IDs will have changed. It may be useful to run the parse_dates.py script, perform quality control on the results, and then run it again immediately before update.

Preparing inputs

The parse_dates.py script includes options for several output types. In addition to the master report, you can also elect to split the master report into subreports based on the following criteria:

  1. begin_single.csv - the same value in both the begin and end field. The date type ID for single dates is 903
  2. inclusive.csv - values in the begin field and the end field. The date type ID for inclusive dates is 905
  3. begin_inclusive.csv - value in the begin field but not the end field. The date type ID for inclusive dates is 905
  4. end_inclusive.csv - value in the end field but not the begin field. The date type ID for inclusive dates is 905
  5. multiples.csv - occasionally timetwister parses an unstructured date into multiple dates. This report must be reviewed, as some of the parsed dates may be usable, while others are errors.
  6. unparsed.csv - all dates that could not be parsed by timetwister
  7. errors - all dates that failed to parse due to a program error

Reports 1-4 have date type IDs added, and can be used as input for the update_dates.sql script. Reports 5-7 are informational.

Creating temporary tables using create_datetemp_table.sql

Execute this script in a MySQL client or CLI. Be sure to comment out the correct lines depending on the date type. Create different temporary tables for inclusive and single date records.

Once the temporary tables are created, import the input spreadsheets into the tables. The process for doing this will depend on your MySQL client.

Running update_dates.sql

Execute this script to update your database. Be sure to comment out the correct lines depending on the date type. The script must be run separately for each date type. Running them altogether can lead to erroneous values being introduced into the database due to the differences in the number of fields and in the date types used in each spreadsheet.

Step 3b: Updating dates via the ArchivesSpace API

This method is less accurate, as it requires matching by URI and date expression, a free-text field, rather than by the date ID. However, users may favor this option if performing an SQL update is undesirable, and if there is concern about changing date IDs.

Preparing inputs

It is ok to use either the master report or the subreports as input. If using the master report, date type IDs may need to be added depending on the existing data. Will soon update parse_dates.py so this info is added to the master report.

Running update_dates_by_expression.py

To run:

$ cd git/yams_data_auditing/dates
$ python update_dates_by_expression.py

Follow the prompts in the terminal:

  • Please enter path to log file- enter the desired file name or path for your log file. Ex: log.log or /Users/username/folder/log.log
  • Please enter the ArchivesSpace API URL - enter your ArchivesSpace API URL
  • Please enter your username - enter your ArchivesSpace username
  • Please enter your password - enter your ArchivesSpace password
  • Please enter path to CSV - enter the filename or path to your input CSV
  • Please enter path to output directory - enter the path to where you want to store your backup JSON files

Other Useful Tools

get_aos_without_dates.sql

Retrieves a report of archival objects which are lacking date subrecords

You can’t perform that action at this time.