A process to identify and normalize unstructured dates in ArchivesSpace.
- Python 3.4+
- ArchivesSpace 2.1+
- Access to ArchivesSpace database
- Access to ArchivesSpace API
- MySQL client
- Ruby 2.0+
parse_dates.py was written for OS X. It may work on a PC but it has not yet been tested.
Step 1: Identifying unstructured dates with
Returns a report of all date records in ArchivesSpace which have a value in the expression field but which do not have values in the begin or end fields.
Output of query should be used as the input for
Step 2: Parsing unstructured dates with
$ cd git/yams_data_auditing/dates $ python parse_dates.py
Follow the prompts in the terminal:
- Please enter path to log file - enter the desired file name or path for your log file. Ex: log.log or /Users/username/folder/log.log
- Please enter path to CSV - enter the filename or path to your input CSV file.
- Please enter path to output CSV - enter the filename or path to your output CSV file
- Enter "Y" to split output into multiple spreadsheets by date type, or any key to continue - indicate whether you'd like a set of subreports to be produced in addition to the output CSV. These subreports are used as input for the database update scripts described below.
Step 3a: Updating dates via the ArchivesSpace database
This is the most accurate method to update records in the date table, as it uses the date subrecord's database ID to identify and update the record.
However, subrecords in ArchivesSpace are not persistent - every time a top-level record (resource, archival object, etc.) is saved, all of its subrecords are deleted and recreated, and thus are assigned new database IDs. It is recommended to run the
parse_dates.py script as close to update as possible, so that fewer database IDs will have changed. It may be useful to run the
parse_dates.py script, perform quality control on the results, and then run it again immediately before update.
parse_dates.py script includes options for several output types. In addition to the master report, you can also elect to split the master report into subreports based on the following criteria:
- begin_single.csv - the same value in both the begin and end field. The date type ID for single dates is 903
- inclusive.csv - values in the begin field and the end field. The date type ID for inclusive dates is 905
- begin_inclusive.csv - value in the begin field but not the end field. The date type ID for inclusive dates is 905
- end_inclusive.csv - value in the end field but not the begin field. The date type ID for inclusive dates is 905
- multiples.csv - occasionally
timetwisterparses an unstructured date into multiple dates. This report must be reviewed, as some of the parsed dates may be usable, while others are errors.
- unparsed.csv - all dates that could not be parsed by
- errors - all dates that failed to parse due to a program error
Reports 1-4 have date type IDs added, and can be used as input for the
update_dates.sql script. Reports 5-7 are informational.
Creating temporary tables using
Execute this script in a MySQL client or CLI. Be sure to comment out the correct lines depending on the date type. Create different temporary tables for inclusive and single date records.
Once the temporary tables are created, import the input spreadsheets into the tables. The process for doing this will depend on your MySQL client.
Execute this script to update your database. Be sure to comment out the correct lines depending on the date type. The script must be run separately for each date type. Running them altogether can lead to erroneous values being introduced into the database due to the differences in the number of fields and in the date types used in each spreadsheet.
Step 3b: Updating dates via the ArchivesSpace API
This method is less accurate, as it requires matching by URI and date expression, a free-text field, rather than by the date ID. However, users may favor this option if performing an SQL update is undesirable, and if there is concern about changing date IDs.
It is ok to use either the master report or the subreports as input. If using the master report, date type IDs may need to be added depending on the existing data. Will soon update
parse_dates.py so this info is added to the master report.
$ cd git/yams_data_auditing/dates $ python update_dates_by_expression.py
Follow the prompts in the terminal:
- Please enter path to log file- enter the desired file name or path for your log file. Ex: log.log or /Users/username/folder/log.log
- Please enter the ArchivesSpace API URL - enter your ArchivesSpace API URL
- Please enter your username - enter your ArchivesSpace username
- Please enter your password - enter your ArchivesSpace password
- Please enter path to CSV - enter the filename or path to your input CSV
- Please enter path to output directory - enter the path to where you want to store your backup JSON files
Other Useful Tools
Retrieves a report of archival objects which are lacking date subrecords