Skip to content

datascience/RDA-WGDC-CSV-Data-Citation-Prototype

Repository files navigation

Data Citation for CSV Data

This tool allows researchers to create citable subsets from CSV data. Arbitrary CSV files can be uploaded and a Web interface supports users at selecting specific subsets. The system automatically monitors the subset creation process and assigns a persistent identifier (PID) to the subset. CSV data sets can be updated (records can be added, deleted or modified) and all data is versioned. Thus earlier versions of a specific subset can be retrieved even if the data is evolving.

Background: Data Citation

Sharing research data is becoming increasingly important as it enables peers to validate and reproduce data driven experiments. As researchers work with different data sets and highly specific individual subsets, the knowledge how subsets have been created is worth preserving. The most common way of creating a subset of a potentially large data set is to make a selection based on filtering out those records which are not needed and only include data which fulfils a given criteria.

The QueryStore

The generic approach to perform such filter operations is to use either use a query language which allows to describe which data to be included and which records should be omitted. SQL is such a general purpose query language. For end users, interfaces exist which can hide the complexity of query languages by providing forms where users can make their selections visually or by entering appropriate values. The QueryStore stored these queries (thus the name) and makes them available for later reuse. Whenever a reseacher uses a query (via an interface) in order to create a subset, the parameters, their order and the sortings applied can be stored. Based on temporal metadata which is collected automatically, the query can be re-executed.

Query Store Features:

The Query store is implemented in Java and provides an API for the most common Tasks. It uses Hibernate to store the entities (details below).

  • Create new queries
  • Add metadata about users
  • Add descriptive text
  • Attach the persistent identifier of the data source (details below)
  • Generate a persistent identifier for the query itself (details below)
  • Append a hash of the result set
  • Calculate a hash of the query to detect duplicate queries
  • Add arbitrary filters in a key value fashion (e.g. 'instrumentName';'tempreatureSensor'). This is used to mapp the interface input fields and their respective values which have been entered.
  • Add arbitrary many sortings either in ascending or desceinding order
  • Detect query duplicates
  • Create timestamps automatically for each insert and update.
  • Full audit log.

Persistent Identifaction Service

Identifying datasets is essential for future reuse. Storing exported datasets on a Web server and providing the URL for its retrieval is not enough. A simle change in the directory structure of the server renders a URL obsolete and the dataset can not be retrieved again. Hence a more sophisticated way of referencing datasets is needed. The concept of persistent identifiers (PID) deals with this problem. Several different PID systems exist and they all follow similar principles. A resolver service is introduced which can resolve an identifier to its URL. The actual URL may change during the course of time, but the identifier remains always the same (it is persistent). Thus whenever a digital object has to be moved to a different location (e.g.. server filesystem updates or similar events), its location may be updated and the identifier can then refer to its new location.

Persistent Identification Service Features

The Query store is implemented in Java and provides an API for the most common Tasks. It uses Hibernate to store the entities (details below). The following features are currently implemented in the API:

  • Create PIDs of the form prefix/identifier.
  • Create alphabetical identifiers of arbitrary length (e.g. zjqpU).
  • Create alphanumeric identifiers of arbitrary length (YouTube style, eg. qsn4zPVRA7a0).
  • Create numeric identifiers of arbitrary length (e.g. 08652).
  • Store one URI with each identifier
  • Update URIs (the identifier can neither be deleted nor updated via the API)
  • Create organizations and prefixes (prefixes are unique).
  • One organization can mit arbitrary many identifiers per prefix
  • Identifiers are uniqie within one prefix and therefore the complete PID is unique.
  • Resolve PIDs to URLs
  • Retrieve latest PID pre organization
  • Retrieve latest PID in the database.
  • Print details
  • Full audit log

CSV Data Citation

Users should be able to upload CSV files which they want to make citable. The CSV files are previously unkown to the server, therefore the files are analyzed and a new table schema is created automatically.

Workfows

The following secenarios are considered.

A new file

The user uploads a new file to the server. The user needs to specify a primary key. If no primary key can be specified, updates can not be detected automatically. The server then server creates a new table schema and adds metadata.

Appending data

The user appends data to an existing file. The file needs to have the same structure as the original file and only new records may be contained in the file. If a record is already there, an error is thrown.

Updating data

A user provides a file which contains updated records or new records. Existing records are identified via their primary key. When a primary key already exists in the database, the records gets updated and a new timestamp gets inserted.

Deleting records

The user uploads a file where rows have been deleted.

How to use this Code?

Dislaimer

This project is a research prototype. This means that this software is provided "as is", without any warranty of any kind. Feedback is welcome, but no support can be given at this point in time.

Software Stack

This project is written in Java and utilises amongst other things the following technologies:

  • Java 1.8
  • Maven 3
  • MySQL 5.6
  • Hibernate 4.3
  • HikariCP
  • Super CSV
  • Apache Tomcat
  • Apache Commons
  • Jersey
  • JSF
  • Primefaces
  • jbcrypt

Modules

The code is structured into several modules, each having its own dependencies.

Batch Modules

Provides a simple interface as alternative to the Web interface.

CSV-Citation-Deployment-Module

As the name implies, this module serves as a metamodule and allows to deploy the complete project within a Apache Tomcat Servlet container. The module also has subfolders containing the SQL data to initialize the database and create the necessary users and permissions. The script "kill_tomcat.sh" is used to launch the Tomca service (killing it if it was running). The scripts mvn_compile_simple and mvn_compile_server are user for compiling the project and uploading it to the Apache server.

CSVTools

This module contains helper tools for handling CSV data.

Database

The database module contains all classes related to database operations. The module consists of three packages:

Authentication

The authentication Module handles the user authentication for the Web application. It provides an API for creating new users. Passwords are stored encrypted.

DatabaseOperations

The operations module comes with different classes containing functions for storing and retrieving data in the MySQL database. It is the data backend of this tool.

Helpers

Simple helper classes, e.g. for String operations.

DatatableModel

This module handles all interactions with the datatable plugin from the Web interface.

Evaluation Module

This module contains the alternative approach based on Git. It is used for evaluating the database solution in terms of storage demand and query execution time.

Persistent Identification tools

The PiD module is a standalone PID service, which is used for creating test identifiers, which can be resolved locally.

Query Store

The query store contains all the queries and their parameters including the metadata of the query execution.

ResultsetVerification

The result set verification module computes hashes of result sets.

WebInterface

The web interface module provides a interface for users.

About

The official RDA WGDC Prorotype for citable CSV data(in development)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published