Skip to content

Code and steps used to generate the Data Citation Corpus dump file

License

Notifications You must be signed in to change notification settings

datacite/corpus-data-file

Repository files navigation

Data Citation Corpus

This project generates data dumps in JSON and CSV formats.

Requirements

Before running this project, please ensure that you have the following requirements installed on your machine:

  • PostgreSQL: You will need to have PostgreSQL installed. If you don't have it installed, you can download it from the official website: PostgreSQL

  • Python 3: You will need to have Python 3 installed. If you don't have it installed, you can download it from the official website: Python

Setup

To set up the project, follow these steps:

  1. Clone the repository: git clone git@github.com:datacite/corpus-data-file.git
  2. Navigate to the project directory: cd corpus-data-file
  3. Create a .env file, cp .env.example .env, and add database credentials

How to run script

Make scripts executable

eg. chmod +x ./export-script/create_assertion_formatted_table.sh

Create table with formatted data

./export-script/create_assertion_formatted_table.sh

Generate dump files

./export-script/generate_assertion_details.sh

Process behind generating dump files

Accession Number Validation

This script is used to validate the accession numbers in our database against a set of regular expressions for each repository.

Setup

  1. Ensure you have Python 3 installed on your system.

  2. Navigate to the script directory:

    cd accession_number_validation
  3. Install the required Python packages:

    pip install -r requirements.txt
  4. Create a .env file in the project root and add database credentials:

    touch .env

    Open the .env file and add the following lines:

    DB_NAME=<database_name>
    DB_USER=<database_username>
    DB_PASSWORD=<database_password>
    DB_HOST=<database_host>
    DB_PORT=<database_port>

Running the Script

To run the accession_number_validation.py script, use the following command:

python accession_number_validation.py

About

Code and steps used to generate the Data Citation Corpus dump file

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages