This project generates data dumps in JSON and CSV formats.
Before running this project, please ensure that you have the following requirements installed on your machine:
-
PostgreSQL: You will need to have PostgreSQL installed. If you don't have it installed, you can download it from the official website: PostgreSQL
-
Python 3: You will need to have Python 3 installed. If you don't have it installed, you can download it from the official website: Python
To set up the project, follow these steps:
- Clone the repository:
git clone git@github.com:datacite/corpus-data-file.git
- Navigate to the project directory:
cd corpus-data-file
- Create a
.env
file,cp .env.example .env
, and add database credentials
eg. chmod +x ./export-script/create_assertion_formatted_table.sh
./export-script/create_assertion_formatted_table.sh
./export-script/generate_assertion_details.sh
- Create multiple SQL queries to create a table and populate it with related fomarmatted data following the spec document.
- Create a bash script create_assertion_formatted_table.sh to automate the creation of the table.
- Create a bash script generate_assertion_details.sh to generate the data dump files. This will create a JSON dump files from the fomarmatted table which we created using this bash script create_assertion_formatted_table.sh and convert each individual file to CSV using a Python script convert_to_csv.py following the spec document.
This script is used to validate the accession numbers in our database against a set of regular expressions for each repository.
-
Ensure you have Python 3 installed on your system.
-
Navigate to the script directory:
cd accession_number_validation
-
Install the required Python packages:
pip install -r requirements.txt
-
Create a
.env
file in the project root and add database credentials:touch .env
Open the
.env
file and add the following lines:DB_NAME=<database_name> DB_USER=<database_username> DB_PASSWORD=<database_password> DB_HOST=<database_host> DB_PORT=<database_port>
To run the accession_number_validation.py
script, use the following command:
python accession_number_validation.py