Background

This project creates JATS XML files for Publish on Accept (PoA) articles. It is centred around using a set of CSV data files as input, although an intermediary data object could be populated with data from other sources.

Also concerning the publication of PoA articles, it can transform the files in a zip package into the desired publishable format. It decapitates the cover page from a PDF file, and rezips other files into a new supplementary zip file.

The can also create CrossRef and PubMed deposits from an XML file input. An XML parser (parsePoaXml.py) can produce an article object using XML as the input (as opposed to from CSV data). From that article a CrossRef or PubMed batch deposit file is produced.

Examples of input and outbox can be found in the test cases.

Installation

Create virtualenv, activate it and install required libraries

virtualenv -venv
source venv/bin/activate
pip install -r requirements.txt

Copy the exmple-settings.py to a new file named settings.py.

You should be able to run the automated tests at this point.

Configuration

XML generation

To use the CSV data to article XML generation function, in the settings.py file you need to set the XLS_PATH to be the directory where the CSV files are stored. If you do not have any CSV files yet (and are just testing this project) there are sample CSV files in the automated tests you can use; in your settings.py file set it as

XLS_PATH = "tests/test_data/"

Run the following and you should get some XML files produced in the generated_xml_output directory (the default output folder name)

python xml_generation.py

CrossRef and PubMed deposit generation

To test run the scripts generateCrossrefXml.py and generatePubMedXml.py at this time, edit the XML filenames in the article_xmls[] list at the bottom of the file when __main__() is run. You can also point these to some automated test data to try them out, for example, set it as

article_xmls = ["tests/test_data/elife-02935-v2.xml"]

After running these scripts successfully, there should be new XML deposit files in the tmp directory.

The PubMed deposit has a few options with regard to Publish on Accept (PoA) articles, which can be deposited at PubMed as being "[Epub ahead of print]", and to deposit a replacement of a previous deposit setting the <Replaces> tag in the deposit. These values are set from external sources, such as the file name or an external data store, before the final PubMed deposit is generated. Reviewing the code itself will be required to understand these concepts.

Others

There are some other functions, such as repackaging article zip files into a new format, decapitating PDF files, and some FTP transfer function, not documented here yet.

## Project dependencies

See requirements.txt.

## Testing

You can run the full automated test suite from the base folder with:

python -m unittest discover tests

or you can run tests with coverage:

coverage run -m unittest discover tests

and then view the coverage report:

coverage report -m

Copyright & Licence

Older readme notes below!

Project outline

#### Data directories

sample-xls-input - contains example output from EJP SQL query
sample-xml-generated-output - place xml that our script generates into this directory
sample-xml-required-output - contains the XML specification from production, our generated XML should match the format of the XML in this directory

#### Python scripts

example-settings.py gives an example configuration script. Copy this to settings.py and adjust for your own path structure.
xml_generation.py generates a set of xml files based on articles that we have published.
generatePoaXml.py set of classes for modelling the output XML.
parseXlsFiles.py reads data from provided XLS files, provides simple interface to the data.

Other files in the repo are represent incomplete or earlier work.

Settings

Scripts look for file paths in a settings.py file. An example is provided in example-settings.py. Copy this example to settings.py and configure for your own path structure. It will look for the following information:

- `XLS_PATH` the location of the xls files to be read in.  
- `TARGET_OUTPUT_DIR` a path to a directory for writing generated xml files.  
- `XLS_FILES` a dict giving a label to the files that will be processed in the XLS read pahse.
- `XLS_COLUMN_HEADINGS` a dict listing column heading names of interest in the XLS files that we will process.

Obtaining XLS files to process

These files are generated out of the EJP system via a set of SQL queries. We do not store this data in this repository. Please contact @nathanlisgo to obtain a set for processing. We are currently procssing the following files. These files are versioned. The root of the filename gives an indication of what data we expect in these files. You should obtain the following files:

poa_author_ : information about eLife authors  
poa_license_ : licensing information for articles  
poa_manuscript_ : manuscript details, including reviewing editor information  
poa_received_ : recieved dates for manuscripts  
poa_subject_area_ : information on subject areas for the manuscripts  
poa_research_organism_ : information on organsisims that the manuscripts operate on
poa_title_ : manuscript titles
poa_abstract_ : manuscript abstracts

Each of these files needs to be placed into the directory located at XLS_PATH in settings.py.

#### Generating XML from and XLS file

$ python xml_generation.py

#### Verifying XML file

$ xmllint --noout --loaddtd --valid sample-xml-generated-output/outputName.xml

Project issues

Live code issues are listed as issues in the git repo for this project.

# Version history

2013-12-30 robust reviewed script ready. 2013-12-09 inital batch of code ready to review. 2013-11-26 first proof of concept.

Name		Name	Last commit message	Last commit date
Latest commit History 772 Commits
JATS xml upgrade		JATS xml upgrade
ftp-to-hw		ftp-to-hw
generated_xml_output		generated_xml_output
made_ftp_ready_on		made_ftp_ready_on
sample-xml-required-output		sample-xml-required-output
tests		tests
tmp		tmp
.coveragerc		.coveragerc
.gitignore		.gitignore
.travis.yml		.travis.yml
EJP-TO-HW-WORKFLOW.md		EJP-TO-HW-WORKFLOW.md
HW-MANIFEST-BUID-SPEC.txt		HW-MANIFEST-BUID-SPEC.txt
Jenkinsfile		Jenkinsfile
LICENSE		LICENSE
README.md		README.md
RESOURCES.md		RESOURCES.md
__init__.py		__init__.py
decapitatePDF.py		decapitatePDF.py
decapitatePDF2.py		decapitatePDF2.py
downstream-poa-workflow.drn		downstream-poa-workflow.drn
downstream-poa-workflow.pdf		downstream-poa-workflow.pdf
elife_poa_xls2xml.py		elife_poa_xls2xml.py
example-settings.py		example-settings.py
ftp_to_highwire.py		ftp_to_highwire.py
generateCrossrefXml.py		generateCrossrefXml.py
generatePoaXml.py		generatePoaXml.py
generatePubMedXml.py		generatePubMedXml.py
parseCSVFiles.py		parseCSVFiles.py
parsePoaXml.py		parsePoaXml.py
parseXlsFiles.py		parseXlsFiles.py
pin.sh		pin.sh
prepare_xml_pdf_for_hw.py		prepare_xml_pdf_for_hw.py
project_tests.sh		project_tests.sh
quick_copy_xml.sh		quick_copy_xml.sh
quickzip.py		quickzip.py
requirements.txt		requirements.txt
transform-ejp-zip-to-hw-zip.py		transform-ejp-zip-to-hw-zip.py
validate.py		validate.py
xml_generation.py		xml_generation.py

License

digirati-co-uk/elife-poa-xml-generation

Folders and files

Latest commit

History

Repository files navigation

Background

Installation

Configuration

XML generation

CrossRef and PubMed deposit generation

Others

Copyright & Licence

Older readme notes below!

Project outline

Settings

Obtaining XLS files to process

Project issues

Copyright & Licence

About

Resources

License

Stars

Watchers

Forks

Languages