pmc - Python

Parse full text XML documents from PubMed Central's Open Access subset into structured data.

This is a complete rewrite in Python that achieves similar functionality to tidypmc but which can generate Markdown.

Examples

1. Convert a single XML file to Markdown

Run the helper script pointing at a local PMC XML file and it will generate a Markdown document next to it:

python extract_from_xml.py

(The sample script is pre-configured to read data/PMC2231364.xml and write sample_output.md. Feel free to adjust the paths inside the file.)

2. Process a local TAR archive of PMC XMLs

Have one or more .tar.gz archives in the data/ directory? Convert every eligible article inside them with:

python extract_from_tar.py

The script automatically skips papers that are licensed CC-BY-ND and – by default – keeps only articles that mention obesity, weight-loss or diabetes in their title, abstract or author keywords. You can tweak the regular expression or turn the filter off by editing the constants at the top of the script.

3. Download & extract the latest PMC OA archives from FTP

To work with the newest content without downloading the huge full dataset first, use:

python extract_from_ftp.py

The script will:

Query the NCBI FTP mirror for the most recent Open Access – comm archives.
Stream-download each *.tar.gz file, decompress it on the fly and apply the same licence/topic filters as above.
Write the resulting Markdown files to data/markdown_outputs/.

Adjust the topic regular expression or disable filtering in extract_from_ftp.py just like in the TAR example above.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
data		data
scripts		scripts
src/pmc_python		src/pmc_python
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
extract_from_ftp.py		extract_from_ftp.py
extract_from_tar.py		extract_from_tar.py
extract_from_xml.py		extract_from_xml.py
pyproject.toml		pyproject.toml
sample_output.md		sample_output.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pmc - Python

Examples

1. Convert a single XML file to Markdown

2. Process a local TAR archive of PMC XMLs

3. Download & extract the latest PMC OA archives from FTP

About

Uh oh!

Releases

Packages

Languages

License

casper-hansen/pmc-python

Folders and files

Latest commit

History

Repository files navigation

pmc - Python

Examples

1. Convert a single XML file to Markdown

2. Process a local TAR archive of PMC XMLs

3. Download & extract the latest PMC OA archives from FTP

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages