Parse full text XML documents from PubMed Central's Open Access subset into structured data.
This is a complete rewrite in Python that achieves similar functionality to tidypmc but which can generate Markdown.
Run the helper script pointing at a local PMC XML file and it will generate a Markdown document next to it:
python extract_from_xml.py
(The sample script is pre-configured to read data/PMC2231364.xml
and write sample_output.md
. Feel free to adjust the paths inside the file.)
Have one or more .tar.gz
archives in the data/
directory? Convert every eligible article inside them with:
python extract_from_tar.py
The script automatically skips papers that are licensed CC-BY-ND and – by default – keeps only articles that mention obesity, weight-loss or diabetes in their title, abstract or author keywords. You can tweak the regular expression or turn the filter off by editing the constants at the top of the script.
To work with the newest content without downloading the huge full dataset first, use:
python extract_from_ftp.py
The script will:
- Query the NCBI FTP mirror for the most recent Open Access – comm archives.
- Stream-download each
*.tar.gz
file, decompress it on the fly and apply the same licence/topic filters as above. - Write the resulting Markdown files to
data/markdown_outputs/
.
Adjust the topic regular expression or disable filtering in extract_from_ftp.py
just like in the TAR example above.