Skip to content

The official code to build up dataset PMC-OA

License

Notifications You must be signed in to change notification settings

WeixiongLin/Build-PMC-OA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build PMC-OA

中文版本

This is our pipeline for the development of PMC-OA. You might need further adaptation to use it for your own purpose.

Installation

  1. Setup ENV
conda create -n pubmed python=3.8  # not test for other version
conda activate pubmed

pip install -r requirements.txt

git clone https://gitee.com/lin_wei_hung/build-pmcoa.git
python setup.py develop  # choose developer mode for customization
  1. Run the script
python src/fetch_oa.py --volumes 0 1 2 3 4 5 6 7 8 9  # 10 volumes for PMC OA in total
python src/fetch_oa.py --volumes 0 1 2  # Choose volumes of 0,1,2 only

About PMC-OA

PMC-OA(Pubmed Open Acess Subset) is built with public papers in Pubmed, which can be downloaded from pubmed page.

Due to the issue of copyright, the papers with Non-Commertial-Use liscence and ones with no liscence is not included in PMC-OA. You might customize the repo for your own purpose.

Structure

setup.py
src/
  |--fetch_oa.py: main script for download PMC-OA
  |--args/
  | |--args_oa.py: Configures for pipeline
  |--parser/
  | |--parse_oa.py: Parse web pages into list of <img, caption> pairs
  |--utils/
  | |--io.py: 

Contribution

  1. Fork the repository
  2. Create Feat_xxx branch
  3. Commit your code
  4. Create Pull Request

Limitation

  1. Some of the paper are only presented in pdf formart, the figures in those would not be obtained by this pipeline
  2. Media files other than images might also be downloaded, such as suffix .mp4, .avi.

Cite

@article{lin2023pmc,
  title={PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents},
  author={Lin, Weixiong and Zhao, Ziheng and Zhang, Xiaoman and Wu, Chaoyi and Zhang, Ya and Wang, Yanfeng and Xie, Weidi},
  journal={arXiv preprint arXiv:2303.07240},
  year={2023}
}

About

The official code to build up dataset PMC-OA

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages