Extract text from papers PDFs and abstracts, and remove uninformative words. This is helpful for building a corpus of papers to train a language model.
Based on CVPR_paper_search_tool by Jin Yamanaka. I decided to split the code into multiple projects:
- AI Papers Scrapper - Download papers pdfs and other information from main AI conferences
- this project - Extract text from papers PDFs and abstracts, and remove uninformative words
- AI Papers Search Tool - Automatic paper clustering
- AI Papers Searcher - Web app to search papers by keywords or similar papers
Docker or, for local installation:
- Python 3.11+
- Poetry
To make it easier to run the code, with or without Docker, I created a few helpers. Both ways use start_here.sh
as an entry point. Since there are a few quirks when calling the specific code, I created this file with all the necessary commands to run the code. All you need to do is to uncomment the relevant lines inside the conferences
array and run the script. Also, comment/uncomment the following as needed:
extract_pdfs=1
extract_urls=1
clean_abstracts=1
clean_papers=1
You'll need to download some nltk data. To do this, read the relevant section according to your usage method below.
You first need to install Python Poetry. Then, you can install the dependencies and run the code:
poetry install
bash start_here.sh
To download the nltk data, run the following:
poetry run ipython3
Then, inside the Python shell:
import nltk
nltk.download('stopwords')
To help with the Docker setup, I created a Dockerfile
and a Makefile
. The Dockerfile
contains all the instructions to create the Docker image. The Makefile
contains the commands to build the image, run the container, and run the code inside the container. To build the image, simply run:
make
To call start_here.sh
inside the container, run:
make run
To download the nltk data, run the following:
make RUN_STRING="ipython3" run
Then, inside the Python shell:
import nltk
nltk.download('stopwords')
The best way to check how the cleaning process works for a specific paper is by running the clean_paper.sh script. You can set inside the following variables:
# clean_abstracts=1
clean_papers=1
index=1
# title="Moon IME: Neural-based Chinese Pinyin Aided Input Method with Customizable Association"
conf=aaai
year=2017
To check the abstract cleaning process, uncomment the clean_abstracts
line and comment the clean_papers
line. To check the paper cleaning process, reverse the comments. You need to set the conf
and year
variables to the conference (as displayed in the conferences
array in start_here.sh) and year of your choice, and set one of index
or title
variables. The index
variable is the index of the paper in the abstracts.csv
or pdfs.csv
file, while title
can be a part of the title of the paper. If you set both, the index
variable will be used. To call the clean_paper.sh script, run:
bash clean_paper.sh # if you're running without Docker
make RUN_STRING="bash clean_paper.sh" run # if you're running with Docker