AI Papers Cleaner

Extract text from papers PDFs and abstracts, and remove uninformative words. This is helpful for building a corpus of papers to train a language model.

Based on CVPR_paper_search_tool by Jin Yamanaka. I decided to split the code into multiple projects:

AI Papers Scrapper - Download papers pdfs and other information from main AI conferences
this project - Extract text from papers PDFs and abstracts, and remove uninformative words
AI Papers Search Tool - Automatic paper clustering
AI Papers Searcher - Web app to search papers by keywords or similar papers

Requirements

Docker or, for local installation:

Python 3.11+
Poetry

Usage

To make it easier to run the code, with or without Docker, I created a few helpers. Both ways use start_here.sh as an entry point. Since there are a few quirks when calling the specific code, I created this file with all the necessary commands to run the code. All you need to do is to uncomment the relevant lines inside the conferences array and run the script. Also, comment/uncomment the following as needed:

extract_pdfs=1
extract_urls=1
clean_abstracts=1
clean_papers=1

You'll need to download some nltk data. To do this, read the relevant section according to your usage method below.

Running without Docker

You first need to install Python Poetry. Then, you can install the dependencies and run the code:

poetry install
bash start_here.sh

Downloading nltk data

To download the nltk data, run the following:

poetry run ipython3

Then, inside the Python shell:

import nltk
nltk.download('stopwords')

Running with Docker

To help with the Docker setup, I created a Dockerfile and a Makefile. The Dockerfile contains all the instructions to create the Docker image. The Makefile contains the commands to build the image, run the container, and run the code inside the container. To build the image, simply run:

make

To call start_here.sh inside the container, run:

make run

Downloading nltk data

To download the nltk data, run the following:

make RUN_STRING="ipython3" run

Then, inside the Python shell:

import nltk
nltk.download('stopwords')

Checking the cleaning process

The best way to check how the cleaning process works for a specific paper is by running the clean_paper.sh script. You can set inside the following variables:

# clean_abstracts=1
clean_papers=1

index=1
# title="Moon IME: Neural-based Chinese Pinyin Aided Input Method with Customizable Association"
conf=aaai
year=2017

To check the abstract cleaning process, uncomment the clean_abstracts line and comment the clean_papers line. To check the paper cleaning process, reverse the comments. You need to set the conf and year variables to the conference (as displayed in the conferences array in start_here.sh) and year of your choice, and set one of index or title variables. The index variable is the index of the paper in the abstracts.csv or pdfs.csv file, while title can be a part of the title of the paper. If you set both, the index variable will be used. To call the clean_paper.sh script, run:

bash clean_paper.sh # if you're running without Docker
make RUN_STRING="bash clean_paper.sh" run # if you're running with Docker

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.dockerignore		.dockerignore
.gitignore		.gitignore
.ruff.toml		.ruff.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
add_papers_with_code.py		add_papers_with_code.py
clean_paper.sh		clean_paper.sh
pdf_extractor.py		pdf_extractor.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
start_here.sh		start_here.sh
text_cleaner.py		text_cleaner.py
timer.py		timer.py
unify_papers_data.py		unify_papers_data.py
url_scrapper.py		url_scrapper.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Papers Cleaner

Requirements

Usage

Running without Docker

Downloading nltk data

Running with Docker

Downloading nltk data

Checking the cleaning process

About

Releases

Packages

Languages

License

george-gca/ai_papers_cleaner

Folders and files

Latest commit

History

Repository files navigation

AI Papers Cleaner

Requirements

Usage

Running without Docker

Downloading nltk data

Running with Docker

Downloading nltk data

Checking the cleaning process

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages