Skip to content

PubRAG uses your own collection of scientific articles to answer your niche questions and cite sources.

Notifications You must be signed in to change notification settings

amcrabtree/pub-rag

Repository files navigation

PubRAG

PubRAG is a RAG using your own folder of publication PDFs to help you answer questions and cite sources. This app was made because large language models (LLMs) like ChatGPT and others do not provide accurate citations for their responses and instead hallucinate them (i.e. they make up something random that sounds legit).

The current version utilizes a vector database to store publication information and each publication's semantic embeddings generated by the SBERT 'all-MiniLM-L6-v2' model. Next, a FAISS (Facebook AI Similarity Search) index is calculated for each publication embedding. Finally, the query is compared to the FAISS indices to find the best publication match and this match is used to pull citation and filename information from the database and the PDF text is used in conjunction with an engineered standard prompt to return a response from the LLM. Now using Llama3.1!

Tip

If you get a bad response, run your question again and it may produce something better.

Environment

PubRAG runs using ollama and some assorted Python libraries, including PDFMiner for scraping PDFs. Python 3.10 is used on Ubuntu OS CPU in this repo.

pip install -r requirements.txt
curl -fsSL https://ollama.com/install.sh | sh

Data

Option 1: Use your own folder of PDF files and save the path as the bash variable PDF_DIR.

Option 2: Download test data to a folder like the following (if using Ubuntu):

PDF_DIR="test/my_pdf_files"
mkdir -p $PDF_DIR
wget https://www.nature.com/articles/s41467-023-44188-w.pdf -P $PDF_DIR
wget https://www.nature.com/articles/s41467-023-40066-7.pdf -P $PDF_DIR

Prepare database and index files

Create vector database

Creating a vector database simplifies the storage and lookup of semantic embeddings between documents. This particular database contains each paper's citation info, file path, and SBERT embeddings.

Caution

Once you make your vector database, don't change the filename or location of the PDFs in your directory. If you do, you'll need to create a fresh database.

python make_vector_db.py $PDF_DIR ./test/research_papers.db

Create FAISS index file

Creating a FAISS index array speeds up similarity searches.

python index_vector_db.py ./test/research_papers.db ./test/faiss_index.index

Run PubRAG

Now you can run the query script.

python run_pub_rag.py ./test/research_papers.db ./test/faiss_index.index \
  "What were the datasets used to train the model in the Mahmood MAPS paper?"

A generated response from llama3.1 using PubRAG (took ~5 minutes on a CPU):

According to the Mahmood MAPS paper, the authors used three in-house datasets for training and evaluation of their proposed method. However, they also compared their results with two existing cell phenotyping methods, namely ASTIR and CellSighter, which were re-trained on the same three in-house datasets using the same 5-fold cross-validation splits.

In addition to these in-house datasets, the authors obtained results for CellSighter on a publicly available CRC CODEX dataset from another paper. Therefore, the answer to your question would be that the Mahmood MAPS paper used at least one publicly available dataset (CRC CODEX) and three in-house datasets for training and evaluation.

(Muhammad Shaban et al., 2023: MAPS: pathologist-level cell type annotation from tissue images through machine learning)

TODO

  • Create a vector db to store encoded pubs
  • Clock run times
  • Add database and index file update option for when new pubs are added
  • Figure out ways to make this framework run faster on a CPU

Acknowledgements

About

PubRAG uses your own collection of scientific articles to answer your niche questions and cite sources.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages