PubRAG

PubRAG is a RAG using your own folder of publication PDFs to help you answer questions and cite sources. This app was made because large language models (LLMs) like ChatGPT and others do not provide accurate citations for their responses and instead hallucinate them (i.e. they make up something random that sounds legit).

The current version utilizes a vector database to store publication information and each publication's semantic embeddings generated by the SBERT 'all-MiniLM-L6-v2' model. Next, a FAISS (Facebook AI Similarity Search) index is calculated for each publication embedding. Finally, the query is compared to the FAISS indices to find the best publication match and this match is used to pull citation and filename information from the database and the PDF text is used in conjunction with an engineered standard prompt to return a response from the LLM. Now using Llama3.1!

Tip

If you get a bad response, run your question again and it may produce something better.

Environment

PubRAG runs using ollama and some assorted Python libraries, including PDFMiner for scraping PDFs. Python 3.10 is used on Ubuntu OS CPU in this repo.

pip install -r requirements.txt
curl -fsSL https://ollama.com/install.sh | sh

Data

Option 1: Use your own folder of PDF files and save the path as the bash variable PDF_DIR.

Option 2: Download test data to a folder like the following (if using Ubuntu):

PDF_DIR="test/my_pdf_files"
mkdir -p $PDF_DIR
wget https://www.nature.com/articles/s41467-023-44188-w.pdf -P $PDF_DIR
wget https://www.nature.com/articles/s41467-023-40066-7.pdf -P $PDF_DIR

Prepare database and index files

Create vector database

Creating a vector database simplifies the storage and lookup of semantic embeddings between documents. This particular database contains each paper's citation info, file path, and SBERT embeddings.

Caution

Once you make your vector database, don't change the filename or location of the PDFs in your directory. If you do, you'll need to create a fresh database.

python make_vector_db.py $PDF_DIR ./test/research_papers.db

Create FAISS index file

Creating a FAISS index array speeds up similarity searches.

python index_vector_db.py ./test/research_papers.db ./test/faiss_index.index

Run PubRAG

Now you can run the query script.

python run_pub_rag.py ./test/research_papers.db ./test/faiss_index.index \
  "What were the datasets used to train the model in the Mahmood MAPS paper?"

A generated response from llama3.1 using PubRAG (took ~5 minutes on a CPU):

According to the Mahmood MAPS paper, the authors used three in-house datasets for training and evaluation of their proposed method. However, they also compared their results with two existing cell phenotyping methods, namely ASTIR and CellSighter, which were re-trained on the same three in-house datasets using the same 5-fold cross-validation splits.

In addition to these in-house datasets, the authors obtained results for CellSighter on a publicly available CRC CODEX dataset from another paper. Therefore, the answer to your question would be that the Mahmood MAPS paper used at least one publicly available dataset (CRC CODEX) and three in-house datasets for training and evaluation.

(Muhammad Shaban et al., 2023: MAPS: pathologist-level cell type annotation from tissue images through machine learning)

TODO

Create a vector db to store encoded pubs
Clock run times
Add database and index file update option for when new pubs are added
Figure out ways to make this framework run faster on a CPU

Acknowledgements

Props to learnbybuilding.ai for providing a great walkthrough and code to get me going: "A beginner's guide to building a Retrieval Augmented Generation (RAG) application from scratch" by learnbybuilding.ai
ChatGPT for helping identify strategies and explain concepts

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.gitignore		.gitignore
README.md		README.md
index_vector_db.py		index_vector_db.py
make_vector_db.py		make_vector_db.py
requirements.txt		requirements.txt
run_pub_rag.py		run_pub_rag.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PubRAG

Environment

Data

Prepare database and index files

Create vector database

Create FAISS index file

Run PubRAG

TODO

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

amcrabtree/pub-rag

Folders and files

Latest commit

History

Repository files navigation

PubRAG

Environment

Data

Prepare database and index files

Create vector database

Create FAISS index file

Run PubRAG

TODO

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages